# Feature Engineering
According to the results obtained during the exploratory data analysis performed, the dataset is manipulated in order to keep only those features that may be useful for the intended purpose.

The next features are removed from the dataset:
- FarmID
- State
- District
- Sub-District
- HDate
- CNext
- ExpYield
- geometry
- CHeight

In [24]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

In [25]:
df=pd.read_csv('../data/Train.csv')
df.drop(columns=['FarmID','State','District','Sub-District','HDate','CNext','ExpYield','geometry','CHeight'],inplace=True)

The SDate is manipulated to keep only the month data:

In [26]:
df['SMonth']=df['SDate'].map(lambda x:x[-7:-5])
df.drop(columns='SDate',inplace=True)

The categorical variables are converted to numerical through the label-enconding technique. This is chosen for simplicity but it is important to take into account that this is better suited for ordinal variables which is not the case.

In [27]:
label_encoder=LabelEncoder()
df['y']=label_encoder.fit_transform(df.category.values)
df['crop']=label_encoder.fit_transform(df.Crop.values)
df['clast']=label_encoder.fit_transform(df.CLast.values)
df['ctransp']=label_encoder.fit_transform(df.CTransp.values)
df['irritype']=label_encoder.fit_transform(df.IrriType.values)
df['irrisource']=label_encoder.fit_transform(df.IrriSource.values)
df['season']=label_encoder.fit_transform(df.Season.values)
df.drop(columns=['category', 'Crop','CLast','CTransp', 'IrriType','IrriSource','Season'],inplace=True)

On the other hand, numerical variables are scaled as follows:

In [28]:
scaler = StandardScaler()
df['crop_covered_area']=scaler.fit_transform(pd.DataFrame(df.CropCoveredArea))
df['water_cov']=scaler.fit_transform(pd.DataFrame(df.WaterCov))
df.drop(columns=['CropCoveredArea','WaterCov'],inplace=True)


The dataset ready for the first training is:

In [21]:
df.head()

Unnamed: 0,CropCoveredArea,IrriCount,WaterCov,SMonth,y,crop,clast,ctransp,irritype,irrisource,season,crop_covered_area
0,97,4,87,11,1,5,0,3,1,1,1,1.362941
1,82,5,94,11,1,5,0,3,1,0,1,0.363057
2,92,3,99,12,1,5,0,3,1,0,1,1.029647
3,91,5,92,2,0,5,0,3,1,0,1,0.962988
4,94,5,97,12,0,5,0,3,1,0,1,1.162964
