# Feature Engineering
According to the results obtained during the exploratory data analysis performed, the dataset is manipulated in order to keep only those features that may be useful for the intended purpose.

The next features are removed from the dataset:
- FarmID
- State
- District
- Sub-District
- HDate
- CNext
- ExpYield
- geometry
- CHeight

In [91]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

In [92]:
df=pd.read_csv('../data/Train.csv')
df.drop(columns=['FarmID','State','District','Sub-District','HDate','CNext','ExpYield','geometry','CHeight'],inplace=True)

The SDate is manipulated to keep only the month data:

In [93]:
df['SMonth']=df['SDate'].map(lambda x:x[-7:-5])
df.drop(columns='SDate',inplace=True)

The categorical variables are converted to numerical through the label-enconding technique. This is chosen for simplicity but it is important to take into account that this is better suited for ordinal variables which is not the case.

In [94]:
label_encoder1=LabelEncoder()
label_encoder2=LabelEncoder()
label_encoder3=LabelEncoder()
label_encoder4=LabelEncoder()
label_encoder5=LabelEncoder()
label_encoder6=LabelEncoder()
label_encoder7=LabelEncoder()
df['y']=label_encoder1.fit_transform(df.category.values)
df['crop']=label_encoder2.fit_transform(df.Crop.values)
df['clast']=label_encoder3.fit_transform(df.CLast.values)
df['ctransp']=label_encoder4.fit_transform(df.CTransp.values)
df['irritype']=label_encoder5.fit_transform(df.IrriType.values)
df['irrisource']=label_encoder6.fit_transform(df.IrriSource.values)
df['season']=label_encoder7.fit_transform(df.Season.values)
df.drop(columns=['category', 'Crop','CLast','CTransp', 'IrriType','IrriSource','Season'],inplace=True)

On the other hand, numerical variables are scaled as follows:

In [95]:
scaler1 = StandardScaler()
scaler2 = StandardScaler()
df['crop_covered_area']=scaler1.fit_transform(pd.DataFrame(df.CropCoveredArea))
df['water_cov']=scaler2.fit_transform(pd.DataFrame(df.WaterCov))
df.drop(columns=['CropCoveredArea','WaterCov'],inplace=True)
df.sort_index(axis='columns',inplace=True)

The dataset ready for the first training is:

In [96]:
df.to_csv('../data/train1.csv',index=False)

## Test Data Preparation

In [97]:
df_test=pd.read_csv('../data/Test.csv')
# Not used columns
df_test.drop(columns=['FarmID','State','District','Sub-District','HDate','CNext','ExpYield','geometry','CHeight'],inplace=True)
# Date to Month
df_test['SMonth']=df_test['SDate'].map(lambda x:x[-7:-5])
df_test.drop(columns='SDate',inplace=True)
# Label Encoding
df_test['crop']=label_encoder2.transform(df_test.Crop.values)
df_test['clast']=label_encoder3.transform(df_test.CLast.values)
df_test['ctransp']=label_encoder4.transform(df_test.CTransp.values)
df_test['irritype']=label_encoder5.transform(df_test.IrriType.values)
df_test['irrisource']=label_encoder6.transform(df_test.IrriSource.values)
df_test['season']=label_encoder7.transform(df_test.Season.values)
df_test.drop(columns=['Crop','CLast','CTransp', 'IrriType','IrriSource','Season'],inplace=True)
# Scaling
df_test['crop_covered_area']=scaler1.transform(pd.DataFrame(df_test.CropCoveredArea))
df_test['water_cov']=scaler2.transform(pd.DataFrame(df_test.WaterCov))
df_test.drop(columns=['CropCoveredArea','WaterCov'],inplace=True)
# Sorting columns
df_test.sort_index(axis='columns',inplace=True)

Export it

In [105]:
print(label_encoder1.classes_)
print(label_encoder1.transform(label_encoder1.classes_))

['Diseased' 'Healthy' 'Pests' 'Stressed']
[0 1 2 3]


In [99]:
df_test.to_csv('../data/test1.csv',index=False)

## Sample Submission

In [138]:
submission=pd.read_csv('../data/Test.csv')
results=pd.read_csv('../data/ypred2.csv')
results=label_encoder1.inverse_transform(results['0'].values)

In [141]:
submission=pd.concat([submission.FarmID,pd.Series(results)],axis=1)

In [143]:
submission.columns=['ID','Target']

In [145]:
submission.to_csv('../data/submission1.csv',index=False)