# Machine Learning

This notebook searchs to use machine learning techniques to model the studied data in the exploratory_analysis.ipynb notebook. First we import the required libraries:

In [255]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score, f1_score

## Feature Engineering

First of all, the data collected from the satellite images (See the satellite processing folder) is added to the training set:

In [225]:
df = pd.read_csv('../data/Train.csv')
df_encoded = pd.read_csv('../data/train_encoded.csv')
df = pd.concat([df,df_encoded.ndvi,df_encoded.evi,df_encoded.ndwi,df_encoded.gndvi,df_encoded.savi,df_encoded.msavi],axis=1)

About one thousand farms don't satellite data associted so all they have for the new data are missing values. These records are dropped:

In [226]:
df.dropna(inplace=True)

Just a short multivariate analysis to see what is going on:

In [227]:
df_satellite = pd.concat([df.ndvi,df.ndwi,df.gndvi,df.savi,df.msavi,df.category],axis=1)
df_satellite_gb = df_satellite.groupby('category')
df_satellite_gb.agg(['mean','std'])

Unnamed: 0_level_0,ndvi,ndvi,ndwi,ndwi,gndvi,gndvi,savi,savi,msavi,msavi
Unnamed: 0_level_1,mean,std,mean,std,mean,std,mean,std,mean,std
category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
Diseased,0.568699,1.460792,10.042921,2.67773,0.254835,0.694027,0.839677,2.176318,3878.101041,1390.030283
Healthy,0.682831,1.637722,10.215223,2.771421,0.295578,0.899111,1.015285,2.448178,3845.65839,1505.893835
Pests,0.72541,1.8005,10.218539,2.6466,0.257164,0.553859,1.085837,2.700968,3879.532629,1503.934969
Stressed,0.912816,2.26162,10.440244,2.919329,0.362572,1.271307,1.345018,3.316586,3763.6025,1556.208114


Next, the dataset is manipulated in order to keep only those features that may be useful for the intended purpose.

The next features are removed from the dataset:
- FarmID
- State
- District
- Sub-District
- HDate
- CNext
- ExpYield
- geometry
- CHeight
- evi (Problematic feature)

In [228]:
df.drop(columns=['FarmID','State','District','Sub-District','HDate','CNext','ExpYield','geometry','CHeight','evi'],inplace=True)

The SDate is manipulated to keep only the month data:

In [229]:
df['SMonth'] = df['SDate'].map(lambda x:x[-7:-5])
df.drop(columns='SDate',inplace=True)

The category column is converted to numerical through a fixed encoding:

In [230]:
category_encoding = {'Healthy':0,
                     'Diseased':1,
                     'Pests':2,
                     'Stressed':3}

df['y'] = df.category.map(category_encoding)
df.drop(columns=['category'],inplace=True)

The categorical variables are converted to numerical through the label-enconding technique. This is chosen for simplicity but it is important to take into account that this is better suited for ordinal variables which is not the case.

In [231]:
label_encoder1 = LabelEncoder()
label_encoder2 = LabelEncoder()
label_encoder3 = LabelEncoder()
label_encoder4 = LabelEncoder()
label_encoder5 = LabelEncoder()
label_encoder6 = LabelEncoder()
df['crop'] = label_encoder1.fit_transform(df.Crop.values)
df['clast'] = label_encoder2.fit_transform(df.CLast.values)
df['ctransp'] = label_encoder3.fit_transform(df.CTransp.values)
df['irritype'] = label_encoder4.fit_transform(df.IrriType.values)
df['irrisource'] = label_encoder5.fit_transform(df.IrriSource.values)
df['season'] = label_encoder6.fit_transform(df.Season.values)
df.drop(columns=['Crop','CLast','CTransp', 'IrriType','IrriSource','Season'],inplace=True)

On the other hand, numerical variables are scaled as follows:

In [232]:
scaler1 = StandardScaler()
scaler2 = StandardScaler()
scaler3 = StandardScaler()
scaler4 = StandardScaler()
scaler5 = StandardScaler()
scaler6 = StandardScaler()
scaler7 = StandardScaler()

df['crop_covered_area'] = scaler1.fit_transform(pd.DataFrame(df.CropCoveredArea))
df['water_cov'] = scaler2.fit_transform(pd.DataFrame(df.WaterCov))
df['ndvi'] = scaler3.fit_transform(pd.DataFrame(df.ndvi))
df['ndwi'] = scaler4.fit_transform(pd.DataFrame(df.ndwi))
df['gndvi'] = scaler5.fit_transform(pd.DataFrame(df.gndvi))
df['savi'] = scaler6.fit_transform(pd.DataFrame(df.savi))
df['msavi'] = scaler7.fit_transform(pd.DataFrame(df.msavi))


df.drop(columns=['CropCoveredArea','WaterCov'],inplace=True)
df.sort_index(axis='columns',inplace=True)

## Test Data Preparation

This step attemps to prepare the provided test data so that it can be evaluated by the models to be built:

In [233]:
df_test = pd.read_csv('../data/Test.csv')
#Satellite data
df_test_encoded = pd.read_csv('../data/test_encoded.csv')
df_test = pd.concat([df_test,df_test_encoded.ndvi,df_test_encoded.evi,df_test_encoded.ndwi,df_test_encoded.gndvi,
                   df_test_encoded.savi,df_test_encoded.msavi],axis=1)

Records in the test data with missing values cannot be just dropped as it was done for the training data as they are must be there so that submissions to the platform are valid.

So another approach for missing values must me performed. Missing values are fill with the variable's mean.

In [234]:
for column in [df_test.ndvi,df_test.evi,df_test.ndwi,df_test.gndvi, df_test.savi,df_test.msavi]:
    column.fillna(column.mean(),inplace=True)

Now the rest of required transformations:

In [235]:
# Not used columns
df_test.drop(columns = ['FarmID','State','District','Sub-District','HDate','CNext',
                      'ExpYield','geometry','CHeight','evi'],inplace=True)
# Date to Month
df_test['SMonth'] = df_test['SDate'].map(lambda x:x[-7:-5])
df_test.drop(columns = 'SDate',inplace = True)
# Label Encoding
df_test['crop'] = label_encoder1.transform(df_test.Crop.values)
df_test['clast'] = label_encoder2.transform(df_test.CLast.values)
df_test['ctransp'] = label_encoder3.transform(df_test.CTransp.values)
df_test['irritype'] = label_encoder4.transform(df_test.IrriType.values)
df_test['irrisource'] = label_encoder5.transform(df_test.IrriSource.values)
df_test['season'] = label_encoder6.transform(df_test.Season.values)
df_test.drop(columns = ['Crop','CLast','CTransp', 'IrriType','IrriSource','Season'],inplace=True)
# Scaling
df_test['crop_covered_area'] = scaler1.transform(pd.DataFrame(df_test.CropCoveredArea))
df_test['water_cov'] = scaler2.transform(pd.DataFrame(df_test.WaterCov))
df_test['ndvi'] = scaler3.fit_transform(pd.DataFrame(df_test.ndvi))
df_test['ndwi'] = scaler4.fit_transform(pd.DataFrame(df_test.ndwi))
df_test['gndvi'] = scaler5.fit_transform(pd.DataFrame(df_test.gndvi))
df_test['savi'] = scaler6.fit_transform(pd.DataFrame(df_test.savi))
df_test['msavi'] = scaler7.fit_transform(pd.DataFrame(df_test.msavi))
df_test.drop(columns = ['CropCoveredArea','WaterCov'],inplace=True)
# Sorting columns
df_test.sort_index(axis='columns',inplace=True)

## Training

In [236]:
y_train=df.y
X_train=df.drop(columns='y')
X_test=df_test

### KNeighbors Algorithm

Because of the *curse of dimensionality* only three variables are used with this algorithm:

In [282]:
# Only the ndvi, savi and SMonth variables are kept
X_train_knn = pd.concat([X_train.ndvi,X_train.savi,X_train.water_cov],axis=1)


# For this case, we will divide the original training data
X_train_knn,X_test_knn,y_train_knn,y_test_knn = train_test_split(X_train_knn,y_train,test_size=0.2,
                                                               random_state=42, shuffle=True)

for n in range(1,6):
    knn = KNeighborsClassifier(n_neighbors=n)
    knn.fit(X_train_knn,y_train_knn)
    y_pred_knn = knn.predict(X_test_knn)
    f1_score_knn = f1_score(y_test_knn,y_pred_knn,average="weighted")
    print('The weighted average f1-score for',n,'neighbors is',f1_score_knn)

The weighted average f1-score for 1 neighbors is 0.69600555973801
The weighted average f1-score for 2 neighbors is 0.7326263204638507
The weighted average f1-score for 3 neighbors is 0.7378557350508214
The weighted average f1-score for 4 neighbors is 0.7315497662425016
The weighted average f1-score for 5 neighbors is 0.7304993757802747


From the last, it is possible to observe that the best number for N is 3.
(Different variable combinations were also tested despite not being shown here).

Now we train an algorithm without splitting the training data, and we test it using the actual test data:

In [285]:
X_train_knn = pd.concat([X_train.ndvi,X_train.savi,X_train.water_cov],axis=1)
X_test_knn = pd.concat([X_test.ndvi,X_test.savi,X_test.water_cov],axis=1)

knn = KNeighborsClassifier(n_neighbors=n)
knn.fit(X_train_knn,y_train)
y_pred_knn = knn.predict(X_test_knn)

## Submissions

In [298]:
def create_submission(submission_number,y_pred):
    '''
    Creates a file ready to submit to the ZINDI platform
    '''
    submission = pd.read_csv('../data/Test.csv') # Gets the original test file which has the Farm IDs
    decoded_categories = []
    for i in y_pred:
        if i==0: decoded_categories.append('Healthy')
        elif i==1: decoded_categories.append('Diseased')
        elif i==2: decoded_categories.append('Pests')
        elif i==3: decoded_categories.append('Stressed')
    submission = pd.concat([submission.FarmID,pd.Series(decoded_categories)],axis=1)
    submission.columns = ['ID','Target']
    submission.to_csv('../data/submissions/submission'+str(submission_number)+'.csv',index=False)

Submission #2

In [299]:
create_submission(2,y_pred_knn)