<a href="https://colab.research.google.com/github/melihkurtaran/MachineLearning/blob/main/Supervised_Learning_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Supervised Learning Project**

In this project, a dataset which has been collected using readings of a multi-spectral imaging sensor installed in a drone intended
to map a specific geographical area will be used for developing supervised machine learning models

In [108]:
#Load libraries
import pandas as pd
import numpy as np
from scipy import stats
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import cross_val_score
from numpy import mean, std
from scipy.stats import sem
from matplotlib import pyplot
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import recall_score
import random
from sklearn.model_selection import KFold
from sklearn import preprocessing
from sklearn.metrics import make_scorer
from sklearn.model_selection import RepeatedStratifiedKFold, cross_validate

In [72]:
#Connect to GitHub for faster access
!git clone https://github.com/melihkurtaran/MachineLearning.git

fatal: destination path 'MachineLearning' already exists and is not an empty directory.


In [73]:
# CSV to DataFrame 
ds_09 = pd.read_csv("MachineLearning/SupervisedLearning/ds_09.csv")
ds_09.head()

Unnamed: 0,V01,V02,V03,V04,V05,V06,V07,V08,V09,V10,...,V29,V30,V31,V32,V33,V34,V35,V36,class,target
0,14.65,,46.46,15.74,49.84,17.95,21.59,60.14,2.85,8.99,...,59.1,13.22,0.0,11.56,3.7,17.89,13.76,15.75,4,0.5272
1,13.97,,19.38,5.99,21.44,31.61,45.76,26.47,2.52,26.66,...,32.93,12.02,14.3,17.82,3.15,11.77,14.46,16.38,1,0.4937
2,12.14,53.27,62.25,11.42,28.51,33.03,42.41,52.27,4.68,25.59,...,25.71,12.5,23.18,14.6,4.08,5.87,6.52,14.25,2,0.5796
3,8.29,16.06,15.41,6.97,14.81,16.53,29.76,36.73,3.14,20.58,...,31.74,9.67,8.43,21.87,6.03,10.2,12.54,7.82,5,0.4098
4,10.02,47.28,45.67,10.43,13.07,,35.22,19.96,3.34,35.5,...,,12.2,18.88,23.72,3.16,15.35,8.81,14.93,1,0.5465


There are 36 features and 2 values to be used in classification and regression tasks

# **T1 - Dataset preparation**

The dataset needs to be preprocessed before using in models

##**(a) removing missing values and outliers**

Samples have 38 rows, class and target row are never missing so thresh needs to setas 34 to drop samples with more than 4 missing feature values.

In [74]:
# samples with more than 4 missing feature values are dropped
print("Size before dropping: " + str(len(ds_09)))
ds_09 = ds_09.dropna(axis=0, thresh=34) # thresh: Require that many non-NA values
print("Size after dropping: " + str(len(ds_09)))

Size before dropping: 1000
Size after dropping: 967


Filling remaining null values with the mean

In [75]:
# the remaining missing values are filled using the average value

for i in ds_09.columns[ds_09.isnull().any(axis=0)]:  #Applying only on variables with NaN valuesfor bettter performance
    ds_09[i].fillna(ds_09[i].mean(),inplace=True)

In [76]:
# We can see that we do not have any missing values anymore
ds_09.isnull().values.any()

False

Removing outliers

In [77]:
# samples with at least one feature value with a z-score higher than 3 (i.e. an outlier) are discarded
print("Size before removing outlier samples: " + str(len(ds_09)))
# code below for each column, first calculates Z-score of each value in the column, and remove all rows that have outliers in at least one column
ds_09 = ds_09[(np.abs(stats.zscore(ds_09)) < 3).all(axis=1)] # axis=1 ensures that for each row, all column satisfy the constraint.
print("Size after removing outlier samples: " + str(len(ds_09)))

Size before removing outlier samples: 967
Size after removing outlier samples: 869


##**(b) Dimensionality Reduction**

Keep only features that account for up to 95% of the variance of the data

##**(c) Standardization**

mu-sigma standardization is used to normalize the features

In [78]:
X = ds_09.copy()
X.drop(['class', 'target'], axis=1, inplace=True)
y = ds_09['class']

In [79]:
# define mu-sigma standardizer scaler
ss = StandardScaler()
  
# transform data
X = pd.DataFrame(ss.fit_transform(X),columns = X.columns)
X.head()

Unnamed: 0,V01,V02,V03,V04,V05,V06,V07,V08,V09,V10,...,V27,V28,V29,V30,V31,V32,V33,V34,V35,V36
0,1.214996,-0.054054,1.152872,2.357207,2.216477,-1.496285,-1.158382,2.376606,-0.895622,-2.346782,...,-0.316122,-1.53932,2.279143,0.406997,-3.20781,-0.729132,0.408947,0.383839,1.212194,-0.099109
1,0.986969,-0.054054,-1.939001,-1.175792,-0.726378,-0.047792,1.256262,-0.882371,-1.204081,-0.161146,...,-1.312862,0.261232,-0.129392,0.010242,-0.254762,0.577851,-0.075171,-0.88251,1.432997,0.053858
2,0.373306,0.726206,2.955703,0.791817,0.006227,0.102783,0.921588,1.614855,0.814919,-0.293497,...,1.431467,-1.003106,-0.793879,0.168944,1.579018,-0.094431,0.743429,-2.103338,-1.071547,-0.463317
3,-0.917733,-2.285514,-2.392278,-0.82068,-1.41339,-1.64686,-0.342179,0.110712,-0.624553,-0.913193,...,-1.802451,-0.385048,-0.238912,-0.766735,-1.466957,1.423423,2.459851,-1.207375,0.827364,-2.024555
4,-0.337604,0.241384,1.062673,0.433082,-1.593692,-0.044691,0.203289,-1.512486,-0.437608,0.93229,...,1.481963,-0.760398,0.015906,0.069755,0.691039,1.809672,-0.066369,-0.141737,-0.349203,-0.298209


##**(d) Calculate IR**

The Imbalance Ratio (IR) is the ratio between the number of samples from the majority class and the number of samples from the minority class

In [80]:
ds_09['class'].value_counts() #Observe majority and minority class

1    230
4    224
5    148
0    132
3     68
2     67
Name: class, dtype: int64

In [81]:
IR = ds_09['class'].value_counts().max() / ds_09['class'].value_counts().min()
print('Imbalance Ratio: ' + str(IR))

Imbalance Ratio: 3.4328358208955225


# **T2 - Classifier design (I)**

4 different classifiers are defined in this task

## **(a) Find the best configuration for each model**

Best parameters found by GridSearchCV

Since the data imbalanced, it is better to use F1 score not the accuracy so all optimization is done based on weighted F1 score

### **1) Gaussian Naive Bayes (GNB)**

In [82]:
from sklearn.naive_bayes import GaussianNB
model_gnb = GaussianNB()

### **2) Logistic Regression (LR)**

In [83]:
from sklearn.linear_model import LogisticRegression
model_lr = LogisticRegression(solver='saga', max_iter=5000, tol=0.001)
param_lr = [{'penalty': ['l1','l2','none'],'C': [0.1, 1, 10]}]

gs_lr = GridSearchCV(model_lr,
                      param_grid=param_lr,
                      scoring='f1_weighted',
                      cv=5)
gs_lr.fit(X, y)
gs_lr.best_params_



{'C': 0.1, 'penalty': 'l2'}

Best C value is 0.1 and best penalty value is l2.

In [84]:
model_lr = LogisticRegression(solver='saga', max_iter=5000, tol=0.001, C=0.1, penalty='l2')

### **3) Decision Tree (DT)**

In [85]:
from sklearn import tree
model_dt = tree.DecisionTreeClassifier()
param_dt = [{'criterion': ['gini','entropy'],'max_depth': [3, 5, None],'min_samples_leaf': [1, 5, 10]}]

gs_dt = GridSearchCV(model_dt,
                      param_grid=param_dt,
                      scoring='f1_weighted',
                      cv=5)
gs_dt.fit(X, y)
gs_dt.best_params_

{'criterion': 'entropy', 'max_depth': None, 'min_samples_leaf': 1}

Best criterion value is entropy, better to set max_dept default which is None and best min. sample leaf is 1.

In [93]:
model_dt = tree.DecisionTreeClassifier(criterion='entropy', max_depth=None, min_samples_leaf=1)

### **4) Support Vector Machine (SVM)**

In [114]:
from sklearn import svm
model_svm = svm.SVC(gamma='scale', probability=True, max_iter=5000, tol=0.05)
param_svm = [{'kernel': ['linear','poly','rbf'],'C': [0.1, 1, 10]}]

gs_svm = GridSearchCV(model_svm,
                      param_grid=param_svm,
                      scoring='f1_weighted',
                      cv=5)
gs_svm.fit(X, y)
gs_svm.best_params_



{'C': 10, 'kernel': 'rbf'}

Best C value is 10 and the best kernel value is rbf

In [115]:
model_svm = svm.SVC(gamma='scale', probability=True, max_iter=5000, tol=0.05, C=10, kernel='rbf')

## **b) Estimate the accuracy, precision, recall and F1 scores**

In [117]:
# specifying the evaluation metrics
scoring = {'acc': 'accuracy',
           'pre': make_scorer(precision_score, average='weighted'),
           'rec': make_scorer(recall_score, average='weighted'),
           'f1': make_scorer(f1_score, average='weighted')}

# creating the cross-validation using 3-repetition, 5-fold 
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3)

models = [model_gnb,model_lr,model_dt,model_svm]

for m in models: # for each classifiers
  # evaluate the classifier using 3-repetition, 5-fold cross-validation
  scores = cross_validate(m, X, y, scoring=scoring, cv=cv, n_jobs=-1)

  print('MODEL:' + str(m)) #printing what the model is

  # print the mean and standard deviation of the evaluation scores
  for metric in scoring:
    mean = np.mean(scores['test_' + metric])
    std = np.std(scores['test_' + metric])
    print(f'{metric}: {mean:.4f} (+/- {std * 2:.4f})')
  print()
 


MODEL:GaussianNB()
acc: 0.5243 (+/- 0.0562)
pre: 0.5665 (+/- 0.0745)
rec: 0.5243 (+/- 0.0562)
f1: 0.5207 (+/- 0.0528)

MODEL:LogisticRegression(C=0.1, max_iter=5000, solver='saga', tol=0.001)
acc: 0.5922 (+/- 0.0658)
pre: 0.6093 (+/- 0.0661)
rec: 0.5922 (+/- 0.0658)
f1: 0.5896 (+/- 0.0657)

MODEL:DecisionTreeClassifier(criterion='entropy')
acc: 0.5389 (+/- 0.0691)
pre: 0.5445 (+/- 0.0731)
rec: 0.5389 (+/- 0.0691)
f1: 0.5377 (+/- 0.0701)

MODEL:SVC(C=10, max_iter=5000, probability=True, tol=0.05)
acc: 0.7614 (+/- 0.0769)
pre: 0.7698 (+/- 0.0714)
rec: 0.7614 (+/- 0.0769)
f1: 0.7609 (+/- 0.0776)



## **c) Rank the models**

In imbalanced datasets, where the minority class is under-represented, it is important to choose an appropriate evaluation metric. The F1 score is a commonly used metric in imbalanced datasets, as it is a balanced measure of precision and recall and takes into account both the true positive and false positive rates. Therefore, four models ranked based on F1 Scores

Ranking

1.   Suppport Vector Machine with F1 score around **0.76**
2.   Logistic Regression with F1 score around **0.59**
3.   Decision Tree with F1 score around **0.53**
4.   Gaussian Naive Bayes with F1 score around **0.52**





# **T3 - Classifier design (II)**

# **T4 - Regression**

# **T5 - Model exploitation**

In [90]:
df_t5 = pd.read_csv('MachineLearning/SupervisedLearning/im_x_09.txt', sep=" ", header=None, index_col=False)
df_t5_y = pd.read_csv('MachineLearning/SupervisedLearning/im_y_09.txt', sep=" ", header=None, index_col=False)
df_t5_t = pd.read_csv('MachineLearning/SupervisedLearning/im_t_09.txt', sep=" ", header=None, index_col=False)
df_t5['class'] = df_t5_y
df_t5['target'] = df_t5_t

df_t5.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,28,29,30,31,32,33,34,35,class,target
0,10.56,53.9,28.49,9.56,35.65,38.17,45.36,22.96,3.43,27.41,...,27.15,15.54,18.56,19.07,2.82,14.04,14.95,19.99,0,0.557017
1,11.94,34.57,45.58,15.47,39.31,31.83,32.69,49.95,2.83,39.21,...,48.16,13.69,7.2,17.95,3.33,6.18,10.37,13.16,0,0.59442
2,6.29,45.65,31.93,11.38,39.51,36.15,29.35,37.18,5.32,20.21,...,32.7,16.9,16.1,12.59,2.09,11.36,14.21,17.1,0,0.519508
3,10.34,52.78,26.24,12.49,29.88,39.67,38.91,36.1,2.06,35.34,...,26.78,13.29,13.92,15.35,2.51,15.46,10.57,16.88,0,0.536709
4,8.25,53.67,30.91,3.57,18.04,27.73,28.09,19.02,3.41,31.42,...,34.27,12.07,15.83,23.11,2.11,18.25,6.78,9.4,0,0.510596
