<a href="https://colab.research.google.com/github/melihkurtaran/MachineLearning/blob/main/SupervisedLearning/Supervised_Learning_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Supervised Learning Project**

In this project, a dataset which has been collected using readings of a multi-spectral imaging sensor installed in a drone intended
to map a specific geographical area will be used for developing supervised machine learning models

In [38]:
#Load libraries
import pandas as pd
import numpy as np
from scipy import stats
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.preprocessing import StandardScaler

In [39]:
#Connect to GitHub for faster access
!git clone https://github.com/melihkurtaran/MachineLearning.git

fatal: destination path 'MachineLearning' already exists and is not an empty directory.


In [40]:
# CSV to DataFrame 
ds_09 = pd.read_csv("MachineLearning/SupervisedLearning/ds_09.csv")
ds_09.head()

Unnamed: 0,V01,V02,V03,V04,V05,V06,V07,V08,V09,V10,...,V29,V30,V31,V32,V33,V34,V35,V36,class,target
0,14.65,,46.46,15.74,49.84,17.95,21.59,60.14,2.85,8.99,...,59.1,13.22,0.0,11.56,3.7,17.89,13.76,15.75,4,0.5272
1,13.97,,19.38,5.99,21.44,31.61,45.76,26.47,2.52,26.66,...,32.93,12.02,14.3,17.82,3.15,11.77,14.46,16.38,1,0.4937
2,12.14,53.27,62.25,11.42,28.51,33.03,42.41,52.27,4.68,25.59,...,25.71,12.5,23.18,14.6,4.08,5.87,6.52,14.25,2,0.5796
3,8.29,16.06,15.41,6.97,14.81,16.53,29.76,36.73,3.14,20.58,...,31.74,9.67,8.43,21.87,6.03,10.2,12.54,7.82,5,0.4098
4,10.02,47.28,45.67,10.43,13.07,,35.22,19.96,3.34,35.5,...,,12.2,18.88,23.72,3.16,15.35,8.81,14.93,1,0.5465


There are 36 features and 2 values to be used in classification and regression tasks

# **T1 - Dataset preparation**

The dataset needs to be preprocessed before using in models

##**(a) removing missing values and outliers**

Samples have 38 rows, class and target row are never missing so thresh needs to setas 34 to drop samples with more than 4 missing feature values.

In [41]:
# samples with more than 4 missing feature values are dropped
print("Size before dropping: " + str(len(ds_09)))
ds_09 = ds_09.dropna(axis=0, thresh=34) # thresh: Require that many non-NA values
print("Size after dropping: " + str(len(ds_09)))

Size before dropping: 1000
Size after dropping: 967


Filling remaining null values with the mean

In [42]:
# the remaining missing values are filled using the average value

for i in ds_09.columns[ds_09.isnull().any(axis=0)]:  #Applying only on variables with NaN valuesfor bettter performance
    ds_09[i].fillna(ds_09[i].mean(),inplace=True)

In [43]:
# We can see that we do not have any missing values anymore
ds_09.isnull().values.any()

False

Removing outliers

In [44]:
# samples with at least one feature value with a z-score higher than 3 (i.e. an outlier) are discarded
print("Size before removing outlier samples: " + str(len(ds_09)))
# code below for each column, first calculates Z-score of each value in the column, and remove all rows that have outliers in at least one column
ds_09 = ds_09[(np.abs(stats.zscore(ds_09)) < 3).all(axis=1)] # axis=1 ensures that for each row, all column satisfy the constraint.
print("Size after removing outlier samples: " + str(len(ds_09)))

Size before removing outlier samples: 967
Size after removing outlier samples: 869


##**(b) Dimensionality Reduction**

Keep only features that account for up to 95% of the variance of the data

##**(c) Standardization**

mu-sigma standardization is used to normalize the features

In [48]:
X = ds_09.copy()
X.drop(['class', 'target'], axis=1, inplace=True)

In [49]:
# define mu-sigma standardizer scaler
ss = StandardScaler()
  
# transform data
X = pd.DataFrame(ss.fit_transform(X),columns = X.columns)
X.head()

Unnamed: 0,V01,V02,V03,V04,V05,V06,V07,V08,V09,V10,...,V27,V28,V29,V30,V31,V32,V33,V34,V35,V36
0,1.214996,-0.054054,1.152872,2.357207,2.216477,-1.496285,-1.158382,2.376606,-0.895622,-2.346782,...,-0.316122,-1.53932,2.279143,0.406997,-3.20781,-0.729132,0.408947,0.383839,1.212194,-0.099109
1,0.986969,-0.054054,-1.939001,-1.175792,-0.726378,-0.047792,1.256262,-0.882371,-1.204081,-0.161146,...,-1.312862,0.261232,-0.129392,0.010242,-0.254762,0.577851,-0.075171,-0.88251,1.432997,0.053858
2,0.373306,0.726206,2.955703,0.791817,0.006227,0.102783,0.921588,1.614855,0.814919,-0.293497,...,1.431467,-1.003106,-0.793879,0.168944,1.579018,-0.094431,0.743429,-2.103338,-1.071547,-0.463317
3,-0.917733,-2.285514,-2.392278,-0.82068,-1.41339,-1.64686,-0.342179,0.110712,-0.624553,-0.913193,...,-1.802451,-0.385048,-0.238912,-0.766735,-1.466957,1.423423,2.459851,-1.207375,0.827364,-2.024555
4,-0.337604,0.241384,1.062673,0.433082,-1.593692,-0.044691,0.203289,-1.512486,-0.437608,0.93229,...,1.481963,-0.760398,0.015906,0.069755,0.691039,1.809672,-0.066369,-0.141737,-0.349203,-0.298209


##**(d) Calculate IR**

The Imbalance Ratio (IR) is the ratio between the number of samples from the majority class and the number of samples from the minority class

In [53]:
ds_09['class'].value_counts() #Observe majority and minority class

1    230
4    224
5    148
0    132
3     68
2     67
Name: class, dtype: int64

In [56]:
IR = ds_09['class'].value_counts().max() / ds_09['class'].value_counts().min()
print('Imbalance Ratio: ' + str(IR))

Imbalance Ratio: 3.4328358208955225


# **T2 - Classifier design (I)**

# **T3 - Classifier design (II)**

# **T4 - Regression**

# **T5 - Model exploitation**

In [47]:
df_t5 = pd.read_csv('MachineLearning/SupervisedLearning/im_x_09.txt', sep=" ", header=None, index_col=False)
df_t5_y = pd.read_csv('MachineLearning/SupervisedLearning/im_y_09.txt', sep=" ", header=None, index_col=False)
df_t5_t = pd.read_csv('MachineLearning/SupervisedLearning/im_t_09.txt', sep=" ", header=None, index_col=False)
df_t5['class'] = df_t5_y
df_t5['target'] = df_t5_t

df_t5.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,28,29,30,31,32,33,34,35,class,target
0,10.56,53.9,28.49,9.56,35.65,38.17,45.36,22.96,3.43,27.41,...,27.15,15.54,18.56,19.07,2.82,14.04,14.95,19.99,0,0.557017
1,11.94,34.57,45.58,15.47,39.31,31.83,32.69,49.95,2.83,39.21,...,48.16,13.69,7.2,17.95,3.33,6.18,10.37,13.16,0,0.59442
2,6.29,45.65,31.93,11.38,39.51,36.15,29.35,37.18,5.32,20.21,...,32.7,16.9,16.1,12.59,2.09,11.36,14.21,17.1,0,0.519508
3,10.34,52.78,26.24,12.49,29.88,39.67,38.91,36.1,2.06,35.34,...,26.78,13.29,13.92,15.35,2.51,15.46,10.57,16.88,0,0.536709
4,8.25,53.67,30.91,3.57,18.04,27.73,28.09,19.02,3.41,31.42,...,34.27,12.07,15.83,23.11,2.11,18.25,6.78,9.4,0,0.510596
