## <center><font color=blue>CITS5508 Labsheet 2</font></center>
 __Name: Anitha Raghupathy__ <br> 
 __Student Number: 22773933__ <br>

### Aim
To perform data cleaning and tranforming and apply two different types of classification algorithms and compare results.

In [None]:
#libraries used in this exercise are listed below
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.multiclass import OneVsOneClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import precision_recall_curve
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR
from sklearn.pipeline import Pipeline
from sklearn import metrics
from sklearn.metrics import classification_report

### Data Exploration
This data set contains training and testing data from a remote sensing study which mapped different forest types based on their spectral characteristics at visible-to-near infrared wavelengths, using ASTER satellite imagery. The output (forest type map) can be used to identify and/or quantify the ecosystem services (e.g. carbon storage, erosion protection) provided by the forest.

In [None]:
train = pd.read_csv('training.csv')
test = pd.read_csv('testing.csv')

In [None]:
train.shape

The training dataset consists of 325 rows and 28 columns.

In [None]:
test.shape

The testing dataset consists of 198 rows and 28 columns.

In [None]:
train.columns

The _target variable_ is 'class' and the other 27 columns are _independent variables_.
The column names are listed below.

In [None]:
train.info()

It is observed that each column has 325 entries (i.e) __no missing values__ in the dataset. The datatype of 'class' is object so it is a _categorical variable_ and other columns are _numerical variables_.

In [None]:
train.describe()

Columns b1 to b9 have all positive values and have similar range. Whereas, the remaining columns have negative values.

### Data Visualization

In [None]:
%matplotlib inline
train.hist(bins=50, figsize=(20,15))
plt.show()

Histogram representation of the attributes reveal that majority of the columns have normal distribution. However, data in few columns are skewed.

In [None]:
sns.set(style="ticks", color_codes=True)
g = sns.pairplot(train)

The scatter plot displays the correlation between various columns.

### Data Cleaning

In [None]:
X_train = train.drop("class", axis=1)
y_train = train["class"].copy()
X_test = test.drop("class", axis=1)
y_test = test["class"].copy()

The training dataset in divided into X_train and y_train. X_train has dependent columns and y_train is the target column of taining data. Similarly, X_test is the testing data and y_test is the target column.

In [None]:
def AttributeRemover(df):
    df.drop(df.filter(regex = 'pred_minus_obs').columns, axis = 1, inplace = True)
    
AttributeRemover(X_train)
AttributeRemover(X_test)

For convenience, we drop all the columns starting with 'pred_minus_obs'.

### Data Transformation

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.values)
X_test_scaled = scaler.fit_transform(X_test.values)

In [None]:
X_train_scaled_df = pd.DataFrame(scaler, index=X_train.index, columns=X_train.columns)
X_test_scaled_df = pd.DataFrame(scaler, index=X_test.index, columns=X_test.columns)

Both the train and test dataset are scaled using standardization.

In [None]:
y_train.value_counts()

The training set is unbalanced. First 2 classes have over 100 records whereas the other two has under 50 records.

In [None]:
y_test.value_counts()

The test data is well balanced. All classes have around 50 observations.

### Data Modelling
#### Stochastic Gradient Descent Classifier

In [None]:
ovo_clf = OneVsOneClassifier(SGDClassifier(random_state=42))
ovo_clf.fit(X_train, y_train)

Training the data using one versus one classification strategy on 4 different labels.

In [None]:
len(ovo_clf.estimators_)

The model is trained on 6 classifiers.

In [None]:
y_test_pred = ovo_clf.predict(X_test)

In [None]:
conf_mx = confusion_matrix(y_test, y_test_pred)

In [None]:
plt.matshow(conf_mx, cmap=plt.cm.gray)
plt.show()

The above confusion matrix show that the classifier is good at classifying group 1 and 4.

#### Support Vector Machine

In [None]:
rbf_kernel_svm_clf = Pipeline((
("scaler", StandardScaler()),
("svm_clf", SVC(kernel="rbf", gamma=5, C=0.001))
))
rbf_kernel_svm_clf.fit(X_train, y_train)

SVM by default uses OvO strategy. Additionally, we are using pipeline to do transformation followed by model training. Hyperparameter used to train is 'rbf'.

In [None]:
y_test_pred2 = rbf_kernel_svm_clf.predict(X_test)

In [None]:
conf_mx2 = confusion_matrix(y_test, y_test_pred2)

In [None]:
plt.matshow(conf_mx2, cmap=plt.cm.gray)
plt.show()

This classifier performs poorly in classifying group 1,2 and 3.

In [None]:
rbf_kernel_svm_clf = Pipeline((
("scaler", StandardScaler()),
("svm_clf", SVC(kernel="linear", gamma=1, C=0.1))
))
rbf_kernel_svm_clf.fit(X_train, y_train)

In [None]:
y_test_pred3 = rbf_kernel_svm_clf.predict(X_test)

In [None]:
conf_mx3 = confusion_matrix(y_test, y_test_pred2)

In [None]:
plt.matshow(conf_mx3, cmap=plt.cm.gray)
plt.show()

This model is applied on training set transformed using minmax scaler and hyperparameter linear.

### Result

Comparing both the models, SDG classifier performs better than SVM for the given dataset.