## Importing required Libraries

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import sqlite3
import warnings

## Loading Data

In [2]:
con = sqlite3.connect("Assignment2021.sqlite") #Assuming that the file is in root folder. 
df = pd.read_sql_query("SELECT * from data", con) #data is saved in df dataframe

DatabaseError: Execution failed on sql 'SELECT * from data': no such table: data

## Changing Display Settings for better View

In [None]:
warnings.filterwarnings('ignore')
pd.options.display.max_columns = 50  # It would allow us to look at all columns at once
pd.options.display.max_rows = 200
np.random.seed(31415)

# Data Exploration

First, we will look at the shape of data frame. After that, we will look at the information about dimensions. 

In [None]:
#Printing Shape of Dataframe 

print(f"The shape of dataframe is {df.shape}")

So, the dataframe has 1200 rows and 32 dimensions. Now, we will look at datatypes of all the dimensions

In [None]:
#Printing information about dataframe dimensions 
df.info()

#### Observations from the information on dimensions 

1. 6 dimensions are of integer datatype 
2. 3 dimensions are of object datatype
3. Remaining 23 dimensions are of float datatype
4. Att00 is missing 9 observations 
5. Att09 is missing 581 observations

### Now we will explore our dataframe further

In [None]:
#printing descriptive statistics about all the columns of dataframe
df.describe(include = 'all')

#### Observations from descriptive statistics

1. There is a huge contrast in range of dimensions. For example, inter-quartile range for Att00 is 4.557 and inter-quartile range for Att02 is 58704.75
2. Dimensions with 'object datatype have following number of unique values - 
    2.1. Att01 -> 10
    2.2. Att08 -> 3
    2.3. Att29 -> 7
    
### Now we will create histograms for all the numerical dimentions to understand their distribution better. 

In [None]:
numericDF = df.select_dtypes(include=np.number)
numericDF.hist(bins=30, figsize=(15, 10))

'''
Author: Reka Horvath
Date:: n.d.
Availability: https://realpython.com/pandas-plot-python/

'''

#### Observations based on distribution of numerical dimentions: 
1. Most of the dimensions have almost-normal distribution
2. Index has uniform distribution. Also, Index dimension do not add any extra information to dataframe (not that is has anything to do with it being uniformally distributed. This just looked like a good time to point that out.)
3. Att13, Att20 and Att24 have uniformally distributed data. (This is a bit unusual. We will cheeck if they are scaled up version of eachother during data preparation.)
4. Att21 and Att23 has binary data (I am planning to use one-hot encoding. Therefore handling them differently does not look like a neccessity.)

### Now we will explore non-numeric dimensions 

Author: n.d.

Date: 2021

Availability: https://www.geeksforgeeks.org/multi-plot-grid-in-seaborn/


In [None]:
sns.set()
for i in ['Att01', 'Att08', 'Att29']:
    plt.figure() 
    sns.histplot(binwidth=0.5, x=i, data=df, stat="count", multiple="stack")

#### Observations based on the distribution of categorical data: 

1. For Att01, the data is not equally divided among all the 10 categories. Category 'LWYW' comes 317 time. On the other hand, 'TRRP' features only one time. 
2. Similarly, there is data imbalance according to 'Att08' and 'Att29' categories too. 

### Now, let us explore our target dimension -> 'class'

In [None]:
df['class'].plot(kind = 'hist')

The class ratio is 2:3:5. The data is imbalanced (Technically, any ratio other than 1:1:1 is imbalance). However, it is too early to say if we need to do anything to introduce this imbalance. 







# Data Preparation

We will do few tweaks in data to make it more awesome for classification. 

#### Baby steps first. Lets start with removing index dimension. 

In [None]:
df.drop('index', axis=1, inplace = True) #axis = 1 is to select 'dimension' axis

### Handling Missing Data 

#### 1. Att09 

There are 581 (48%) instances missing for this dimension. I believe that we must try the easiest approach first. If we are able to reach business goal with the easiest approach, then it would save time and efforts. Therefore, I am dropping Att09 from analysis

In [None]:
#printing class distribution for missing data
df.drop('Att09', axis = 1, inplace = True)

### Handling Missing Data 

#### 2. Att00 

There are 9 (48%) instances missing for this dimension. I believe that we must try the easiest approach first. If we are able to reach business goal with the easiest approach, then it would save time and efforts. Therefore, I am dropping all instances where Att00 is missing. Previously, I dropped the entire column as 48% data was missing. However, here, only less than 1% data is missing. Therefore, dropping entire dimension will not be justified

In [None]:
df.dropna(inplace = True, subset = ['Att00'])

### Handling duplicate instances

We will drop all duplicate instances

In [None]:
duplicate = df[df.duplicated()]
print(duplicate.shape)

As no instance is duplicate, hence we need not to do anything

### Handling Categorical Variables (where datatype is object)

I am using one-hot encoding for this purpose

In [None]:
df = pd.get_dummies(df)

### Standardising Data

I am also separating class dimension before performing standardisation

In [None]:
dfY = df['class']
dfX = df.drop(['class'], axis = 1)
dfX=(dfX-dfX.mean())/dfX.std()

### Dropping highly correlated data 

I am plotting how many data dimensions will be lost for different correlation cut-offs

Author: NISHA DAGA

Date: June, 2017

Availability: https://stackoverflow.com/questions/29294983/how-to-calculate-correlation-between-all-columns-and-remove-highly-correlated-on

In [None]:
frequency = [0.50, 0.55, 0.60, 0.65, 0.70, 0.75, 0.80, 0.85, 0.90, 0.95]
count = []
for i in frequency:
    corr_matrix = dfX.corr().abs()
    upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
    to_drop = [column for column in upper.columns if any(upper[column] > i)]
    count.append(len(to_drop))
plt.plot(frequency, count)

It seems that 6 dimensions are copy of some existing dimensions. Apart from them, no other dimensions even have pearson coefficient greater than 0.7

In [None]:
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]
dfX.drop(to_drop, axis=1, inplace=True)
print('dimensions that has been dropped are', to_drop)
dfX.drop(['Att08_VEVT', 'Att01_TRRP'], axis=1, inplace=True)

### de-attaching dataset for prediction

In [None]:
dfX_pred = dfX[991:1191]
dfX_work = dfX[0:991]
dfY_work = dfY[0:991]
print(dfX_work.shape, dfY_work.shape, dfX_pred.shape)

This ends the data preparation section. We have three dataframes. 

__dfX_work__ : 991 x 38
__dfY_work__ : 991 x 1
__dfX_pred__ : 200 x 38

# Data Classification

It is important to identify that there is class imbalance.
My strategy - Do nothing at first. See if we can achieve the business goal (75% accuracy) without addressing class imbalance issue. If we are not able to do that, then I will try oversampling

## Test-Train Split

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(dfX_work, dfY_work, test_size=0.20)
ACCURACY = [] # we will append all the accuracies as we move forward

## 1. Random-Forrest Classifier

In [None]:
def plot_grid_search(cv_results, grid_param_1, grid_param_2, name_param_1, name_param_2):
    '''
    Author: David Alvarez and Mike Lewis
    Date: 2018
    Availability: https://stackoverflow.com/questions/37161563/how-to-graph-grid-scores-from-gridsearchcv
    '''
    # Get Test Scores Mean and std for each grid search
    scores_mean = cv_results['mean_test_score']
    scores_mean = np.array(scores_mean).reshape(len(grid_param_2),len(grid_param_1))

    scores_sd = cv_results['std_test_score']
    scores_sd = np.array(scores_sd).reshape(len(grid_param_2),len(grid_param_1))

    # Plot Grid search scores
    _, ax = plt.subplots(1,1)

    # Param1 is the X-axis, Param 2 is represented as a different curve (color line)
    for idx, val in enumerate(grid_param_2):
        ax.plot(grid_param_1, scores_mean[idx,:], '-o', label= name_param_2 + ': ' + str(val))

    ax.set_title("Grid Search Scores", fontsize=20, fontweight='bold')
    ax.set_xlabel(name_param_1, fontsize=16)
    ax.set_ylabel('CV Average Score', fontsize=16)
    ax.legend(loc="best", fontsize=15)
    ax.grid('on')

In [None]:
#Code inspired from Practicle sessions held by Dr Paul Hancock at Curtin University in Sem 2 of 2021
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
n_estimators = [50, 100, 150, 200, 300]
criterion = ['gini', 'entropy']
params = [{'n_estimators' : n_estimators,
           'criterion' : criterion
          }]
clf_rff = GridSearchCV(RandomForestClassifier(), param_grid=params, scoring='accuracy',cv=5)
clf_rff.fit(X_train,y_train)
y_pred=clf_rff.predict(X_test)
plot_grid_search(clf_rff.cv_results_, n_estimators, criterion, 'N Estimators', 'Criterion')


As evident from the graph above, the best n_estimator = 200 for entropy criterion. Therefore, we will create a model with these parameters and test it on test data

In [None]:
rff = RandomForestClassifier(n_estimators = 200, criterion = 'entropy')
rff.fit(X_train,y_train)
y_pred=rff.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
ACCURACY.append(['Random Forrest', metrics.accuracy_score(y_test, y_pred)])

## 2. k-NN Classifier

In [None]:
#Code inspired from Practicle sessions held by Dr Paul Hancock at Curtin University in Sem 2 of 2021
from sklearn.neighbors import KNeighborsClassifier
n_neighbors = [1, 2, 3, 5, 6, 8, 10]
p = [1, 2]
params = [{'n_neighbors' : n_neighbors,
           'p' : p
          }]
clf_knn = GridSearchCV(KNeighborsClassifier(), param_grid=params, scoring='accuracy',cv=5)
clf_knn.fit(X_train,y_train)
y_pred=clf_knn.predict(X_test)
plot_grid_search(clf_knn.cv_results_, n_neighbors, p, 'Number of neighbors', 'type of distance')

It is evident that best results are when number of neighbors = 2 for euclidian distance

In [None]:
knn = KNeighborsClassifier(n_neighbors = 2) #by default p = 2 for KNeighborsClassifier
knn.fit(X_train,y_train)
y_pred=knn.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
ACCURACY.append(['knn', metrics.accuracy_score(y_test, y_pred)])

## 3. Gaussian Naive-Bayes Classifier

In [None]:
#Code inspired from Practicle sessions held by Dr Paul Hancock at Curtin University in Sem 2 of 2021
from sklearn.naive_bayes import GaussianNB

var_smoothing = np.logspace(0,-9, num=100)

p = [1, 2]
params = [{'var_smoothing' : var_smoothing
          }]
clf_nb = GridSearchCV(GaussianNB(), param_grid=params, scoring='accuracy',cv=5)
clf_nb.fit(X_train,y_train)
y_pred=clf_nb.predict(X_test)
p = [1] #just for plotting. no significance. I thought that tampering with plotting function can turn out to be dangerous. 
plot_grid_search(clf_nb.cv_results_, var_smoothing, p, 'Number of neighbors', '')

In [None]:
print('the best parameter for naive byes is : ', clf_nb.best_params_)

In [None]:
nb = GaussianNB(var_smoothing = 0.81113) 
nb.fit(X_train,y_train)
y_pred=nb.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
ACCURACY.append(['Naive Bayes', metrics.accuracy_score(y_test, y_pred)])

## 4. Decision Tree Classifier

In [None]:
#Code inspired from Practicle sessions held by Dr Paul Hancock at Curtin University in Sem 2 of 2021
from sklearn.tree import DecisionTreeClassifier

min_samples_leaf = [i for i in range(1, 10)]

criterion = ['gini', 'entropy']
params = [{'min_samples_leaf' : min_samples_leaf,
          'criterion': criterion}]
clf_dt = GridSearchCV(DecisionTreeClassifier(), param_grid=params, scoring='accuracy',cv=5)
clf_dt.fit(X_train, y_train)
y_pred=clf_dt.predict(X_test)
plot_grid_search(clf_dt.cv_results_, min_samples_leaf, criterion, 'Minimum Sample Leaves', 'Criterion')


It is eveident that the best parameter is number of minimum sample leaves = 3 for entropy criterion

In [None]:
dt = DecisionTreeClassifier(criterion = 'entropy', min_samples_leaf = 3) 
dt.fit(X_train,y_train)
y_pred=dt.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
ACCURACY.append(['Decision Tree', metrics.accuracy_score(y_test, y_pred)])

In [None]:
print(ACCURACY)

# Model Selection 

The two best Model that we have in order or accuracy are: 

__Random Forrest Classifier__ : 83.41% accuracy

__Decision Tree Claffifier__ : 71.85%


## Saving predictions done by these two models in a sqlite file with the name 'Answers.sqlite'

In [None]:
answers = pd.DataFrame()
answers['index'] = range(1000, 1200)
answers['Predict1'] = rff.predict(dfX_pred).astype(np.int32)
answers['Predict2'] = dt.predict(dfX_pred).astype(np.int32)
answers = answers.set_index('index')
print(answers)

In [None]:
conn = sqlite3.connect('Answers.sqlite')
answers.to_sql('data', conn)

### The End