# COMP6940 Assignment 3: Classification and Clustering

#### Kevan Lee Lum    816003573

What is classification and clustering in the first place? From the top answer at StackOverflow (even though it has not been accepted poor guy): In general, in classification you have a set of predefined classes and want to know which class a new object belongs to. Clustering tries to group a set of objects and find whether there is some relationship between the objects. In other words classification is supervised learning and clustering unsuperviseed learning. Easy enough. In both cases, a model is created in oder to predict the behaviour of other objects.

The dataset here is the UCI Credit Approval Dataset, where each record is a credit card application. All attribute names and values have been changed to meaningless symbols to maintain confidentiality. The dataset has been cleaned to remove missing attributes. The data is stored in a comma-separated file (csv). Each line describes an instance using 16 columns: the first 15 columns represent the attributes of the application, and the last column is the ground truth label for credit card approval. Note: The last column should not be treated as an attribute. The objective of this exercise is to build some model to will help to predict whether a credit card will be approved based on some 15 attributes

In [1]:
import pandas as pd
import numpy as np

from scipy.stats import randint as sp_randint

from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

## Q1. Cleaning data 

### Clean the dataset and do any type conversions necessary 

In [2]:
df = pd.read_csv('data.csv')
df.head()

Unnamed: 0,b,30.83,0,u,g,w,v,1.25,t,t.1,01,f,g.1,00202,0.1,+
0,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
1,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
2,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
3,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+
4,b,32.08,4.0,u,g,m,v,2.5,t,f,0,t,g,360,0,+


We see that the columns do not have a header, so let's assign a header since we're losing out on the first row.

In [3]:
headers = []
num = 1
for col in df.columns:
    headers.append("col" + str(num))
    num += 1

In [4]:
df = pd.read_csv('data.csv', header=None, names=headers)
df.head()

Unnamed: 0,col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15,col16
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


In [5]:
df.dtypes

col1      object
col2      object
col3     float64
col4      object
col5      object
col6      object
col7      object
col8     float64
col9      object
col10     object
col11      int64
col12     object
col13     object
col14     object
col15      int64
col16     object
dtype: object

Why is col2 not a float? Try to fix that...

In [6]:
# df["col2"] = df["col2"].apply(lambda x: float(str(x)))
# df.dtypes

I get an error trying to run the above saying "ValueError: could not convert string to float: '?'". This possibly means that null values are represented by "?". So reading in the df again using "?" as nulls.

In [7]:
df = pd.read_csv('data.csv', header=None, names=headers, na_values="?" )
df.dtypes

col1      object
col2     float64
col3     float64
col4      object
col5      object
col6      object
col7      object
col8     float64
col9      object
col10     object
col11      int64
col12     object
col13     object
col14    float64
col15      int64
col16     object
dtype: object

Everything looks fine I think

### Ensure there are no null values, (imputing any if encountered)

So previously I did determine that there are null values represented by "?". Looking at the categorical columns first...

In [8]:
cat = df.dtypes[df.dtypes == "object"].index
df_missing = {}
for c in cat:
    df_m = np.where(df[c].isnull() == True)
    if len(df_m[0]) > 0:
        df_missing[c] = {'length':len(df_m[0])}

df_missing


{'col1': {'length': 12},
 'col4': {'length': 6},
 'col5': {'length': 6},
 'col6': {'length': 9},
 'col7': {'length': 9}}

Ok so we have some null values in 5 of the columns. What to do with them? Not quite enough reason to delete the columns so we can replace with the most common value. value_counts returns the frequencies of the items.

In [9]:
mappings_df = {}
for key in df_missing.keys():
    mappings_df[key] = df[key].value_counts()._index[0]
print(mappings_df)

{'col5': 'g', 'col1': 'b', 'col4': 'u', 'col7': 'v', 'col6': 'c'}


In [10]:
for k in mappings_df:
    df[k].fillna(mappings_df[k], inplace=True)

cat = df.dtypes[df.dtypes == "object"].index
df_missing = {}
for c in cat:
    df_m = np.where(df[c].isnull() == True)
    if len(df_m[0]) > 0:
        df_missing[c] = {'length':len(df_m[0])}

df_missing

{}

It appears that there are no more nulls in the categorical. Now looking at the non-categorical...

In [11]:
nan_cols = df.columns[df.isnull().any()]
print(nan_cols)

Index(['col2', 'col14'], dtype='object')


Ok so we see that 2 columns have some nulls. We can replace with the mean, but first do a describe to make sure that makes sense.

In [12]:
df.describe()

Unnamed: 0,col2,col3,col8,col11,col14,col15
count,678.0,690.0,690.0,690.0,677.0,690.0
mean,31.568171,4.758725,2.223406,2.4,184.014771,1017.385507
std,11.957862,4.978163,3.346513,4.86294,173.806768,5210.102598
min,13.75,0.0,0.0,0.0,0.0,0.0
25%,22.6025,1.0,0.165,0.0,75.0,0.0
50%,28.46,2.75,1.0,0.0,160.0,5.0
75%,38.23,7.2075,2.625,3.0,276.0,395.5
max,80.25,28.0,28.5,67.0,2000.0,100000.0


In [13]:
for c in nan_cols:
    df[c].fillna(df[c].mean(), inplace=True)
    
nan_cols = df.columns[df.isnull().any()]
print(nan_cols)

Index([], dtype='object')


Ok great. On to the next thing

### Encode all categorical attributes

Encoding converts the categorical attributes into numerical values in order to be easier analyzed in pandas or sklearn, or any other algorithm that requires numbers.

In [14]:
for c in cat:
    df[c] = df[c].astype('category')
    df[c] = df[c].cat.codes
    
df.head()

Unnamed: 0,col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15,col16
0,1,30.83,0.0,1,0,12,7,1.25,1,1,1,0,0,202.0,0,0
1,0,58.67,4.46,1,0,10,3,3.04,1,1,6,0,0,43.0,560,0
2,0,24.5,0.5,1,0,10,3,1.5,1,0,0,0,0,280.0,824,0
3,1,27.83,1.54,1,0,12,7,3.75,1,1,5,1,0,100.0,3,0
4,1,20.17,5.625,1,0,12,7,1.71,1,0,0,0,2,120.0,0,0


We need a training dataset and a testing dataset. Can use train_test_split from sklearn to split the original dataset. 

In [15]:
x  = df.drop("col16", axis=1)
y = df["col16"]

### Scale the attributes of the dataset 

Scaling sets all of the columns into a similar scale in order for them to be compared. This is important for PCA.

In [16]:
scaler = StandardScaler()
scaler.fit(x)
x = scaler.transform(x)

### Perform PCA to obtain attributes with which explains 95% of the variance in the data

PCA attempts to reduce the number of dimensions of the dataset for less intensive calculation

In [17]:
pca = PCA(0.95)
pca.fit(x)
pca.n_components_
x = pca.transform(x)
x = pd.DataFrame(x)
x.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.65286,-0.774701,-0.399894,1.387732,0.525603,-0.805988,0.419147,1.300375,-0.771403,0.506222,-0.815474,-0.671759,-0.098399
1,2.428716,0.162798,-0.610089,-0.671336,-1.608986,-0.781,0.398719,1.00764,0.393714,1.095577,-1.126044,0.224518,-0.704029
2,-0.178628,-1.087993,-0.53395,0.434695,-1.812954,-0.205526,0.396901,0.763085,0.274859,0.409848,-0.10011,-1.362535,0.515935
3,1.44198,-0.278353,0.293573,1.700763,0.701062,-1.059722,0.570916,-0.202567,-0.379249,-0.534559,-0.988278,-0.057016,0.275485
4,-0.5097,-1.520026,0.847717,-0.221366,1.492885,-0.438917,2.083965,1.97974,1.408254,-1.088073,0.180449,-1.038724,-0.12361


For classification, we need a dataset to train the model, and one to test the model. We can split up the original dataset using sklearn's train_test_split with its default parameters. The default split size it 0.75 to 0.25.

In [18]:
x_train, x_test, y_train, y_test = train_test_split(x, y)
x_train.shape, y_train.shape, x_test.shape, y_test.shape

((517, 13), (517,), (173, 13), (173,))

All ready for classification

## Q2. Random Forest Classifier

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting. 

### Part 1: Using the RandomForest Classifier provided by the sklearn library

### Initialize the classifier with default arbitrary paramenters

In [19]:
random_forest = RandomForestClassifier()

### Train the classifier

In [20]:
random_forest.fit(x_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

### Determine the recall score of the classifier

In [21]:
y_pred = random_forest.predict(x_test)
acc_random_forest = round(accuracy_score(y_test,y_pred) * 100, 2)
acc_random_forest

82.08

### Part 2: Using the RandomizedSearchCV module provided by the sklearn library

RandomizedSearchCV implements a randomized search over parameters, where each setting is sampled from a distribution over possible parameter values. Hyper-parameters are parameters that are not directly learnt within estimators. 

In [22]:
random_forest.get_params()

{'bootstrap': True,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 10,
 'n_jobs': 1,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

We see that there are a lot of parameters that can be played with. Let's define the ones we are concerned with.
1. n_estimators: The number of trees in the forest
2. max_features: The number of features to consider when looking for the best split
3. max_depth: Maximum depth of the tree
4. min_samples_split: The minimum number of samples required to split an internal node
5. min_samples_leaf: The minimum number of samples required to be at a leaf node
6. bootstrap: Whether bootstrap samples are used when building trees. Bootstrap samples are just smaller samples of the original sample.

### Do parameter tuning to obtain the optimal parameters to initialize the RandomForest Classifier

In [23]:
param_dist = {"n_estimators": sp_randint(1,20),
              "max_features": sp_randint(1, 11),
              "max_depth": [3, None],
              "min_samples_split": sp_randint(2, 11),
              "min_samples_leaf": sp_randint(1, 11),
              "bootstrap": [True, False]}

In [44]:
random_search = RandomizedSearchCV(random_forest, param_distributions=param_dist,
                                   n_iter=50)

A larger number of iterations has a better chance of finding optimal parameters.

### Determine the recall score of the classifier

In [45]:
random_search.fit(x_train, y_train)

RandomizedSearchCV(cv=None, error_score='raise',
          estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
          fit_params=None, iid=True, n_iter=100, n_jobs=1,
          param_distributions={'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7efd0f89e8d0>, 'bootstrap': [True, False], 'max_features': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7efd0f89eb70>, 'max_depth': [3, None], 'min_samples_split': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7efd0faefa58>, 'min_samples_leaf': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7efd0faef240>},
     

In [46]:
random_search.score(x_train, y_train)

0.9032882011605415

In [47]:
y_pred = random_search.predict(x_test)
acc_random_forest = round(accuracy_score(y_test,y_pred) * 100, 2)
acc_random_forest

84.39

## Q3. KNN Classifier

The K nearest neighbours algorithm uses the nearest neighbours to an object to predict its behaviour.

### Part 1: Using the KNN Classifier provided by the sklearn library

### Initialize the classifier with default value for n_neighbors

In [28]:
knn = KNeighborsClassifier()     #The default n_neighbours is 5

### Train the classifier

In [29]:
knn.fit(x_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

### Determine the recall score of the classifier

In [30]:
y_pred = knn.predict(x_test)
acc_knn = round(accuracy_score(y_test,y_pred) * 100, 2)
acc_knn

84.39

Higher than the random forest classfier. Cool

### Part 2: Using the cross_val_score module provided by the sklearn library

Cross validation is used to determine the accuracy of the model before it is applied to the test set, and it used to account for overfitting of the model. It is particularly useful for small datasets so that more of the dataset can be used in training the model as opposed to creating datasets for a validation dataset and a test dataset. 10 fold cross validation means that the training dataset is split into 10 smaller datasets, 1 of which is held as the validation dataset. Apparently 10 fold is the most common, even though 3 fold is the default.

### Perform 10 fold cross validation to obtain the optimal value to use for n_neighbor

In [31]:
neighbors = filter(lambda x: x % 2 != 0, list(range(1,50)))     #list of odd numbers between 1 and 50. Odd numbers are used in order to always have a majority
cv_scores = {}

for k in neighbors:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, x_train, y_train, cv=10, scoring='accuracy')
    cv_scores[k] = scores.mean()

In [32]:
opt_k = max(cv_scores, key=lambda k: cv_scores[k])
opt_k

5

The optimal value to use for the n_neighbours is 5? That's what we had before, so we'll get the same answer

### Retrain the classifier

In [33]:
knn = KNeighborsClassifier(n_neighbors = 5)

In [34]:
knn.fit(x_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

### Determine the recall score of the classifier

In [35]:
y_pred = knn.predict(x_test)
acc_knn = round(accuracy_score(y_test,y_pred) * 100, 2)
acc_knn

84.39

I have learnt that the recall score is highly dependent on the training and test datasets, which are chosen at random everytime the code is run. Two classfiication techniques were used in this exercise: Random Forest and KNN. Other classification techniques include logistic regression, gaussian naive bayes, and perceptron.