# Homework 1 - Naive Bayes Classifier
### Mihovil Mandic '21
### COSC074 Winter 2019

Notes:
- Columns 1 through 6 of the given CSV file represent independent variables
- The last column (“Label”) represents the dependent variable (0 or 1)
- 3 functions:
    - <strong>train(data)</strong>, where input data is the CSV filename of the training dataset, and output is a classifier C
    - <strong>predict(C, row)</strong>, where input C is a classifier object on  train(...) was called, input row is an array corresponding to some row of IVs in the CSV test dataset (a single row only, to account for variable dataset sizes), output is 0 or 1.
    - <strong>check(idx1, idx2)</strong>, cross-validates with dataframe limited to feature vectors from idx1 to idx2, and displays the accuracy and classification report
    
- <strong>I've selected 4 features for my classifier, Feature_2 to Feature_5; explained at the end of this notebook</strong>

In [1]:
import numpy as np
import pandas as pd
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_classif

input_file = "hw1_trainingset.csv"
output_file = "hw1_testset.csv"
trainset = pd.read_csv(input_file)
testset = pd.read_csv(output_file)

# Returns a trained classifier object
def train(data):
    gnb = GaussianNB()
    X = data.loc[:, 'Feature_2':'Feature_5']    # Dropped Feature_1 and Feature_6 - explained below
    y = data['Label']
    gnb.fit(X, y)
    return gnb

# Takes in a classifier object and a dataframe row (2D array) as arguments
# Returns the class label as an int64
def predict(C, row):
    return C.predict(row)

## Using functions to classify and writing the into final CSV file
<b>Note: Created the new Label column with a default value of -1. This value is then changed row by row according to prediction (0 or 1)</b>

In [2]:
nrows = len(testset.index)    # 2400
testset['Label'] = -1   # Default value for the new column

# Classifier object
clf = train(trainset)

# Iterating over 2400 rows
for i in range(nrows):
    row = testset.iloc[i].values[1:5].reshape(1, -1)   # Looking at values in columns Feature_2 to Feature_5
    testset.at[i, 'Label'] = predict(clf, row)    # Give the row to the predict function and save the value

print('Labeled testset: ')
print(testset)

# Write into a new CSV file, without the row index column
testset.to_csv("hw1_testset_classified.csv", index=False)

Labeled testset: 
      Feature_1  Feature_2  Feature_3  Feature_4  Feature_5  Feature_6  Label
0      13876.40  140984.00  -5240.000         33        315       5844      0
1       9958.06   27101.60  -3330.730         27        174       1415      0
2       8716.96   10783.70  -3113.440         34        156        654      0
3      15707.20  116106.00  -5050.000         54        271       4426      0
4       5189.89    1928.00  -1113.000         14         68        149      0
5      15522.30  115720.00  -5350.000         47        271       4426      0
6       7971.42   12121.00  -2564.000         27        110        765      0
7      14985.50  140936.00  -4070.000         49        315       5844      0
8       7471.31   13998.90  -2089.160         62        120        809      0
9       4014.91    2176.70  -1127.700         20         54        165      0
10      7297.55   14181.00  -3814.000          8        131        804      0
11     15531.30  131051.00  -6548.000         

## Checking the Label frequency counts in both sets

In [3]:
print("Trainset - Label counts")
print(trainset['Label'].value_counts())
print("\nTestset - Label counts")
print(testset['Label'].value_counts())

Trainset - Label counts
0    3230
1    2370
Name: Label, dtype: int64

Testset - Label counts
0    2050
1     350
Name: Label, dtype: int64


Note: The difference in class ratios between the training set and test set is noticable. Perhaps the test set contained more variable data than the training set.

## Checking statistical parameters for both sets

In [4]:
print(trainset.describe())
print(testset.describe())

          Feature_1     Feature_2      Feature_3    Feature_4    Feature_5  \
count   5600.000000  5.600000e+03    5600.000000  5600.000000  5600.000000   
mean   10971.668209  7.646888e+04   -4389.410329    58.234464   211.824107   
std     5286.310454  1.280057e+05    6672.021675    48.152214   130.152914   
min     2060.160000  4.222420e+01 -359930.000000     0.000000    33.000000   
25%     7260.657500  9.782892e+03   -5660.960000    26.000000   119.000000   
50%     9785.010000  3.191295e+04   -3368.850000    44.000000   180.000000   
75%    13542.450000  8.334033e+04   -2159.777500    78.000000   271.000000   
max    45110.200000  1.857700e+06  136000.000000   480.000000   981.000000   

          Feature_6        Label  
count   5600.000000  5600.000000  
mean    2931.682500     0.423214  
std     3856.844886     0.494113  
min        5.000000     0.000000  
25%      584.000000     0.000000  
50%     1626.000000     0.000000  
75%     3539.000000     1.000000  
max    40369.0000

### Note: I have noticed that Feature_3 in the testset is different than the one in the training set. The mean is much lower because the max is lower.


--------------------------------------------------------------------------------------------------------------------
## *Appendix: Cross-validation, and accuracy check
Without any feature selection, and by using <b>train_test_split</b> function for cross-validation on the training set, the accuracy was between 0.59 and 0.64, depending on the random_state for cross-validation.

In [5]:
# Function that cross-validates the classifier, receieves from-to column indices as arguments
def check(idx1, idx2):
    X = trainset.columns[idx1:idx2] # features (all)
    y = trainset.columns[-1] # target (class label - last column)

    X_train, X_test, y_train, y_test = train_test_split(trainset[X], trainset[y], random_state=1)

    clf = GaussianNB()
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(classification_report(y_test, y_pred))
    print('Accuracy is:', accuracy_score(y_test, y_pred))
    return

check(0, 6) # All feature columns 0 - 6

              precision    recall  f1-score   support

           0       0.64      0.88      0.74       838
           1       0.60      0.27      0.37       562

   micro avg       0.63      0.63      0.63      1400
   macro avg       0.62      0.57      0.55      1400
weighted avg       0.62      0.63      0.59      1400

Accuracy is: 0.6328571428571429


### Feature selection
I am going to try and select K best features using the <b>SelectKBest</b> class offered by the scikit library.
The scoring function used was <b>f_classif</b>, as I didn't want to manipulate the data because perhaps the negative values were important, and I didn't know what the features represent.

In this part of the code I decided to check what the <b>k=4</b> best features were.
The output from line 14 shows that the k=4 highest scoring features were Feature_2 to Feature_5 if we set the score threshold.

In [6]:
array = trainset.values
X = array[:, 0:6]
y = array[:, 6]

# feature extraction
test = SelectKBest(score_func=f_classif, k=4)
fit = test.fit(X, y)

# summarize scores
np.set_printoptions(precision=3)
print("Scores")
idx = 0
for score in fit.scores_:
    idx += 1
    print("Feature_{0}: {1} ".format(idx, score))
features = fit.transform(X)

# summarize selected features
# print(features[0:6,:])

# check accuracy again, but drop lowest scored columns (0 and 6)
check(1, 5)

Scores
Feature_1: 70.56736208879721 
Feature_2: 85.36832830778845 
Feature_3: 134.48631971158613 
Feature_4: 89.47029490159723 
Feature_5: 107.79603861454493 
Feature_6: 72.45654062902496 
              precision    recall  f1-score   support

           0       0.65      0.89      0.75       838
           1       0.63      0.27      0.38       562

   micro avg       0.64      0.64      0.64      1400
   macro avg       0.64      0.58      0.56      1400
weighted avg       0.64      0.64      0.60      1400

Accuracy is: 0.6421428571428571


--------------------------------------------------------------------------
Accuracy has slightly improved, and the precision for Label 1 has risen.
Thus I will drop Feature_1 and Feature_6.