#  Classification : Nearest Neighbors and Naive Bayes 

## Classification using Nearest Neighbors

(a) Perform k-Nearest neighbours on the given dataset($X_{knn}$ and $y_{knn}$: where $X_{knn}$ stores feature vectors representing the movies and  $y_{knn}$ stores the 0-1 labelling for each movie) for binary classification of movies, for classifiying whether a given movie is a comedy(label 1) or not a comedy(label 0) . Split the dataset into train(80%), validation(10%) and test sets(10%).Run k-Nearest neighbours for different k values (1,3,7,15,31,63). Select the k, using validation set, which returns the best accuracy score. 

(i)  Report all the validation accuracies for all the values of k. 
<br>(ii) Report accuracy score by performing k-NN on the test dataset using the best chosen k value. 

In [70]:
#Import required packages
import numpy as np
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import pandas as pd
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split
import matplotlib
from scipy import stats
from tqdm import tqdm

In [3]:
## write your code here.
X = np.loadtxt('X_knn.csv')
y = np.loadtxt('y_knn.csv')

In [131]:
##Splitting data set
X_train, X_, y_train, y_ = train_test_split(X, y, test_size= 0.2, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_,y_,test_size = 0.5,random_state=42)
y_val = y_val[:,np.newaxis]
y_test = y_test[:,np.newaxis]

In [127]:
def euclidian_distance(a, b):
    return np.sqrt(np.sum((a - b)**2, axis=1))

def kneighbors(X_val, n_neighbors):

    dist = []
    neigh_ind = []

    point_dist = [euclidian_distance(x_val, X_train) for x_val in X_val]

    for row in point_dist:
        enum_neigh = enumerate(row)
        sorted_neigh = sorted(enum_neigh,
                                  key=lambda x: x[1])[:n_neighbors]

        ind_list = [tup[0] for tup in sorted_neigh]
        dist_list = [tup[1] for tup in sorted_neigh]

        dist.append(dist_list)
        neigh_ind.append(ind_list)

    return np.array(neigh_ind)


K = [1,3,7,15,31,63]
for k in tqdm(range(len(K))):
    neighbour = kneighbors(X_val,K[k])
    y_close = y_train[neighbour]
    y_pred = np.array(stats.mode(y_close,axis = 1))[0]
    score = float(sum(y_pred == y_val))/ float(len(y_val))
    print("Validation score K =",K[k],":",score*100)

 17%|█▋        | 1/6 [00:30<02:33, 30.75s/it]

Validation score K = 1 : 84.89999999999999


 33%|███▎      | 2/6 [01:02<02:03, 30.90s/it]

Validation score K = 3 : 84.7


 50%|█████     | 3/6 [01:37<01:37, 32.34s/it]

Validation score K = 7 : 84.8


 67%|██████▋   | 4/6 [02:13<01:06, 33.27s/it]

Validation score K = 15 : 85.9


 83%|████████▎ | 5/6 [02:44<00:32, 32.73s/it]

Validation score K = 31 : 85.6


100%|██████████| 6/6 [03:16<00:00, 32.71s/it]

Validation score K = 63 : 84.8





K=15 performs best

In [132]:
###Test set accuracy
neighbour = kneighbors(X_test,15)
y_close = y_train[neighbour]
y_pred = np.array(stats.mode(y_close,axis = 1))[0]
score = float(sum(y_pred == y_test))/ float(len(y_test))
print("Test set score K =",15,":",score*100)

Test set score K = 15 : 86.9


(b) State why using an even value of k in k-NN should not be chosen

K is chosen odd to avoid ties in choosing the classes

## Learning Naive Bayes' classifier  

### From Continuous Distribution of data

Here, the distribution of the data( $X$ represents the datapoints and $Y$ represents the 0-1 binary-class label; where 0 being the negative class and 1 being the positive class) is already known.
<br>Consider the following one-dimensional(1-D) Gaussian distributions where means and variances are unknown. You need to estimate means($\mu_-$: for negative class and  $\mu_+$: for positive class) and variances ($\sigma^{2}_{-}$: for negative class and $\sigma^{2}_+$: for positive class) from the given data : 
<br> (1) Assume $X|Y_{Y=0} \sim \mathcal{N}(\mu_- , \sigma^{2}_-)$ 
<br>(2) Assume $X|Y_{Y=1} \sim \mathcal{N}(\mu_+ , \sigma^{2}_+)$


*Generating artificial datasets in the next cell *

In [240]:
## This cell is for generating datasets. Students should not change anything in this cell. 
## You can compare your mean and variance estimates by the actual ones used to generate these datasets

import numpy as np
X_pos = np.random.randn(1000,1)+np.array([[2.]])
X_neg = np.random.randn(1000,1)+np.array([[4.]])
X_train_pos = X_pos[:900]
X_train_neg = X_neg[:900]
X_test_pos = X_pos[900:]
X_test_neg = X_neg[900:]
X_train = np.concatenate((X_train_pos, X_train_neg), axis=0)
X_test = np.concatenate((X_test_pos, X_test_neg), axis=0)
Y_train = np.concatenate(( np.ones(900),np.zeros(900) ))
Y_test = np.concatenate(( np.ones(100), np.zeros(100) ))
Y_test = Y_test[:,np.newaxis]

## X_train, X_test, Y_train, Y_test are your datasets to work with ####



<br>**Instructions to follow for learning a Baeysian classifier:** *(Code the formulae for estimating the different parameters yourself)*
<br> a)Utilize the training dataset to estimate the means($\hat{\mu_+}$,$\hat{\mu_-}$) and variances($\hat{\sigma^{2}_+}$, $\hat{\sigma^{2}_-}$) for both positive and negative classes  
b)Estimate the prior probability: $P(Y=1)$  ⟶ which could be referred to as: $\hat{a}$ 
<br>c)Estimate the classifier funtion/posterior probability:  $P(Y=1|X = x)$  ⟶ which could be referred to as $\hat{\eta(x)}$
<br>d)Find out the threshold value($x^*$) for classification by equating the estimated classifier function($\hat{\eta(x)}$)  with threshold probability of 0.5
<br>e)Classify the test dataset into the two classes using this threshold value($x^*$) and find out the **accuracy** of the prediction 

Return back:  $\hat{\mu_+}$, $\hat{\mu_-}$, $\hat{\sigma^{2}_+}$, $\hat{\sigma^{2}_-}$, $\hat{a}$, $x^*$ and accuracy from the code written 

*Hint: $X|Y_{Y=0} \sim \mathcal{N}(\mu_- , \sigma^{2}_-)$ implies $P_{X|Y=0} = \mathcal{N}(\mu_- , \sigma^{2}_-) $*


In [241]:
## write your code here.  
np.random.seed(10)
def fitNormal(x):
    """
    PARAMETERS:
    x : input 1D data
    
    RETURNS:
    mu : mean of distribution
    sig : standard deviation of distribution
    """
    mu = np.mean(x)
    var = np.var(x)
    
    return [mu, var]


def find_threshold(mu_pos,var_pos,mu_neg,var_neg):
    if var_pos == var_neg:
        return (mu_pos + mu_neg)/2
    else:
        A = var_pos-var_neg
        B = 2*((var_neg*mu_pos)-(var_pos*mu_neg))
        C = var_pos*mu_neg**2-var_neg*mu_pos**2-var_pos*var_neg*np.log(var_pos/var_neg)
        D = B**2-4*A*C
        return ((-B-np.sqrt(D))/(2*A)) #-ve sign chosen to have positive soln
        

X_train_pos = X_train[Y_train==1]
X_train_neg = X_train[Y_train==0]
mu_pos, var_pos = fitNormal(X_train_pos)
mu_neg, var_neg = fitNormal(X_train_neg)
a = len(Y_train[Y_train==1])/len(Y_train)
x_thresh = find_threshold(mu_pos,var_pos,mu_neg,var_neg)
Y_pred = (X_test < x_thresh).astype(int)
score = float(sum(Y_pred == Y_test))/ float(len(Y_test))
print("mu_pos:",mu_pos)
print("mu_neg",mu_neg)
print("var_pos:",var_pos)
print("var_neg:",var_neg)
print("a_hat:",a)
print("x_thresh:",x_thresh)
print("Test set accuracy:",score*100)

mu_pos: 1.93150541977847
mu_neg 4.011580775217707
var_pos: 0.9788711095557733
var_neg: 0.9413264569474702
a_hat: 0.5
x_thresh: 2.9726882457635204
Test set accuracy: 84.0


### From Discrete distribution of data

Unlike the first exercise for learning the Naive Bayes' classifier where we dealt with continuous distribution of data, here you need to work with discrete data, which means finding Probability Mass Distribution(PMF). 

Age  | Income | Status  | Buy
-----|--------|-------- |----
<=20 |  low   | students| yes
<=20 |  high  | students| yes
<=20 | medium | students| no
<=20 | medium | married | no
<=20 |  high  | married | yes
21-30|  low   | married | yes
21-30|  low   | married | no 
21-30| medium | students| no
21-30|  high  | students| yes
 >30 |  high  | married | no
 >30 |  high  | married | yes
 >30 | medium | married | yes
 >30 | medium | married | no
 >30 | medium | students| no
 
Consider the train dataset above. Take any random datapoint ($X_{i}$) where $X_{i} = (X_{i,1} = Age,X_{i,2} = Income,X_{i,3} = Status)$ and its corresponding label 

($Y_{i} = Buy$). A "yes" in Buy corresponds to label-1 and a "no" in Buy corresponds to label-0.

<br>**Instructions to follow for learning a Baeysian classifier:** *(Code the formulae for estimating the different parameters yourself)*
<br> a)Estimate the prior probability: $P(Y=1)$  ⟶ which could be referred to as: $\hat{a}$   
b)Estimate the likelihood for each feature:  $P(X_{i,j} = x |Y = y_{i})$, where $ i$=datapoint counter, $j \in \{1,2,3\}$ and $y_{i} \in \{0,1\}$ 
<br>c)Estimate the total likelihood: $P(X_{i} = x |Y = y_{i})$  
d)Calculate the posterior probability: $P(Y = 1|X_{i} = x_{test} )$ = $p_{test}$ where $x_{test} = (Age = 21-30, Income= medium, Status = married)$


Return back: $\hat{a}$, total likelihood and $p_{test}$ 


In [292]:
## write your code here.
### Data frame defn
c1 = ['<=20','<=20','<=20','<=20','<=20','21-30','21-30','21-30','21-30','>30','>30','>30','>30','>30']
c2 = ['low','high','medium','medium','high','low','low','medium','high','high','high','medium','medium','medium']
c3 = ['students','students','students','married','married','married','married','students','students','married','married','married','married','students']
c4 = ['yes','yes','no','no','yes','yes','no','no','yes','no','yes','yes','no','no']
c4 =[1,1,0,0,1,1,0,0,1,0,1,1,0,0]
df = pd.DataFrame({'Age':c1,'Income':c2,'Status':c3,'Buy':c4})

In [294]:
df

Unnamed: 0,Age,Income,Status,Buy
0,<=20,low,students,1
1,<=20,high,students,1
2,<=20,medium,students,0
3,<=20,medium,married,0
4,<=20,high,married,1
5,21-30,low,married,1
6,21-30,low,married,0
7,21-30,medium,students,0
8,21-30,high,students,1
9,>30,high,married,0


In [329]:
a_hat = np.sum((df['Buy'] == 1))/len(df)
likelihood = []
X_test = ['21-30','medium','married']
df2 = df[df['Buy'] == 1]
prob1 = dict(df2['Age'].value_counts()/len(df2))
prob2 = dict(df2['Income'].value_counts()/len(df2))
prob3 = dict(df2['Status'].value_counts()/len(df2))
for i in range(len(c4)):
    P1 = prob1[df.at[i,'Age']]
    P2 = prob2[df.at[i,'Income']]
    P3 = prob3[df.at[i,'Status']]
    likelihood.append(P1*P2*P3

In [330]:
likelihood

[0.05247813411078716,
 0.10495626822157432,
 0.02623906705539358,
 0.034985422740524776,
 0.1399416909620991,
 0.0466472303206997,
 0.0466472303206997,
 0.017492711370262388,
 0.06997084548104955,
 0.0932944606413994,
 0.0932944606413994,
 0.02332361516034985,
 0.02332361516034985,
 0.017492711370262388]

In [327]:
a_hat

0.5

In [336]:
p_test = a_hat*prob1[X_test[0]]*prob2[X_test[1]]*prob3[X_test[2]]
print(p_test)

0.011661807580174925
