# <img src="https://img.icons8.com/dusk/64/000000/mind-map.png" style="height:50px;display:inline"> EE 044165/6 - Technion - Intro to Machine Learning Lab

## Part 2 - K-NN and Perceptron

### <img src="https://img.icons8.com/bubbles/50/000000/checklist.png" style="height:50px;display:inline"> Agenda

* Recap of Part 1
    * Loading the Data
    * Data Representation
    * Train-Test Separation
    * Naive Bayes
* K Nearset Neighbors (K-NN)
* The Perceptron
* Final Comparison

#### Notes
* To run a code block, select it (with mouse) and press Ctrl + Enter to run it or Shift + Enter to run it and move on to the next block.
* To get description of functions and classes, run `help(name_of_function)`.
* To display lines in the code block, select the block, press ESC and then 'L'.

In [None]:
# imports for the lab
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator
from helper_functions import email_pipeline
from tqdm import tqdm
# import K-NN classifier
from sklearn.neighbors import KNeighborsClassifier
# import TF-IDF pre-processor
from sklearn.feature_extraction.text import TfidfTransformer

## <img src="https://img.icons8.com/dusk/64/000000/rewind.png" style="height:50px;display:inline"> Recap of Part 1
We will now repeat the process of of loading the data, pre-processing it and splitting it.

#### Copy & Paste relevant code from the previous lab

In [None]:
# load the data
"""
Your Code Here
"""
"""Starter code
# email_data = 
"""
email_data = pd.read_csv('./email_data.csv')
''' END OF SOLUTION !!!! '''
# let's look at 15 random samples from it.
email_data.sample(15)

In [None]:
X = email_data['Content'].values
y = email_data['Label'].values == 'S' # 1 Spam, 0 for Ham

# split to train and test
"""
Your Code Here
"""
X_train, X_test, y_train, y_test = 



"""
Your Code Here
"""
# transform using email_pipeline
X_train_augmented =
X_test_augmented =


# get statistics
print("num train samples: ", X_train.shape[0])
print("num test samples: ", X_test.shape[0])
print("shape after augmentation: ", X_train_augmented.shape)
print("fraction of spam in the original: ", np.mean(y == 1))
print("fraction of spam in the train set: ", np.mean(y_train == 1))

In [None]:
#  run this cell
#  (is just the estimate_likelihood_params function from previous time)

def estimate_likelihood_params(X, y, dist_type="gaussian", c=0.5, num_classes=2):
    """
    Calculate the likelihood P(X|y,theta)
    :param X: features
    :param y: labels
    :param dist_type: type of distribution: "gaussian", "bernoulli", "multinomial", "multinomial_smooth"
    :param c: smoothing parameter for "multinomial_smooth"
    :param num_classes: number of classes
    :return likelihood_params
    """
    if isinstance(X, csr_matrix):
        X = X.todense()
    n_samples = X.shape[0]
    n_feat = X.shape[1]
    params = {'type': dist_type}
    if dist_type == 'gaussian':
        mu_s = np.zeros((num_classes, n_feat))
        sigmaSqr_s = np.zeros((num_classes, n_feat))
        for i_class in range(num_classes):
            mu_s[i_class] = X[y == i_class].mean(axis=0)
            sigmaSqr_s[i_class] = np.square(X[y == i_class] - mu_s[i_class]).mean(axis=0)
        params['mu'] = mu_s
        params['sigmaSqr'] = sigmaSqr_s

    elif dist_type == 'bernoulli':
        p_s = np.zeros((num_classes, n_feat))
        for i_class in range(num_classes):
            x_i = X[y == i_class]
            # change to 0-1 (binary features)
            x_i[x_i > 0] = 1
            p_s[i_class] = x_i.mean(axis=0)
        params['p'] = p_s

    elif dist_type == 'multinomial':
        p_s = np.zeros((num_classes, n_feat))
        for i_class in range(num_classes):
            x_i = X[y == i_class]
            T = np.sum(x_i)
            p_s[i_class] = np.sum(x_i, axis=0) / T
        params['p'] = p_s
    elif dist_type == 'multinomial_smooth':
        p_s = np.zeros((num_classes, n_feat))
        for i_class in range(num_classes):
            x_i = X[y == i_class]
            T = np.sum(x_i[:]) + c * n_feat
            p_s[i_class] = (c + np.sum(x_i, axis=0)) / T
        params['p'] = p_s
    else:
        print("unknown distribution!")
        return
    return params

In [None]:
# copy and paste your implemented Naive Bayes Classifier and implemented function calc_err to use later

"""
Your Code Here
Paste here your MlabNaiveBayes class and your calc_err function
"""






In [None]:
#  complete the following function

def evaluate_classifier(clf, X, y, test_size=0.2, num_repeats=20):
    current_errors = np.zeros(num_repeats)
    for i_rep in tqdm(range(num_repeats)):
        """
        Your Code Here
        """
        # split and pre-process
        X_train, X_test, y_train, y_test =
        X_augmented_train =
        X_augmented_test =
        # train
        
        
        # test
        y_pred =
        calculate error
        current_errors[i_rep] =
     
    error_mean = np.mean(current_errors)
    error_std = np.std(current_errors)
    return error_mean, error_std


## <img src="https://img.icons8.com/dusk/64/000000/rewind.png" style="height:50px;display:inline"> K Nearest Neighbors
We will now use K-NN classifier to complete the classification task. You will use Scikit-Learn's K-NN Classifier `KNeighborsClassifier`.

Usage:

`clf = KNeighborsClassifier(n_neighbors=K, p=2)` or `KNeighborsClassifier(n_neighbors=K, metric='cosine')`

`clf.fit(X_augmented_train, y_train)`

`y_pred = clf.predict(X_augmented_test)`

In [None]:
# using distances:
# L2 - p=2 [KNeighborsClassifier(n_neighbors=K, p=2)]
# L1 - p=1 [KNeighborsClassifier(n_neighbors=K, p=1)]
# Cosine Distance - metric='cosine' [KNeighborsClassifier(n_neighbors=K, metric='cosine')]

"""
Your Code Here
"""

# num neighbors
K =

# L2 distance

print("l2 error: {}, std: {}".format(l2_error, l2_error_std))

# L1 distance

print("l1 error: {}, std: {}".format(l1_error, l1_error_std))

# Cosine distance

print("cosine dist error: {}, std: {}".format(cos_error, cos_error_std))



In [None]:
# summary table
summary_df = pd.DataFrame(np.concatenate([np.array([l2_error, l1_error, cos_error]).reshape(-1, 1),
                                       np.array([1 - l2_error, 1 - l1_error, 1 - cos_error]).reshape(-1, 1),
                                       np.array([l2_error_std, l1_error_std, cos_error_std]).reshape(-1,1)],axis=1),
                       columns=['Error', 'Accuracy', 'Error STD'], index=['L2', 'L1', 'Cosine'])
summary_df

In [None]:
# performance vs. K
K_s = [1, 3, 5, 7, 15]  # num neighbors
# using distances:
# Cosine Distance - metric='cosine' [KNeighborsClassifier(n_neighbors=K, metric='cosine')]
num_repeats = 10
K_errors = np.zeros(len(K_s))
K_errors_std = np.zeros(len(K_s))

"""
 Your Code Here
 """


In [None]:
# plot
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111)
"""
Your Code Here
Use ax.errorbar()
"""

ax.set_xlabel("number of neighbors (K)")
ax.set_ylabel("error %")
ax.xaxis.set_major_locator(MaxNLocator(integer=True))
ax.grid()
ax.set_title("Test Error vs. Number of Neigbors (N={} Repeats)".format(num_repeats))
plt.show()

### <img src="https://img.icons8.com/color/96/000000/transformer.png" style="height:50px;display:inline"> TF-IDF Transformation
We will now apply TF-IDF transformation as another pre-proccessing stage of the data.

Usage:

`tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)`

`X_augmented_tfidf_train = tfidf_transformer.fit_transform(X_augmented_train)`

`X_augmented_tfidf_test = tfidf_transformer.transform(X_augmented_test)`

In [None]:
help(TfidfTransformer)

In [None]:
K = 3  # num neighbors
# using distances:
# Cosine Distance - metric='cosine' [KNeighborsClassifier(n_neighbors=K, metric='cosine')]

num_repeats = 10
test_size = 0.2

current_errors = np.zeros(num_repeats)
current_errors_tfidf = np.zeros(num_repeats)

for i_rep in tqdm(range(num_repeats)):
    """
    Your Code Here
    """
    """ Starter code     """
    # split and pre-process
    X_train, X_test, y_train, y_test =
    X_augmented_train =
    X_augmented_test =

    # train without TF-IDF
    clf =
    # test
    y_pred =
    # calculate error
    current_errors[i_rep] =

    # train with TF-IDF
    tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
    X_augmented_tfidf_train =
    X_augmented_tfidf_test =
    clf_tfidf =
    # test
    y_pred_tfidf =
    # calculate error
    current_errors_tfidf[i_rep] =


cos_error = np.mean(current_errors)
cos_error_std = np.std(current_errors)
print("cosine dist error: {}, std: {}".format(cos_error, cos_error_std))
cos_error_tfidf = np.mean(current_errors_tfidf)
cos_error_std_tfidf = np.std(current_errors_tfidf)
print("cosine dist error with TF-IDF: {}, std: {}".format(cos_error_tfidf, cos_error_std_tfidf))


In [None]:
# summary table
summary_df = pd.DataFrame(np.concatenate([np.array([cos_error, cos_error_tfidf]).reshape(-1, 1),
                                       np.array([1 - cos_error, 1 - cos_error_tfidf]).reshape(-1, 1),
                                       np.array([cos_error_std, cos_error_std_tfidf]).reshape(-1,1)],axis=1),
                       columns=['Error', 'Accuracy', 'Error STD'], index=['Cosine' ,'Cosine with TD-IDF'])
summary_df

## <img src="https://img.icons8.com/dusk/64/000000/artificial-intelligence.png" style="height:50px;display:inline"> The Perceptron
We will now implement the Perceptron and test its performance.

In [None]:
# Perceptron
class MlabPerceptron():
    "This class implements a Perceptron Classifier"

    def __init__(self, num_epochs=10, alpha=0.5):
        """
        Initialize the classfier
        :param num_epochs: how many epochs to run on the data
        :param alpha: learning rate
        """
        self.num_epochs = num_epochs
        self.alpha = alpha
        self.w = None  # no weights

    def fit(self, X, y, verbose=False):
        """
        Train the classfier
        :param X: features
        :param y: labels
        """
        if isinstance(X, csr_matrix):
            X = X.todense()
        y = np.array(y, dtype=np.int)
        y[y == 0] = -1  # convert 0 -> -1
        num_samples = X.shape[0]
        num_features = X.shape[1]
        # initialize weights
        self.w = np.ones((1, num_features + 1))
        # train
        for epoch in range(self.num_epochs):
            num_updates = 0  # how many updates were performed
            """
            Your Code Here
            Remember that you have 2 stopping criteria
            1. Reached maximum number of epochs
            2. No more updates to the weights
            """
            for i_samp in range(num_samples):
                # concatenate 1 to feature vector
                sample = np.append(X[i_samp], np.ones((1,1)), axis=1)
                
                
            # end for i_samp
            if num_updates == 0:
                
            
            if verbose:
                print("epoch {}: {} updates".format(epoch, num_updates))
        # end for epoch

    def predict(self, X):
        """
        Predict labels for features
        :param X: features
        :return y_pred: predictions
        """
        if self.w is None:
            print("can't call 'predict' before 'fit'")
            return
        num_samples = X.shape[0]
        num_features = X.shape[1]
        if isinstance(X, csr_matrix):
            X = X.todense()
        """
        Your Code Here
        """
        """ Starter code     """
        
        for i_samp in range(num_samples):
            # concatenate 1 to feature vector
            sample = np.append(X[i_samp], np.ones((1,1)), axis=1)

            
        # end for i_samp
        y_pred[y_pred == -1] = 0  # convert back -1 -> 0
        return y_pred

In [None]:

# let's see it in action

X = email_data['Content'].values
y = email_data['Label'].values == 'S'  # 1 Spam, 0 for Ham

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_augmented_train = email_pipeline.fit_transform(X_train)
X_augmented_test = email_pipeline.transform(X_test)

# train and test
perc_clf = MlabPerceptron(num_epochs=10, alpha=0.5)
perc_clf.fit(X_augmented_train, y_train, verbose=True)
y_pred = perc_clf.predict(X_augmented_test)
print("Pereceptron error (using only 10 epochs): ", calc_err(y_pred, y_test))

In [None]:
# let's look at the weights
print(perc_clf.w)

In [None]:
num_repeats = 20
num_epochs = 50

"""
Your Code Here
"""


print("perceptron error: {}, std: {}".format(perc_error, perc_error_std))

## <img src="https://img.icons8.com/dusk/64/000000/prize.png" style="height:50px;display:inline"> Credits
* Icons from <a href="https://icons8.com/">Icon8.com</a> - https://icons8.com
* Datasets from <a href="https://www.kaggle.com/">Kaggle</a> - https://www.kaggle.com/
* Notebook made by <a href="mailto:taldanielm@campus.technion.ac.il">Tal Daniel</a>
* Updates: Ron Amit (March 2020)