Assignment: Classification
Classification refers to categorizing the given data into classes. For example,

Given an image of hand-written character, identifying the character (multi-class classification)
Given an image, annotating it with all the objects present in the image (multi-label classification)
Classifying an email as spam or non-spam (binary classification)
Classifying a tumor as benign or malignant and so on
In this assignment, we will be building a classifier to classify emails as spam or non-spam. We will be using the Kaggle dataset Spam or Not Spam Dataset for this task.

Note: You cannot load any libraries other than the mentioned ones.

Data pre-processing
The first step in every machine learning algorithm is to process the raw data in some meaningful representations. We will be using the Bag-of-Words representation to process the text. It comprises of following steps:

Process emails line-by-line to extract all the words.
Replace extracted words by their stem (root) word. This is known as stemming and lematization.
Remove stop words like and, or, is, am, and so on.
Assign a unique index to each word. This forms the vocabulary.
Represent each email as a binary vector of length equal to the size of the vocabulary such that the 
 element of the vector is 1 iff the 
 word is present in the email.
Here we provide you with the function signature along with the expected functionality. You are expected to complete them accordingly.

In [1]:
import numpy as np
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

In [2]:
#takes an email as an argument
# read email line-by-line and extract all the words
# return list of extracted words
def read_email(email):
  list_of_words=email.split(" ")
  return list_of_words
  
# takes a list of words as an argument
# replace each word by their stem word
# return list of stem words
def stemming(list_of_words):
  list_of_stem_words=[]
  ps= PorterStemmer()
  for word in list_of_words:
    list_of_stem_words.append(ps.stem(word))
  return list_of_stem_words

# takes a list of stem-words as an argument
# remove stop words
# return list of stem words after removing stop words
def remove_stop_words(list_of_stem_words):
  stem_no_stop_words=[]
  list_of_stop_words=set(stopwords.words('english'))
  for word in list_of_stem_words:
    if word not in list_of_stop_words:
      stem_no_stop_words.append(word)
  return stem_no_stop_words

# takes a list of stem-words as an argument
# add new words to the vocabulary and assign a unique index to them
# returns new vocabulary
def build_vocabulary(stem_no_stop_words):
  vocab=[]
  for list in stem_no_stop_words:
    for word in list:
      if word not in vocab:
        vocab.append(word)
  vocab.pop(0)
  return vocab

# takes a list of stem-words and vocabulary as an argument
# returns bow representation
def get_bow(stem_no_stop_words,vocab):
  email_bow=[]
  for list in stem_no_stop_words:
    element_email_bow=[]
    for word in vocab:
      if word in list:
        element_email_bow.append(1)
      else:
        element_email_bow.append(0)
    email_bow.append(element_email_bow)
  return email_bow

# read the entire dataset
# convert emails to bow and maintain their labels
# call function text_to_bow()
def read_data():
  list_of_lines = []
  with open('spam_or_not_spam.csv', 'r', encoding='utf8') as file:
        for line in file:
                elements = line.split(',')
                list_of_lines.append([elements[0],elements[1][0][0]])
  list_of_lines.pop(0)
  list_of_stem_words = []
  dataset= []
  for line in list_of_lines:
    r= read_email(line[0])
    s=stemming(r)
    f=remove_stop_words(s)
    list_of_stem_words.append(f)
  vocab= build_vocabulary(list_of_stem_words)
  email_bow = get_bow(list_of_stem_words,vocab)
  n = len(list_of_lines)
  for i in range(0,n):
    dataset.append([email_bow[i],int(list_of_lines[i][1])])
  return dataset, vocab

In [3]:
data,vocab=read_data()
print(vocab)

LookupError: 
**********************************************************************
  Resource [93mstopwords[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('stopwords')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mcorpora/stopwords[0m

  Searched in:
    - 'C:\\Users\\DELL/nltk_data'
    - 'D:\\Anaconda\\nltk_data'
    - 'D:\\Anaconda\\share\\nltk_data'
    - 'D:\\Anaconda\\lib\\nltk_data'
    - 'C:\\Users\\DELL\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************


In [None]:
import matplotlib.pyplot as plt

In [None]:
# visuallze data distribution
def data_vis(dataset, vocab):
  list_of_all_words=[0 for word in vocab]
  list_of_spam=[0 for word in vocab]
  list_of_non_spam=[0 for word in vocab]

  # print(dataset)
  for line in dataset:
    for i in range(0,len(line[0])):
      if line[0][i] == 1:
        list_of_all_words[i]+=1
        if line[1]==0:
          list_of_non_spam[i]+=1
        else:
          list_of_spam[i]+=1

  plt.bar(vocab, list_of_all_words, color = "purple")
  plt.title("Plot for Spam and Non-Spam emails")
  plt.show()

  plt.bar(vocab, list_of_spam, color = "purple")
  plt.title("Plot for Non-Spam emails")
  plt.show()

  plt.bar(vocab, list_of_non_spam, color = "purple")
  plt.title("Plot for Spam emails")
  plt.show()
  return

data_vis(data,vocab)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm

In [None]:
# split dataset
def split(dataset):
  np.random.shuffle(dataset)
  separation_index= int(len(dataset)*8/10)
  train_dataset=[]
  test_dataset=[]
  index=0
  for data in dataset:
    if index < separation_index:
      train_dataset.append(data)
    else:
      test_dataset.append(data)
    index=index+1
  return train_dataset, test_dataset

# learn a SVM model
# use the model to make prediction
# return the model predictions on train and test dataset
def svm_classifier(train_dataset, test_dataset):
  train_data=[]
  train_label=[]
  test_data=[]
  test_label=[]

  for data in train_dataset:
    train_data.append(data[0])
    train_label.append(data[1])

  for data in test_dataset:
    test_data.append(data[0])
    test_label.append(data[1])

  clf = svm.SVC()
  clf.fit(train_data,train_label)

  svm_train_predictions= clf.predict(train_data)
  svm_test_predictions = clf.predict(test_data)

  return svm_train_predictions, svm_test_predictions

# implement k-NN algorithm
# use the model to make prediction
# return the model predictions on train and test dataset
def knn_classifier(train_dataset, test_dataset):
  train_data=[]
  train_label=[]
  test_data=[]
  test_label=[]

  for data in train_dataset:
    train_data.append(data[0])
    train_label.append(data[1])

  for data in test_dataset:
    test_data.append(data[0])
    test_label.append(data[1])

  knn= KNeighborsClassifier(n_neighbors=3)
  knn.fit(train_data,train_label)

  knn_train_predictions= knn.predict(train_data)
  knn_test_predictions = knn.predict(test_data)

  return knn_train_predictions, knn_test_predictions

train_data, test_data = split(data)
svm_train_predictions, svm_test_predictions = svm_classifier(train_data, test_data)
knn_train_predictions, knn_test_predictions = knn_classifier(train_data, test_data)

In [None]:
from sklearn import metrics

In [None]:
# compute accuracy 
def compute_accuracy(true_labels, predicted_labels):
  accuracy= metrics.accuracy_score(true_labels,predicted_labels)
  return accuracy

# compute AUC score 
def compute_auc(true_labels, predicted_labels):
  false_positive_rate,true_positive_rate,threshold = metrics.roc_curve(true_labels,predicted_labels)
  auc_score=metrics.auc(false_positive_rate,true_positive_rate)
  return auc_score

# write code to print train and test accuracy and AUC score of SVM and k-NN classifier

In [None]:
train_dataset, test_dataset = split(data)
svm_train_predictions,svm_test_predictions=svm_classifier(train_dataset, test_dataset)
knn_train_predictions,knn_test_predictions=knn_classifier(train_dataset, test_dataset)

true_train_label=[]
true_test_label=[]
for data in train_dataset:
    true_train_label.append(data[1])

for data in test_dataset:
    true_test_label.append(data[1])

print("SVM Classifier")
print("Train accuracy: ",compute_accuracy(true_train_label,svm_train_predictions))
print("Train auc_score: ",compute_auc(true_train_label,svm_train_predictions))
print("Test accuracy: ",compute_accuracy(true_test_label,svm_test_predictions))
print("Test auc_score: ",compute_auc(true_test_label,svm_test_predictions))

print("KNN Classifier")
print("Train accuracy: ",compute_accuracy(true_train_label,knn_train_predictions))
print("Train auc_score: ",compute_auc(true_train_label,knn_train_predictions))
print("Test accuracy: ",compute_accuracy(true_test_label,knn_test_predictions))
print("Test auc_score: ",compute_auc(true_test_label,knn_test_predictions))