# Lab Course Machine Learning
## Exercise Sheet 11
##### January, 2022  
##### Kenechukwu Ejimofor
###### Data Analytics
<center>
<b>
Information Systems and Machine Learning Lab<br>
University of Hildesheim<br>
</b>
<br>
<br>
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Universit%C3%A4t_Hildesheim_logo.svg/1200px-Universit%C3%A4t_Hildesheim_logo.svg.png" height="10%" width="10%">
</center>



#### Exercise 0: Preprocessing Text Data

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,confusion_matrix
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from collections import Counter
from scipy.sparse import csr_matrix #for sparse matrix
import warnings

np.random.seed(3116)
warnings.filterwarnings('ignore')
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to C:\Users\Kenechukwu
[nltk_data]     Ejimofor\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
category = [ 'sci.med', 'comp.graphics']
dataset =  fetch_20newsgroups(subset='train',categories=category,random_state=3116)

Preview of an example in the dataset

In [3]:
print(dataset.data[0])

From: kaminski@netcom.com (Peter Kaminski)
Subject: Re: Krillean Photography
Lines: 101
Organization: The Information Deli - via Netcom / San Jose, California

[Newsgroups: m.h.a added, followups set to most appropriate groups.]

In <1993Apr19.205615.1013@unlv.edu> todamhyp@charles.unlv.edu (Brian M.
Huey) writes:

>I am looking for any information/supplies that will allow
>do-it-yourselfers to take Krillean Pictures.

(It's "Kirlian".  "Krillean" pictures are portraits of tiny shrimp. :)

[...]

>One might extrapolate here and say that this proves that every object
>within the universe (as we know it) has its own energy signature.

I think it's safe to say that anything that's not at 0 degrees Kelvin
will have its own "energy signature" -- the interesting questions are
what kind of energy, and what it signifies.

I'd check places like Edmund Scientific (are they still in business?) --
or I wonder if you can find ex-Soviet Union equipment for sale somewhere
in the relcom.* hierarchy.



In [4]:
print("\n".join(dataset.data[0].split("\n")[:3]))

From: kaminski@netcom.com (Peter Kaminski)
Subject: Re: Krillean Photography
Lines: 101


In [5]:
#Example classification
print(dataset.target_names[0])

comp.graphics


### Preprocessing textual data to remove punctuation, stop-words 
- I decided to stem the words in other to reduce the maximum size in the count vectorizer during Tokenization

In [6]:
corpus = [] #list to get processed data

'''We remove all the punctuations by replacing them with spaces
we do this by looping through the data and replacing any element
that is not in the alphabet both upper and lower cases then convert the
whole data to lower case'''

for i in range(0, len(dataset.data)):
    news = re.sub('[^a-zA-Z]',' ', dataset.data[i]) 
    news = news.lower()
    news = news.split()
    stemmer = PorterStemmer() #stemming class
    news = [stemmer.stem(j) for j in news if not j in set(stopwords.words('english'))] 
  #We loop through the data to only input words absent from the stopwords
    news = ' '.join(news) #join words and add space
    corpus.append(news)

In [7]:
len(corpus)

1178

In [8]:
#preview data
corpus[0]

'kaminski netcom com peter kaminski subject krillean photographi line organ inform deli via netcom san jose california newsgroup h ad followup set appropri group apr unlv edu todamhyp charl unlv edu brian huey write look inform suppli allow yourself take krillean pictur kirlian krillean pictur portrait tini shrimp one might extrapol say prove everi object within univers know energi signatur think safe say anyth degre kelvin energi signatur interest question kind energi signifi check place like edmund scientif still busi wonder find ex soviet union equip sale somewher relcom hierarchi expans kirlian photographi credul side stanway andrew altern medicin guid natur therapi isbn new york vike penguin p p overli critic still use overview altern health therapi russian engin semyon kirlian wife valentina use altern current high frequenc illumin subject photograph found object good conductor metal pictur show surfac pictur poor conductor show inner structur object even optic opaqu found high f

- Bag-of-words model

For the bag of words model we need to implement a count vectorizer((For Tokenization). The following steps are required to implement the count vectorizer:
- Get the unique words and set an index to each of these words
- Iterate through each sentence and get the number of occurence for each unique word
- Create a sparse matrix that represents all the available words(i.e size = number of unique word) and fill in the matrix with the respective counts

In [9]:
#Define a function that returns the unique words and their positions in a text
def word_index(s):
    unique_words = set() #Define a set of unique words to avoid duplicates
    for sentence in s:
        for word in sentence.split(' '):
            if word not in stopwords.words('english'): #Exclude stopwords
                unique_words.add(word)
    #Now we need to return the words and the index
    word_dict = {}
    for idx, word in enumerate(sorted(list(unique_words))):
        word_dict[word] = idx
    return word_dict

In [10]:
#sanity check 
s = ['this is lab eleven', 'here is the count vectorizer', 'this is an example'] #example for testing
word_index(s)

{'count': 0, 'eleven': 1, 'example': 2, 'lab': 3, 'vectorizer': 4}

We can notice the stopwords `'this'`, `'is'`, `'an'`, `'here'` were excluded. Furthermore, the number of unique words is 5, therefore the Count Vector would consist of 3 lists with a length of 5 on each of them.

In [11]:
def CountVectorizer(data):
    #First step is the unique word indexing in the given input
    dict_word = word_index(data)
    row = []
    column = []
    value = []
    
    for idx, sentence in enumerate(data):
        word_count = dict(Counter(sentence.split(' '))) #For each word get the count
        for word, count in word_count.items():
            if word not in stopwords.words('english'):
                column_index = dict_word.get(word)
                if column_index >= 0:
                    row.append(idx)
                    column.append(column_index)
                    value.append(count)
                    
    result = csr_matrix((value, (row, column)), shape=(len(data), len(dict_word))) 
    result = result.toarray()
    return result
    

In [12]:
#Sanity check
print(CountVectorizer(s))

[[0 1 0 1 0]
 [1 0 0 0 1]
 [0 0 1 0 0]]


As we can see the count vectorizer works properly. 
The first sentence ('this is lab eleven') had lab and eleven (index 2 and 4) set as 1 because they appear once each, and the other word index set to 0 because they do not appear

Reference:
For more on this, please refer to <a href='https://medium.com/@saivenkat_/implementing-countvectorizer-from-scratch-in-python-exclusive-d6d8063ace22'>link<a/> 

In [13]:
#Implement the count vectorization/tokenization
X = CountVectorizer(corpus)
y = dataset.target

In [14]:
#preview 
len(X[0])

15899

Show that there are 15899 different words after stemming has been applied

In [15]:
len(y)

1178

- Term Frequency–inverse Document Frequency (Tf-IDF)

Term Frequency–inverse Document Frequency  is basically a step further after tokenization/ bag-of-words model using count vectorization. It combines 2 concepts, Term Frequency (TF) and Document Frequency (DF). The term frequency is the number of occurrences of a specific term in a document. Term frequency indicates how important a specific term in a document. Inverse document frequency (IDF) is the weight of a term, it aims to reduce the weight of a term if the term’s occurrences are scattered throughout all the documents. The TF-IDF score as the name suggests is just a multiplication of the term frequency matrix with its IDF
Reference:<a href='https://towardsdatascience.com/tf-idf-simplified-aba19d5f5530'> link <a/>

In [16]:
def tfIdf(data):
    cv = CountVectorizer(data)
    tf = cv / (np.sum(cv, axis=-1).reshape(-1,1))
    idf = np.log((1 + len(cv)) / (1 + (cv!=0).sum(axis=0) )) + 1
    tf_idf = tf * idf
    #Normalize
    tf_idf = tf_idf / np.linalg.norm(tf_idf, axis = -1).reshape(-1,1)
    return tf_idf

In [17]:
#Sanity check
print(tfIdf(s))

[[0.         0.70710678 0.         0.70710678 0.        ]
 [0.70710678 0.         0.         0.         0.70710678]
 [0.         0.         1.         0.         0.        ]]


As can be seen from the example above the tfIdf converts the previous count vectorized form of zeros and ones to a combination of term frequency and inverse document frequency

In [18]:
#Implement the Term Frequency–inverse Document Frequency 
X1 = tfIdf(corpus)
y1 = dataset.target

Splitting the dataset randomly into train / validation / test splits according to ratios 80%:10%:10%

   - Bag-of-words 

In [19]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size =0.10, random_state =3116)
X_train,X_val,y_train,y_val = train_test_split(X_train,y_train,test_size =0.111, random_state =3116)
print("Train Set size:")
print(X_train.shape,y_train.shape)
print("Validation Set size:")
print(X_val.shape,y_val.shape)
print("Test set size:")
print(X_test.shape,y_test.shape)

Train Set size:
(942, 15899) (942,)
Validation Set size:
(118, 15899) (118,)
Test set size:
(118, 15899) (118,)


- Term Frequency–inverse Document Frequency 

In [20]:
X_train1,X_test1,y_train1,y_test1 = train_test_split(X1,y1,test_size =0.10, random_state =3116)
X_train1,X_val1,y_train1,y_val1 = train_test_split(X_train1,y_train1,test_size =0.111, random_state =3116)
print("Train Set size:")
print(X_train1.shape,y_train1.shape)
print("Validation Set size:")
print(X_val1.shape,y_val1.shape)
print("Test set size:")
print(X_test1.shape,y_test1.shape)

Train Set size:
(942, 15899) (942,)
Validation Set size:
(118, 15899) (118,)
Test set size:
(118, 15899) (118,)


#### Exercise 1: Implementing Naive Bayes Classifier for Text Data

In [21]:
class MultinomialNB():

    def __init__(self, alpha=1):
        self.alpha = alpha 
        #for smoothing (Laplace)
    def fit(self, X_train, y_train):
        m, n = X_train.shape
        self._classes = np.unique(y_train)
        n_classes = len(self._classes)
        self._priors = np.zeros(n_classes)
        self._likelihoods = np.zeros((n_classes, n))
        for idx, c in enumerate(self._classes):
            X_train_c = X_train[c == y_train]
            self._priors[idx] = X_train_c.shape[0] / m  #Calculate the prior P(c)
            self._likelihoods[idx, :] = ((X_train_c.sum(axis=0)) + self.alpha) / (np.sum(X_train_c.sum(axis=0) + self.alpha))
            #Find the likelihood
            #P(X|Y=c) = P(X|Y=c)/ P(Y=c)
    def predict(self, X_test):
        #Use the _predict helper function for each value
        return [self._predict(x_test) for x_test in X_test]

    def _predict(self, x_test):
        posteriors = [] #posterior probability of assigning examples to each class
        for idx, c in enumerate(self._classes):
            prior_class = np.log(self._priors[idx])
            likelihoods_class = self.calc_likelihood(self._likelihoods[idx,:], x_test)
            posteriors_class = np.sum(likelihoods_class) + prior_class
            posteriors.append(posteriors_class)
            #The class with the maximum posterior is selected as the predicted class
        return self._classes[np.argmax(posteriors)]

    def calc_likelihood(self, class_likelihood, x_test):
        return np.log(class_likelihood) * x_test

    def score(self, X_test, y_test):
        y_pred = self.predict(X_test)
        return np.sum(y_pred == y_test)/len(y_test)


Reference:<a href='https://stackoverflow.com/search?q=user:12312396+naivebayes&s=0e12d82b-e213-4061-808d-4074e67c52f3'> Link 1 <a/>  &nbsp; <a href='https://stackoverflow.com/questions/33830959/multinomial-naive-bayes-parameter-alpha-setting-scikit-learn'> Link 2 <a/>

### Bag-of-words

In [22]:
classifier = MultinomialNB()
classifier.fit(X_train, y_train)

# Predicting test set
y_pred = classifier.predict(X_test)
y_test = np.array(y_test)

print(confusion_matrix(y_test, y_pred))

[[53  0]
 [ 1 64]]


In [23]:
#Preview of second 20 test predictions comparing with actual categories
for i,j in zip(y_test[20:40],y_pred[20:40]):
    if i ==0:
        i = "sci.med"
    else:
        i = "comp.graphics"
    if j ==0:
        j = "sci.med"
    else:
        j = "comp.graphics"
    print(f'Actual category: {i}\t Predicted category: {j}')

Actual category: sci.med	 Predicted category: sci.med
Actual category: comp.graphics	 Predicted category: comp.graphics
Actual category: sci.med	 Predicted category: sci.med
Actual category: sci.med	 Predicted category: sci.med
Actual category: sci.med	 Predicted category: sci.med
Actual category: comp.graphics	 Predicted category: comp.graphics
Actual category: comp.graphics	 Predicted category: comp.graphics
Actual category: sci.med	 Predicted category: sci.med
Actual category: comp.graphics	 Predicted category: comp.graphics
Actual category: comp.graphics	 Predicted category: comp.graphics
Actual category: comp.graphics	 Predicted category: comp.graphics
Actual category: comp.graphics	 Predicted category: comp.graphics
Actual category: comp.graphics	 Predicted category: comp.graphics
Actual category: sci.med	 Predicted category: sci.med
Actual category: sci.med	 Predicted category: sci.med
Actual category: comp.graphics	 Predicted category: comp.graphics
Actual category: comp.graphi

- Accuracy of Test set

In [24]:
print("Accuracy:", accuracy_score(y_test,y_pred))

Accuracy: 0.9915254237288136


### Exercise 2: Implementing SVM Classifier via Scikit-Learn (bag-of-words)

In [25]:
svc = SVC(random_state=3116)
svc = svc.fit(X_train,y_train) #model training
#Hyperparameter tuning with the validation set
grid_parameters = {'C':list(np.arange(0,3,0.1)),'kernel':['linear','rbf','poly','sigmoid'],'gamma':['scale','auto']}
svc_search = GridSearchCV(svc, grid_parameters, cv=3,scoring='accuracy', verbose=0)
svc_search.fit(X_val,y_val)

GridSearchCV(cv=3, estimator=SVC(random_state=3116),
             param_grid={'C': [0.0, 0.1, 0.2, 0.30000000000000004, 0.4, 0.5,
                               0.6000000000000001, 0.7000000000000001, 0.8, 0.9,
                               1.0, 1.1, 1.2000000000000002, 1.3,
                               1.4000000000000001, 1.5, 1.6, 1.7000000000000002,
                               1.8, 1.9000000000000001, 2.0, 2.1, 2.2,
                               2.3000000000000003, 2.4000000000000004, 2.5, 2.6,
                               2.7, 2.8000000000000003, 2.9000000000000004],
                         'gamma': ['scale', 'auto'],
                         'kernel': ['linear', 'rbf', 'poly', 'sigmoid']},
             scoring='accuracy')

In [26]:
print("Best Parameters:", svc_search.best_params_)

Best Parameters: {'C': 2.6, 'gamma': 'scale', 'kernel': 'sigmoid'}


Naive Bayes Classifier

In [27]:
print("Accuracy score",accuracy_score(y_test,classifier.predict(X_test)))
print(confusion_matrix(y_test,classifier.predict(X_test)))

Accuracy score 0.9915254237288136
[[53  0]
 [ 1 64]]


SVM Classifier

In [28]:
print("Accuracy score",accuracy_score(y_test,svc_search.predict(X_test)))
print(confusion_matrix(y_test,svc_search.predict(X_test)))

Accuracy score 0.8135593220338984
[[47  6]
 [16 49]]


After tuning the hyperparameters with the validation set, we got an accuracy of 0.814 on the test set

`N/b` - Tuning with the train set will get the model to perform with an accuracy of 1.0 on the test set

### Term Frequency–inverse Document Frequency

In [29]:
classifier1 = MultinomialNB()
classifier1.fit(X_train1, y_train1)

# Predicting test set
y_pred1 = classifier1.predict(X_test1)
y_test1 = np.array(y_test1)

print(confusion_matrix(y_test1, y_pred1))

[[53  0]
 [ 1 64]]


In [30]:
#Preview of second 20 test predictions comparing with actual categories
for i,j in zip(y_test1[20:40],y_pred1[20:40]):
    if i ==0:
        i = "sci.med"
    else:
        i = "comp.graphics"
    if j ==0:
        j = "sci.med"
    else:
        j = "comp.graphics"
    print(f'Actual category: {i}\t Predicted category: {j}')

Actual category: sci.med	 Predicted category: sci.med
Actual category: comp.graphics	 Predicted category: comp.graphics
Actual category: sci.med	 Predicted category: sci.med
Actual category: sci.med	 Predicted category: sci.med
Actual category: sci.med	 Predicted category: sci.med
Actual category: comp.graphics	 Predicted category: comp.graphics
Actual category: comp.graphics	 Predicted category: comp.graphics
Actual category: sci.med	 Predicted category: sci.med
Actual category: comp.graphics	 Predicted category: comp.graphics
Actual category: comp.graphics	 Predicted category: comp.graphics
Actual category: comp.graphics	 Predicted category: comp.graphics
Actual category: comp.graphics	 Predicted category: comp.graphics
Actual category: comp.graphics	 Predicted category: comp.graphics
Actual category: sci.med	 Predicted category: sci.med
Actual category: sci.med	 Predicted category: sci.med
Actual category: comp.graphics	 Predicted category: comp.graphics
Actual category: comp.graphi

- Accuracy of Test set

In [31]:
print("Accuracy:", accuracy_score(y_test1,y_pred1))

Accuracy: 0.9915254237288136


### Exercise 2: Implementing SVM Classifier via Scikit-Learn (TfIdf)

In [32]:
svc1 = SVC(random_state=3116)
svc1 = svc1.fit(X_train1,y_train1) #model training
grid_parameters = {'C':list(np.arange(0,3,0.1)),'kernel':['linear','rbf','poly','sigmoid'],'gamma':['scale','auto']}
svc_search1 = GridSearchCV(svc1, grid_parameters,scoring='accuracy', cv=3, verbose=0)
svc_search1.fit(X_val,y_val)

GridSearchCV(cv=3, estimator=SVC(random_state=3116),
             param_grid={'C': [0.0, 0.1, 0.2, 0.30000000000000004, 0.4, 0.5,
                               0.6000000000000001, 0.7000000000000001, 0.8, 0.9,
                               1.0, 1.1, 1.2000000000000002, 1.3,
                               1.4000000000000001, 1.5, 1.6, 1.7000000000000002,
                               1.8, 1.9000000000000001, 2.0, 2.1, 2.2,
                               2.3000000000000003, 2.4000000000000004, 2.5, 2.6,
                               2.7, 2.8000000000000003, 2.9000000000000004],
                         'gamma': ['scale', 'auto'],
                         'kernel': ['linear', 'rbf', 'poly', 'sigmoid']},
             scoring='accuracy')

In [33]:
print("Best Parameters:", svc_search1.best_params_)

Best Parameters: {'C': 2.6, 'gamma': 'scale', 'kernel': 'sigmoid'}


Naive Bayes Classifier

In [34]:
print("Accuracy score",accuracy_score(y_test1,classifier1.predict(X_test1)))
print(confusion_matrix(y_test1,classifier1.predict(X_test1)))

Accuracy score 0.9915254237288136
[[53  0]
 [ 1 64]]


SVM Classifier

In [35]:
print("Accuracy score",accuracy_score(y_test1,svc_search1.predict(X_test1)))
print(confusion_matrix(y_test1,svc_search1.predict(X_test1)))

Accuracy score 0.4491525423728814
[[53  0]
 [65  0]]


After tuning the hyperparameters with the validation set on the tfidf dataset, we got an accuracy of 0.449 on the test set.
The Naive Bayes Classifier performs similar on both the bag-of-words embedding and the TfIdf embedding. However, the `SVM` performs poorly on the `TfIdf` embedding

|Method|Feature Embedding|Test Accuracy|
|-|-|-|
|Naive Bayes Classifier|Bag-of-words| 0.9915|
|-|-|-|
|Naive Bayes Classifier|Tf-IDf| 0.9915|
|-|-|-|
|SVM Classifier|Bag-of-words|0.8136|
|-|-|-|
|SVM Classifier|Tf-IDf|0.4492|