# Exercise 4

## Task 1

Using the files in _rural.txt_ and _science.txt_ , train and test classifiers provided in scikit-learn. The goal of this task is that you explore different features, not only in the classifier, but also in the vectorizer:

a) Each file (rural and science) contains senetnce-wise documents. Your job is to create a list of documents and their corresponding label (rural, science). This data structure (up to you which one you want to use) will be used later as input for your vectorizer. E.g. you can think about it as a table, which contains in the first column the sentences of both files and in the second column each class label.

In [83]:
import pandas as pd
import numpy as np
import sklearn.metrics as metrics
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import naive_bayes

# read files into individual dataframes and add column for label (0 = science, 1 = rural)
data_science = pd.read_csv("science.txt", sep='\t', header=None)
data_science[1] = 's'
data_rural = pd.read_csv("rural.txt", sep='\t', header=None)
data_rural[1] = 'r'

# combine dataframes 
data_total = data_rural.append(data_science)
display(data_total)

Unnamed: 0,0,1
0,PM denies knowledge of AWB kickbacks,r
1,The Prime Minister has denied he knew AWB was ...,r
2,Letters from John Howard and Deputy Prime Mini...,r
3,In one of the letters Mr Howard asks AWB manag...,r
4,The Opposition's Gavan O'Connor says the lette...,r
...,...,...
592,Liddicoat hopes this will one day make it poss...,s
593,If you understand what they are then ... you m...,s
594,An ancient ancestor of today's crocodiles look...,s
595,"The discovery of a six-foot-long, bipedal and ...",s


b) Split the data into train (70%) and test (30%) sets and use the _tf-idf-vectorizer_ provided by scikit-learn and explained in class to train following classifiers provided also by scikit-learn: _naive_bayes.GaussianNB()_ and _svm.LinearSVC()_ . __Hint:__ Please notice that the Gaussian NB Classifier takes a dense matrix as input and the output of the vectorizer is a sparse matrix. Use *my_matrix.__toarray()__* for this conversion.

In [84]:
text_train, text_test, label_train, label_test = train_test_split(data_total[0], data_total[1], test_size=0.30, random_state=1234, shuffle=True)

vectorizer = TfidfVectorizer()
bow_matrix = vectorizer.fit_transform(text_train)
print(bow_matrix.toarray())
bayes_classifier = naive_bayes.GaussianNB()
bayes_classifier.fit(bow_matrix.toarray(), label_train)
svm_classifier = svm.LinearSVC()
svm_classifier.fit(bow_matrix, label_train)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)

c) Evaluate both classifiers using only the test set, report accuracy, recall, percision, and f-measure and explain differences in the performance if any.

In [117]:
test_data_matrix = vectorizer.transform(text_test)

# test and evaluate Bayes classifier
results_bayes = bayes_classifier.predict(test_data_matrix.toarray())
correct_bayes = np.sum(np.equal(results_bayes, label_test))

print("Bayes: ")
print()
print("Accuracy: " + str(metrics.accuracy_score(label_test, results_bayes)))
print()
print("Percision Science: " + str(metrics.precision_score(label_test, results_bayes, pos_label = 's')))
print("Recall Science: " + str(metrics.recall_score(label_test, results_bayes, pos_label = 's')))
print("F-Measure Science: " + str(metrics.f1_score(label_test, results_bayes, pos_label = 's')))
print()
print("Precision Rural: " + str(metrics.precision_score(label_test, results_bayes, pos_label = 'r')))
print("Recall Rural: " + str(metrics.recall_score(label_test, results_bayes, pos_label = 'r')))
print("F-Measure Rural: " + str(metrics.f1_score(label_test, results_bayes, pos_label = 'r')))

print()

# test and evaluate SVM classifier
results_svm = svm_classifier.predict(test_data_matrix.toarray())
correct = np.sum(np.equal(results_svm, label_test))

print("SVM: ")
print()
print("Accuracy: " + str(metrics.accuracy_score(label_test, results_svm)))
print()
print("Percision Science: " + str(metrics.precision_score(label_test, results_svm, pos_label = 's')))
print("Recall Science: " + str(metrics.recall_score(label_test, results_svm, pos_label = 's')))
print("F-Measure Science: " + str(metrics.f1_score(label_test, results_svm, pos_label = 's')))
print()
print("Precision Rural: " + str(metrics.precision_score(label_test, results_svm, pos_label = 'r')))
print("Recall Rural: " + str(metrics.recall_score(label_test, results_svm, pos_label = 'r')))
print("F-Measure Rural: " + str(metrics.f1_score(label_test, results_svm, pos_label = 'r')))


Bayes: 

Accuracy: 0.9296636085626911

Percision Science: 0.9590643274853801
Recall Science: 0.9111111111111111
F-Measure Science: 0.9344729344729344

Precision Rural: 0.8974358974358975
Recall Rural: 0.9523809523809523
F-Measure Rural: 0.924092409240924

SVM: 

Accuracy: 0.9480122324159022

Percision Science: 0.9267015706806283
Recall Science: 0.9833333333333333
F-Measure Science: 0.9541778975741241

Precision Rural: 0.9779411764705882
Recall Rural: 0.9047619047619048
F-Measure Rural: 0.9399293286219081


The svm.LinearSVC classifier scores better in every aspect except for the precision score on the science texts (~ 0.03 worse) and recall score on the rural texts (~ 0.05 worse). Overall however, the performance is fairly similar. Both classifiers have scores above 0.90 except for the rural percision of the naive Bayes classifier (0.897).

## Task 2

Using the same splits from __Task 1__:

a) Use spaCy to extract vector representations of each document (sentence) in your data.

b) Train again new instances of both classifiers but this time the input should be only the vector representations obtained from spaCy and the labels.

c) Report accuracy, recall, precision, and f-measure and explain differences to the results of __tf-idf__.