# Assignment 1: Classifing Documents
The assignment consists in the development, in NLTK, OpenNLP, SketchEngine or GATE/Annie a Naïve Bayes Classifier able to detect a single class in one of the corpora available as attachments to the chosen package, by distinguishing ENGLISH against NON-ENGLISH. In particular the classifier has to be:

- Trained on a split subset of the chosen corpus, by either using an existing partition between sample documents for training and for test or by using a random splitter among the available ones;

- Devised as a pipeline of any chosen format, including the simplest version based on word2vec on a list of words obtained by one of the available lexical resources.

## Introduction:
I decided to use only NLTK for defining the pipeline, fetching the data and analize the results. I decided to train the classifier on full documents, so the results on tests sets are flawless (if not perfect) in most of the trainings. </br>
The catch is that this classifier should be used only with documents of the same size or bigger, smaller documents (or even only paragraphs) are unreliably detected by the classifier, and therefore should be avoided. </br>
The script can be divided in five distinct parts:
- ***Data Fetching***
- ***Pipeline***
- ***Feature Extraction***
- ***Traning the Model***
- ***Analyzing the Results***

In [1]:
#************************ IMPORTS ************************#
import nltk
import asyncio
import random
import math 
import collections
from tqdm import tqdm 

## Data Fetching:
In this step documents are fetched, associated with the proper label, depending on the class, and shuffled.
The script takes in consideration 3 languages: ***English***, ***French*** and ***Dutch***.</br>
In this dataset consist in a total of 46 documents, of which 26 documents are in English and the other 20 in NonEnglish (i.e. Dutch and French); English documents not from the exact same background, 10 of them are from **europarl_raw** (a dataset that contains speeches of euro-parlamentars) and the other 16 are from the **state_union** dataset (American Presidents speeches). </br>
In my opinion, even if American and British English are different, in a formal context (e.g. a politician speech) they tend to be more similar than if compared in a casual setting. 

In [2]:
from nltk.corpus import europarl_raw # Euro Parlamentars speeches 
from nltk.corpus import state_union as union # America's presidents Union Day speeches 
# CORPUS DATA 

# Creating iterators containing all the needed file ids
en_ids = [fileid for fileid in europarl_raw.english.fileids()]
dutch_ids = [fileid for fileid in europarl_raw.dutch.fileids()]
fr_ids = [fileid for fileid in europarl_raw.french.fileids()]
union_ids = [fileid for fileid in union.fileids()]
union_ids = union_ids[:math.floor(len(union_ids)/4)]
print("English Documents Count:", len(en_ids)+len(union_ids))
print("NonEnglish Documents Count:", len(dutch_ids)+len(fr_ids))

# Loading ENGLISH euro_parlcorpora and adding the English label 
documents= [(europarl_raw.english.raw(fileid), "English") for fileid in en_ids]

# Loading America's union speechs corpora and adding the English label 
for fileid in union_ids:
    documents.append((union.raw(fileid) , "English"))

# Loading FRENCH corpora and  label 
for fileid in fr_ids:
    documents.append((europarl_raw.french.raw(fileid) , "NonEnglish"))
    
# Loading DUTCH corpora and respective label 
for fileid in dutch_ids:
    documents.append((europarl_raw.dutch.raw(fileid) , "NonEnglish"))
    
random.shuffle(documents)
print("Done")

English Documents Count: 26
NonEnglish Documents Count: 20
Done


## Pipeline:
In this step the script modifies the data through a data pipeline, with the aim of reducing the number of words in our vocabulary *V* and deleting informationless data. In this step the script also calculates the frequency of each word in the whole corpora, this is essential because in the next step this information is used to decide the features (i.e. words) of our model.
The pipeline consist in:
- **Tokenization**: converting text is list of sentences or and words;
- **Stop Words removal**: removing words that dont add meaning to the text. The script uses nltk sets of stopwords for each language instead of removing the *N* most common words;
- **Stemming**: reduce words to their root. This is done using the Porter stemmer, a really old stemmer (1979) that is also still a viable option;
- **Lemmatizing**: reduces the words to their core meaning, but it does so by replacing the word with a word with the same meaning, instead of taking only the root like in Stemming;

In [3]:
# STOPWORDS
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))  
stop_words.add(word for word in stopwords.words("french"))
stop_words.add(word for word in stopwords.words("dutch"))

# STEMMER 
from nltk.stem import PorterStemmer 
stemmer = PorterStemmer()  

# LEMMATIZER
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [4]:
# TOKENIZATION, LEMMATIZING, STEMMMING AND STOP WORDS REMOVAL 

from nltk.probability import FreqDist
from nltk.tokenize import sentà-_tokenize, word_tokenize
fdist = FreqDist() # freqdist to keep counting w instances for creating BOW 
data = [0 for _ in range(len(documents))]

for i,(text,label) in enumerate(tqdm(documents)):
    appo = ([],label)

    sents = sent_tokenize(text)
    for sent in sents:
        words = word_tokenize(sent) 
        for word in words:
            if word.casefold() not in stop_words:
                stemmed = stemmer.stem(word.lower()) # Stemming 
                lemmatized = lemmatizer.lemmatize(stemmed) # Lemmatization
                fdist[lemmatized] += 1 # Increases Word Counter inside the Bag of Words
                appo[0].append(lemmatized) # Saves the Result

    data[i] = appo 
top_words = list(fdist)[:2000]

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 46/46 [00:58<00:00,  1.27s/it]


## Feature Extraction:
After the pipeline we transform all documents data in a readable form for the classifier; this is done by creating a dictionary (one for every document) that for each of the 2000 more common words (the most common in all the data) associates a *True* if present in the text, otherwhise a *False*. After the program has created a list with all feature set, one for each document, it just splits it two, giving 70% of the data for training, and the remaining 30% for testing.

In [5]:
# Feature Extraction: 
def feature_estractor(document,top_words):
    document_set = set(document)
    features = {}
    for word in top_words:
        features['contains({})'.format(word)] = (word in document_set)
    return features

featuresets = [(feature_estractor(d,top_words), c) for (d,c) in tqdm(data)]
train_test_split = math.floor(len(featuresets) * 0.7 )
train_set, test_set = featuresets[:train_test_split], featuresets[train_test_split:]

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 46/46 [00:00<00:00, 272.28it/s]


## Training the Model: 

By using the Naive Bayes classifier every feature gets a a say in determing wich label should be assign to a given input value (in this case documents). </br>
The Naive bayes classifier is trained on 70% of the feature sets. Thus, each feature set has to be labelled with the correct class (English or NonEnglish) by creating a tuple *(FeatureSet,Label)*. The reason for this drastic train-test split is the low ammount of documents in the corpora (only 46): during testing classifiers trained with less documents gave subpar results, especially in the most significant words department that will be  shown in the next section.


Example of the decision process during classification of a Naive Bayes classifier:
<img src="./images/naive-bayes-triangle.png" width="400" height="400" />

In [8]:

print(f"Number of training Documents: {len(train_set)} --- Number of testing Documents: {len(test_set)}")
classifier = nltk.NaiveBayesClassifier.train(train_set) 


Number of training Documents: 32 --- Number of testing Documents: 14


## Analyzing the results:
Lastly, the script computes an analysis of the results the classifier returns on testing data. The script calculates *Accuracy*, *Precision*, *Recall*, *F1 score* and creates a *Confusion Matrix*. In this **Binary classification problem** for computing the confusion matrix we need to choose one of the two labels, then extract the number of documents present in the test set for that labels  and the number of documents that are correctly classified for the same, previosly choosen label (as rappresented in the image below).</br>
All the metrics are pretty high, this is probably related to the small size of the Corpora (the dataset has only 46 documents) and the lenght of the documents: the more a document is long, the more significant features for a language are more likely to appear. The latter is especially true in a field like politics in where much of the terminology is language dependent, unlike, for example, Computer Science corpora.
<p align="center">
<img src="./images/precision-recall.png" width="700" height="500" />

In the image above Relevant and Retrieved are: 
- **Relevant documents** are all the documents that are part of the positive class (English) 
- **Retrieved documents** are all  the documents that are being identified by the classifier as part of the positive class (English).
 </p>

In [9]:
from nltk.metrics.scores import (precision, recall)
from nltk.metrics import ConfusionMatrix
print("Testing and Metrics: ")
refsets =  collections.defaultdict(set)
testsets = collections.defaultdict(set)
labels = []
tests = []
for i,(feats,label) in enumerate(test_set):
    refsets[label].add(i)
    result = classifier.classify(feats)
    testsets[result].add(i)
    labels.append(label)
    tests.append(result)
    
cm = ConfusionMatrix(labels, tests)
print("Accuracy:",nltk.classify.accuracy(classifier, test_set))
prec = precision(refsets['English'], testsets['English'])
print( 'Precision:', prec )
rec = recall(refsets['English'], testsets['English'])
print( 'Recall:', rec )
f1 = 2 *(prec*rec)/(prec+rec)
print("F1 score:", f1)
print(cm.pretty_format(sort_by_count=True, show_percents=True, truncate=9))
classifier.show_most_informative_features(35)


Testing and Metrics: 
Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1 score: 1.0
           |             N |
           |             o |
           |             n |
           |      E      E |
           |      n      n |
           |      g      g |
           |      l      l |
           |      i      i |
           |      s      s |
           |      h      h |
-----------+---------------+
   English | <57.1%>     . |
NonEnglish |      . <42.9%>|
-----------+---------------+
(row = reference; col = test)

Most Informative Features
        contains(achiev) = False          NonEng : Englis =     12.2 : 1.0
          contains(also) = False          NonEng : Englis =     12.2 : 1.0
        contains(believ) = False          NonEng : Englis =     12.2 : 1.0
         contains(bring) = False          NonEng : Englis =     12.2 : 1.0
          contains(come) = False          NonEng : Englis =     12.2 : 1.0
        contains(cooper) = False          NonEng : Englis =     12.2 : 1.0
          