<a href="https://colab.research.google.com/github/masonreznov/CS-332-NLP-LAB/blob/main/LAB-3/Lab_3_text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 **Text Classification of a movie review [dataset](https://www.cs.cornell.edu/people/pabo/movie-review-data/)** using machine learning classifiers.

Steps/tasks: 

1.   Dataset collection
2.   Mapping the training data and the labels 
3.   Data preprocessing
4.   Vectorising/numerification of the text data
5.   Training a classifier from standard library (sklearn)
6.   Evaluating the performance in terms of accuracy



In [None]:
# importing the necessary libraries


import numpy as np
import re
import nltk
from sklearn.datasets import load_files
nltk.download('stopwords')
import pickle
from nltk.corpus import stopwords
nltk.download('wordnet')

We will be using the cornell movie review [dataset](https://www.cs.cornell.edu/people/pabo/movie-review-data/). The dataset contains 2000 documents and each of them are the reviews by an individual reviewer which is either positive or negative.

In [None]:
## This commands will download the dataset and extract in the current working directory

!wget https://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz
!tar -xvzf review_polarity.tar.gz



After extracting the dataset, `txt_sentence` folder will be
created. This folder contains two sub-folders `neg` and `pos` where each of these categories have 1000 review documents.

---
The next step is to map each documents to their respective categories which can be accesed as per the need.

As we have only two categories, so this can be facilited by traversing all the documents in the `neg` and `pos` folders and assign them as class 0 or 1 (0 maybe for `neg` and 1 for `pos` and vice-versa).


You can also use the `load_files` method from [sklearn.datasets](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_files.html) which has already been imported in cell #1.

In [None]:
# load_files take in the parent directory (just before the category subfolders) path as the arguwment
# and can return the list of all training data (combined documents in a single list) as well as the respective categories 

# movie_data = load_files("txt_sentoken", shuffle=False)
movie_data = load_files("txt_sentoken")

In [None]:
X, y = movie_data.data, movie_data.target


# here X is the list of all text documents and y is a numpy array of all the categories
# such that, the document number 1 i.e X[1] has the category y[1]  

In [None]:
print(f'Total number of documents {len(X)}')


Total number of documents 2000


In [None]:
## Preprocessing the data
## The labels are not required to be processed as they are already 0 and 1

## Process the data X and store into the documents list


### complete the following commented statements/blocks

documents = []

from nltk.stem import WordNetLemmatizer

stemmer = WordNetLemmatizer()

for sen in range(0, len(X)):
    ## Remove all the special characters or the non-word characters
    document = re.sub(r'\W', ' ', str(X[sen]))
    
    ## remove all single characters
    document = re.sub(r'\s+[a-zA-Z]\s+', ' ', document)
       
    ## Substituting multiple spaces with single space
    # document = 
        
    ## Converting to Lowercase
    # document =
    
    ### Lemmatization

    ## split the document into words
    # document = 
    ## use the stemmer.lemmatize() to lemmatize each words of the document 
    document = [stemmer.lemmatize(word) for word in document] 
    ## join each individual tokens of the document 
    # document = 
    
    documents.append(document)

Vectorizing the text in oder to change the text document into numerical values using [Bag of words](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(stop_words=stopwords.words('english'))
X_bow = vectorizer.fit_transform(documents).toarray()

In [None]:
## Splitting the training and testing instances into 80:20 ratio, 80% training and 20% testing

from sklearn.model_selection import train_test_split
X_bow_train, X_bow_test, y_bow_train, y_bow_test = train_test_split(X_bow, y, test_size=0.2, random_state=0)

In [None]:
## Classifying the `pos` and `neg` categories using Gasussian Naive bayes
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_bow_train, y_bow_train)

In [None]:

y_bow_pred = gnb.predict(X_bow_test)
## y_bow_pred is the predicted categories of the test data from learnt model "gnb"

In [None]:
from sklearn.metrics import accuracy_score

print(accuracy_score(y_bow_test, y_bow_pred))

## accuracy measure in comparison of the predicted categories (y_bow_test) and the actual categories (y_bow_pred)

Tasks:
Try with the following classifiers and report compare the accuracies with that of the `GaussianNaiveBayes`:

1.   [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

2.   [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier)

----



*   Repeat the same tasks i.e. classifying with the various classifiers but using [TFIDF](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) as the vectorizer instead of Bag of Words vectorizer and report the differences


## For your reference, the above example is taken [from](https://stackabuse.com/text-classification-with-python-and-scikit-learn/)

# **Assignment**
**Spoken language** classification task: Given a dataset consisting of a person name and their spoken language, train a classifier to predict the spoken language of a given unkown name. So, the person name is the data and the spoken language is your labels.

## Datset [link](https://download.pytorch.org/tutorial/data.zip)

## Dataset organisation
    |-data
    |    |-names
    |          |-English 
    |          |-French
    .          .
    .          .
    .          . 
    |          |-Japanese      
    |-eng-fra.txt 

Here, each files inside the `names` folder is a collection of the person names according to the `language` and your first task is to map each of the names to the language in a datastructure which you can access. 

**Just ignore the `eng-fra.txt`** 

Finally, apply all the steps that were done in the movie review task to train classifiers for this language classification task.

## NOTE:
*  The assigment can be done either by a single person or in a group (max 10 in a group)
*  Submit your assignmnent using this google [form](https://forms.gle/h3n5PoHtBiXDnN7p6).
*  Submit on or before ~~next Tuesday (8/03/2022)~~ **23/03/2022**.
*  The submission file should be a report in a pdf format which should include
        1. The name and roll numbers of all the members
        2. The objectives/goal of the experiments
        3. Results (accuracy) should be reported in a tabular format comparing the effects of
            1. choice of the vectoriser (Bag of words or TFIDF)
            2. choice of the classifiers
            3. explore whether the use of lemmatization improves the model or not.
        4. Finally, conclude with your observation
        5. Also, include your code as a colab link in the pdf itself.


