NLP Coursework Assignment

CW heavily weighted on preprocessing steps (Feature extraction)!!!

Scientific method: 
Motivate, justify, explain 

Structure
Step 1 Baseline model: 
    Baseline model – select basic ML model (Naïve base) 
        Research papers, justify why we are starting with NB – proven results for text classification (spam filtering).
Step 2 Feature engineering:
    Analyse and compare different feature extraction methods and preprocessing. Input different features into same model. Analyse results. Compare feature extraction and preprocessing using same model. See what produces best results.  
        Preprocessing – tokenization and linguistic analysis (Zipf's law). 
        Latent Dirichlet Allocation (LDA) - Unsupervised.   
        Feature extraction (Information retrieval techniques) – feature vector construction: 
        Word frequency. 
        Word classification frequency (Num of neg/pos words). 
        Stemming and Lemmatization  
        PoS  
        TDFG, bigram models.  
Step 3:
    Then choose preprocessing based on results from baseline model. Expand baseline model to more complex model. 
    Implement different complex models to explore results and what works best with same preprocessing. Compare methods from different areas, e.g SVM, clustering model and NN.  


<hr>
<h1>Introduction:</h1>
Creating software which can distinguish between positive and negative movie reviews involves the task of binary classification using machine learning. Arguably, the most important aspect of classifying a negative or positive review, is the review's features. These features are used to define the review's characteristics and general opinion, helping one choose between negative or positive. In other words, the features determine how a review is interpreted. Therefore, the analysis and construction of these features is a key aspect to the success of building a classification model. Furthermore, classification falls into a branch of machine learning called supervised learning. Supervised learning is an area involving the construction of a statistical model from labeled data sets, which then can be used to predict the class of a new unseen data point. This model is constructed from features which are engineered from the review text. There are many different techniques and methods within the natural language processing (NLP) paradigm which can be used to extract useful features from language. The success and accuracy of the classification model is heavily reliant on the extraction of useful and revealing features. This is because the model training will construct a prediction function using the statistical properties of these features. 
<br> <br>

<img src="images/text_classificaton_model.png" width="50%" alt="Source: https://monkeylearn.com/text-classification/"> 


The above image shows the process architecture of training a ML model for text binary classification.

More specifically, the binary classification task could be approached in two similar ways. The first, training a model from Boolean labeled data, so it can predict either positive or negative. The second, training a model from review rating labelled data, so it can multi-class prediction between 0-4 and 7-9. Thereafter, a threshold can be set to produce a binary classification of positive or negative. Furthermore, both approaches will require sufficient and equal data points for each class so the model training remains equally distributed.

Feature engineering is a key process step in constructing a successful text classifier. The text domain contains a high dimensionality of feature space, each word and each phrase can be interpreted and translated into many meaningful representations. Some of which may not be relevant and beneficial for the review classification task. Irrelevant features may add noise and reduce model prediction accuracy. The feature extraction process is an opportunity to reduce the high dimensionality feature space of the review text, encouraging improvement in the efficiency and accuracy of our classifier. Therefore, we will need to test different feature vectors by evaluating their relevance and usefulness using some process and metrics. According to John, Kohavi, and Pfleger (1994) [2], there are two main types of feature selection methods: wrappers and filters. The wrapper approach tests different features with the same baseline model, evaluating performance via model accuracy, allowing the identification and construction of an optimal feature vector which can be used to train a future model. Whereas filters make use of evaluation metrics to determine a features ability to differentiate between classes. The later method is described to be much more suitable when identifying features for text classification, due to wrappers requiring the training of a classifier for each feature subset, becoming far more computationally expensive than filters [1]. However as Mladenic showed when using the filter selection method, the choice of evaluation metrics is paramount. A suitable metric for this text classification problem should consider the problem domain and algorithm as it has been proven in literature that .  

As a student of NLP, I will be using a hybrid approach of both wrappers and filters to help investigate the features but also compare the two feature selection processes. In contradiction to Jingnian Chen, Houkuan Huang, Shengfeng Tian, Youli Qu [1], when taking the wrapper approach, I estimate that the computational expense of training a classifier for each subset of feature vectors will be manageable. Additionally, I am also intrigued how the feature testing results of using a baseline model, compared to the results of using an evaluative metric. Does the evaluation metric reflect the accuracy produced by the model? Does the metric imply the relative performance of the baseline model? Furthermore, I will then also be able to investigate different evaluation metrics.  
 
The goal of my hybrid approach and investigative comparison is to verify the integrity of both wrappers and filters as feature selection methods for text classification models, and most importantly engineer features which encourage the best performance of our text classifier. 

<br><hr>
<h1>Feature Analysis & Selection:</h1>
<h5>Wrapper - Baseline model:</h5>As my wrapper and baseline model, I will be using a Naive Bayes classifier. This is because the Naive Bayes classifier has been demonstrated to achieve relatively impressive results in text classification problems such as spam detection. The Naive Bayes classifier is also computationally inexpensive so overhead for training against many feature vector subsets will be relatively efficient. I will be using my baseline model to measure the effects of feature vectors on predictive success, and utimatley identify a feature vector which delivers optimal results. 

<h5>Filters - Evaluation metrics:</h5><br>
As Mladenic showed [3], a feature selection metric may only be effective for a specific domain and less effective than alternative metrics in another. Therefore it is important to carefully choose an evaluation metric considering our domain of binary classification. Mladenic and Grobelnik [3] demonstrated that binary-class Odds Ratio performed effective measurments when testing feature vectors for the Naive Bayes Classifier. The Odds Ratio metric can also be applied to multi-class datasets, allowing the investigation of multi-class prediction occumpanied by a binary theshold. 


In [2]:
import nltk
import os
import re 
from sklearn.model_selection import train_test_split
import collections
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn import metrics 
from sklearn.preprocessing import StandardScaler
from tqdm import tqdm

class PreProcess():

    def __init__(self):
        # Read data and format into data structure
        self.neg_reviews = ['data/neg/'+x for x in os.listdir('data/neg')]
        self.pos_reviews = ['data/pos/'+x for x in os.listdir('data/pos')]
        # Data structure dic(file_name, ([text], {features}, rating))
        self.vocabulary = {}
        self.data = self.readReviewData(self.neg_reviews) | self.readReviewData(self.pos_reviews)
        print("Total unique words: ",len(self.vocabulary.keys()))
        
    # Reads files, removes html and returns dict(key=filename,list of words) 
    def readReviewData(self, files):
        review_dict = {}
        for txt_file in files:
            with open(txt_file) as txt:
                lines = txt.readlines()
            # Strip HTML and tokenise.
            text = re.sub('<[^<]+?>', ' ',lines[0].replace(",",", ").replace(".",". "))
            text = nltk.tokenize.word_tokenize(text, language='english')
            # Collect all words into vocabulary
            self.updateVocabulary(text)
            rating = self.getRating(txt_file)
            # Second index is for rating.
            # Third index is for features.
            review_dict[txt_file] = (text,rating,{})
        return review_dict

    def updateVocabulary(self, text):
        for word in text:
            count = self.vocabulary.get(word,0)
            self.vocabulary[word] = count + 1 

    def getRating(self, file_name):
        return int(file_name[len(file_name)-5])

    def collectWordFreq(self, review_data):
        # Loop through texts and count frequency for each rating class.
        for filename in tqdm(review_data):
            review = review_data.get(filename)
            # Append word frequency to feature dict.
            review_word_freq = self.countWordFreq(review[0])
            review[2]['word frequency']=review_word_freq

    # Count frequency of words in text. 
    def countWordFreq(self, text):
        word_freq_dict = {}
        for word in text:
            count = word_freq_dict.get(word,1)
            word_freq_dict[word] = count+1
        for word in self.vocabulary:
            if not word in word_freq_dict:
                # Lapace smoothing
                word_freq_dict[word]=1
        return collections.OrderedDict(sorted(word_freq_dict.items()))

    def zipf_analysis(self, review_data):
        # Process 3 different datasets to train model:
        # 1 - Head, body & tail.
        # 2 - Head & body.
        # 3 - Body & tail.
        pass

    def extractFeatures(self, feature_list):
        review_data = self.data
        classes = sorted(review_data.keys())
        if 'word frequency' in feature_list:
            self.collectWordFreq(review_data)
            feature_vector=[]
            ratings_vector=[]
            for review in review_data:
                feature_vector.append(list(review_data[review][2]['word frequency'].values()))
                ratings_vector.append(review_data[review][1])
            return ratings_vector, feature_vector
    
def MNB_Classifier(X_train, X_test, y_train, y_test):
    mnb_clf = MultinomialNB(force_alpha=True, verbose=True)
    print("Training Multinomial Naive Bayes Classifier.")
    mnb_clf.fit(X_train,y_train)
    y_pred = mnb_clf.predict(X_test)
    analyseResults(y_test, y_pred)

def analyseResults(y_test, y_pred):
    print("\nConfusion Matrix: ")
    print(confusion_matrix(y_test, y_pred))
    print("MNB %):", metrics.accuracy_score(y_test, y_pred)*100)
            
preprocess = PreProcess()
ratings, feature_vector = preprocess.extractFeatures(['word frequency'])
X_train, X_test, y_train, y_test = train_test_split(feature_vector, ratings, test_size=0.2, random_state=42)
y_pred = MNB_Classifier(X_train, X_test, y_train, y_test)



Total unique words:  48388


 29%|██▉       | 1176/4000 [03:57<09:30,  4.95it/s]


KeyboardInterrupt: 

References: 

[1] Jingnian Chen, Houkuan Huang, Shengfeng Tian, Youli Qu,
Feature selection for text classification with Naïve Bayes,
Expert Systems with Applications,
Volume 36, Issue 3, Part 1,2009,pp. 5432-5435,

[2] G.H. John, R. Kohavi, K. Pfleger
Irrelevant Features and the Subset Selection Problem
Proceedings of the 11th International Conference on machine learning, Morgan Kaufmann, San Francisco (1994), pp. 121-129

[3] Mladenic, D., & Grobelnik, M., 1999. 
Feature selection for unbalanced class distribution and Naive Bayes. 
In Proceedings of 16th international conference on machine learning (pp. 258–267). San Francisco.