## Summary, aims & improvements

This notebook contains the code to load the classify reviews from the imdb review dataset into positve and negative reviews. It contains instructions to download the code and preprocess the reviews from the command line. I have then created classes for 3 core parts of the process: loading and cleaning the data, vectorising the data and training a model to predict the sentiment of a movie review. I have also created a master class which inherits from these 3 classes so that the model can make a prediction on a review from the raw text data with just one function call from the master class. 

Workflow:
- Instructions to download dataset and some minor preprocessing
- Basic data cleaning on reviews, with print statements to notify user if dataset hasn't been downloaded or extracted
- Vectorisation, defualt tf-idf method but user can easily change the method
- Model training, I trained 2 classifiers a logistic regression model as it is a simple model that provides interpretability to to validate the training procedure. I also trained a svm model as they tend to perform well on sparse datasets. Tuning of hyperparameters was done on the fly with validation set.
- Testing: recorded the test accuracy of both models: Note I also implemented a classifier fine tuned from BERT in a seperate notebook
- Productionise: Create class that can combine the data cleaning, vectorisation and model to make predictions on raw text data. I have saved each subclass and the parent class and reloaded them to demonstrate how this could be used in production
- Tests: I ahve completed some minor end to end tests e.g. empty reviews and extremely long reviews

Results (Accuracy):
- Logistic Regression : 0.88024
- SVM : 0.88324
- BERT : 0.92

Positives about this system:
- The vectorisation class can be easily provide different vectorisation methods by changing one argument when it is called e.g. 'bag of words'.
- Code is modular and I have demonstrated how each instance of each class can be saved and loaded to create the master class. This is also done for the master class to demonstrate how this could easily be put into production
- I have an easily callable function for end to end predictions of reviews from raw text data, suitable for production needs

Production:
I would simply load an instance of the master class that is created after training. This can be loaded for production to male predictions on raw text data. An API would be created to handle request for predictions and also how to load the data from another source e.g. csv to then pass to my model. Also use software like docker to manage versions of dpendencies.

If labelled data is available while inn production I would monitor the accuracy of the system to check it is not changing form the accuracy achieved on the test set, Also precision and recall scores. If labelled data is unavailable I would monitor the proportion of positive predictions compared to negative, as this could potentially signify a change in the distribution of the labels from the dataset to the data being used in production.

Analysis & Improvements:
- SVM only produces a very minor accuracy boost compared to logistic regression at the cost of lost interpretability. With more time I would explore more kernels for the SVM to try and boost the accuracy
- Data cleaning is very basic. I would like to experiment with stemming and n-grams to try and improve results
- BERT achieves superior results with about an hours training on google colab with GPU(seperate notebook)--- You can skip training and load the model I trained form the repo. Leverages transfer learning would likely also perform better on OOD inputs as a result of knowledge transfer from large corpus. I have not done any productionalisation of this model though and would like to do this with more time.
- only a few end-to-end tests are implemented. I would like to try more and also test each class individually rather than the whole system
- Would like to automate download and extraction of dataset but not have to do it from the command line.

In [None]:
import os
import re
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC

## Load & Clean Data



In [None]:
class DataLoader():
    def __init__(self):
        self.data_path = 'aclImdb/data/'
        if not os.path.isfile('aclImdb_v1.tar.gz'):
            print ("Haven't downloaded dataset")
            
        if not os.path.isdir(self.data_path):
            print("haven't extracted data")
            
    def read_data(self):
        data_path = self.data_path
        reviews_train = []
        reviews_test = []
     
        for line in open(data_path + 'full_train.txt', 'r', encoding="utf8"):
            reviews_train.append(line.strip())

        for line in open(data_path + 'full_test.txt', 'r', encoding="utf8"):
            reviews_test.append(line.strip())

        return reviews_train, reviews_test
    
    def clean_data(self, reviews):
        ##remove these characters
        REPLACE_NO_SPACE = re.compile("[.;:!\'?,\"()\[\]]")
        ##replace these characters with space
        REPLACE_WITH_SPACE = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")
        
        reviews = [REPLACE_NO_SPACE.sub("", line.lower()) for line in reviews]
        reviews = [REPLACE_WITH_SPACE.sub(" ", line) for line in reviews]
        
        return reviews
    
    def read_and_clean(self):
        reviews_train, reviews_test = self.read_data()
        reviews_train = self.clean_data(reviews_train)
        reviews_test = self.clean_data(reviews_test)
        
        return reviews_train, reviews_test
        
     

In [None]:
loader = DataLoader()
reviews_train, reviews_test = loader.read_and_clean()

## Vectorise Reviews
selct one vectorisation method from 'one hot encode', 'bag of words' and 'tf-idf'

In [None]:
class Vectoriser():    
    def __init__(self, mode = 'tf-idf'):
        
        if mode == 'tf-idf':
            vectorizer = TfidfVectorizer()
        
        elif mode == 'one hot encode':
            vectorizer = CountVectorizer(binary=True)
        
        elif mode == 'bag of words':
            vectorizer = CountVectorizer(binary=False)
            
        else:
            print('error! specify mode as one of the following : ')
            print("'tf-idf', 'one hot encode', 'bag of words'")
            return
        
        self.vectorizer = vectorizer
        
    def fit(self, reviews_train, reviews_test):
        vectorizer = self.vectorizer        
        vectorizer.fit(reviews_train)
        X = vectorizer.transform(reviews_train)
        X_test = vectorizer.transform(reviews_test)
        
        self.vec_method = vectorizer
        
        return X, X_test

        
        

In [None]:
## I have picked tf-idf to vectorise the data
vectoriser = Vectoriser(mode = 'tf-idf')
X, X_test = vectoriser.fit(reviews_train, reviews_test)

## both test and train targets have equal number of reviews and are evenly split between positve and negative reviews as well as being ordered

target = [1 if i < 12500 else 0 for i in range(25000)]


## Train a Classifier

Here we will train a logistic regression classifier to give an interpretable model and a baseline model. We will also train an svm to try and improve upon that without much more training time. We will then try and leverage transfer learning by deploying BERT to further increase the accuracy at the cost of increased training time and model size.

#### Logistic Regression 

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split


X_train, X_val, y_train, y_val = train_test_split(
    X, target, train_size = 0.75
)

## tune the regularisation parameter on validation set
best_acc = 0
c_opt = 0
for c in [0.01, 0.05, 0.25, 0.5, 1]:
    
    lr = LogisticRegression(C=c)
    lr.fit(X_train, y_train)
    
    acc = accuracy_score(y_val, lr.predict(X_val))
    print ("Accuracy for C=%s: %s" 
           % (c, acc))
    
    if acc > best_acc:
        best_acc = acc
        c_opt = c
    

    
log_reg_model = LogisticRegression(C=c_opt)
log_reg_model.fit(X_train, y_train)

acc = accuracy_score(target, log_reg_model.predict(X_test))
print('')
print ("Test Accuracy for C=%s: %s" 
       % (c_opt, acc))

In [None]:
feature_to_coef = {
    word: coef for word, coef in zip(
        vectoriser.vec_method.get_feature_names(), log_reg_model.coef_[0]
    )
}
print('words likely to lead to positive review')
for best_positive in sorted(
    feature_to_coef.items(), 
    key=lambda x: x[1], 
    reverse=True)[:5]:
    print (best_positive)
    
    

In [None]:
print('words likely to lead to negative review')
for best_negative in sorted(
    feature_to_coef.items(), 
    key=lambda x: x[1])[:5]:
    print (best_negative)

#### SVM
Small boost in accuracy for loss in interpretability 

In [None]:
best_acc = 0
c_opt = 0


for c in [0.01, 0.05, 0.25, 0.5, 1,1.5,2]:
    
    svm = LinearSVC(C=c)
    svm.fit(X_train, y_train)
    acc = accuracy_score(y_val, svm.predict(X_val))
    print ("Accuracy for C=%s: %s" 
           % (c, acc))
    
    if acc > best_acc:
        c_opt = c
        best_acc = acc
        
        
svm_mod = LinearSVC(C=c_opt)
svm_mod.fit(X_train, y_train)

acc = accuracy_score(target, svm_mod.predict(X_test))
print('')
print ("Test Accuracy for C=%s: %s" 
       % (c_opt, acc))

## Production and testing 

create instance of final model by passing instances of selected trained model and associated Vectoriser() and DataLoader()
Logic protects against very long querys and empty querys. Both querys in string and list of querys accepted.

Save model from training using joblib modules and instances of data loader and vectoriser using pickle module

In [None]:
import joblib 
import pickle

joblib.dump(svm_mod, 'saved_model.joblib') 
# vectoriser, loader
with open('vectoriser.pickle', 'wb') as pickle_file:
    pickle.dump(vectoriser, pickle_file)
    
with open('loader.pickle', 'wb') as pickle_file:
    pickle.dump(loader, pickle_file)

# pickle.dump(loader, 'loader.pickle')

with open('loader.pickle', 'rb') as f:
    prod_loader = pickle.load(f)
    
with open('vectoriser.pickle', 'rb') as f:
    prod_vec = pickle.load(f)


prod_model = joblib.load('saved_model.joblib')

In [None]:
class FinalModel():
    def __init__(self, model, vectoriser, data_loader, max_len):
        self.model = model
        self.vectoriser = vectoriser
        self.data_loader = data_loader
        self.max_len = max_len
        
    def valid_query(self,query):
        valid = True
        if (len(query) == 0) or (query == None):
            valid = False
            print('empty query!')  
            
        if len(query) > max_len:
            print('query too long!')
            valid = False
        return valid
        
        
    def predict(self, query):
        data_loader = self.data_loader
        vectoriser = self.vectoriser
        model = self.model
        
        if not isinstance(query,list):
            query = [query]
        clean_query = data_loader.clean_data(query)
        
        valid = [self.valid_query(single_query) for single_query in clean_query]
        if not all(valid):
            print('Invalid query(s)!')
            
            valid_idxs = np.array((valid))
            idxs = []
            bad_querys = []
            for i, bl in enumerate(valid):
                if not bl:
                    idxs.append(i)
                    bad_querys.append(query[i])
                    
            print('bad_indexs : ', idxs)
            print('bad querys : ', bad_querys)
            return None
        
        X = self.vectoriser.vec_method.transform(clean_query)
        pred = self.model.predict(X)
        return pred

In [None]:
# specify the maximum accpetable length of a review as little longer than the longest in the training data
max_len = 0
for review in reviews_train:
    l = len(review)
    if l> max_len:
        max_len = l
        
max_len*=1.1


## test end to end querys from raw text on final model from test data and check accuracy is the same

query = reviews_test.copy()

production_model = FinalModel(prod_model, prod_vec, prod_loader, max_len)

predictions = production_model.predict(query)

acc = accuracy_score(target, predictions)
print('')
print ("Test Accuracy for final model : %s" 
       % ( acc))

Now we have created a class that can make predictions from unprocessed text data we can save it as a complte model and reload to deomonstrate how we could use this in production.

Here I also implement some tests. Note currently the model provides no predictions if a single query is deemed invalid.

In [None]:
## save and load final model to demonstrate how this could be used in production and then test it against new querys
with open('production_model.pickle', 'wb') as pickle_file:
    pickle.dump(production_model, pickle_file)

# pickle.dump(loader, 'loader.pickle')

with open('production_model.pickle', 'rb') as f:
    loaded_production_model = pickle.load(f)


In [None]:
test_querys = ['', 'bad', 'good', 'too long '*int(max_len)]
for query in test_querys:
    print(loaded_production_model.predict(query))

In [None]:
loaded_production_model.predict(test_querys)