# Enabling a Machine Learning Model into a Web Application
In the previous chapter we created a machine learning model to predict the sentiment behind the movie reviews as positive or negative, using a LogisticRegression classifier. Then we created an out-of-core learning model using the Stocastic Gradient Classifier, which is very less computationally expensive compared to the standard model. Here we'll implement the same out-of-core model for sentiment analysis in a web application. The topics to cover in this chapter are:
* Saving the current state of a trained machine learning model
* Using SQLite databases for storage
* Developing a web application using the popular Flask web framework
* Deploying a machine learning application to a public web server.

So we'll simply use the code that we developed for the out-of-core learning model for sentiment analysis in the previous chapter.

First, we define a tokenizer fuction that cleans the unprocessed text data from the movie_data.csv file and seperates into word tokens while removing stop words:

In [1]:
import numpy as np 
import re 
from nltk.corpus import stopwords 
stop = stopwords.words('english')
def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized


Next we define a generator function, stream_docs, that reads in and returns one document at a time.

In [2]:
def stream_docs(path):
    with open(path, 'r') as csv:
        next(csv) #skip header
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

To verify that our stream_docs function works correctly, let's read in the first document from the movie_data.csv file :

In [3]:
next(stream_docs(path='./movie_data.csv'))

('"Just watched this movie over the weekend, and I must say I thoroughly enjoyed it. The 2 Italo American actors are excellent as usual (Michael Imperioli and John Ventimiglia). It is obvious that the director was influenced by 2 great films of the past directed by Italians. Primarily he was influenced by Dino Risi and his film IL SORPASSO. It is the story of 2 young men who meet by chance and become friends. One is extroverted and the other is introverted. They enjoy the whole day together and by the end of the day, the shy one learns that there is more to life than his usual routine monotony. The same thing happens to Albert De Santi. Unfortunately, IL SORPASSO has a very similar ending and this apparently influenced the director of ON THE RUN because he uses the same technique but with a twist. I had expected something but was surprised to see that it turned out to be the opposite. If you watch both movies you will understand. The other film that influenced the director is AFTER HOU

W'll now define a function get_minibatch, that will take a document stream from the stream_docs function and return a particular number of documents specified by the size parameter: 

In [4]:
def get_minibatch(doc_stream, size):
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, y


Unfortunately, we can't use the *CountVectorizer* for out-of-core learning since it requires holding the entire vocabulary in memory. Also, the *TfidfVectorizr* needs to keep all the feature vectors of the training dataset in memory to calculate the inverse document frequencies. However another useful vectorizer for text processing implemented in scikit-learn is *HashingVectorizer* which is data independent and makes use of the hashing trick via the 32-bit MurmurHash3 algorithm by Austin Appleby.

In [5]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier
vect = HashingVectorizer(decode_error='ignore',
                        n_features=2**21,
                        preprocessor=None,
                        tokenizer=tokenizer)
clf = SGDClassifier(loss='log', random_state=1, n_iter=1)
doc_stream = stream_docs(path='./movie_data.csv')

Having set up all the complementary functions, we can now start the out-of-core learning using the following code:

In [6]:
import pyprind
pbar = pyprind.ProgBar(45)
classes = np.array([0, 1])
for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    if not X_train:
        break
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes=classes)
    pbar.update()

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:42


We initialized the progress bar object with 45 iterations and in the following for loop, iterated over 45 minibatches of documents where minibatch consists of 1000 documents each.  
Having comleted the incremental learning process, we'll use the last 5000 documents to evaluate the performance of the model.

In [7]:
X_test, y_test = get_minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test)
print('Accuracy: %0.3f' % clf.score(X_test, y_test))

Accuracy: 0.868


As we can see that the accuracy of the model is 87 percent, slightly below the accuracy we achieved using grid search for hyperparameter tuning. However, out-of-core learning is vrey memory efficient and took less than two minutes to complete. Finally we can use the last 5000 documents to update the model.

In [8]:
clf = clf.partial_fit(X_test, y_test)

Now, we have the out-of-core sentiment analysis classifier up and ready.

## Serializing fitted scikit-learn estimators

Training a machine learning model can be computationally quite expensive. Surely we don't want to train our model everytime we close our python interpreter and want to make a new prediction or reload our web application. One model for **model persistence** is Python's built in pickle module, which allows us to serialize and de-serialize Python object structures to compact byte code, so we can save our classifier in its current state and reload it when we want to classify new samples without needing to learn the model from the training data all over again. 

In [9]:
import pickle
import os
dest = os.path.join('movieclassifier', 'pkl_objects')
if not os.path.exists(dest):
    os.makedirs(dest)
pickle.dump(stop,
           open(os.path.join(dest, 'stopwords.pkl'), 'wb'),
           protocol=4)
pickle.dump(clf,
           open(os.path.join(dest, 'classifier.pkl'), 'wb'),
           protocol=4)

In the preceding code, we created a movieclassifier directory where we will later store the files and data for our web application. Within the movieclassifier directory, we created a pkl_objects sub-directory to save the serialized Python objects to our local drive. Now, via the pickle's dump method, we serialized the trained logistic regression model as well as the stopword set from NLTK vocabulary on our server. 

We don't need to pickle the HashingVectorizer, since it does not need to be fitted. Instead, we can create a new Python script file, from which we can import vectorizer into our current Python session.

In [10]:
%%writefile movieclassifier/vectorizer.py
from sklearn.feature_extraction.text import HashingVectorizer
import re
import os
import pickle

cur_dir = os.path.dirname(__file__)
stop = pickle.load(open(os.path.join(cur_dir,'pkl_objects','stopwords.pkl'), 'rb'))

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized
vect = HashingVectorizer(decode_error='ignore',
                        n_features=2**21,
                        preprocessor=None,
                        tokenizer=tokenizer)


Overwriting movieclassifier/vectorizer.py


After executing the preceding code cells, we can now restart the IPython kernel to check if the objects were serialized correctly.  
First change the current working directory to movieclassifer.

In [1]:
import os
os.chdir('movieclassifier')

In [2]:
import pickle
import re
import os
from vectorizer import vect
clf = pickle.load(open(os.path.join('pkl_objects','classifier.pkl'), 'rb'))


After we have successfully loaded the vectorizer and unpickled the classifier, now we can use these objects to pre-process the document samples and make predictions about their sentiment.

In [4]:
import numpy as np
label = {0:'negative', 1:'positive'}
example = ['I love this movie']
X = vect.transform(example)
print('Prediction: %s\nProbability: %0.2f%%' %(label[clf.predict(X)[0]], np.max(clf.predict_proba(X))*100))

Prediction: positive
Probability: 86.91%


Note that predict_proba function returns an array with a probability value of each unique class label, hence we used np.max

## Setting up of a SQLite database for data storage

Here we'll set up a simple SQLite database to collect optional feedback about the predictions from the users of the web application. We can use this feedback to update our classification model. SQLite is an open source SQL database engine that doesn't require a seperate server to operate, making it ideal for smaller projects and simple web application.  
Fortunately there is already an API in Python standard library, sqlite3 which allows us to work with SQLite databases.

In [8]:
import sqlite3
import os
conn = sqlite3.connect('reviews.sqlite')
c = conn.cursor()
c.execute('CREATE TABLE review_db (review TEXT, sentiment INTEGER, date TEXT)')

example1 = 'I love this movie'
c.execute("INSERT INTO review_db (review, sentiment, date) VALUES (?, ?, DATETIME('now'))", (example1, 1))
example2 = 'I disliked this movie'
c.execute("INSERT INTO review_db (review, sentiment, date) VALUES (?, ?, DATETIME('now')) ", (example2, 0))
conn.commit()
conn.close()

In the preceding code, we created a new SQLite database inside the movieclassifier directory and stored two examples of movie reviews.

To check if the entries have been stored in the database table correctly, we'll now reopen the connection to the database and use SQL's SELECT command to fetch all the rows that have been committed between the beginning of the year 2018 and the current date. 

In [9]:
conn = sqlite3.connect('reviews.sqlite')
c = conn.cursor()
c.execute(" SELECT * FROM review_db WHERE date BETWEEN '2018-01-01 00:00:00' AND DATETIME('now')")
results = c.fetchall()
conn.close()
print(results)

[('I love this movie', 1, '2018-02-10 04:55:42'), ('I disliked this movie', 0, '2018-02-10 04:55:42')]
