# Embedding a Machine Learning Model into a Web Application

In the previous chapters, you learned about the many different machine learning concepts and algorithms that can help us with better and more efficient decision-making. However, machine learning techniques are not limited to offline applications and analysis, and they can be the predictive engine of your web services. For example, popular and useful applications of machine learning models in web applications include spam detection in submission forms, search engines, recommendation systems for media or shopping portals, and many more. 

In this chapter, you will learn how to embed a machine learning model into a web application that can not only classify, but also learn from data in real time. The topics that we will cover are as follows: 

* Saving the current state of a trained machine learning model
* Using SQLite databases for data storage
* Developing a web application using the popular Flask web framework
* Deploying a machine learning application to a public web server

# Serializing fitted scikit-learn estimators

Training a machine learning model can be computationally quite expensive, as we have seen previously. Surely we do not want to train our model every time we close our Python interpreter and want to make a new prediction or reload our web application? One option for model persistence is Python's in-built *pickle* module, which allows us to serialize and deserialize Python object structures to compact bytecode so that we can save our classifier in its current state and reload it if we want to classify new samples, without needing the model to learn from the training data all over again. Before you execute the following code, please make sure that you have trained the out-of-core logistic regression model from the last section of the previous chapter and have it ready in your current Python session: 

In [4]:
'''
import pickle 
import os

dest = os.path.join('movieclassifier', 'pkl_objects')
if not os.path.exists(dest):
    os.makedirs(dest)
    
pickle.dump(stop, open(os.path.join(dest, 'stopwords.pkl'), 'wb'), protocol=4)
pickle.dump(clf, open(os.path.join(dest, 'classifier.pkl'), 'wb'), protocol=4)
''';

Using the preceding code, we create a *movieclassifier* directory where we will later store the files and data for our web application. Within this *movieclassifier* directory, we created a *pkl_objects* subdirectory to save the serialized Python objects to our local drive. Via the *dump* method of the *pickle* module, we then serialized the trained logistic regression model as well as the stop word set from the **Natural Language Toolkit (NLTK)** library, so that we do not have to install the NLTK vocabulary on our server. 

The *dump* method takes as its first argument the object that we want to pickle, and for the second argument we provied an open file object that the Python object will be written to. Via the *wb* argument inside the *open* function, we opened the file in binary mode for pickle, and we set *protocol=4* to choose the latest and most efficient pickle protocol that has been added to Python 3.4, which is compatible with Python 3.4 or newer. 

We do not need to pickle *HashingVectorizer*, since it does not need to be fitted. Instead, we can create a new Python script file from which we can import the vectorizer into our current Python session. Now, copy the following code and save it as *vectorizer.py* in the *movieclassifier* directory: 

In [5]:
'''
from sklearn.feature_extraction.text import HashingVectorizer
import re
import os
import pickle

cur_dir = os.path.dirname(__file__)
stop = pickle.load(open(os.path.join(cur_dir,'pkl_objects','stopwords.pkl'), 'rb'))

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized

vect = HashingVectorizer(decode_error='ignore',n_features=2**21,preprocessor=None,tokenizer=tokenizer)
''';

After we have pickled the Python objects and created the *vectorizer.py* file, it would now be a good idea to restart our Python interpreter or IPython Notebook kernel to test if we can deserialize the objects without error. 

From your terminal, navigate to the *movieclassifier* directory, start a new Python session and execute the following code to verify that you can import the *vectorizer* and unpickle the classifier: 

In [8]:
import pickle
import re
import os
from movieclassifier.vectorizer import vect

clf = pickle.load(open(os.path.join('movieclassifier', 'pkl_objects', 'classifier.pkl'), 'rb'))

After we have successfully loaded the *vectorizer* and unpickled the classifier, we can now use these objects to preprocess documents samples and make predictions about their sentiment: 

In [10]:
import numpy as np

label = {0:'negative', 1:'positive'}

example = ['I love this movie']
X = vect.transform(example)
print('Prediction: %s\nProbability: %.2f%%'
      % (label[clf.predict(X)[0]], np.max(clf.predict_proba(X))*100))

Prediction: positive
Probability: 86.25%


Since our classifier returns the class labels as integers, we defined a simple Python dictionary to map these integers to their sentiment. We then used *HashingVectorizer* to transform the simple example document into a word vector *x*. Finally, we used the *predict* method of the logist regression classifier to predict the class label, as well as the *predict_proba* method to return the corresponding probability of our prediction. Note that the *predict_proba* method call returns an array with a probability value for each unique class label. Since the class label with the largest probability corresponds to the class that is returned by the *predict* call, we used the *np.max* function to return the probability of the predicted class. 

# Setting up an SQLite database for data storage

In this section, we wil set up a simple SQLite database to collect optional feedback about the predictions from users of the web application. We can use this feedback to update our classification model. SQLite is a open source SQL database engine that does not require a separate server to operate, which makes it ideal for smaller projects and simple web applications. Essentially, a SQLite dataset can be understood as a single, self-contained database file that allows us to directly access storage files. 

Furthermore, SQLite does not require any system-specific configuration and is support by all common operating systems. It has gained a reputation for being very reliable as it is used by popular companies, such as Google, Mozilla, Adobe, Apple, Microsoft, and many more. 

Fortunately, following Python's *batteries included* philosophy, there is already an API in the Python standard library, *sqlite3*, which allows us to work with SQLite databases. 

By executing the following code, we will create a new SQLite database inside the *movieclassifier* directory and store two example movie reviews: 

In [2]:
import sqlite3
import os

bd_file = os.path.join('movieclassifier', 'reviews.sqlite')

if os.path.exists(bd_file):
    os.remove(bd_file)

conn = sqlite3.connect(bd_file)
c = conn.cursor()
c.execute("CREATE TABLE review_db"\
          " (review TEXT, sentiment INTEGER, date TEXT)")

example1 = "I love this movie"
c.execute("INSERT INTO review_db"\
          " (review, sentiment, date) VALUES"\
          " (?, ?, DATETIME('now'))", (example1, 1))

example2 = "I disliked this movie"
c.execute("INSERT INTO review_db"\
          " (review, sentiment, date) VALUES"\
          " (?, ?, DATETIME('now'))", (example2, 0))

conn.commit()
conn.close()

Following the preceding code example, we created a connection (*conn*) to a SQLite database file by calling the *connect* method of the *sqlite3* library, which created the new database file *reviews.sqlite* in the *movieclassifier* directory if it did not already exist. Please note that SQLite does not implement a replace function for existing tables; you need to delete the database file manually from your file browser if you want to execute the code a second time. 

Next, we created a cursor via the *cursor* method, which allows us to transverse over the database records using the versatile SQL syntax. Via the first *execute* call, we then created a new database table, *review_db*. We used this to store and access database entries. Along with *review_db*, we also created three columns in this database table: *review*, *sentiment*, and *date*. We used these to store two example movie reviews and respective class labels (sentiments). 

Using the *DATETIME('now')* SQL command, we also added date and timestamps to our entries. In addition to the timestamps, we used the question mark symbol (*?*) to pass the movie review texts (*example1* and *example2*) and the corresponding class label (1 and 0) as positional arguments to the *execute* method, as members of a tuple. Lastly, we called the *commit* method to save the changes that we made to the database and closed the connection via the *close* method. 

To check if the entries have been stored in the database table correctly, we will now reopen the connection to the database and use the SQL *SELECT* command to fetch all rows in the database table that have been commited between the beginning of the year 2017 and today: 

In [3]:
conn = sqlite3.connect(bd_file)
c = conn.cursor()

c.execute("SELECT * FROM review_db WHERE date"\
          " BETWEEN '2017-01-01 00:00:00' AND DATETIME('now')")
results = c.fetchall()

conn.close()
print(results)

[('I love this movie', 1, '2018-04-13 09:50:09'), ('I disliked this movie', 0, '2018-04-13 09:50:09')]
