<h1 style="font-size: 35px;">Embedding a Machine Learning Model into a Flask Web Application</h1><hr style="border: 1px solid #f00" >

Must have this install through _pip_ or _conda_:
* jupyter notebook
* numpy
* scikit-learn
* nltk
* Flask
* WTForms
* re
* PyPrind
* Pickle
* SQLite3

<br>
<br>

# Training a model for movie review classification

This section is we are training a logistic regression model for movie review classification, execute the following code blocks to train a model that we will serialize in the next section.

In real-world applications we will be working with large datasets that may exceed our computer's memory. Not everyone has access to a supercomputer, so we have to apply a technique called _out-of-core learning_ that allows us to work with large datasets.

The concept of **stochastic gradient descent** is an optimization algorithm that updates the model's weights using one sample at a time. In this section, we will make use of the **partial_fit** function of the **SGDClassifier** in scikit-learn to stream the documents directly from our local drive and train a logistic regression model using small minibatches of documents.

In [2]:
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

#nltk.download()

stop = stopwords.words('english')
porter = PorterStemmer()

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized

def stream_docs(path):
    with open(path, 'r') as csv:
        next(csv) # skip header
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

In the code example above, we define a **tokenizer** function that cleans the unprocessed text data from **movie_data.csv**. This will separate it into word tokens while removing stop words.

Next we define a generator function called **steam_docs**, which reads in and returns one document at a time.

<br>

To verify that **steam_docs** function works correctly, we will read the first document from **movie_data.csv** file. This will return a tuple of the review text as well as the corresponding class label

In [3]:
next(stream_docs(path='movie_data.csv'))

('"In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />""Murder in Greenwich"" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and rich f

<br>

We will now define a function called **get_minibatch** that will take a document stream from the **stream_docs** function and return a particular number of documents specified by the _size_ parameter.

In [6]:
def get_minibatch(doc_stream, size):
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, y

<br>

We will be using **HashingVectorizer** from scikit-learn, which is a useful vectorizer for text processing.

**HashingVectorizer** is data-independent and makes use of the Hashing trick via the 32-bit MurmurHash3 algorithm by Austin Appleby(https://sites.google.com/site/murmurhash/).

In [4]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier

vect = HashingVectorizer(decode_error='ignore', 
                         n_features=2**21,
                         preprocessor=None, 
                         tokenizer=tokenizer)

clf = SGDClassifier(loss='log', random_state=1, n_iter=1)
doc_stream = stream_docs(path='movie_data.csv')

Using the preceding code, we intialized **HashingVectorizer** with our **tokenizer** function and set the number of features to 2<sup>21</sup>. Then we reinitialized a logistic regression classifier by setting the _loss_ parameter of the **SGDClassifier** to _log_, by choosing a large number of features in the **HashingVectorizer**, which reduced the chance of cause hash collisions, but we also increased the number of coeffients in our logistic regression model.   

<br>

After setting up all the complementary functions, we now start the _out-of-core learning_ using the following code:

Here we used the PyPrind package in order to estimate the progress of our learning algorithm. We initialized the progress bar object with 45 iterations and in the following _for_ loop, we iterated over 45 minibatches of documents where each minibatch consists of 1,000 documents each.

In [None]:
import pyprind
pbar = pyprind.ProgBar(45)

classes = np.array([0, 1])
for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    if not X_train:
        break
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes=classes)
    pbar.update()

0%                          100%
[#####                         ] | ETA: 00:01:18

<br>

Having completed the incremental learning process, we will use the last 5,000 documents to evaluate the performance of our model.

In [8]:
X_test, y_test = get_minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test)
print('Accuracy: %.3f' % clf.score(X_test, y_test))

Accuracy: 0.867


<br>

The accuracy of the model is 87 percent, which is not bad. We used the _out-of-core learning_ algorithm, which is very memory-effient and took less than a minute to complete. Finally, we can use the last 5,000 documents to update our model.

In [9]:
clf = clf.partial_fit(X_test, y_test)

<br>
<br>

# Serializing fitted scikit-learn estimators

After we trained the logistic regression model as shown last section, we can now save the classifier along with the stop words, Porter Stemmer, and `HashingVectorizer` as serialized objects to our local disk so that we can use the fitted classifier in our web application later.

<br>

Training a machine learning model can be computationally quite expensive, as we have seen in the last section, Applying Machine Learning to Sentiment Analysis. Surely, we don't want to train our model every time we close our Python interpreter and want to make a new prediction or reload our web application? One option for model persistence is Python's in-built pickle module (https://docs.python.org/2.7/library/pickle.html), which allows us to serialize and de-serialize Python object structures to compact byte code, so that we can save our classifier in its current state and reload it if we want to classify new samples without needing to learn the model from the training data all over again.

Using the preceding code, we created a _movieclassifier_ directory where we will later store the files and data for our web application. Within this _movieclassifier_ directory, we created a **pkl_objects** subdirectory to save the serialized Python objects to our local drive. Via pickle's **dump** method, we then serialized the trained logistic regression model as well as the stop word set from the NLTK library so that we don't have to install the NLTK vocabulary on our server. The **dump** method takes as its first argument the object that we want to pickle, and for the second argument we provided an open file object that the Python object will be written to. Via the wb argument inside the open function, we opened the file in binary mode for pickle, and we set _protocol=2_

In [10]:
import pickle
import os

dest = os.path.join('movieclassifier', 'pkl_objects')
if not os.path.exists(dest):
    os.makedirs(dest)

pickle.dump(stop, open(os.path.join(dest, 'stopwords.pkl'), 'wb'), protocol=2)   
pickle.dump(clf, open(os.path.join(dest, 'classifier.pkl'), 'wb'), protocol=2)

Next, we save the `HashingVectorizer` as in a separate file so that we can import it later.

In [11]:
%%writefile movieclassifier/vectorizer.py
from sklearn.feature_extraction.text import HashingVectorizer
import re
import os
import pickle

cur_dir = os.path.dirname(__file__)
stop = pickle.load(open(
                os.path.join(cur_dir, 
                'pkl_objects', 
                'stopwords.pkl'), 'rb'))

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
                           text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) \
                   + ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized

vect = HashingVectorizer(decode_error='ignore',
                         n_features=2**21,
                         preprocessor=None,
                         tokenizer=tokenizer)

Writing movieclassifier/vectorizer.py


After executing the preceeding code cells, we can now restart the IPython notebook kernel to check if the objects were serialized correctly.

First, change the current Python directory to `movieclassifer`:

In [1]:
import os
os.chdir('movieclassifier')

In [2]:
import pickle
import re
import os
from vectorizer import vect

clf = pickle.load(open(os.path.join('pkl_objects', 'classifier.pkl'), 'rb'))

In [3]:
import numpy as np
label = {0:'negative', 1:'positive'}

example = ['I love this movie']
X = vect.transform(example)
print('Prediction: %s\nProbability: %.2f%%' %\
      (label[clf.predict(X)[0]], clf.predict_proba(X).max()*100))

Prediction: positive
Probability: 82.53%


<br>
<br>

# Setting up a SQLite database for data storage 

Before you execute this code, please make sure that you are currently in the `movieclassifier` directory.

In [4]:
import sqlite3

conn = sqlite3.connect('reviews.sqlite')
c = conn.cursor()
c.execute('CREATE TABLE review_db (review TEXT, sentiment INTEGER, date TEXT)')

example1 = 'I love this movie'
c.execute("INSERT INTO review_db (review, sentiment, date) VALUES (?, ?, DATETIME('now'))", (example1, 1))

example2 = 'I disliked this movie'
c.execute("INSERT INTO review_db (review, sentiment, date) VALUES (?, ?, DATETIME('now'))", (example2, 0))

conn.commit()
conn.close()

In [5]:
conn = sqlite3.connect('reviews.sqlite')
c = conn.cursor()

c.execute("SELECT * FROM review_db WHERE date BETWEEN '2015-01-01 10:10:10' AND DATETIME('now')")
results = c.fetchall()

conn.close()

In [6]:
print(results)

[(u'I love this movie', 1, u'2016-04-21 23:26:57'), (u'I disliked this movie', 0, u'2016-04-21 23:26:57')]


<br>

# Developing a web application with Flask

Install Flask library if its not in your current Python environment. 

```python
pip install flask
```
or
```python
conda install flask
```

## Our first Flask web application

First we need to create a directory tree like this example in your current directory:

```html
flask_app_1/
    app.py
    templates/
        first_app.html
```

<br>
<br>

The **app.py** file will contain the main code that will be executed. The _templates_ directory is the directory that will hold all your static HTML, CSS, & JavaScript files. 

Let's now create the **app.py** inside your flask_app_1 directory by creating a Textfile in jupyter notebook and renaming it to **app.py**. 

```python
from flask import Flask, render_template
app = Flask(__name__)

@app.route('/')
def index():
    return render_template('first_app.html')

if __name__ == '__main__':
    app.run()
```

Here, we running our application as a single module, initializing a new Flask instance with argument **__name__** to let Flask know that it can find the HTML template folder (_templates_) in the same directory. Then we used the **@app.route('/')** decorator to specify URL that will trigger the execution of the **index()** function. Now, the **index()** function will render the **first_app.html** file. Then we used the **if __name __ == '__main__':** statement to tell the Python interpreter to run the **app.run()** function on the server when this script is directly executed. 

<br>
<br>

Now lets add some HTML content on the **first_app.html** file:

```html
<!DOCTYPE html>
<html>
    <head>
        <title>First App</title>
    </head>
    <body>
        <div>Hi, this my first Flask web app!</div>
    </body>
</html>
```

Now, let's start our web application by executing the command from the terminal or command prompt inside the flask_app_1 directory:

```html
python app.py
```

We should now see a line with following displayed in on your terminal or command prompt:
```html
* Running on http://127.0.0.1:5000/
```
This line contains the address your local server. Just enter the address in your web browser to see the application.

<br>
<br>

## Form validation and rendering

In our next app, we will take simple Flask web application with HTML form elements to learn how to collect data from a user using the **WTForms** library (https://wtforms.readthedocs.org/en/latest/).

```html
pip install wtforms
```

This web app will prompt the user to enter their name, after pressing the button (**Say Hello**) the form will validate and render a new HTML page that will display the user's name.

The new directory structure will look like this:
```html
flask_app_2/
    app.py
    static/
        style.css
    templates/
        _formhelpers.html
        first_app.html
        hello.html
```

<br>
<br>

Now let's modify the app.py file:
```python
from flask import Flask, render_template, request
from wtforms import Form, TextAreaField, validators

app = Flask(__name__)

class HelloForm(Form):
    sayhello = TextAreaField('',[validators.DataRequired()])

@app.route('/')
def index():
    form = HelloForm(request.form)
    return render_template('first_app.html', form=form)

@app.route('/hello', methods=['POST'])
def hello():
    form = HelloForm(request.form)
    if request.method == 'POST' and form.validate():
        name = request.form['sayhello']
        return render_template('hello.html', name=name)
    return render_template('first_app.html', form=form)

if __name__ == '__main__':
    app.run(debug=True)
```

Using **wtforms**, we extended the **index()** function with a text field that we will embed in our start page using the _TextAreaField_ class, which automatically checks whether a user has provided valid input text or not.

Then we defined a new function, **hello()**, which will render an HTML page **hello.html** if the form has been validated, and then we used the _POST_ method to transport the form data to the server in the message body.

Finally we set the  argument _debug=True_ the activate Flask's debugger.

<br>

Using the **Jinja2** (http://jinja.pocoo.org) template engine, we will create a generic macro in **\_formhelpers.html**, which we will later import in our **first_app.html** to render the text field. 
```html
{% macro render_field(field) %}
    <dt>{{ field.label }}
    <dd>{{ field(**kwargs)|safe }}
    {% if field.errors %}
        <ul class=errors>
        {% for error in field.errors %}
            <li>{{ error }}</li>
        {% endfor %}
        </ul>
    {% endif %}
    </dd>
{% endmacro %}
```

<br>

Inside your static directory will modify the **style.css** file with this code:
```CSS
body {
    font-size: 2em;
}
```

Now let's add this code to the **first_app.html**:
```html
<!DOCTYPE html>
<html>
    <head>
        <title>First app</title>
        <link rel="stylesheet" href="{{ url_for('static', filename='style.css') }}">
    </head>
    <body>
        {% from "_formhelpers.html" import render_field %}
        <div>What's your name?</div>
        <form method=post action="/hello">
            <dl>
                {{ render_field(form.sayhello) }}
            </dl>
            <input type=submit value='Say Hello' name='submit_btn'>
        </form>
    </body>
</html>
```

Lastly, we create a **hello.html** file that will be rendered via the line return
**render_template('hello.html', name=name)** inside the **hello()** function,
which we defined in the _app.py_ script to display the text that a user submitted
via the text field. The code is as follows:
```html
<!DOCTYPE html>
<html>
    <head>
        <title>First app</title>
        <link rel="stylesheet" href="{{ url_for('static', filename='style.css') }}">
    </head>
    <body>
        <div>Hello {{ name }}</div>
    </body>
</html>
```

In the flask_app_2 directory run the command below in your terminal or command prompt:
```html
python app.py
```

# Movie Classifier Web Application

Now we are going to implement the movie classifier into a Flask web application. 

#### Breakdown
The web app will prompt the user to enter a movie review, after the review is submitted, the user will see a new page that shows the predicted class label and the probability of the prediction. Also the user will provide feedback about this prediction by clicking the _Correct_ or _Incorrect_ buttons and after clicking on the feedback buttons is a simple _thank you_ screen with a _Submit another review_ button that redirects the user back to the start page. 

This is how the directory tree will look like:
```html
movieclassifier/
    app.py
    pkl_objects/
        classifier.pkl
        stopwords.pkl
    reviews.sqlite
    static/
        style.css
    templates/
        _formhelpers.html
        results.html
        reviewform.html
        thanks.html
    vectorizer.py
```
You've already created this directory and the stuff that's inside it with earlier code when we created the movie classifier. We are just adding **app.py**, **static/**, and **templates**// inside the movieclassifier directory. 

<br>

Let's now modify the **app.py** file. Since this is a long piece of code, will break it into two parts:
```python
# Part one
from flask import Flask, render_template, request
from wtforms import Form, TextAreaField, validators
import pickle
import sqlite3
import os
import numpy as np

# import HashingVectorizer from local dir
from vectorizer import vect

app = Flask(__name__)

######## Preparing the Classifier
cur_dir = os.path.dirname(__file__)
clf = pickle.load(open(os.path.join(cur_dir, 'pkl_objects/classifier.pkl'), 'rb'))
db = os.path.join(cur_dir, 'reviews.sqlite')

def classify(document):
    label = {0: 'negative', 1: 'positive'}
    X = vect.transform([document])
    y = clf.predict(X)[0]
    proba = np.max(clf.predict_proba(X))
    return label[y], proba

def train(document, y):
    X = vect.transform([document])
    clf.partial_fit(X, [y])

def sqlite_entry(path, document, y):
    conn = sqlite3.connect(path)
    c = conn.cursor()
    c.execute("INSERT INTO review_db (review, sentiment, date)"\
    " VALUES (?, ?, DATETIME('now'))", (document, y))
    conn.commit()
    conn.close()

# Part two
```

We imported the **HashingVectorizer** and unpickled the logistic regression classifier. Next, we defined a **classify()** function to return the predicted class label as well as the corresponding probability prediction of a given text document. The **train()** function is used to update the classifier, as long as the document and a class label is provided. The **sqlite_entry** function can store a submitted movie review in the SQLite database with the its class label and timestamp. Note that the _clf_ object will be reset to its original, pickled state if we restart the web application.

```python 
# Part two

class ReviewForm(Form):
    moviereview = TextAreaField('',
                                [validators.DataRequired(),
                                validators.length(min=15)])
@app.route('/')
def index():
    form = ReviewForm(request.form)
    return render_template('reviewform.html', form=form)

@app.route('/results', methods=['POST'])
def results():
    form = ReviewForm(request.form)
    if request.method == 'POST' and form.validate():
        review = request.form['moviereview']
        y, proba = classify(review)
        return render_template('results.html',
                                content=review,
                                prediction=y,
                                probability=round(proba*100, 2))
    return render_template('reviewform.html', form=form)

@app.route('/thanks', methods=['POST'])
def feedback():
    feedback = request.form['feedback_button']
    review = request.form['review']
    prediction = request.form['prediction']
    
    inv_label = {'negative': 0, 'positive': 1}
    y = inv_label[prediction]
    if feedback == 'Incorrect':
        y = int(not(y))
    train(review, y)
    sqlite_entry(db, review, y)
    return render_template('thanks.html')

if __name__ == '__main__':
    app.run(debug=True)

```

We defined a **ReviewForm** class that instantiates a _TextAreaField_, which will be rendered in the **reviewform.html** template file (the landing page of our web app). This, in turn, is rendered by the **index()** function. With the **validators.length(min=15)** parameter, we require the user to enter a review that contains at least 15 characters.

The **feedback()** function fetches the predicted class label from the **results.html** template if a user clicked on the _Correct_ or _Incorrect_ feedback button, and then transforms the predicted sentiment into an integer class label that will be used to update the classifier via **train()**. Also, a new entry to the SQLite database will be made an the **thanks.html** template will be rendered. 

<br>

Let's now create some HTML code for **reviewform.html**:
```html
<!DOCTYPE html>
<html>
    <head>
        <title>Movie Classification</title>
    </head>
    <body>
        <h2>Please enter your movie review:</h2>
        {% from "_formhelpers.html" import render_field %}
        <form method=post action="/results">
            <dl>
                {{ render_field(form.moviereview, cols='30', rows='10') }}
            </dl>
            <div>
                <input type=submit value='Submit review' name='submit_btn'>
            </div>
        </form>
    </body>
</html>
```

Here, we simply imported the same **\_formhelpers.html** template that we defined earlier. The **render_field()** function of this macro is used to render a _TextAreaField_ where a user can provide a movie review and submit it via the Submit review button displayed at the bottom of the page. This _TextAreaField_ is 30 columns wide and 10 rows tall.

<br>

Our next template, **results.html**, looks a little bit more interesting:
```html
<!DOCTYPE html>
<html>
    <head>
        <title>Movie Classification</title>
        <link rel="stylesheet" href="{{ url_for('static', filename='style.css') }}">
    </head>
    <body>
        <h3>Your movie review:</h3>
        <div>{{ content }}</div>
        <h3>Prediction:</h3>
        <div>This movie review is <strong>{{ prediction }}</strong>
            (probability: {{ probability }}%).</div>
        <div id='button'>
            <form action="/thanks" method="post">
                <input type=submit value='Correct' name='feedback_button'>
                <input type=submit value='Incorrect' name='feedback_button'>
                <input type=hidden value='{{ prediction }}' name='prediction'>
                <input type=hidden value='{{ content }}' name='review'>
            </form>
        </div>
        <div id='button'>
            <form action="/">
                <input type=submit value='Submit another review'>
            </form>
        </div>
    </body>
</html>
```

First, we inserted the submitted review as well as the results of the prediction in the corresponding fields _{{ content }}_, _{{ prediction }}_, and _{{ probability }}_. You may notice that we used the _{{ content }}_ and _{{ prediction }}_ placeholder variables a second time in the form that contains the _Correct_ and _Incorrect_ buttons. This is a workaround to _POST_ those values back to the server to update the classifier and store the review in case the user clicks on one of those two buttons.

<br>

Let's now modify the **style.css** file in your **static/** directory:
```CSS
body{
    width: 600px;
}
#button{
    padding-top: 20px;
}
```

<br>

Input this code into your **thanks.html** file in your **templates/** directory:
```html
<!DOCTYPE html>
<html>
    <head>
        <title>Movie Classification</title>
    </head>
    <body>
        <h3>Thank you for your feedback!</h3>
        <div id='button'>
            <form action="/">
                <input type=submit value='Submit another review'>
            </form>
        </div>
    </body>
</html>
```