In [1]:
# -*- coding: utf-8 -*-
# @author: tongzi
# @description: combining different models for ensemble learing
# @created date: 2019/09/04
# @last modification: 2019/09/16

In [2]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#### Preparing the movie dataset into more convenient format

In [3]:
import pyprind

In [4]:
import os

In [5]:
basepath = 'aclImdb'
pbar = pyprind.ProgBar(50000)
df = pd.DataFrame()
labels = {'pos': 1, 'neg': 0}

In [6]:
for s in ('test', 'train'):
    for label in ('pos', 'neg'):
        path = os.path.join(basepath, s, label)
        for file in os.listdir(path):
            with open(os.path.join(path, file), 'r', encoding='utf-8') as infile:
                txt = infile.read()
            df = df.append([[txt, labels[label]]], ignore_index=True)
            pbar.update()



0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:04:38


In [7]:
df.columns = ['review', 'sentiment']

In [8]:
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))
df.to_csv('movie_data.csv', index=False, encoding='utf-8')

#### Transforming words into feature vectors  
To construct a bag-of-words model based on the word counts in the respective
documents, we can use the CountVectorizer class implemented in scikit-learn. As
we will see in the following code section, CountVectorizer takes an array of text
data, which can be documents or sentences, and constructs the bag-of-words model
for us:

In [9]:
from sklearn.feature_extraction.text import CountVectorizer

In [10]:
count = CountVectorizer()

In [11]:
docs = np.array([
    'I am a boy',
    'I like baseketball',
    "But I don't play baseketball for a long time"
])

In [12]:
bag = count.fit_transform(docs)

In [13]:
count.vocabulary_

{'am': 0,
 'boy': 2,
 'like': 6,
 'baseketball': 1,
 'but': 3,
 'don': 4,
 'play': 8,
 'for': 5,
 'long': 7,
 'time': 9}

In [14]:
bag.toarray()

array([[1, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 1, 0, 1, 1, 1, 0, 1, 1, 1]], dtype=int64)

These values in the feature vectors are also called the raw term frequencies: $tf(t, d)$ —the number of times a term t occurs in a document d.

>To summarize the concept of the n-gram representation, the 1-gram and 2-gram representations of our first document "the sun is shining" would be constructed as follows:  
• 1-gram: "the", "sun", "is", "shining"  
• 2-gram: "the sun", "sun is", "is shining"  
The CountVectorizer class in scikit-learn allows us to use different n-gram models via its ngram_range parameter. While a 1-gram representation is used by default, we could switch to a 2-gram representation by initializing a new CountVectorizer instance with *ngram_range=(2,2)*.

#### Accessing word relevancy via term frequency-inverse document frequency  
  
  When we are analyzing text data, we often encounter words that occur across
multiple documents from both classes. These frequently occurring words typically
don't contain useful or discriminatory information. In this subsection, we will learn about a useful technique called **term frequency-inverse document frequency (tf-idf)** that can be used to downweight these frequently occurring words in the feature vectors. The tf-idf can be defined as the product of the term frequency and the inverse document frequency:  
$$tf - idf(t, d) = tf(t, d) \times idf(t, d)$$


Here the $tf(t, d)$ is the term frequency that we introduced in the previous  section, and $idf(t, d)$ is the inverse document frequency and can be calculated as follows:  
$$idf(t, d) = \log \frac{n_d}{1+df(d, t)}$$

Here $n_d$ is the total number of documents, and $df(d, t)$ is the number of documents $d$ that contain the term $t$.

The scikit-learn library implements yet another transformer, the TfidfTransformer
class, that takes the raw term frequencies from the CountVectorizer class as input
and transforms them into tf-idfs:

In [15]:
from sklearn.feature_extraction.text import TfidfTransformer

In [16]:
tfidf = TfidfTransformer(use_idf=True, norm='l2', smooth_idf=True)

In [17]:
np.set_printoptions(precision=2)

In [18]:
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

[[0.71 0.   0.71 0.   0.   0.   0.   0.   0.   0.  ]
 [0.   0.61 0.   0.   0.   0.   0.8  0.   0.   0.  ]
 [0.   0.3  0.   0.39 0.39 0.39 0.   0.39 0.39 0.39]]


However, if we'd manually calculated the tf-idfs of the individual terms in our
feature vectors, we'd notice that TfidfTransformer calculates the tf-idfs slightly
differently compared to the standard textbook equations that we defined previously.
The equations for the inverse document frequency implemented in scikit-learn is
computed as follows:  
$$idf(t, d) = \log \frac{1+n_d}{1+df(d, t)}$$

Similarly, the $tf-idf$ computed in scikit-learn deviates slightly from the default equation we defined earlier:  
$$tf-idf(t,d) = tf(t, d) \times \left(idf(t,d)+1 \right)$$

By default ( norm='l2' ), scikit-learn's TfidfTransformer applies the L2-normalization, which returns a vector of length 1 by dividing an un-normalized feature vector v by its L2-norm:  
$$v_{norm} = \frac{v}{\lVert v \rVert_2} = \frac{v}{\sqrt{v_1^2 + v_2^2 + ... + v_n^2}} = \frac{v}{\sqrt{\sum_{i=1}^{n} v_i^2}}$$

#### Cleaning text data

In [19]:
df.loc[0, 'review']

"I went and saw this movie last night after being coaxed to by a few friends of mine. I'll admit that I was reluctant to see it because from what I knew of Ashton Kutcher he was only able to do comedy. I was wrong. Kutcher played the character of Jake Fischer very well, and Kevin Costner played Ben Randall with such professionalism. The sign of a good movie is that it can toy with our emotions. This one did exactly that. The entire theater (which was sold out) was overcome by laughter during the first half of the movie, and were moved to tears during the second half. While exiting the theater I not only saw many women in tears, but many full grown men as well, trying desperately not to let anyone see them crying. This movie was great, and I suggest that you go see it before you judge."

In [20]:
df.loc[0, 'review'][-50:]

'and I suggest that you go see it before you judge.'

In [21]:
import re
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = (re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', ''))
    return text

In [22]:
preprocessor("</a>This :) is :( a test :-)!")

'this is a test :) :( :)'

 since we will make use of the cleaned text data over and over again during the
next sections, let us now apply our preprocessor function to all the movie reviews
in our DataFrame 

In [23]:
df['review'] = df['review'].apply(preprocessor)

#### Processing documents into tokens

In [24]:
from nltk.stem.porter import PorterStemmer

In [25]:
porter = PorterStemmer()

In [26]:
def tokenizer(text):
    return text.split()

In [27]:
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

In [28]:
tokenizer_porter('runners like running and thus they run')

['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

In [38]:
from nltk import  word_tokenize
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))

#### Train a logistic regression model for document classification

In [39]:
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

In [40]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

In [41]:
tfidf = TfidfVectorizer(strip_accents=None, lowercase=False, preprocessor=None)

In [42]:
param_grid = [{'vector__ngram_range': [(1,1)],
              'vector_stop_word': [stop, None],
              'vector_tokenizer': [tokenizer, tokenizer_porter],
              'clf__penalty': ['l1', 'l2'],
              'clf__C': [1.0, 10.0, 100.0]},
             
             {'vector__ngram_range': [(1,1)],
             'vector__stop_words': [stop, None],
             'vector__tokenizer': [tokenizer, tokenizer_porter],
             'vector__use_idf': [False],
              'vector__smooth_idf': [False],
              'vector__norm': [None],
             'clf__penalty': ['l1', 'l2'],
             'clf__C': [1.0, 10.0, 100.0]}]

In [43]:
lr_tfidf = Pipeline([('vector', tfidf),
                    ('clf', LogisticRegression(random_state=0))])

In [46]:
gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid, scoring='accuracy', 
                          cv=5, verbose=1, n_jobs=-1)

In [None]:
gs_lr_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


In [None]:
print(f'Best parameter set: {gs_lr_tfidf.best_params_}')
print(f'CV accuracy: {gs_lr_tfidf.best_score_:.3f}')


In [None]:
clf = gs_lr_tfidf.best_estimator_

In [None]:
print('Test Accuracy: {clf.score(X_test, y_test):.3f}')

#### Working with bigger data-online algorithm and out-of-score learning  
&emsp;&emsp;If you executed the code examples in the previous section, you may have noticed that it could be computationally quite expensive to construct the feature vectors for the 50,000 movie review dataset during grid search. In many real-world applications, it is not uncommon to work with even larger datasets that can exceed our computer's memory. Since not everyone has access to supercomputer facilities, we will now apply a technique called out-of-core learning, which allows us to work with such large datasets by fitting the classifier incrementally on smaller batches of the dataset.

In [None]:
def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
... emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
... text.lower())
... text = re.sub('[\W]+', ' ', text.lower()) \
... + ' '.join(emoticons).replace('-', '')
... tokenized = [w for w in text.split() if w not in stop]
... return tokenized

In [None]:
def stream_dcos(path):
    with open(path, 'r', encoding='utf-8') as csv:
        next(csv) # 跳过csv文件的头部
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label
            
        

In [None]:
def get_minibatch(doc_stream, size):
    docs, y = [], []
    for _ in range(size):
        text, label = next(doc_stream)
        dosc.append(text)
        y.append(label)
    return docs, y

Unfortunately, we can't use CountVectorizer for out-of-core learning since it
requires holding the complete vocabulary in memory. Also, TfidfVectorizer
needs to keep all the feature vectors of the training dataset in memory to calculate the inverse document frequencies. However, another useful vectorizer for text processing implemented in scikit-learn is HashingVectorizer . HashingVectorizer is data-independent and makes use of the hashing trick via the 32-bit MurmurHash3 function by Austin Appleby ( https://sites.google.com/site/murmurhash/ ):

In [None]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier

In [None]:
vect = HashingVectorizer(decode_error='ignore', n_feartures=2**21,
                        preprocessor=None, tokenizer=tokenizer)

In [None]:
clf = SGDClassifier(loss='log', random_state=1, n_iter=1)

In [None]:
doc_stream = stream_docs(path='./movie_data.csv')

In [None]:
import pyprind
pbar = pyprind.ProgBar(45)
classes = np.array[0, 1]
for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size=1024)
    if note X_train:
        break
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes=classes)
    pbar.update()

Having completed the incremental learning process, we will use the last 5,000 documents to evaluate the performance of our model:

In [None]:
X_test, y_test = get_minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test)
print('Accuracy: {clf.score(X_test, y_test):.3f}')

out-of-core learning is very memory efficient and took less than a minute to complete. Finally, we can use the last 5,000 documents to update our model:

In [None]:
clf = clf.partial_fit(X_test, y_test)

>A more modern alternative to the bag-of-words model is word2vec,
an algorithm that Google released in 2013 (Efficient Estimation of Word
Representations in Vector Space, T. Mikolov, K. Chen, G. Corrado, and J. Dean,
arXiv preprint arXiv:1301.3781, 2013). The word2vec algorithm is an
unsupervised learning algorithm based on neural networks that attempts
to automatically learn the relationship between words. The idea behind
word2vec is to put words that have similar meanings into similar clusters,
and via clever vector-spacing, the model can reproduce certain words
using simple vector math, for example, king – man + woman = queen.
The original C-implementation with useful links to the relevant papers
and alternative implementations can be found at https://code.
google.com/p/word2vec/.

### Topic modeling with Latent Dirichlet Allocation (隐狄利克雷分布)  
  
In this section, we will introduce a popular technique for topic modeling called Latent Dirichlet Allocation (**LDA**). However, note that while Latent Dirichlet
Allocation is often abbreviated as LDA, it is not to be confused with Linear
discriminant analysis, a supervised dimensionality reduction technique that we
introduced in Chapter 5, Compressing Data via Dimensionality Reduction.

#### Decomposing text documents with LDA

LDA is a generative probabilistic model that tries to find groups of words that appear frequently together across different documents. These frequently appearing words represent our topics, assuming that each document is a mixture of different words. The input to an LDA is the bag-of-words model we discussed earlier in this chapter. Given a bag-of-words matrix as input, LDA decomposes it into two new matrices:  
• A document to topic matrix  
• A word to topic matrix  
  
LDA decomposes the bag-of-words matrix in such a way that if we multiply those
two matrices together, we would be able to reproduce the input, the bag-of-words matrix, with the lowest possible error. In practice, we are interested in those topics that LDA found in the bag-of-words matrix. The only downside may be that we must define the number of topics beforehand—the number of topics is a hyperparameter of LDA that has to be specified manually.

#### LDA with scikit-learn

In [None]:
df = pd.read_csv('movie_data.csv', encoding='utf-8')

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer(stop_words='english', 
                       max_df=0.1, max_features=5000)


In [None]:
X = count.fit_transform(df['review'].values)

Notice that we set the maximum document frequency of words to be considered
to 10 percent ( max_df=.1 ) to exclude words that occur too frequently across
documents. The rationale behind the removal of frequently occurring words is that these might be common words appearing across all documents and are therefore less likely associated with a specific topic category of a given document. Also, we limited the number of words to be considered to the most frequently occurring 5,000 words ( max_features=5000 ), to limit the dimensionality of this dataset so that it improves the inference performed by LDA. However, both max_df=.1 and max_features=5000 are hyperparameter values that I chose arbitrarily, and readers are encouraged to tune them while comparing the results.

In [None]:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_topics=10, random_state=123, 
                                learning_method='batch')

In [None]:
X_topics = lda.fit_transform(X)

By setting learning_method='batch' , we let the lda estimator do its estimation based on all available training data (the bag-of-words matrix) in one iteration, which is slower than the alternative 'online' learning method but can lead to more accurate results (setting learning_method='online' is analogous to online or mini-batch learning that we discussed in Chapter 2, Training Simple Machine Learning Algorithms for Classification, and in this chapter).

>The scikit-learn library's implementation of LDA uses the Expectation-
Maximization (**EM**) algorithm to update its parameter estimates
iteratively. We haven't discussed the EM algorithm in this chapter, but
if you are curious to learn more, please see the excellent overview on
Wikipedia (https://en.wikipedia.org/wiki/Expectation–
maximization_algorithm) and the detailed tutorial on how it is used
in LDA in Colorado Reed's tutorial, Latent Dirichlet Allocation: Towards a
Deeper Understanding, which is freely available at http://obphio.us/
pdfs/lda_tutorial.pdf.

After fitting the LDA, we now have access to the components_ attribute of the lda instance, which stores a matrix containing the word importance (here, 5000 ) for each of the 10 topics in increasing order:

In [None]:
lda.components_.shape

To analyze the results, let's print the five most important words for each of the 10 topics. Note that the word importance values are ranked in increasing order. Thus, to print the top five words, we need to sort the topic array in reverse order:

In [None]:
n_top_words = 5
feature_names = count.get_feature_names()
for topic_idx, topic in enumerate(lda.components_):
    print(f'Topic {topic_idx+1}')
    print(' '.join([feature_names[i] for i in topic.argsort()[:n_top_words-1:-1]]))

### **Summary**  
In this chapter, we learned how to use machine learning algorithms to classify text documents based on their polarity, which is a basic task in sentiment analysis in the field of NLP. Not only did we learn how to encode a document as a feature vector using the bag-of-words model, but we also learned how to weight the term frequency by relevance using tf-idf.  
  
Working with text data can be computationally quite expensive due to the large
feature vectors that are created during this process; in the last section, we learned how to utilize out-of-core or incremental learning to train a machine learning algorithm without loading the whole dataset into a computer's memory.  
  
Lastly, we introduced the concept of topic modeling using LDA to categorize the movie reviews into different categories in unsupervised fashion.

## Chapter 9 Embedding a Machine Learning Model into a Web Application

In this chapter, we will learn how to embed a machine learning model into a web application that can not only classify, but also learn from data in real time. The topics that we will cover are as follows:  
(1) Saving the current state of a trained machine learning model  
(2) Using SQLite database for data storage  
(3) Developing a web application using the popular Flask web framework  
(4) Deploying a machine learning application to a public web server