# Creating Sentiment Model Classifier in Spacy

In this notebook, we illustrate how we can build on our text classifier model on Spacy. Spacy allows various forms of training in https://spacy.io/usage/training. Some things to train include named entities, dependency parsing, text categories (sentiment) and etc. Earlier on we had used sm, md models from the built-in Spacy models. 

Since this is a sentiment course, we will train sentiment classes. One of the key reasons to build our own model is often due to the domain specific nature of sentiment classification. In this case, our training data will come from airline sentiment data that is available from https://www.kaggle.com/welkin10/airline-sentiment-analysis - a Kaggle dataset.

There can be improvements made to it including removing stopwords, n-grams etc. This can be done as part of your project assignment. 

In [1]:
from __future__ import unicode_literals, print_function

import copy

import spacy
import re
from spacy.util import minibatch, compounding
import pandas as pd

Reading in data. 

In [2]:
df = pd.read_csv('data/airline_sentiment.csv', encoding= "utf-8")
df = df[["text","airline_sentiment"]]
df = df[df['airline_sentiment'] != '']
print (len(df))

def clean_string(mystring):
    return re.sub('[^A-Za-z\ 0-9 ]+', '', mystring)

14640


In Spacy, to create a new model first use the spacy.blank command. 

Refer to https://spacy.io/api/textcategorizer more details. More details on the pipeline for training can be referenced from https://spacy.io/api.

Creating the data set below and training only to do text classification into the the 3 categories - positive, negative and neutral. Spacy actually uses deep learning for its training. 

In [3]:
my_nlp = spacy.blank("en")  # create blank Language class

textcat = my_nlp.create_pipe("textcat", config={"exclusive_classes": True, "architecture": "simple_cnn"})
my_nlp.add_pipe(textcat, last=True)

textcat.add_label("positive")
textcat.add_label("negative")
textcat.add_label("neutral")

train_split = 0.1

sentiment_values = df['airline_sentiment'].unique()
labels_default = dict((v, 0) for v in sentiment_values)

train_data = []
for i, row in df.iterrows():

    label_values = copy.deepcopy(labels_default)
    label_values[row['airline_sentiment']] = 1

    train_data.append((clean_string(row['text']), {"cats": label_values}))

train_data = train_data[:14000]

Training of the spacy model occurs here. 

In [4]:
other_pipes = [pipe for pipe in my_nlp.pipe_names if pipe != "textcat"]
n_iter= 2
init_tok2vec=None 


with my_nlp.disable_pipes(*other_pipes):  # only train textcat
    optimizer = my_nlp.begin_training()
    if init_tok2vec is not None:
        with init_tok2vec.open("rb") as file_:
            textcat.model.tok2vec.from_bytes(file_.read())
    print("Training the model...")
    print('{:^5}\t'.format('LOSS'))
    batch_sizes = compounding(16.0, 32.0, 2.0) # only two batches passed
    for i in range(n_iter):
        losses = {}
        # batch up the examples using spaCy's minibatch
        batches = minibatch(train_data, size=batch_sizes)
        
        for batch in batches:
            texts, annotations = zip(*batch)
            my_nlp.update(texts, annotations, sgd=optimizer, drop=0.2, losses=losses)
            print('{0:.3f}'.format(losses['textcat']))  # print a simple table                  


Training the model...
LOSS 	
0.948
1.951
2.968
4.090
4.989
6.132
7.129
8.127
9.121
10.253
11.169
12.031
12.913
13.703
14.644
15.590
16.631
17.468
18.402
19.293
20.226
21.109
21.998
22.838
23.788
24.557
25.294
26.004
26.907
27.686
28.426
29.140
29.850
30.455
31.216
32.035
32.699
33.479
34.224
35.040
35.594
36.240
36.981
37.546
38.042
38.712
39.341
39.902
40.416
41.081
41.696
42.365
43.006
43.442
44.000
44.484
44.897
45.299
46.036
46.613
47.054
47.440
47.851
48.533
48.777
49.194
49.700
50.253
50.535
51.010
51.570
52.080
52.577
53.096
53.569
54.207
54.725
55.153
55.535
55.913
56.484
56.951
57.502
58.063
58.570
59.011
59.484
60.024
60.607
61.201
61.743
62.361
62.830
63.269
63.883
64.356
64.829
65.323
65.903
66.443
66.949
67.336
67.716
68.329
69.070
69.554
70.123
70.745
71.203
71.693
72.211
72.940
73.334
74.043
74.717
75.350
75.824
76.222
76.653
77.104
77.588
78.131
78.628
78.986
79.541
79.921
80.520
80.949
81.186
81.791
82.396
82.863
83.202
83.842
84.462
84.746
85.429
85.990
86.403
87.029


In [5]:
# test the trained model
test_text = ["This movie sucked", "The food is yummy.", "I don't know what to say."]

for i in test_text:
    doc = my_nlp(i)
    print(i, sorted(doc.cats.items(), key=lambda val: val[1], reverse=True))

This movie sucked [('negative', 0.9971849322319031), ('neutral', 0.00280904327519238), ('positive', 6.0481247601273935e-06)]
The food is yummy. [('neutral', 0.541243314743042), ('positive', 0.44526243209838867), ('negative', 0.013494227081537247)]
I don't know what to say. [('neutral', 0.8193414807319641), ('negative', 0.18003545701503754), ('positive', 0.0006230857688933611)]


Here, the spacy model is saved and re-loaded to test it.

In [6]:
output_dir = "models\\my_nlp_sm"
my_nlp.to_disk(output_dir)
print("Saved model to", output_dir)

# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
doc2 = my_nlp(test_text[0])
print(test_text[0], doc2.cats)

Saved model to models\my_nlp_sm
Loading from models\my_nlp_sm
This movie sucked {'positive': 6.0481247601273935e-06, 'negative': 0.9971849322319031, 'neutral': 0.00280904327519238}
