# Creating Sentiment Model Classifier in Spacy

In this notebook, we illustrate how we can build on our text classifier model on Spacy. Spacy allows various forms of training in https://spacy.io/usage/training. Some things to train include named entities, dependency parsing, text categories (sentiment) and etc. Earlier on we had used sm, md models from the built-in Spacy models. 

Since this is a sentiment course, we will train sentiment classes. One of the key reasons to build our own model is often due to the domain specific nature of sentiment classification. In this case, our training data will come from airline sentiment data that is available from https://www.kaggle.com/welkin10/airline-sentiment-analysis - a Kaggle dataset.

There can be improvements made to it including removing stopwords, n-grams etc. This can be done as part of your project assignment. 

In [1]:
from __future__ import unicode_literals, print_function

import copy

import spacy
import re
from spacy.util import minibatch, compounding
import pandas as pd

Reading in data. 

In [2]:
df = pd.read_csv('data/airline_sentiment.csv', encoding= "utf-8")
df = df[["text","airline_sentiment"]]
df = df[df['airline_sentiment'] != '']
print (len(df))

def clean_string(mystring):
    return re.sub('[^A-Za-z\ 0-9 ]+', '', mystring)

14640


In Spacy, to create a new model first use the spacy.blank command. 

Refer to https://spacy.io/api/textcategorizer more details. More details on the pipeline for training can be referenced from https://spacy.io/api.

Creating the data set below and training only to do text classification into the the 3 categories - positive, negative and neutral. Spacy actually uses deep learning for its training. 

In [3]:
my_nlp = spacy.blank("en")  # create blank Language class

textcat = my_nlp.create_pipe("textcat", config={"exclusive_classes": True, "architecture": "simple_cnn"})
my_nlp.add_pipe(textcat, last=True)

textcat.add_label("positive")
textcat.add_label("negative")
textcat.add_label("neutral")

train_split = 0.1

sentiment_values = df['airline_sentiment'].unique()
labels_default = dict((v, 0) for v in sentiment_values)

train_data = []
for i, row in df.iterrows():

    label_values = copy.deepcopy(labels_default)
    label_values[row['airline_sentiment']] = 1

    train_data.append((clean_string(row['text']), {"cats": label_values}))

train_data = train_data[:14000]

Training of the spacy model occurs here. 

In [12]:
other_pipes = [pipe for pipe in my_nlp.pipe_names if pipe != "textcat"]
n_iter= 2
init_tok2vec=None 


with my_nlp.disable_pipes(*other_pipes):  # only train textcat
    optimizer = my_nlp.begin_training()
    if init_tok2vec is not None:
        with init_tok2vec.open("rb") as file_:
            textcat.model.tok2vec.from_bytes(file_.read())
    print("Training the model...")
    print('{:^5}\t'.format('LOSS'))
    batch_sizes = compounding(16.0, 32.0, 2.0) # only two batches passed
    for i in range(n_iter):
        losses = {}
        # batch up the examples using spaCy's minibatch
        batches = minibatch(train_data, size=batch_sizes)
        
        for batch in batches:
            texts, annotations = zip(*batch)
            my_nlp.update(texts, annotations, sgd=optimizer, drop=0.2, losses=losses)
            print('{0:.3f}'.format(losses['textcat']))  # print a simple table                  


Training the model...
LOSS 	
0.562
0.927
1.454
1.773
2.030
2.279
2.619
2.959
3.262
3.652
3.980
4.334
4.629
5.076
5.460
5.838
6.176
6.599
6.887
7.232
7.473
7.869
8.173
8.621
8.964
9.353
9.536
9.877
10.199
10.456
10.767
10.954
11.220
11.445
11.610
11.917
12.155
12.377
12.549
12.826
13.060
13.423
13.758
13.904
14.025
14.289
14.446
14.587
14.853
15.172
15.477
15.842
16.195
16.419
16.662
16.882
17.032
17.234
17.698
17.961
18.218
18.453
18.587
19.084
19.218
19.363
19.686
19.980
20.167
20.454
20.827
21.058
21.320
21.646
21.877
22.258
22.510
22.777
23.092
23.242
23.640
23.867
24.233
24.491
24.748
24.999
25.271
25.572
25.906
26.106
26.459
26.887
27.178
27.422
27.742
27.971
28.237
28.521
28.812
29.119
29.396
29.593
29.858
30.179
30.564
30.874
31.190
31.497
31.735
32.010
32.256
32.700
32.886
33.441
33.874
34.306
34.595
34.716
35.050
35.309
35.629
36.003
36.256
36.510
36.979
37.180
37.583
37.911
38.045
38.346
38.623
38.911
39.176
39.571
40.043
40.189
40.442
40.740
41.004
41.369
41.820
42.209
42.54

In [18]:
# test the trained model
test_text = ["This movie isn't beautiful", "The food is yummy.", "I don't know what to say."]

for i in test_text:
    doc = my_nlp(i)
    print(i, sorted(doc.cats.items(), key=lambda val: val[1], reverse=True))

This movie isn't beautiful [('positive', 0.5213156342506409), ('neutral', 0.39566105604171753), ('negative', 0.06633545458316803)]
The food is yummy. [('positive', 0.3966348469257355), ('negative', 0.3453865945339203), ('neutral', 0.23590366542339325)]
I don't know what to say. [('neutral', 0.736530065536499), ('negative', 0.1698307991027832), ('positive', 0.044965170323848724)]


Here, the spacy model is saved and re-loaded to test it.

In [29]:
output_dir = "models/my_nlp_sm"
my_nlp.to_disk(output_dir)
print("Saved model to", output_dir)

# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
doc2 = my_nlp(test_text[0])
print(test_text[0], doc2.cats)

TypeError: __init__() got an unexpected keyword argument 'encoding'

In [32]:
print(34234)

34234
