# Creating Sentiment Model Classifier in Spacy

In this notebook, we illustrate how we can build on our text classifier model on Spacy. Spacy allows various forms of training in https://spacy.io/usage/training. Some things to train include named entities, dependency parsing, text categories (sentiment) and etc. Earlier on we had used sm, md models from the built-in Spacy models. 

Since this is a sentiment course, we will train sentiment classes. One of the key reasons to build our own model is often due to the domain specific nature of sentiment classification. In this case, our training data will come from airline sentiment data that is available from https://www.kaggle.com/welkin10/airline-sentiment-analysis - a Kaggle dataset.

There can be improvements made to it including removing stopwords, n-grams etc. This can be done as part of your project assignment. 

In [5]:
from __future__ import unicode_literals, print_function

import copy

import spacy
import re
from spacy.util import minibatch, compounding
import pandas as pd

Reading in data. 

In [7]:
df = pd.read_csv('data/airline_sentiment.csv', encoding= "utf-8")
df = df[["text","airline_sentiment"]]
df = df[df['airline_sentiment'] != '']
print(df.dtypes)
print (len(df))

def clean_string(mystring):
    return re.sub('[^A-Za-z\ 0-9 ]+', '', mystring)

text                 object
airline_sentiment    object
dtype: object
14640


In Spacy, to create a new model first use the spacy.blank command. 

Refer to https://spacy.io/api/textcategorizer more details. More details on the pipeline for training can be referenced from https://spacy.io/api.

Creating the data set below and training only to do text classification into the the 3 categories - positive, negative and neutral. Spacy actually uses deep learning for its training. 

In [9]:
my_nlp = spacy.blank("en")  # create blank Language class

textcat = my_nlp.create_pipe("textcat", config={"exclusive_classes": True, "architecture": "simple_cnn"})
my_nlp.add_pipe(textcat, last=True)

textcat.add_label("positive")
textcat.add_label("negative")
textcat.add_label("neutral")

train_split = 0.1

sentiment_values = df['airline_sentiment'].unique()
labels_default = dict((v, 0) for v in sentiment_values)

train_data = []
for i, row in df.iterrows():

    label_values = copy.deepcopy(labels_default)
    label_values[row['airline_sentiment']] = 1

    train_data.append((clean_string(row['text']), {"cats": label_values}))
    
train_data = train_data[:14000]

14640


Training of the spacy model occurs here. 

In [4]:
other_pipes = [pipe for pipe in my_nlp.pipe_names if pipe != "textcat"]
n_iter = 2
init_tok2vec = None 


with my_nlp.disable_pipes(*other_pipes):  # only train textcat
    optimizer = my_nlp.begin_training()
    if init_tok2vec is not None:
        with init_tok2vec.open("rb") as file_:
            textcat.model.tok2vec.from_bytes(file_.read())
    print("Training the model...")
    print('{:^5}\t'.format('LOSS'))
    batch_sizes = compounding(16.0, 32.0, 2.0) # only two batches passed
    for i in range(n_iter):
        losses = {}
        # batch up the examples using spaCy's minibatch
        batches = minibatch(train_data, size=batch_sizes)
        
        for batch in batches:
            texts, annotations = zip(*batch)
            my_nlp.update(texts, annotations, sgd=optimizer, drop=0.2, losses=losses)
            print('{0:.3f}'.format(losses['textcat']))  # print a simple table                  


Training the model...
LOSS 	
0.003
0.003
0.004
0.005
0.005
0.006
0.006
0.007
0.008
0.008
0.009
0.010
0.010
0.011
0.012
0.012
0.013
0.013
0.014
0.015
0.015
0.015
0.016
0.016
0.017
0.017
0.018
0.018
0.019
0.019
0.019
0.020
0.020
0.021
0.021
0.021
0.022
0.022
0.022
0.023
0.023
0.023
0.024
0.024
0.025
0.025
0.025
0.025
0.026
0.026
0.026
0.027
0.027
0.028
0.028
0.028
0.029
0.029
0.030
0.030
0.030
0.030
0.031
0.031
0.031
0.032
0.032
0.032
0.033
0.033
0.033
0.034
0.034
0.035
0.035
0.036
0.036
0.036
0.037
0.037
0.037
0.038
0.038
0.039
0.039
0.039
0.040
0.040
0.041
0.041
0.041
0.042
0.042
0.042
0.043
0.043
0.044
0.044
0.044
0.045
0.045
0.045
0.046
0.046
0.046
0.047
0.047
0.048
0.048
0.048
0.049
0.049
0.049
0.050
0.051
0.051
0.051
0.052
0.052
0.052
0.053
0.053
0.054
0.054
0.054
0.055
0.055
0.056
0.056
0.056
0.057
0.057
0.057
0.058
0.058
0.059
0.059
0.060
0.060
0.061
0.061
0.062
0.062
0.063
0.063
0.064
0.064
0.065
0.065
0.066
0.066
0.066
0.067
0.068
0.068
0.069
0.069
0.070
0.070
0.070
0.071
0.071

In [7]:
# test the trained model
test_text = ["Education price is attractive", "interesting. now is the wait for 10th Gen for 16", "Yes correct more worth it to get that.", "sweet! wait for thermal tests before deciding to get or not", "Yes. It is ridiculously cheaper than before."]

for i in test_text:
    doc = my_nlp(i)
    print(i, sorted(doc.cats.items(), key=lambda val: val[1], reverse=True), "\n")

Education price is attractive [('neutral', 0.5909589529037476), ('negative', 0.36612898111343384), ('positive', 0.04291209205985069)] 

interesting. now is the wait for 10th Gen for 16 [('negative', 0.8104421496391296), ('positive', 0.14652833342552185), ('neutral', 0.04302957281470299)] 

Yes correct more worth it to get that. [('negative', 0.8470313549041748), ('neutral', 0.12522996962070465), ('positive', 0.027738623321056366)] 

sweet! wait for thermal tests before deciding to get or not [('negative', 0.9695537090301514), ('positive', 0.0164778009057045), ('neutral', 0.01396855153143406)] 

Yes. It is ridiculously cheaper than before. [('negative', 0.9859248995780945), ('neutral', 0.009494852274656296), ('positive', 0.004580210894346237)] 



Here, the spacy model is saved and re-loaded to test it.

In [6]:
output_dir = "models/my_nlp_sm"
my_nlp.to_disk(output_dir)
print("Saved model to", output_dir)

# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
doc2 = my_nlp(test_text[0])
print(test_text[0], doc2.cats)

Saved model to models/my_nlp_sm
Loading from models/my_nlp_sm
This movie isn't beautiful {'positive': 0.033161554485559464, 'negative': 0.9193779826164246, 'neutral': 0.047460537403821945}


In [32]:
print(34234)

34234
