
![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/https://github.com/JohnSnowLabs/nlu/blob/master/examples/collab/Training/binary_text_classification/NLU_training_sentiment_classifier_demo.ipynb)



# Training a Sentiment Analysis Classifier with NLU 
With the [ClassifierDL model](https://nlp.johnsnowlabs.com/docs/en/annotators#classifierdl-multi-class-text-classification) from Spark NLP you can achieve State Of the Art results on any multi class text classification problem 

This notebook showcases the following features : 

- How to train the deep learning classifier
- How to store a pipeline to disk
- How to load the pipeline from disk (Enables NLU offline mode)



Source - https://colab.research.google.com/drive/1f-EORjO3IpvwRAktuL4EvZPqPr2IZ_g8?usp=sharing



## 1. Install Java 8 and NLU

In [1]:
!pip install pyspark==3.0.1

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark==3.0.1
  Downloading pyspark-3.0.1.tar.gz (204.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m204.2/204.2 MB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting py4j==0.10.9
  Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.0.1-py2.py3-none-any.whl size=204612247 sha256=a28262ba1ac19b1502582b296a8165e5ed85b790f76dca2cc56fdc1c208ef5cd
  Stored in directory: /root/.cache/pip/wheels/4d/60/70/4b354ff632e827ce13a755d886b704e306089e6c275be8aba4
Successfully built pyspark
Installing collected packages: py4j, py

In [2]:
import os
from sklearn.metrics import classification_report
! apt-get update -qq > /dev/null   
# Install java
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]  
! pip install nlu

import nlu

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting nlu
  Downloading nlu-4.2.0-py3-none-any.whl (639 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m639.9/639.9 kB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting spark-nlp>=4.2.0
  Downloading spark_nlp-4.4.0-py2.py3-none-any.whl (486 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.4/486.4 kB[0m [31m30.1 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses
  Downloading dataclasses-0.6-py3-none-any.whl (14 kB)
Installing collected packages: spark-nlp, dataclasses, nlu
Successfully installed dataclasses-0.6 nlu-4.2.0 spark-nlp-4.4.0


# 2. Download Stock Market Sentiment dataset 
https://www.kaggle.com/yash612/stockmarket-sentiment-dataset

In [3]:
! wget http://ckl-it.de/wp-content/uploads/2020/11/stock_data.csv


--2023-04-18 06:56:24--  http://ckl-it.de/wp-content/uploads/2020/11/stock_data.csv
Resolving ckl-it.de (ckl-it.de)... 217.160.0.108, 2001:8d8:100f:f000::209
Connecting to ckl-it.de (ckl-it.de)|217.160.0.108|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 479973 (469K) [text/csv]
Saving to: ‘stock_data.csv’


2023-04-18 06:56:25 (606 KB/s) - ‘stock_data.csv’ saved [479973/479973]



In [4]:
import nlu
sentiment = nlu.load('sentiment')

sentimentdl_glove_imdb download started this may take some time.
Approximate size to download 8.7 MB
[OK!]
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]


In [5]:
sentiment.predict("I'm very very not at all happy")

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


Unnamed: 0,sentence,sentence_embedding_converter,sentiment,sentiment_confidence,word_embedding_glove
0,I'm very very not at all happy,"[-0.2865465581417084, 0.25398728251457214, 0.2...",pos,0.999995,"[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,..."


In [6]:
import pandas as pd
train_path = '/content/stock_data.csv'

train_df = pd.read_csv(train_path)
# the text data to use for classification should be in a column named 'text'
# the label column must have name 'y' name be of type str
train_df.columns=['text','y']
train_df.y = train_df.y.astype(str)
train_df.y = train_df.y.str.replace('-1','negative')
train_df.y = train_df.y.str.replace('1','positive')
train_df

Unnamed: 0,text,y
0,Kickers on my watchlist XIDE TIT SOQ PNK CPW B...,positive
1,user: AAP MOVIE. 55% return for the FEA/GEED i...,positive
2,user I'd be afraid to short AMZN - they are lo...,positive
3,MNTA Over 12.00,positive
4,OI Over 21.37,positive
...,...,...
5786,Industry body CII said #discoms are likely to ...,negative
5787,"#Gold prices slip below Rs 46,000 as #investor...",negative
5788,Workers at Bajaj Auto have agreed to a 10% wag...,positive
5789,"#Sharemarket LIVE: Sensex off day’s high, up 6...",positive


# 3. Train Deep Learning Classifier using nlu.load('train.sentiment')

You dataset label column should be named 'y' and the feature column with text data should be named 'text'

In [7]:
import nlu 
# load a trainable pipeline by specifying the train. prefix  and fit it on a datset with label and text columns
# by default the Universal Sentence Encoder (USE) Sentence embeddings are used for generation
trainable_pipe = nlu.load('train.sentiment')
fitted_pipe = trainable_pipe.fit(train_df)

# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df,output_level='document')
#sentence detector that is part of the pipe generates sone NaNs. lets drop them first
preds.dropna(inplace=True)
print(classification_report(preds['y'], preds['sentiment']))

preds

sent_small_bert_L2_128 download started this may take some time.
Approximate size to download 16.1 MB
[OK!]
              precision    recall  f1-score   support

    negative       0.00      0.00      0.00      2106
    positive       0.64      1.00      0.78      3685

    accuracy                           0.64      5791
   macro avg       0.32      0.50      0.39      5791
weighted avg       0.40      0.64      0.49      5791



Unnamed: 0,document,sentence_embedding_small_bert_L2_128,sentiment,sentiment_confidence,text,y
0,Kickers on my watchlist XIDE TIT SOQ PNK CPW B...,"[-0.9530858993530273, 0.21358267962932587, 0.1...",positive,1.0,Kickers on my watchlist XIDE TIT SOQ PNK CPW B...,positive
1,user: AAP MOVIE. 55% return for the FEA/GEED i...,"[-0.47259706258773804, 0.5354136228561401, -0....",positive,1.0,user: AAP MOVIE. 55% return for the FEA/GEED i...,positive
2,user I'd be afraid to short AMZN - they are lo...,"[0.30400291085243225, 0.2286294549703598, -0.5...",positive,1.0,user I'd be afraid to short AMZN - they are lo...,positive
3,MNTA Over 12.00,"[-1.7079013586044312, -0.4847279489040375, -0....",positive,1.0,MNTA Over 12.00,positive
4,OI Over 21.37,"[-2.3011534214019775, 0.26495108008384705, -0....",positive,1.0,OI Over 21.37,positive
...,...,...,...,...,...,...
5786,Industry body CII said #discoms are likely to ...,"[-0.2165522426366806, 0.6153535842895508, 0.04...",positive,1.0,Industry body CII said #discoms are likely to ...,negative
5787,"#Gold prices slip below Rs 46,000 as #investor...","[-0.19915248453617096, 0.26074403524398804, 0....",positive,1.0,"#Gold prices slip below Rs 46,000 as #investor...",negative
5788,Workers at Bajaj Auto have agreed to a 10% wag...,"[-0.43615204095840454, 0.9346762895584106, -0....",positive,1.0,Workers at Bajaj Auto have agreed to a 10% wag...,positive
5789,"#Sharemarket LIVE: Sensex off day’s high, up 6...","[-0.6081280708312988, 0.2732299864292145, 0.25...",positive,1.0,"#Sharemarket LIVE: Sensex off day’s high, up 6...",positive


# Test the fitted pipe on new example

In [8]:
fitted_pipe.predict("Bitcoin is going to the moon!")

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]


Unnamed: 0,sentence,sentence_embedding_small_bert_L2_128,sentiment,sentiment_confidence
0,Bitcoin is going to the moon!,"[-1.0531491041183472, -0.2827453911304474, -0....",positive,1.0


## Configure pipe training parameters

In [9]:
trainable_pipe.print_info()

The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :
>>> component_list['bert_sentence_embeddings@sent_small_bert_L2_128'] has settable params:
component_list['bert_sentence_embeddings@sent_small_bert_L2_128'].setBatchSize(8)              | Info: Size of every batch | Currently set to : 8
component_list['bert_sentence_embeddings@sent_small_bert_L2_128'].setEngine('tensorflow')      | Info: Deep Learning engine used for this model | Currently set to : tensorflow
component_list['bert_sentence_embeddings@sent_small_bert_L2_128'].setIsLong(False)             | Info: Use Long type instead of Int type for inputs buffer - Some Bert models require Long instead of Int. | Currently set to : False
component_list['bert_sentence_embeddings@sent_small_bert_L2_128'].setMaxSentenceLength(128)    | Info: Max sentence length to process | Currently set to : 128
component_list['bert_sentence_embeddings@sent_small_bert_L2_128'].setDimension(128)            | I

## Retrain with new parameters

In [19]:
print(trainable_pipe.keys())

dict_keys(['bert_sentence_embeddings@sent_small_bert_L2_128', 'document_assembler', 'sentiment_dl@sent_small_bert_L2_128'])


In [22]:
# Train longer!
trainable_pipe['sentiment_dl@sent_small_bert_L2_128']
fitted_pipe = trainable_pipe.fit(train_df)
# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df,output_level='document')

#sentence detector that is part of the pipe generates sone NaNs. lets drop them first
preds.dropna(inplace=True)
print(classification_report(preds['y'], preds['sentiment']))

preds

              precision    recall  f1-score   support

    negative       0.00      0.00      0.00      2106
    positive       0.64      1.00      0.78      3685

    accuracy                           0.64      5791
   macro avg       0.32      0.50      0.39      5791
weighted avg       0.40      0.64      0.49      5791



Unnamed: 0,document,sentence_embedding_small_bert_L2_128,sentiment,sentiment_confidence,text,y
0,Kickers on my watchlist XIDE TIT SOQ PNK CPW B...,"[[-0.9530858993530273, 0.21358267962932587, 0....",positive,1.0,Kickers on my watchlist XIDE TIT SOQ PNK CPW B...,positive
1,user: AAP MOVIE. 55% return for the FEA/GEED i...,"[[-0.47259706258773804, 0.5354136228561401, -0...",positive,1.0,user: AAP MOVIE. 55% return for the FEA/GEED i...,positive
2,user I'd be afraid to short AMZN - they are lo...,"[[0.30400291085243225, 0.2286294549703598, -0....",positive,1.0,user I'd be afraid to short AMZN - they are lo...,positive
3,MNTA Over 12.00,"[[-1.7079013586044312, -0.4847279489040375, -0...",positive,1.0,MNTA Over 12.00,positive
4,OI Over 21.37,"[[-2.3011534214019775, 0.26495108008384705, -0...",positive,1.0,OI Over 21.37,positive
...,...,...,...,...,...,...
5786,Industry body CII said #discoms are likely to ...,"[[-0.2165522426366806, 0.6153535842895508, 0.0...",positive,1.0,Industry body CII said #discoms are likely to ...,negative
5787,"#Gold prices slip below Rs 46,000 as #investor...","[[-0.19915248453617096, 0.26074403524398804, 0...",positive,1.0,"#Gold prices slip below Rs 46,000 as #investor...",negative
5788,Workers at Bajaj Auto have agreed to a 10% wag...,"[[-0.43615204095840454, 0.9346762895584106, -0...",positive,1.0,Workers at Bajaj Auto have agreed to a 10% wag...,positive
5789,"#Sharemarket LIVE: Sensex off day’s high, up 6...","[[-0.6081280708312988, 0.2732299864292145, 0.2...",positive,1.0,"#Sharemarket LIVE: Sensex off day’s high, up 6...",positive


# Try training with different Embeddings

In [23]:
# We can use nlu.print_components(action='embed_sentence') to see every possibler sentence embedding we could use. Lets use bert!
nlu.print_components(action='embed_sentence')

For language <am> NLU provides the following Models : 
nlu.load('am.embed_sentence.xlm_roberta') returns Spark NLP model_anno_obj sent_xlm_roberta_base_finetuned_amharic
For language <de> NLU provides the following Models : 
nlu.load('de.embed_sentence.bert.base_cased') returns Spark NLP model_anno_obj sent_bert_base_cased
For language <el> NLU provides the following Models : 
nlu.load('el.embed_sentence.bert.base_uncased') returns Spark NLP model_anno_obj sent_bert_base_uncased
For language <en> NLU provides the following Models : 
nlu.load('en.embed_sentence') returns Spark NLP model_anno_obj tfhub_use
nlu.load('en.embed_sentence.albert') returns Spark NLP model_anno_obj albert_base_uncased
nlu.load('en.embed_sentence.bert') returns Spark NLP model_anno_obj sent_bert_base_uncased
nlu.load('en.embed_sentence.bert.base_uncased_legal') returns Spark NLP model_anno_obj sent_bert_base_uncased_legal
nlu.load('en.embed_sentence.bert.finetuned') returns Spark NLP model_anno_obj sbert_setfit_

In [26]:
print(trainable_pipe.keys())

dict_keys(['bert_sentence_embeddings@sent_small_bert_L2_128', 'trainable_sentiment_dl', 'document_assembler'])


In [27]:
trainable_pipe = nlu.load('embed_sentence.bert train.sentiment')
# We need to train longer and user smaller LR for NON-USE based sentence embeddings usually
# We could tune the hyperparameters further with hyperparameter tuning methods like gridsearch
# Also longer training gives more accuracy
trainable_pipe['trainable_sentiment_dl'].setMaxEpochs(1)  
trainable_pipe['trainable_sentiment_dl'].setLr(0.0005) 
fitted_pipe = trainable_pipe.fit(train_df)
# predict with the trainable pipeline on dataset and get predictions
preds = fitted_pipe.predict(train_df,output_level='document')

#sentence detector that is part of the pipe generates sone NaNs. lets drop them first
preds.dropna(inplace=True)
print(classification_report(preds['y'], preds['sentiment']))

preds

sent_small_bert_L2_128 download started this may take some time.
Approximate size to download 16.1 MB
[OK!]
              precision    recall  f1-score   support

    negative       0.89      0.01      0.02      2106
     neutral       0.00      0.00      0.00         0
    positive       0.67      0.93      0.78      3685

    accuracy                           0.60      5791
   macro avg       0.52      0.31      0.26      5791
weighted avg       0.75      0.60      0.50      5791



Unnamed: 0,document,sentence_embedding_bert,sentiment,sentiment_confidence,text,y
0,Kickers on my watchlist XIDE TIT SOQ PNK CPW B...,"[-0.9530858993530273, 0.21358267962932587, 0.1...",positive,0.0,Kickers on my watchlist XIDE TIT SOQ PNK CPW B...,positive
1,user: AAP MOVIE. 55% return for the FEA/GEED i...,"[-0.47259706258773804, 0.5354136228561401, -0....",positive,0.0,user: AAP MOVIE. 55% return for the FEA/GEED i...,positive
2,user I'd be afraid to short AMZN - they are lo...,"[0.30400291085243225, 0.2286294549703598, -0.5...",positive,0.0,user I'd be afraid to short AMZN - they are lo...,positive
3,MNTA Over 12.00,"[-1.7079013586044312, -0.4847279489040375, -0....",positive,0.0,MNTA Over 12.00,positive
4,OI Over 21.37,"[-2.3011534214019775, 0.26495108008384705, -0....",positive,0.0,OI Over 21.37,positive
...,...,...,...,...,...,...
5786,Industry body CII said #discoms are likely to ...,"[-0.2165522426366806, 0.6153535842895508, 0.04...",neutral,0.0,Industry body CII said #discoms are likely to ...,negative
5787,"#Gold prices slip below Rs 46,000 as #investor...","[-0.19915248453617096, 0.26074403524398804, 0....",neutral,0.0,"#Gold prices slip below Rs 46,000 as #investor...",negative
5788,Workers at Bajaj Auto have agreed to a 10% wag...,"[-0.43615204095840454, 0.9346762895584106, -0....",neutral,0.0,Workers at Bajaj Auto have agreed to a 10% wag...,positive
5789,"#Sharemarket LIVE: Sensex off day’s high, up 6...","[-0.6081280708312988, 0.2732299864292145, 0.25...",neutral,0.0,"#Sharemarket LIVE: Sensex off day’s high, up 6...",positive


# 5. Lets save the model

In [28]:
stored_model_path = './models/classifier_dl_trained' 
fitted_pipe.save(stored_model_path)

Stored model_anno_obj in ./models/classifier_dl_trained


# 6. Lets load the model from HDD.
This makes Offlien NLU usage possible!   
You need to call nlu.load(path=path_to_the_pipe) to load a model/pipeline from disk.

In [29]:
hdd_pipe = nlu.load(path=stored_model_path)

preds = hdd_pipe.predict('Tesla plans to invest 10M into the ML sector')
preds



Unnamed: 0,document,sentence_embedding_from_disk,sentiment,sentiment_confidence
0,Tesla plans to invest 10M into the ML sector,"[-0.0711158961057663, 0.9532930254936218, -1.0...",positive,0.0


In [30]:
hdd_pipe.print_info()

The following parameters are configurable for this NLU pipeline (You can copy paste the examples) :
>>> component_list['document_assembler'] has settable params:
component_list['document_assembler'].setCleanupMode('shrink')                                  | Info: possible values: disabled, inplace, inplace_full, shrink, shrink_full, each, each_full, delete_full | Currently set to : shrink
>>> component_list['bert_sentence_embeddings@sent_small_bert_L2_128'] has settable params:
component_list['bert_sentence_embeddings@sent_small_bert_L2_128'].setBatchSize(8)              | Info: Size of every batch | Currently set to : 8
component_list['bert_sentence_embeddings@sent_small_bert_L2_128'].setCaseSensitive(False)      | Info: whether to ignore case in tokens for embeddings matching | Currently set to : False
component_list['bert_sentence_embeddings@sent_small_bert_L2_128'].setDimension(128)            | Info: Number of embedding dimensions | Currently set to : 128
component_list['bert_sen