# Text Classification with ClassifierDL

**Relevant blogpost:** https://towardsdatascience.com/text-classification-in-spark-nlp-with-bert-and-universal-sentence-encoders-e644d618ca32

## Colab Setup

In [1]:
!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.4.1 -s 5.3.2 -g

--2024-07-28 18:05:28--  https://setup.johnsnowlabs.com/colab.sh
Resolving setup.johnsnowlabs.com (setup.johnsnowlabs.com)... 3.86.22.73
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|3.86.22.73|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh [following]
--2024-07-28 18:05:28--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1191 (1.2K) [text/plain]
Saving to: ‘STDOUT’

Installing PySpark 3.2.3 and Spark NLP 5.3.2

2024-07-28 18:05:29 (14.5 MB/s) - written to stdout [1191/1191]

setup Colab for PySpark 3.2.3 and Spark NLP 5.3.2
U

In [2]:
import sparknlp

from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
import pandas as pd
import os

spark = sparknlp.start(gpu = True)# for GPU training >> sparknlp.start(gpu = True)

print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", spark.version)

spark

Spark NLP version 5.3.2
Apache Spark version: 3.2.3


## ClassiferDL with Word Embeddings and Text Preprocessing

### Load Dataset

In [4]:
from pyspark.sql.types import StructType, StructField, StringType, ArrayType
from functools import reduce
from pyspark.sql.functions import explode, split, desc, col, lower, regexp_replace, udf, trim, to_json, struct

In [22]:
trainDataset = (
    spark
    .read
    .option("header", False)
    .option("delimiter", ",")
    .option("multiLine", "true")
    #.schema(custom_schema)
    .csv("/content/all-data.csv")
    .select(col("_c0").alias("sentiment"), col("_c1").alias("text"))
  )

trainDataset.show(truncate=50)

+---------+--------------------------------------------------+
|sentiment|                                              text|
+---------+--------------------------------------------------+
|  neutral|According to Gran , the company has no plans to...|
|  neutral|Technopolis plans to develop in stages an area ...|
| negative|The international electronic industry company E...|
| positive|With the new production plant the company would...|
| positive|According to the company 's updated strategy fo...|
| positive|FINANCING OF ASPOCOMP 'S GROWTH Aspocomp is agg...|
| positive|For the last quarter of 2010 , Componenta 's ne...|
| positive|In the third quarter of 2010 , net sales increa...|
| positive|Operating profit rose to EUR 13.1 mn from EUR 8...|
| positive|Operating profit totalled EUR 21.1 mn , up from...|
| positive|TeliaSonera TLSN said the offer is in line with...|
| positive|STORA ENSO , NORSKE SKOG , M-REAL , UPM-KYMMENE...|
| positive|A purchase agreement for 7,200 tons of gasol

In [23]:
trainDataset.count()

10688

In [24]:
trainDataset, testDataset = trainDataset.randomSplit([0.9, 0.1], seed = 2018)


In [25]:
from pyspark.sql.functions import col

trainDataset.groupBy("sentiment") \
    .count() \
    .orderBy(col("count").desc()) \
    .show()

+---------+-----+
|sentiment|count|
+---------+-----+
|  neutral| 5445|
| positive| 2885|
| negative| 1314|
+---------+-----+



In [26]:
testDataset.groupBy("sentiment") \
      .count() \
      .orderBy(col("count").desc()) \
      .show()

+---------+-----+
|sentiment|count|
+---------+-----+
|  neutral|  564|
| positive|  330|
| negative|  150|
+---------+-----+



In [28]:

document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

bert_embeddings = BertEmbeddings().pretrained(name='small_bert_L8_512', lang='en') \
.setInputCols(["document",'token'])\
.setOutputCol("embeddings")

embeddingsSentence = SentenceEmbeddings() \
    .setInputCols(["document", "embeddings"]) \
    .setOutputCol("sentence_embeddings") \
    .setPoolingStrategy("AVERAGE")

classsifierdl = ClassifierDLApproach()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("class")\
    .setLabelColumn("sentiment")\
    .setMaxEpochs(150)\
    .setLr(0.001)\
    .setBatchSize(8)\
    .setEnableOutputLogs(True)
    #.setOutputLogsPath('logs')

bert_clf_pipeline = Pipeline(
    stages=[
        document_assembler,
        tokenizer,
        bert_embeddings,
        embeddingsSentence,
        classsifierdl
])

small_bert_L8_512 download started this may take some time.
Approximate size to download 149.1 MB
[OK!]


In [29]:
! rm -r /root/annotator_logs

In [30]:
%%time
bert_clf_pipelineModel = bert_clf_pipeline.fit(trainDataset)

CPU times: user 13.9 s, sys: 1.42 s, total: 15.3 s
Wall time: 41min 56s


In [31]:
log_files = os.listdir("/root/annotator_logs")
log_files

['ClassifierDLApproach_6fa0ef078241.log']

In [32]:
log_file_name = os.listdir("/root/annotator_logs")[0]

with open("/root/annotator_logs/"+log_file_name, "r") as log_file :
    print(log_file.read())

Training started - epochs: 150 - learning_rate: 0.001 - batch_size: 8 - training_examples: 9644 - classes: 3
Epoch 0/150 - 10.89s - loss: 1029.6616 - acc: 0.6819502 - batches: 1206
Epoch 1/150 - 10.05s - loss: 960.5566 - acc: 0.7457469 - batches: 1206
Epoch 2/150 - 9.93s - loss: 924.2237 - acc: 0.7774896 - batches: 1206
Epoch 3/150 - 8.78s - loss: 903.5121 - acc: 0.79533195 - batches: 1206
Epoch 4/150 - 10.15s - loss: 889.98846 - acc: 0.80643153 - batches: 1206
Epoch 5/150 - 9.92s - loss: 879.7088 - acc: 0.81649375 - batches: 1206
Epoch 6/150 - 8.91s - loss: 871.216 - acc: 0.8253112 - batches: 1206
Epoch 7/150 - 9.91s - loss: 864.2659 - acc: 0.8310166 - batches: 1206
Epoch 8/150 - 10.13s - loss: 858.3437 - acc: 0.83599585 - batches: 1206
Epoch 9/150 - 10.15s - loss: 853.1787 - acc: 0.8407676 - batches: 1206
Epoch 10/150 - 8.70s - loss: 848.5461 - acc: 0.8450208 - batches: 1206
Epoch 11/150 - 9.95s - loss: 844.35504 - acc: 0.8477178 - batches: 1206
Epoch 12/150 - 9.97s - loss: 840.7591 

In [33]:
# We are going to use sklearn to evalute the results on test dataset
from sklearn.metrics import classification_report
preds = bert_clf_pipelineModel.transform(testDataset)
preds_df = preds.select('sentiment','text',"class.result").toPandas()
preds_df['result'] = preds_df['result'].apply(lambda x : x[0])
print (classification_report(preds_df['sentiment'], preds_df['result']))

              precision    recall  f1-score   support

    negative       0.55      0.59      0.57       150
     neutral       0.86      0.84      0.85       564
    positive       0.81      0.82      0.82       330

    accuracy                           0.80      1044
   macro avg       0.74      0.75      0.75      1044
weighted avg       0.80      0.80      0.80      1044



# Save model and Zip it for Modelshub Upload/Downloads

In [34]:
!rm -fr /content/sentiment-analysis-model

In [35]:
bert_clf_pipelineModel.save('sentiment-analysis-model')
!zip -r /content/sentiment-analysis-model.zip /content/sentiment-analysis-model

In [None]:
from pyspark.ml import PipelineModel
loaded_pipeline_model = PipelineModel.load('models/sentiment-analysis')

In [37]:
preds1 = loaded_pipeline_model.transform(testDataset)
preds_df1 = preds1.select('sentiment','text',"class.result").toPandas()
preds_df1['result'] = preds_df1['result'].apply(lambda x : x[0])
print (classification_report(preds_df1['sentiment'], preds_df1['result']))

              precision    recall  f1-score   support

    negative       0.55      0.59      0.57       150
     neutral       0.86      0.84      0.85       564
    positive       0.81      0.82      0.82       330

    accuracy                           0.80      1044
   macro avg       0.74      0.75      0.75      1044
weighted avg       0.80      0.80      0.80      1044

