# Classical Pipeline
Using Apache spark and spark-nlp

In this notebook, many sentences were analyzed to find the semantics of each one, then the accuracy of the classical case is calculated. Which will then be compared with the accuracy of the hybrid an quantum pipeline.

A pipeline has three major steps:


1.   Extract: in which three files are extracted and stored in one file.
2.   Transform: different transformations were done like removing duplicate sentences and finding if a sentence is positive or negative.
3.   Load: the result is saved in this jupyter notebook and then stored in a github repository.



In [None]:
%pip install spark-nlp
%pip install pyspark

In [None]:
# installing the pretrained spark-nlp pipeline
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

In [9]:
import warnings

# ignore warnings about deprecated features
warnings.filterwarnings("ignore", category=FutureWarning)

Merge the 3 files into 1 file

In [10]:
import glob

read_files = glob.glob("data/*.txt")

# merge all .txt files in the data folder to one file called sentences_file_result.txt
with open("data/sentences_file_result.txt", "wb") as outfile:
    for f in read_files:
        with open(f, "rb") as infile:
            outfile.write(infile.read())

Transformation

In [11]:
import pandas as pd
file_path = 'data/sentences_file_result.txt'

data = []
with open(file_path, 'r') as file:
    for line in file:
        # split the labels and sentences
        parts = line.split(' ', 1)
        if len(parts) == 2:
            # assign the value of label to number and the sentence to text
            # strip() function returns the sentence without spaces at the beginning and at the end
            number, text = parts
            data.append((number, text.strip()))

df = pd.DataFrame(data, columns=['label', 'text'])

df

Unnamed: 0,label,text
0,1,I am glad I took the leap of faith .
1,1,I ordered May 31 .
2,1,Thank you .
3,1,Thank you .
4,1,Thank you .
...,...,...
648,1,The app has all the features that I need in a ...
649,1,Value for money display very good performance ...
650,1,Thank you .
651,1,Good packaging and promote delivery .


In [12]:
# remove duplicate sentences
df_clean = df.drop_duplicates()

df_clean

Unnamed: 0,label,text
0,1,I am glad I took the leap of faith .
1,1,I ordered May 31 .
2,1,Thank you .
5,1,Was unused at first with ordering a refurbishe...
7,0,Very disappointed with the shell ordered: it i...
...,...,...
646,1,Product received as expected great .
647,1,Doesn't connect consistently with either ios o...
648,1,The app has all the features that I need in a ...
649,1,Value for money display very good performance ...


In [13]:
import sparknlp
spark = sparknlp.start()

In [14]:
# importing the pretrained sentiment analysis pipeline
from sparknlp.pretrained import PretrainedPipeline

pipeline = PretrainedPipeline('analyze_sentimentdl_glove_imdb', 'en')

analyze_sentimentdl_glove_imdb download started this may take some time.
Approx size to download 154.1 MB
[OK!]


In [15]:
# list containing the labels of sentences found using spark-nlp
classified_label = []

for sentence in df_clean.values:
  result = pipeline.annotate(sentence[1])
    # if review is positive append 1
  if result['sentiment'][0] == 'pos':
      classified_label.append(float(str(1)))
    # if review is negative append 0
  elif result['sentiment'][0] == 'neg':
      classified_label.append(float(str(0)))
    # review is neutral append -1
  else:
      classified_label.append(float(str(-1)))

# stop the spark nlp session
spark.stop()

print(classified_label)

[1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, -1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, -1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, -1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1

In [16]:
from pyspark.sql import SparkSession
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Initialize Spark
spark = SparkSession.builder.appName("SentenceClassifier").getOrCreate()

# Create a Spark DataFrame
spark_df = spark.createDataFrame(df_clean)

# transform spark dataframe into a pandas dataframe
spark_df = spark_df.toPandas()

# add the labels found using sparknlp to the dataframe
spark_df['classified_label'] = classified_label

# convert the new pandas dataframe into a spark dataframe
classified_df = spark.createDataFrame(spark_df)

classified_df.show()

# Convert to DoubleType for the "label" column, since accuracy requires columns of type double
# the classified_label column is already a float from the cell above using float(str(value))
classified_df_sp = classified_df.withColumn("label", classified_df.label.cast('double'))

# Filter out rows with a -1 value for "classified_label"
classified_df_sp = classified_df_sp.filter(classified_df_sp["classified_label"] != -1)


# Calculate accuracy using MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator(predictionCol="classified_label", labelCol="label", metricName="accuracy")
accuracy = evaluator.evaluate(classified_df_sp)

print("Accuracy:", accuracy)

# Stop the Spark session
spark.stop()


+-----+--------------------+----------------+
|label|                text|classified_label|
+-----+--------------------+----------------+
|    1|I am glad I took ...|             1.0|
|    1|  I ordered May 31 .|             0.0|
|    1|         Thank you .|             1.0|
|    1|Was unused at fir...|             0.0|
|    0|Very disappointed...|             0.0|
|    0|        A bit ugly .|             0.0|
|    0|Manufacturing def...|             0.0|
|    0|Very disappointed...|             1.0|
|    0|The case is great...|             1.0|
|    0|       Bad quality .|             0.0|
|    0|        Overpriced .|             0.0|
|    1|I ordered the pho...|             1.0|
|    1|A very reasonable...|             1.0|
|    1|I ordered an phon...|             1.0|
|    1|I ordered an phon...|             1.0|
|    1|      Good quality .|             1.0|
|    1|Fast, friendly, r...|             1.0|
|    1|Got this phone fo...|             1.0|
|    1|I ordered an iPho...|      

In [17]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("example").getOrCreate()
spark_version = spark.version
print("Apache Spark version:", spark_version)
spark.stop()

Apache Spark version: 3.2.3


In [18]:
sparknlp.version()

'5.1.4'

In [19]:
!python -V

Python 3.10.12
