<img src="https://nlp.johnsnowlabs.com/assets/images/logo.png" width="180" height="50" style="float: left;">

## Vivekn Sentiment Analysis

In the following example, we walk-through Sentiment Analysis training and prediction using Spark NLP Annotators.

The ViveknSentimentApproach annotator will compute [Vivek Narayanan algorithm](https://arxiv.org/pdf/1305.6143.pdf) with either a column in training dataset with rows labelled 'positive' or 'negative' or a folder full of positive text and a folder with negative text. Using n-grams and negation of sequences, this statistical model can achieve high accuracy if trained properly.

Spark can be leveraged in training by utilizing ReadAs.Dataset setting. Spark will be used during prediction by default.

We also include in this pipeline a spell checker which shall correct our sentences for better Sentiment Analysis accuracy.

### Spark `2.4` and Spark NLP `1.8.4`

#### 1. Call necessary imports and set the resource path to read local data files

In [None]:
#Imports
import time
import sys
import os
#sys.path.append('../../')

from pyspark.ml import Pipeline, PipelineModel
from sparknlp.annotator import *
from pyspark.sql.functions import array_contains
from sparknlp.base import DocumentAssembler, Finisher

#Setting location of resource Directory
resource_path= "../../../src/test/resources/"


#### 2. Load SparkSession if not already there

In [None]:
spark = SparkSession.builder \
    .appName("VivekNarayanSentimentApproach")\
    .master("local[*]")\
    .config("spark.driver.memory","8G")\
    .config("spark.driver.maxResultSize", "2G")\
    .config("spark.jars.packages", "JohnSnowLabs:spark-nlp:1.8.4")\
    .config("spark.kryoserializer.buffer.max", "500m")\
    .getOrCreate()

 #### 3. Load a spark dataset and put it in memory

In [None]:
#Load the input data to be annotated
data = spark. \
        read. \
        parquet( resource_path+"sentiment.parquet"). \
        limit(1000).cache()
data.show()

#### 4. Create the document assembler, which will put target text column into Annotation form

In [None]:
### Define the dataframe
document_assembler = DocumentAssembler() \
            .setInputCol("text")\
            .setOutputCol("document")


In [None]:
### Example: Checkout the output of document assembler
assembled = document_assembler.transform(data)
assembled.show(5)

#### 5. Create Sentence detector to parse sub sentences in every document

In [None]:
### Sentence detector
sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

In [None]:
### Example: Checkout the output of sentence detector
sentence_data = sentence_detector.transform(assembled)
sentence_data.show(5)

#### 6. The tokenizer will match standard tokens

In [None]:
### Tokenizer
tokenizer = Tokenizer() \
            .setInputCols(["sentence"]) \
            .setOutputCol("token")


In [None]:
### Example: Checkout the outout of tokenizer
tokenized = tokenizer.transform(sentence_data)
tokenized.show(5)

#### 7. Normalizer will clean out the tokens

In [None]:
normalizer = Normalizer() \
            .setInputCols(["token"]) \
            .setOutputCol("normal")

#### 8. The spell checker will correct normalized tokens, this trains with a dictionary of english words

In [None]:
### Spell Checker
spell_checker = NorvigSweetingApproach() \
            .setInputCols(["normal"]) \
            .setOutputCol("spell") \
            .setDictionary( resource_path+ "spell/words.txt")


#### 9. Create the ViveknSentimentApproach and set resources to train it

In [None]:
sentiment_detector = ViveknSentimentApproach() \
    .setInputCols(["spell", "sentence"]) \
    .setOutputCol("sentiment") \
    .setPruneCorpus(0) \
    .setPositiveSource(resource_path+"vivekn/positive") \
    .setNegativeSource(resource_path+"vivekn/negative") \


#### 10. The finisher will utilize sentiment analysis output

In [None]:
finisher = Finisher() \
    .setInputCols(["sentiment"]) \
    .setIncludeMetadata(False)


##### 11. Fit and predict over data

In [None]:
pipeline = Pipeline(stages=[
    document_assembler,
    sentence_detector,
    tokenizer,
    normalizer,
    spell_checker,
    sentiment_detector,
    finisher
])

start = time.time()
sentiment_data = pipeline.fit(data).transform(data)

end = time.time()
print("Time elapsed pipeline process: " + str(end - start))

##### 13. Check the result

In [None]:
sentiment_data.show(5,False)

In [None]:
type(sentiment_data)


In [None]:
# Negative Sentiments
for r in sentiment_data.where(array_contains(sentiment_data.finished_sentiment, "negative")).take(5):
    print(r['text'].strip(),"->",r['finished_sentiment'])

In [None]:
# Positive Sentiments
for r in sentiment_data.where(array_contains(sentiment_data.finished_sentiment, "positive")).take(5):
    print(r['text'].strip(),"->",r['finished_sentiment'])

#### 14. Can also be used directly on an array of dummy text

In [None]:
dummy_data = spark.sparkContext.parallelize([["I am happy and like this spark NLP"], ["Have to say something bad now"]]).toDF().toDF("text")
dummy_data.show()

In [None]:
pipeline.fit(dummy_data).transform(dummy_data).show()

##### 14. The pipeline could be saved on disk for future reuse. Either after or before fitting the model

In [None]:
new_pipeline = Pipeline(stages=[
    document_assembler,
    sentence_detector,
    tokenizer,
    normalizer,
    spell_checker,
    sentiment_detector,
    finisher
])

start = time.time()
new_pipeline.write().overwrite().save("./ps")
end = time.time()
print("Time elapsed in write before fiting pipelines: " + str(end - start))
start = time.time()
new_pipeline.fit(data).write().overwrite().save("./ms")
end = time.time()
print("Time elapsed in write after fiting pipelines: " + str(end - start))

##### 15. Pipelines can be easily loaded back in memory 

In [None]:

start = time.time()
p = Pipeline.read().load("./ps")
pm = PipelineModel.read().load("./ms")
end = time.time()
print("Time elapsed in read pipelines: " + str(end - start))

In [None]:
# Using the fitted pipeline read from disk
start = time.time()
data_transformed=pm.transform(data)
data_transformed.where(array_contains(data_transformed.finished_sentiment, "negative")).show()
print(data_transformed.count())
end = time.time()
print("Time elapsed in using loaded pipelines: " + str(end - start))