A distributed sentimental analysis pipeline with PySpark
In this project, we will use a pre-trained sentimental analysis model and deploy its prediction pipeline on to a distributed Spark cluster simulated with VirtualBox VMs
- Setup and test a sentimental analysis prediction pipeline
- Note: prediction only not training
- Deploy on a spark cluster running on VMs (for realistic simulation purposes)
- Testing with
animal-crossing
review dataset - Gather and review dataset/results with some analysis, visualizations.
VirtualBox
,Linux
Anaconda
,Python
Java
,Scala
Spark
,PySpark
JupyterNotebook
VADER
from nltk
toolkit
We use a pretrained model by VADER (Valence Aware Dictionary and Entiment Reasoner) from nltk toolkit. NLTK is open source software and you can find more detail at (Link)
We found Animal Crossing Reviews dataset, which contains 4 csv files about the VillagerDB and Metacritic.
In this project, we just use the critis.csv (Metacritic reviews of Animal Crossing). You can see the csv file in data/animal-crossing.csv
folder. This file includes 107 rows and 4 columns (grade, publication, text, date).
Some insights about the data
Distribution of grade from the reviews. | Categories base on grade. |
---|---|
More EDA and info about dataset can be found at Ref
Our experiments are done on Linux VMs running on VirtualBox. Very detailed explanation on how to setup the environment is available in our doc. (How to setup spark cluster on VMs)
-
Spark Cluster (
VirtualBox
)- 1 master node:
spark-master
- 2 slaves:
spark-master
andspark-slave-1
- 1 master node:
-
Pipeline description
- A data streaming server: simulated with
stream-server.py
which opens a socket at port9999
and sends messages to any clients that connect to it - Spark Master: Connect to the streaming server at socket using
SparkStreaming
and distribute received data in batches to all workers for processing. - Spark Workers: Each have an instance of our model, process each batch of messages to generate sentimental score
- Result Merging: After prediction, results are saved into the database (here we simulate that by making workers send their predictions back to master for storing in a
.csv
file)
- A data streaming server: simulated with
-
How to recreate? 🤿
-
Setup VMs and environment similar to our environment section
-
Go to
spark-master
and start the cluster withstart-all.sh
-
(Optional) Pickle the model with
model-pickle.py
-
Run the streaming server
python stream-server.py
-
Run spark job to get data from socket and predict in batches
spark-submit spark-stream-model.py # use this if want to output to log file spark-submit ... > ./logs/log-file.log
-
Observe results and draw your own conclusions from
- Spark web-ui (
spark-master:8080
) - Spark history server (
spark-master:18080
) ./data/model-output.csv
console
orlog
(an example log is included in./logs/
)
- Spark web-ui (
-
-
Look at our sample log and sample output for more insights
-
Spark cluster specs
-
Event timeline of the cluster executing
spark-stream-model.py
-
Details for execution of the #1 data frame
(batch 1 size 58 - query 5)
The above query is related to Job 5 (which info could be found here)
-
Histogram
Here, we can note that the graph is biased towards the right side, and hence this is a sign of distribution, which is left-skewed distribution. A large number of data values occur on the right side and fewer data on the left side. It indicates that many have positive and very positive opinions. Relatively fewer put some neutral or negative words.
-
Wordcloud (Popular words in review text)
Animal Crossing is a social video game developed Nintendo. In Animal Crossing, the player character is a human who lives in an island w animals. The player can carry out many activities to develops the island as they want. Some words like Animal, island,series appear the most frequently. The pl like the game put some positive words in general, the best, fun,relaxing etc.
-
Bargraph (Most popular adjective in 2 grading groups)
One thing maybe interesting to see is to look at the most common adjective to see how people decribe their exprience. The people like the game tend to put more discriptive adj in their reviews, thinking it's the best, great, perfect. Also, we can see that in generally people feel bad about the game dont describe it clearly (?).
- Correlation between Score and Text Length
When there is no clear relationship between the two variables, we say there is no correlation between the two variables.
Documents included in this projects are:
- How to setup spark cluster on VMs (❗Our main testing environment for this project)
- How to setup Spark on Windows (not used for experiment)
Here's a video demo-ing our results
- Visit
/docs
- for all the documents relating to setting up spark, ... /scripts
- script for various testing purposes/windows
is for supporting tools needed to run spark on windows/data
- our input data and output aggregated data/models
- where we store our pickled models (binary) for distributed reuse at workers
Original teammate includes @phoenisbuster and @RedEvilBK
Created by @produdez - feel free to contact me or follow my blog on medium ❤️!