Sparkimental

A distributed sentimental analysis pipeline with PySpark

Description

In this project, we will use a pre-trained sentimental analysis model and deploy its prediction pipeline on to a distributed Spark cluster simulated with VirtualBox VMs

Aim

Setup and test a sentimental analysis prediction pipeline
- Note: prediction only not training
Deploy on a spark cluster running on VMs (for realistic simulation purposes)
Testing with animal-crossing review dataset
Gather and review dataset/results with some analysis, visualizations.

Technology

VirtualBox, Linux
Anaconda, Python
Java, Scala
Spark, PySpark
JupyterNotebook

Model

VADER from nltk toolkit

We use a pretrained model by VADER (Valence Aware Dictionary and Entiment Reasoner) from nltk toolkit. NLTK is open source software and you can find more detail at (Link)

Data

We found Animal Crossing Reviews dataset, which contains 4 csv files about the VillagerDB and Metacritic.

In this project, we just use the critis.csv (Metacritic reviews of Animal Crossing). You can see the csv file in data/animal-crossing.csv folder. This file includes 107 rows and 4 columns (grade, publication, text, date).

Some insights about the data

Distribution of grade from the reviews.	Categories base on grade.

More EDA and info about dataset can be found at Ref

Environment

Our experiments are done on Linux VMs running on VirtualBox. Very detailed explanation on how to setup the environment is available in our doc. (How to setup spark cluster on VMs)

Experiment Details

Spark Cluster (VirtualBox)
- 1 master node: spark-master
- 2 slaves: spark-master and spark-slave-1
Pipeline description
- A data streaming server: simulated with stream-server.py which opens a socket at port 9999 and sends messages to any clients that connect to it
- Spark Master: Connect to the streaming server at socket using SparkStreaming and distribute received data in batches to all workers for processing.
- Spark Workers: Each have an instance of our model, process each batch of messages to generate sentimental score
- Result Merging: After prediction, results are saved into the database (here we simulate that by making workers send their predictions back to master for storing in a .csv file)
How to recreate? 🤿
1. Setup VMs and environment similar to our environment section
2. Go to spark-master and start the cluster with start-all.sh
3. (Optional) Pickle the model with model-pickle.py
4. Run the streaming server
```
    python stream-server.py
```
5. Run spark job to get data from socket and predict in batches
```
    spark-submit spark-stream-model.py

    # use this if want to output to log file
    spark-submit ... > ./logs/log-file.log
```
6. Observe results and draw your own conclusions from
  - Spark web-ui (spark-master:8080)
  - Spark history server (spark-master:18080)
  - ./data/model-output.csv
  - console or log (an example log is included in ./logs/)

Results

Spark Results

Look at our sample log and sample output for more insights
Spark cluster specs
Event timeline of the cluster executing spark-stream-model.py
Details for execution of the #1 data frame (batch 1 size 58 - query 5)

The above query is related to Job 5 (which info could be found here)

Output Visualizations

Histogram

Here, we can note that the graph is biased towards the right side, and hence this is a sign of distribution, which is left-skewed distribution. A large number of data values occur on the right side and fewer data on the left side. It indicates that many have positive and very positive opinions. Relatively fewer put some neutral or negative words.
Wordcloud (Popular words in review text)

Animal Crossing is a social video game developed Nintendo. In Animal Crossing, the player character is a human who lives in an island w animals. The player can carry out many activities to develops the island as they want. Some words like Animal, island,series appear the most frequently. The pl like the game put some positive words in general, the best, fun,relaxing etc.
Bargraph (Most popular adjective in 2 grading groups)

One thing maybe interesting to see is to look at the most common adjective to see how people decribe their exprience. The people like the game tend to put more discriptive adj in their reviews, thinking it's the best, great, perfect. Also, we can see that in generally people feel bad about the game dont describe it clearly (?).

Correlation between Score and Text Length

When there is no clear relationship between the two variables, we say there is no correlation between the two variables.

Docs

Documents included in this projects are:

How to setup spark cluster on VMs (❗Our main testing environment for this project)
How to setup Spark on Windows (not used for experiment)

Demo

Here's a video demo-ing our results

Extra Notes

Visit /docs - for all the documents relating to setting up spark, ...
/scripts - script for various testing purposes
/windows is for supporting tools needed to run spark on windows
/data - our input data and output aggregated data
/models - where we store our pickled models (binary) for distributed reuse at workers

Acknowledgements

Original teammate includes @phoenisbuster and @RedEvilBK

Contact

Created by @produdez - feel free to contact me or follow my blog on medium ❤️!

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.vscode		.vscode
data		data
docs		docs
logs		logs
models		models
scripts		scripts
visualization_output		visualization_output
.gitignore		.gitignore
README.md		README.md
Visualization.ipynb		Visualization.ipynb
conda.env.yml		conda.env.yml
model-pickle.py		model-pickle.py
spark-stream-model.py		spark-stream-model.py
stream-server.py		stream-server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sparkimental

Description

Aim

Technology

Model

Data

Environment

Experiment Details

Results

Spark Results

Output Visualizations

Docs

Demo

Extra Notes

Acknowledgements

Contact

About

Releases

Packages

Contributors 3

Languages

produdez/sparkimental

Folders and files

Latest commit

History

Repository files navigation

Sparkimental

Description

Aim

Technology

Model

Data

Environment

Experiment Details

Results

Spark Results

Output Visualizations

Docs

Demo

Extra Notes

Acknowledgements

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages