sentiment_analysis

Installation and Usage:

1) Install mysql
2) Open mysql shell
3) Create database with name =  "sentiment_store"
4) Please enter your username and password on line 8 and 9
5) Run pip install -r requirements.txt
6) Run main.py
    -> This combines the 2 data sources and stores it in a table called data_dump
    -> Then it reads data from data_dump.
    -> Cleans the entire dataset and preps it for running on a ML model
-> Trains ML model and generates insights

What does the business need? -> The movie producers need to understand if their movie is popular or not.

Data Loading: All data is loaded onto a mysql database called sentiment store. The data is then processed for evaluation using the techniques mentioned below. Once data has been cleaned , it is passed to various machine learning classification and clustering models These models then generate the insights and sentiment analysis engine which can then be stored and used in future applications.

Data understanding and prep:

We looked at the combined dataset and figured out that it was well balanced which is good.
 -> TODO: Attach Image

Further, we can see that there are many words that can throw our model off track and these
need to be cleaned.
Example : tags like "</br>", spaces, punctuation marks and so on.

Cleaning of the data has been done in 4 stages:
    1)Tokenize (using TweetTokenizer)
    2) Handle emojis
        -> These are very important for sentiment analysis and need to be preserved.
    3) Convert all text to lowercase
    4) Remove html tags
    5) Remove remove punctuations

Tokenizer:

   TweetTokenizer chosen because it performs better than the others by taking into account emojis, etc.

Stemming:

    Stemming not done because it may destroy some of our features
    For example : Complicate (negative) and compliment(positive) are stemmed to => comply

Handling emojis:

    Emojis are very important and we replace each emoji with the text it represents
    example:
        :) is replaced with "happy"

NLTK Stop words used:

    NLTK stop words would remove important features
    ex : mustn't, shouldn't, etc

CONTRIBUTION BREAKDOWN :

Data assembly/ management : Jagwant Sehgal
Data preprocessing and preparation : Chaitanya Mehta
Machine learning model generation and tuning : Chaitanya Mehta and Jagwant Sehgal
Clustering model generation : Chaitanya Mehta
Analysis : Jagwant Sehgal

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.idea		.idea
__pycache__		__pycache__
aclImdb		aclImdb
Classification.png		Classification.png
DataGobbler.py		DataGobbler.py
DataInsight.py		DataInsight.py
Data_distribution.png		Data_distribution.png
DatabaseConnection.py		DatabaseConnection.py
Installation and Usage Instructions.docx		Installation and Usage Instructions.docx
MachineLearningModel.py		MachineLearningModel.py
PreProcessingPipeline.py		PreProcessingPipeline.py
README.md		README.md
SentimentAnalyzer.py		SentimentAnalyzer.py
Untitled.ipynb		Untitled.ipynb
datasetSentences.txt		datasetSentences.txt
main.py		main.py
notes.txt		notes.txt
requirements.txt		requirements.txt
rotten_tomatoes_reviews.csv		rotten_tomatoes_reviews.csv
wordcloud.png		wordcloud.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sentiment_analysis

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

sentiment_analysis

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages