# Sentiment Analysis

---

### Case Study: Sentiment Analysis for Tweets

> Goal: Derive the sentiment (positive, negative) of political tweets

1000 "popular" English tweets directed towards

- U.S. Senate Republican Minority Leader Mitch McConnell
- U.S. Senate Democratic Majority Leader Chuck Schumer

#### Approaches

- Train your own custom supervised learning model (Bag of Words, Word Vectors)
- Use a pre-trained sentiment model (https://docs.aws.amazon.com/comprehend/latest/dg/how-sentiment.html)
- Unsupervised "Learning" based on hard coded vocabulary and rules (`vaderSentiment`)

> What is the advantage and disadvantage of each approach?

In [4]:
!pip install vaderSentiment

Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
[K     |████████████████████████████████| 125 kB 2.8 MB/s eta 0:00:01
Installing collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2


In [5]:
!pip install sqlalchemy



In [6]:
import os
import pandas as pd
import numpy as np
import tweepy
import seaborn as sns
import pymongo
import sqlalchemy

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

---
# ETL Job

1. Get tweets data from a MongoDB
2. Annotate each tweet with a sentiment score
3. Convert the data into a pandas dataframe
4. Send the data to a PostgreSQL DB

## 1. Get tweets data from a MongoDB

For the project you are going to use your own MongoDB that lives inside a docker container!

```
pymongo.MongoClient(host="mongodb", port=27017)
```

In [None]:
connection_url = "mongodb+srv://minty-carlo:spiced99@cluster0.8km2h.mongodb.net"
client = pymongo.MongoClient(host=connection_url)

In [None]:
# TODO: connect to the twitter database
db = client.???

In [None]:
# TODO: count the number of documents in the tweets collection
db.tweets.???

In [None]:
# TODO: fetch all documents from the tweets collection
tweets = []
for tweet in ???:
    tweets.append(???)

In [None]:
tweets[0]

## 2. Annotate each tweet with a sentiment score


### What to do if we don't have any labeled data?

> VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media

https://github.com/cjhutto/vaderSentiment

Take a look at the Vader-github-repo and try to answer these questions:

1. Locate the "lexicon" (dictionary). What can we find in the dictionary and what are the values in the file representing?
2. Locate the implementation of the "rules":
    - Does vader take punctuation into account?
    - Which words intensify a sentiment?
    - What happens if one word is in ALL CAPS? What if the whole text is in ALL CAPS?

In [None]:
s  = SentimentIntensityAnalyzer()

In [None]:
sentiment = s.polarity_scores('I like that :) :)')
sentiment

In [None]:
# TODO: perform sentiment analysis for each tweet and store the compound score
for tweet in tweets:
    sentiment = ???
    tweet['sentiment'] = ???  

## 3. Convert data into a DataFrame

In [None]:
tweets_df = pd.DataFrame(tweets)
tweets_df.head(3)

In [None]:
# TODO: make a boxplot of the sentiment distribution separately for each mention (@LeaderMcConnell vs. @SenSchumer)
sns.boxplot(x=???, y=???, 
            data=???, 
            palette=['indianred', 'steelblue']
)
sns.despine()

## 4. Sent the data to a PostgreSQL database

> How to connect to your postgresql database inside docker?

In [None]:
# is the postgresql database running?
! docker-compose ps

In [None]:
# ! cat ../docker-compose.yml

In [None]:
# sqlalchemy can't handle the MongoDB object id type let's remove it!
tweets_df.drop('_id', axis=1, inplace=True)

In [None]:
# TODO: establish a connection to the database server
engine = create_engine('postgresql://<user-name>:<password>@<hostname>:<port>/<database-name>')

In [None]:
tweets_df.to_sql('tweets', engine, if_exists='replace', index=False)