# Twitter Sentiment Analysis

## Introduction

Our project is a sentiment analysis using Twitter data centered around the 2016 Presidential election. We aim to analyze geotagged tweets from different states in the United States to determine how happy or unhappy each state is. In addition, we aim to explore how the sentiment of each state compares to its political leanings and see if there were any changes before and after the election.

This analysis will be organized as follows:

- [Data Scraping](#Data-scraping)
- [Data Analysis](#Data-analysis)
- [Data Visualization](#Data-visualization)
- [Conclusions](#Conclusions)

## Data Scraping

To scrape the data from Twitter we used [Twint](https://github.com/haccer/twint), (Twitter Intelligence Tool), an advanced Twitter scraper that allowed us to fetch tweets for specific time periods and tweets that related to specific topics and trends. The benefits of using Twint as opposed to Twitter's API are that Twitter's API only allows users to fetch tweets from the past 7 days and limits the number of tweets to 3200 tweets. Twint allows users to fetch almost all tweets and can fetch tweets from any time period. In addition, Twint is not rate limited and does not require a Twitter account.

We chose to collect tweets from 6 months before the 2016 Presidential election and 6 months after the election. In addition, we filtered all the tweets and only included ones which contained the word "election" in order to analyze the political sentiment within the tweets.

Below is an example of the code used to extract all tweets sent 6 months before the election from Philadelphia, Pennsylvania. We used this code to extract tweets from each state both before and after the election.

As seen in the output, some users include a location on their profile to indicate where they are from. For the purposes of our analysis, however, we chose to use the location from which the user sent the tweet as the user's location. Thus, only tweets for which the geotagged location was available were included.

In [6]:
import twint

In [5]:
# Configure
c = twint.Config()

# 6 months before election
c.Since = "2016-05-08"
c.Until = "2016-11-07"

# tweets mentioning the election
c.Search = "election"

c.Format = "Tweet: {tweet} | Location: {location}"
c.Location = True
c.Count = True
# Philadelphia
c.Geo = '40.0024137,-75.2581183,50km'

# Run
twint.Search(c)

Tweet: City of Philadelphia files for injunction to halt Septa strike for Election Day FOX 29  http://toplocalnow.com/us/philadelphia  | Location: Philadelphia
Tweet: City of Philadelphia files injunction to temporarily halt the SEPTA strike for Election Day.  http://toplocalnow.com/us/philadelphia  | Location: Philadelphia
Tweet: UPDATE: Bruce Springsteen to join Clintons, Obamas at Independence Hall rally on night before Election Day…  https://twitter.com/i/web/status/795371867829891072 … | Location: Philadelphia
Tweet: As Democratic presidential candidate Hillary Clinton made a final blitz ahead of Election Day…  https://www.instagram.com/p/BMfEdlugPhL/  | Location: Washington, DC
Tweet: Spent the day canvassing in Philly.  Election Day is this Tuesday.  VOTE VOTE VOTE…  https://www.instagram.com/p/BMe61gHBYwp/  | Location: LosAngeles
Tweet: "Curiosity killed the cat". The Curio will be televising the election results Tuesday 7…  https://www.instagram.com/p/BMe5_CsDEp_/  | Location:

<twint.search.Search at 0x113260400>

After fetching the tweets for each state, we exported the results to csv files. We had 100 total csv files (tweets from before and after the election for each state). 

## Data Analysis

To analyze our data, we looked at the csv files containing the tweets from each state before and after the election and performed sentiment analysis on them. We went about this process by:

1. Scraping the tweet text from the csv files
2. Cleaning tweets to remove stopwords and other unneccesary parts
3. Training a Naive Bayes classifier to classify tweets as positive or negative
4. Running the classifier on our cleaned tweets to determine average sentiment

First we scraped the tweets from the csv file using scrape_tweets.

In [14]:
import nltk
from nltk.corpus import stopwords
from nltk.classify import NaiveBayesClassifier
import csv
import re
import os
import pandas as pd

In [15]:
# scrape tweets from csv file
def scrape_tweets(filename):
    tweets = []
    # assume csv files are in directory named 'data'
    with open('data/' + filename) as f:
        reader = csv.reader(f)
        next(reader) # skip header
        data = [r for r in reader]
        for item in data:
            tweets.append(item[3])
    return tweets   

Next, we want to clean our scraped tweets. We do this using clean_tweets where we remove stopwords and other non-essential parts of the tweets.

In [16]:
# clean scraped tweets
def clean_tweets(tweets):
    cleaned_tweets = []
    stop_words = set(stopwords.words('english'))
    for tweet in tweets:
        cleaned_tweet = re.sub("[^a-zA-Z]", " ", tweet).lower()
        for w in tweet.split(" "):
            if w not in stop_words:
                cleaned_tweet += w + " "
        cleaned_tweet = cleaned_tweet.strip()
        cleaned_tweets += [cleaned_tweet]
    return cleaned_tweets

Now we want to train a naive bayes classifier to tag a tweet as positive or negative. We use NLTK's NaiveBayesClassifier to do so in train_naive_bayes using training data from pos_tweet.txt and neg_tweets.txt.

In [17]:
# train naive bayes classifier
# some of this code has been adapted from
# https://www.twilio.com/blog/2017/09/sentiment-analysis-python-messy-data-nltk.html
def format_sentence(s):
    return({word: True for word in nltk.word_tokenize(s)})

def train_naive_bayes():
    positive = []
    with open("./pos_tweets.txt") as f:
        for i in f: 
            positive.append([format_sentence(i), 'positive'])
    negative = []
    with open("./neg_tweets.txt") as f:
        for i in f: 
            negative.append([format_sentence(i), 'negative'])  
    training = positive + negative
    classifier = NaiveBayesClassifier.train(training)
    return classifier

Now that we have a classifer from the previous step, we can evaluate our cleaned tweets. We determine average sentiment using evaluate_sentiment where we calculate the ratio of positive tweets to total tweets.

In [18]:
# determine average sentiment accross tweets
def evaluate_sentiment(tweets, classifier):
    total_tweets = len(tweets)
    positive_tweets = 0
    for tweet in tweets:
        sentiment = classifier.classify(format_sentence(tweet))
        if sentiment == "positive":
            positive_tweets += 1
    average_happiness = positive_tweets/total_tweets
    return average_happiness

Since we have all of the individual steps hammered down now, we can put them together to perform sentiment analysis on our 100 csv files. We do this using perform_sentiment_analysis. This returns a dictionary with the average sentiment of each state before and after the election.

In [19]:
# determine average sentiment accross all states
def perform_sentiment_analysis(csv_files):
    results = dict()
    classifier = train_naive_bayes() # only need to train once
    for file in csv_files:
        tweets = scrape_tweets(file)
        cleaned_tweets = clean_tweets(tweets)
        average_happiness = evaluate_sentiment(cleaned_tweets, classifier)
        key = file.replace(".csv", "")
        results[key] = average_happiness
    return results

In [20]:
csv_files = []
for root, dirs, files in os.walk('data'):
    for file in files:
        if file.endswith('.csv'):
            csv_files.append(file)
scores_dict = perform_sentiment_analysis(csv_files)
print(scores_dict)

{'ak_after': 0.6, 'ak_before': 0.0, 'al_after': 0.5616438356164384, 'al_before': 0.35185185185185186, 'ar_after': 0.625, 'ar_before': 0.5714285714285714, 'az_after': 0.472636815920398, 'az_before': 0.4409937888198758, 'ca_after': 0.49676898222940225, 'ca_before': 0.4650602409638554, 'co_after': 0.5714285714285714, 'co_before': 0.5192307692307693, 'ct_after': 0.6470588235294118, 'ct_before': 0.7, 'de_after': 0.5613207547169812, 'de_before': 0.5654320987654321, 'fl_after': 0.5, 'fl_before': 0.417910447761194, 'ga_after': 0.5166240409207161, 'ga_before': 0.5179968701095462, 'hi_after': 0.5074626865671642, 'hi_before': 0.5416666666666666, 'ia_after': 0.7777777777777778, 'ia_before': 0.875, 'id_after': 0.45, 'id_before': 1.0, 'il_after': 0.4806378132118451, 'il_before': 0.4146341463414634, 'in_after': 0.45365853658536587, 'in_before': 0.6078431372549019, 'ks_after': 0.21428571428571427, 'ks_before': 0.2222222222222222, 'ky_after': 0.4444444444444444, 'ky_before': 0.5714285714285714, 'la_aft

## Data Visualization

Next we organized the results into two separate data frames in order to visualize the results. We formatted the data frame in order to export them to csv files and used R to create heat maps to visualize the data.

In [21]:
statesBefore = []
statesAfter = []
scoresBefore = []
scoresAfter = []
for key in scores_dict:
    if key.find('before') != -1:
        statesBefore.append(key[:2])
        scoresBefore.append(scores_dict[key])
    else:
        statesAfter.append(key[:2])
        scoresAfter.append(scores_dict[key])

In [22]:
sentBefore = pd.DataFrame({'state': statesBefore, 'score': scoresBefore})
sentAfter = pd.DataFrame({'state': statesAfter, 'score': scoresAfter})

states = ['alaska', 'alabama', 'arkansas', 'arizona', 'california', 
          'colorado', 'connecticut', 'delaware', 'florida', 'georgia', 
          'hawaii', 'iowa', 'idaho', 'illinois', 'indiana', 'kansas', 
          'kentucky', 'louisiana', 'massachusetts', 'maryland', 'maine', 
          'michigan', 'minnesota', 'missouri', 'mississippi', 'montana',
         'north carolina', 'north dakota', 'nebraska', 'new hampshire',
         'new jersey', 'new mexico', 'nevada', 'new york', 'ohio', 'oklahoma',
         'oregon', 'pennsylvania', 'rhode island', 'south carolina',
         'south dakota', 'tennessee', 'texas', 'utah', 'virginia', 
         'vermont', 'washington', 'wisconsin', 'west virginia', 'wyoming']

sentBefore['state'] = states
sentAfter['state'] = states

sentBefore.to_csv('sentBefore.csv', encoding='utf-8')
sentAfter.to_csv('sentAfter.csv', encoding='utf-8')

Below are heat maps displaying the sentiments calculated for each state both before and after the election (click for larger image): 
[<img src="https://www.dropbox.com/s/01grcwmt6sp44kl/sentBefore.png?dl=0">](https://www.dropbox.com/s/01grcwmt6sp44kl/sentBefore.png?dl=0)

[<img src="https://www.dropbox.com/s/l0hx9dg2cskdkzh/sentAfter.png?dl=0">](https://www.dropbox.com/s/l0hx9dg2cskdkzh/sentAfter.png?dl=0)

Displayed below is a heat map depicting the difference in sentiment values before and after the election (click for larger image):
[<img src="https://www.dropbox.com/s/6b2eq0hymta8sk6/scoreDiff.png?dl=0">](https://www.dropbox.com/s/6b2eq0hymta8sk6/scoreDiff.png?dl=0)

In order to compare the results of our analysis to each state's political leanings, below is a 2016 Presidential election map by county and vote share:
[<img src="https://brilliantmaps.com/wp-content/uploads/2016nationwidecountymapshadedbyvoteshare.png">](https://brilliantmaps.com/wp-content/uploads/2016nationwidecountymapshadedbyvoteshare.png)


## Conclusions

Overall, it appears that the majority of states' sentiment values increased after the Presidential election. The states whose values decreased are mostly located in the northwestern region of the United States. These are states whose sentiment values were very high before the election. Among the states whose sentiment values increased, most increased by a small amount.

West Virginia has the most positive change in sentiment value while Montana had the most negative change. Both of these states voted largely Republican in the election. In general, there does not appear to be any relationship with political inclination.

### Things to Consider

A little over 1 percent of Twitter users have geolocation turned on which enables us to see the exact location from which they send their tweets. Because of this, we were unable to obtain a representative sample of users from each state. Additionally, only 24 percent of online adults use Twitter. Twitter users are generally younger and only 13 percent of people aged 50 to 65 use Twitter. This also affects the results as many people who use Twitter are not of voting age.

Another important fact to consider is that sentiment analysis is difficult when evaluating tweets with sarcasm and other paralinguistic features.

In [24]:
scoreDiff = sentAfter['score'] - sentBefore['score']
diff = pd.DataFrame({'state': states, 'scoreDiff': scoreDiff})
diff

Unnamed: 0,scoreDiff,state
0,0.6,alaska
1,0.209792,alabama
2,0.053571,arkansas
3,0.031643,arizona
4,0.031709,california
5,0.052198,colorado
6,-0.052941,connecticut
7,-0.004111,delaware
8,0.08209,florida
9,-0.001373,georgia


In [30]:
# state whose sentiment value increased the most
diff.loc[diff['scoreDiff'].idxmax()]

scoreDiff         0.623529
state        west virginia
Name: 48, dtype: object

In [31]:
# state whose sentiment value decreased the most
diff.loc[diff['scoreDiff'].idxmin()]

scoreDiff         -1
state        montana
Name: 25, dtype: object