# Introduction

Trump vs Biden 2020 Presidential Campaign Twitter Sentiment
Introduction
One of the hottest topics in 2020 was the presidential election between Donald J. Trump and former vice president Joe Biden. This election was full of firsts; It was the first election held during a worldwide pandemic, the first election to have 3 states whose margin of victory was under 1%, and the first incumbent president not to concede. Because of the election's uniqueness and the highly contrasting personalities between the candidates, I decided to analyze the sentiment of each candidate’s tweets to find out how much their social media attitude affected their twitter impressions and their general approval.

In the analysis, we mapped the candidate's sentiment in a time series as well as measured how popular their most liked and dislike tweets were through retweets and likes. We wanted to see if there was a correlation between negative or positive sentiment and popularity for each candidate.

We got the motivation for this topic from an article published by Cambridge University Press titled: “Differences in negativity bias underlie variations in political ideology”. This article discusses how negative thoughts gain more attention and popularity than positive thoughts do. Negative thoughts also stay within our memory for a longer period. This also links to negative bias in politics, which triggered an idea to apply sentiment analysis in politics. The 2020 presidential election was the perfect area to focus this analysis on negative political bias. 

# Import Resources
## Read Libraries

In [109]:
import tweepy
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
from textblob import TextBlob
import datetime
import re
from nltk.tokenize import word_tokenize
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\lalin90\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\lalin90\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Read Data

### Web Scraping

We will get the most recent Trump and Biden tweets by web scraping twitter to obtain tweets from August 2020 - November 2, 2020

The script below gets the 2000 most recent tweets from a twitter account and stores it as a `pkl` file. We will only run it occasionally to control the API requests we make to twitter as they have a limit on how many you can make.

In [110]:
# dont run these 
# AccessToken = '1325166553-AaLqrAHSEzFQg0KOQPrfL5B46EiOuvw3bWV7FUb'
# AccessTokenSecret = 'ahA6V1c7wQoByveazFsq1dcf2YrqccUqbmlPuWCNUJ8Yo'
# API_key = 'BtsZY4DuRx6D6WuYRmyRKc5gr'
# API_key_secret = 'hcqxqk31EzMpKMcRvNdQKDzpGVmsShVlfpX3k1vDXR6zzbIwuJ'
# auth = tweepy.OAuthHandler(API_key,API_key_secret)
# auth.set_access_token(AccessToken, AccessTokenSecret)
# api = tweepy.API(auth)
# user = api.me()
# print(user.name)
#tweets = []
#for page in range(1,10):
#   tweets.extend(api.user_timeline(screen_name = "realDonaldTrump", count=200, page=page))
    
#print("Number of tweets extracted: {}.".format(len(tweets)))

In [111]:
#with open('trump_tweets.pkl', 'wb') as f:
#    pickle.dump(tweets, f)

The `pkl` files will be named `trump_tweets.pkl` and `biden_tweets.pkl` accordingly.

### Import CSV files

We will import the historical twitter data for Trump and Biden from the Kaggle repositories (Up to 8/30/2020)
- https://www.kaggle.com/rohanrao/joe-biden-tweets
- https://www.kaggle.com/codebreaker619/donald-trump-tweets-dataset

We renamed the CSV files to `TrumpOld.csv` and `BidenOld.csv` respectively.


In [112]:
trumpOld = pd.read_csv("./Data/TrumpOld.csv")
bidenOld = pd.read_csv("./Data/BidenOld.csv")

# Data Wrangling

## Exploratory Analysis

First we'll print out 5 tweets to evaluate their format

In [113]:
file = open('./Data/trump_tweets.pkl','rb')
ttweets = pickle.load(file)
for tweet in ttweets[:5]:
    print(tweet.text)

These states in question should immediately be put in the Trump Win column. Biden did not win, he lost by a lot!… https://t.co/w7y0zDaYdL
Big legal win in Pennsylvania!
RT @realDonaldTrump: STOP THE COUNT!
Jobless Claims Dip to 751,000, Lowest Since March https://t.co/dzuJpS78nb via @BreitbartNews
Fmr NV AG Laxalt: ‘No Question‘ Trump Would Have Won Nevada ‘Convincingly‘ Without Mail-in Voting https://t.co/pm4Wpfr6x0 via @BreitbartNews


In [114]:
file = open('./Data/biden_tweets.pkl','rb')
btweets = pickle.load(file)
for tweet in btweets[:5]:
    print(tweet.text)

I extend my deep condolences to the loved ones of the peacekeepers, including 6 American service members, who died… https://t.co/h5ZF41fR9C
RT @Transition46: President-elect Biden spoke this morning with His Holiness Pope Francis. https://t.co/om635SC3M9 https://t.co/DYuiiphOE0
Because of the Affordable Care Act:

- People with pre-existing conditions are protected
- More than 20 million Ame… https://t.co/GglK0KyJe7
Ron Klain’s deep, varied experience and capacity to work with people all across the political spectrum is precisely… https://t.co/KOx0BvNlae
This Veterans Day, I feel the full weight of the honor and the responsibility that has been entrusted to me by the… https://t.co/VjsNzut0R3


#### conclusion
It seems that based off the first 5 tweets for each candidate, there are links, retweet identifiers (RT) and punctuation that we'll have to clean before applying NLP or sentiment analysis processes.

We'll also examine the "keys" for each tweet, which are the variables associated with the tweet object. 

We can use twitter's official developer documenation to examine what each variable means:<br>
https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/object-model/tweet

In [115]:
ttweets[0].__dict__.keys()

dict_keys(['_json', 'created_at', 'id', 'id_str', 'text', 'truncated', 'entities', 'source', 'source_url', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'author', 'user', 'geo', 'coordinates', 'place', 'contributors', 'is_quote_status', 'quoted_status_id', 'quoted_status_id_str', 'quoted_status', 'retweet_count', 'favorite_count', 'favorited', 'retweeted', 'possibly_sensitive', 'lang'])

## Make DataFrames

### Filter out rewteets from Twitter API data

The data obtained from Twitter's API contains a column named `rewtweet_status` which identifies whether or not a tweet is a retweet, according to the documentation. We will use this to take out retweets from the `pkl` files

In [116]:
town_tweets = [tweet for tweet in ttweets if not hasattr(tweet, 'retweeted_status')]
town_tweets[0].text

'These states in question should immediately be put in the Trump Win column. Biden did not win, he lost by a lot!… https://t.co/w7y0zDaYdL'

In [117]:
bown_tweets = [tweet for tweet in btweets if not hasattr(tweet, 'retweeted_status')]
bown_tweets[0].text

'I extend my deep condolences to the loved ones of the peacekeepers, including 6 American service members, who died… https://t.co/h5ZF41fR9C'

### Make DataFrames from `pkl` files

We'll make dataframes for Trump and Biden from the `pkl` files.

In [118]:
dfTrump = pd.DataFrame(data = [[tweet.id, tweet.created_at, tweet.text, tweet.favorite_count, tweet.retweet_count] for tweet in town_tweets],
                 columns= ['id', 'date', 'tweet', 'likes', 'retweets'])
dfTrump = dfTrump[dfTrump.date <= pd.to_datetime('2020-11-02',  infer_datetime_format=True)]
print(dfTrump.shape)

(1305, 5)


In [119]:
dfBiden = pd.DataFrame(data = [[tweet.id, tweet.created_at, tweet.text, tweet.favorite_count, tweet.retweet_count] for tweet in bown_tweets],
                 columns= ['id', 'date', 'tweet', 'likes', 'retweets'])
dfBiden = dfBiden[dfBiden.date <= pd.to_datetime('2020-11-02',  infer_datetime_format=True)]
print(dfBiden.shape)

(1492, 5)


### Merge `pkl` and historical `csv` files

We will also merge the historical tweets data with the recently web scraped tweet data and assure there are no duplicates. 

We will also need to remove retweets, adjust the column order/names, refactor the date column, and <b>only include data from 4/25/2019 onwards, as that's the date when Joe Biden was chosen as a presidential candidate.</b>

In [120]:
#Remove retweets using isRetweet column
trumpOld = trumpOld.loc[trumpOld.isRetweet == "f" , ["id", "date", "text", "favorites", "retweets"]].copy()
#prepare columns for joining
trumpOld.columns = ["id", "date", 'tweet', 'likes', 'retweets']
#refactor date column to datetime dtype
trumpOld.date = pd.to_datetime(trumpOld.date, infer_datetime_format=True)
#Only include data after april 25, 2019
trumpOld = trumpOld[trumpOld.date >=pd.to_datetime('2019-04-25',  infer_datetime_format=True)]
print(trumpOld.shape)

(7484, 5)


In [121]:
#prepare columns for joining
bidenOld = bidenOld.loc[:, ["id", "timestamp", "tweet", "likes", "retweets"]].copy()
#change column names
bidenOld.columns = ["id", "date", 'tweet', 'likes', 'retweets']
#refactor date column to datetime dtype
bidenOld.date = pd.to_datetime(bidenOld.date, infer_datetime_format=True)
#Only include data after april 25, 2019
bidenOld = bidenOld[bidenOld.date >=pd.to_datetime('2019-04-25',  infer_datetime_format=True)]
print(bidenOld.shape)

(3264, 5)


Finally we will join the datasets together.

In [122]:
TrumpFinal = trumpOld.append(dfTrump)
print(TrumpFinal.shape)

(8789, 5)


In [123]:
BidenFinal = bidenOld.append(dfBiden)
print(BidenFinal.shape)

(4756, 5)


### Drop Duplicate tweets

In [124]:
TrumpFinal = TrumpFinal.drop_duplicates(subset=['tweet']).copy()

In [125]:
BidenFinal = BidenFinal.drop_duplicates(subset=['tweet']).copy()

In [126]:
print(TrumpFinal.shape)
print(BidenFinal.shape)

(8617, 5)
(4755, 5)


It seems like Trump had almost 200 duplicate tweets while Biden had 1.

## Clean Tweet content

### Remove noise

We use regex to remore words and symbols that don't mean anything in speech such as:
- mentions of other users with the `@` symbol
- Retweet symbol (`RT`)
- Hastags (`#`)
- Website links
- Any other symbol
- punctuation

In [127]:
def cleaner(txt):
    #remove '@' mentions
    txt=re.sub(r'@[A-Za-z0-9]+','',txt)
    #Remove reweet symbols (RT)
    txt=re.sub(r'#','',txt)
    #remove hashtag symbols (#)
    txt=re.sub(r'RT[\s]+','',txt)
    #Remove website links
    txt=re.sub(r'https?:\/\/\S+','',txt)
    #Remove any other symbol
    txt=re.sub(r'[^\w]', ' ', txt)
    #Punctuation
    txt=re.sub(r'[_"\-;%()|+&=*%.,!?:#$@\[\]/]', ' ', txt)
    txt=re.sub(r'amp', '', txt) 
    return txt

In [128]:
TrumpFinal['tweet'] = TrumpFinal['tweet'].apply(cleaner)
TrumpFinal = TrumpFinal[(TrumpFinal['tweet'] != "") & (TrumpFinal['tweet'] != " ")].copy()
print(TrumpFinal.shape)
BidenFinal['tweet'] = BidenFinal['tweet'].apply(cleaner)
BidenFinal = BidenFinal[(BidenFinal['tweet'] != "") & (BidenFinal['tweet'] != " ")].copy()
print(BidenFinal.shape)

(7732, 5)
(4749, 5)


### Remove small tweets

A large portion of tweets from Trump are too short and vague to be included in the NLP analysis. A large portion of these tweets, are thanking people with the hashtage(#) symbol, which has been removed. So because of tweets liket these, we will remove tweets that are 5 of fewer words.

In [129]:
TrumpFinal[TrumpFinal.tweet.str.contains("THANK YOU")].tail()

Unnamed: 0,id,date,tweet,likes,retweets
1185,1.303835e+18,2020-09-09 23:18:46,These are my real words about our GREAT HEROES...,90033,33326
1320,1.301956e+18,2020-09-04 18:49:55,THANK YOU MAGA,113728,24547
1321,1.301937e+18,2020-09-04 17:37:20,A GREAT HONOR THANK YOU,86395,23003
1335,1.301694e+18,2020-09-04 01:31:46,THANK YOU LATROBE PENNSYLVANIA MAGA,70475,15963
1359,1.301249e+18,2020-09-02 20:03:06,THANK YOU NORTH CAROLINA,83046,19894


In [130]:
TrumpFinal = TrumpFinal[TrumpFinal.tweet.str.split().str.len().gt(5)].copy()
BidenFinal = BidenFinal[BidenFinal.tweet.str.split().str.len().gt(5)].copy()
print(TrumpFinal.shape)
print(BidenFinal.shape)

(6695, 5)
(4617, 5)


### Remove Outliers

In [131]:
def removeOutliers(df):
    mean = np.mean(df.likes)
    sd = np.std(df.likes)
    df = df[df.likes > (mean - 2 * sd)].copy()
    df = df[df.likes < (mean + 2 * sd)].copy()
    return df

In [132]:
TrumpFinal = removeOutliers(TrumpFinal)
BidenFinal = removeOutliers(BidenFinal)
print(TrumpFinal.shape)
print(BidenFinal.shape)

(6444, 5)
(4471, 5)


## Adding Sentiment and Subjectivity

We will use the TextBlob library to assign sentiment and subjectivity to tweets.

#### Test TextBlob

In [133]:
a = TextBlob("Wear your mask")
b = TextBlob("Have a good day")
c = TextBlob("I hate you")
print("Sentence: " + str(a))
print("Sentiment: " + str(a.sentiment.polarity))
print("Sentence: " + str(b))
print("Sentiment: " + str(b.sentiment.polarity))
print("Sentence: " + str(c))
print("Sentiment: " + str(c.sentiment.polarity))

Sentence: Wear your mask
Sentiment: 0.0
Sentence: Have a good day
Sentiment: 0.7
Sentence: I hate you
Sentiment: -0.8


Sentiment is how positive or negative a sentence is

Sujectivity is how much of an opinion or fact a sentence is (0 is opinion, 1 is fact)

### Add sentiment and subjectivity to each tweet

In [134]:
TrumpFinal['sentiment'] = TrumpFinal['tweet'].apply(lambda tweet: TextBlob(tweet).sentiment.polarity)
TrumpFinal['subjectivity'] = TrumpFinal['tweet'].apply(lambda tweet: TextBlob(tweet).sentiment.subjectivity)
print(TrumpFinal.shape)

(6444, 7)


In [135]:
BidenFinal['sentiment'] = BidenFinal['tweet'].apply(lambda tweet: TextBlob(tweet).sentiment.polarity)
BidenFinal['subjectivity'] = BidenFinal['tweet'].apply(lambda tweet: TextBlob(tweet).sentiment.subjectivity)
print(BidenFinal.shape)

(4471, 7)


Add a -1, 0, 1 label for each tweet

In [136]:
TrumpFinal['label'] = 1
TrumpFinal.loc[TrumpFinal['sentiment'] < 0, ['label']] = -1
TrumpFinal.loc[TrumpFinal['sentiment'] == 0, ['label']] = 0
BidenFinal['label'] = 1
BidenFinal.loc[BidenFinal['sentiment'] < 0, ['label']] = -1
BidenFinal.loc[BidenFinal['sentiment'] == 0, ['label']] = 0

TrumpFinal['label_s'] = "Positive"
TrumpFinal.loc[TrumpFinal['sentiment'] < 0, ['label_s']] = 'Negative'
TrumpFinal.loc[TrumpFinal['sentiment'] == 0, ['label_s']] = 'Neutral'
BidenFinal['label_s'] = "Positive"
BidenFinal.loc[BidenFinal['sentiment'] < 0, ['label_s']] = 'Negative'
BidenFinal.loc[BidenFinal['sentiment'] == 0, ['label_s']] = 'Neutral'

In [137]:
#TrumpFinal.to_csv("TrumpFinal.csv", index = False)

## Basic Visualization

First I'll take a look at the average likes per week for each candidate

In [138]:
trumpLikes = TrumpFinal.groupby([pd.Grouper(key='date', freq='W-MON')]).mean().reset_index()
bidenLikes = BidenFinal.groupby([pd.Grouper(key='date', freq='W-MON')]).mean().reset_index()
fig = go.Figure()
fig.add_trace(go.Scatter(x=bidenLikes.date, y=bidenLikes.likes,
                    mode='lines',
                    name='Biden'))
fig.add_trace(go.Scatter(x=trumpLikes.date, y=trumpLikes.likes,
                    mode='lines',
                    name='Trump'))
fig.update_layout(title='Average Weekly Tweet Likes',
                   xaxis_title='Date (by week)',
                   yaxis_title='Likes')
fig.show()

As we can see, Trump has a lot more likes, mostly because he was just a much more popular person than Biden because of his personality. It seems that around <b>June 2020</b> is when Biden gets a huge jump in popularity making him be on par with Trump in terms of twitter likes.

We can assume that this is because Trump had 4 years to become famous, or infamous, while Biden needed a year of campaigning to get his popularity on part with Trump. We'll be only considering tweets from June 2020, onwards moving forward.

In [139]:
trumpSent = TrumpFinal[TrumpFinal.date > pd.to_datetime('2020-06-01',  infer_datetime_format=True)].groupby([pd.Grouper(key='date', freq='D')]).mean().reset_index()
bidenSent = BidenFinal[BidenFinal.date > pd.to_datetime('2020-06-01',  infer_datetime_format=True)].groupby([pd.Grouper(key='date', freq='D')]).mean().reset_index()
fig = go.Figure()
fig.add_trace(go.Scatter(x=bidenSent.date, y=bidenSent.label,
                    mode='lines',
                    name='lines'))
fig.add_trace(go.Scatter(x=bidenSent.date, y=trumpSent.label,
                    mode='lines',
                    name='lines'))
fig.add_shape(type='line',
                x0=min(trumpSent.date),
                y0=0,
                x1=max(trumpSent.date),
                y1=0,
                line=dict(color='#00CC96',),
                xref='x',
                yref='y'
)
fig.update_layout(title='Daily Sentiment Score',
                   xaxis_title='Date',
                   yaxis_title='Sentiment')
fig.show()

A we can see from the chart above, both candidate's tweets are similar in their sentiment between June 2020 and November 2nd 2020.

However, Trump is noticeably more extreme in his positive and negative sentiment overall while Biden is slightly more neutral.

In [153]:
trumpLikes = TrumpFinal[TrumpFinal.date > pd.to_datetime('2020-06-01',  infer_datetime_format=True)].groupby([pd.Grouper(key='date', freq='D')]).mean().reset_index()
bidenLikes = BidenFinal[BidenFinal.date > pd.to_datetime('2020-06-01',  infer_datetime_format=True)].groupby([pd.Grouper(key='date', freq='D')]).mean().reset_index()
trumpSent = TrumpFinal[TrumpFinal.date > pd.to_datetime('2020-06-01',  infer_datetime_format=True)].groupby([pd.Grouper(key='date', freq='D')]).mean().reset_index()
bidenSent = BidenFinal[BidenFinal.date > pd.to_datetime('2020-06-01',  infer_datetime_format=True)].groupby([pd.Grouper(key='date', freq='D')]).mean().reset_index()

bidenLikes['vars'] = bidenLikes.likes.values - np.median(bidenLikes.likes)
fig = make_subplots(specs=[[{"secondary_y": True}]])

fig.add_trace(go.Scatter(x=bidenSent.date, y=bidenSent.label,
                    mode='lines',
                    name='Biden'),
             secondary_y=True);

fig.add_trace(
        go.Bar(
            x=bidenLikes.date,
            y=bidenLikes.vars),
        secondary_y=False);

fig.update_layout(title='Average Weekly Tweet Likes',
                   xaxis_title='Date (by week)',
                   yaxis_title='Likes')
fig.update_yaxes(title_text="<b>primary</b> yaxis title", range = secondary_y=False)
fig.update_yaxes(title_text="<b>secondary</b> yaxis title", secondary_y=True)
fig.show()

In [152]:
trumpLikes = TrumpFinal[TrumpFinal.date > pd.to_datetime('2020-06-01',  infer_datetime_format=True)].groupby([pd.Grouper(key='date', freq='D')]).mean().reset_index()
bidenLikes = BidenFinal[BidenFinal.date > pd.to_datetime('2020-06-01',  infer_datetime_format=True)].groupby([pd.Grouper(key='date', freq='D')]).mean().reset_index()
trumpSent = TrumpFinal[TrumpFinal.date > pd.to_datetime('2020-06-01',  infer_datetime_format=True)].groupby([pd.Grouper(key='date', freq='D')]).mean().reset_index()
bidenSent = BidenFinal[BidenFinal.date > pd.to_datetime('2020-06-01',  infer_datetime_format=True)].groupby([pd.Grouper(key='date', freq='D')]).mean().reset_index()

trumpLikes['vars'] = trumpLikes.likes.values - np.median(trumpLikes.likes.dropna())
fig = make_subplots(specs=[[{"secondary_y": True}]])

fig.add_trace(go.Scatter(x=trumpSent.date, y=trumpSent.label,
                    mode='lines',
                    name='Biden'),
             secondary_y=True);

fig.add_trace(
        go.Bar(
            x=trumpLikes.date,
            y=trumpLikes.vars),
        secondary_y=False);

fig.update_layout(title='Average Weekly Tweet Likes',
                   xaxis_title='Date (by week)',
                   yaxis_title='Likes')

fig.show()

In [100]:
fig = px.scatter(TrumpFinal, x='sentiment', y='subjectivity', color = 'label_s')
fig.show()

From this plot, we can see that the more subjective the stronger the sentiment is

# NLP