# Trump's Twitter History
Those reading this will already be familiar with the political career of Donald Trump, and will no doubt understand the level of influence his twitter account, as well as his television interviews and meetings with the press had in communicating his political message. Often, his tweets were headline news in publications around the globe, and were not uncommon to captured the attention of the world.
  
Why take the time to preform an overview analysis of Trump's twitter career? Well, most influencial political leaders of the past are nearly completly lost to history. How many speechs of Roman emperors do you know? How many of there speeches even remain?  
Thus this dataset is, in an way, a historical artifact. We, as data scientists/data analysts, have a chance observe and study this political artifact before it is lost to the memory of human kind. In addition, analysing this dataset allows us to imagine how political leaders of the future will use their rising technological powers to rule the populos of the future.  
  
Having said that, this series of notebooks will serve more so as an investigation of modern natural language processing and data analisys techniques. Both experimental and fundemental, to see what insights we can extract from a twitter account alone.   

## Table of Content
Section 1: Data Retreval and Package Download  
Section 2.1: Extracting Quote Tweets  
Secion 2.2: Extracting Hashtags and Mentions  
Section 2.3: Extracting Datetime Information  
Section 2.4: Sentiment Analysis  

In [1]:
# Import the Relevent Packages
import pandas as pd
import numpy as np
import os
import re
import math
import statistics as sts
import datetime as dt
import time
import seaborn as sns
from matplotlib import pyplot as plt
import spacy
import random as rd
from textblob import TextBlob
from wordcloud import WordCloud

In [2]:
tweets = pd.read_csv('tweets.csv')
tweets.head(5)

Unnamed: 0,id,text,isRetweet,isDeleted,device,favorites,retweets,date,isFlagged
0,98454970654916608,Republicans and Democrats have both created ou...,f,f,TweetDeck,49,255,2011-08-02 18:07:48,f
1,1234653427789070336,I was thrilled to be back in the Great city of...,f,f,Twitter for iPhone,73748,17404,2020-03-03 01:34:50,f
2,1218010753434820614,RT @CBS_Herridge: READ: Letter to surveillance...,t,f,Twitter for iPhone,0,7396,2020-01-17 03:22:47,f
3,1304875170860015617,The Unsolicited Mail In Ballot Scam is a major...,f,f,Twitter for iPhone,80527,23502,2020-09-12 20:10:58,f
4,1218159531554897920,RT @MZHemingway: Very friendly telling of even...,t,f,Twitter for iPhone,0,9081,2020-01-17 13:13:59,f


In [3]:
tweets.shape

(56571, 9)

In [4]:
tweets.isna().sum()

id           0
text         0
isRetweet    0
isDeleted    0
device       0
favorites    0
retweets     0
date         0
isFlagged    0
dtype: int64

### Section 2.1 Extracting Quote Tweets.
### Types of Tweets
The first thing to mention when examining the tweets is that there appear to be three types of tweets.  
1) Regular Trump Tweets.  
2) Tweets that have been retweeted by Trump. These tweets are disinquesed in two ways, they all begin with RT in the text, and isRetweet is valued at 't'.   
3) Quote tweets; similar to a retweet, but not flagged by isRetweet. These tweets can be detected by the """@user that begins the text string. To deal with this we will both extract the username, the quoted message, and the additional message added by trump. 

In [5]:
# 1) An example of a standard tweet.
print(tweets.loc[54])
print(tweets.loc[(54,'text')])

id                                         1346120645613150208
text         The “Surrender Caucus” within the Republican P...
isRetweet                                                    f
isDeleted                                                    f
device                                      Twitter for iPhone
favorites                                               235516
retweets                                                 60242
date                                       2021-01-04 15:45:46
isFlagged                                                    f
Name: 54, dtype: object
The “Surrender Caucus” within the Republican Party will go down in infamy as weak and ineffective “guardians” of our Nation, who were willing to accept the certification of fraudulent presidential numbers!


In [6]:
# 2) An example of a retweet.
print(tweets.loc[942])
print(tweets.loc[(942,'text')])

id                                         1327471652268101633
text         RT @GOP: “Operation Warp Speed is unequaled an...
isRetweet                                                    t
isDeleted                                                    f
device                                      Twitter for iPhone
favorites                                                    0
retweets                                                  9926
date                                       2020-11-14 04:41:19
isFlagged                                                    f
Name: 942, dtype: object
RT @GOP: “Operation Warp Speed is unequaled and unrivaled anywhere in the world.”—@realDonaldTrump https://t.co/LHOGKznr0R


In [7]:
# 3) An example of a qoute tweet.
print(tweets.loc[23186]) 
print(tweets.loc[(23186,'text')])

id                                          321909640742961152
text         """@CpacDean: @realDonaldTrump you've raised c...
isRetweet                                                    f
isDeleted                                                    f
device                                     Twitter for Android
favorites                                                   30
retweets                                                    38
date                                       2013-04-10 08:56:53
isFlagged                                                    f
Name: 23186, dtype: object
"""@CpacDean: @realDonaldTrump you've raised children that young people can look up too. thats probably your greatest achievement in my mind"""


Our next goal is to seperate out the information encoded in quote tweets.  
This section is a bit of the nitty-gritty found in real world data problems, because the formating used is inconsistent through out the dataset.  
  
Here, pattern1 identifies tweet['text'] of the form: """@user: User_Message"" Trump_message". qoute tweet

pattern2 identifies tweets with the text of the form: """@user User_messsage".  
  
The function extracts and records each of the above sections, as well as recordes an indicator as to whether the tweet is of type pattern1 (t1), pattern2 (t2), or not quote-tweet ('n/a').

In [8]:
# A Funtion to extract info from quote tweets.
isQuoteTweet = []
quotedFrom = []
quoteText = []
trumpResp = []

pattern1 = re.compile(r'"""(@.+):(.+"")(.*)')
pattern2 = re.compile(r'"""(@.+) (.+")')

for i in range(len(tweets)):
    matches1 = pattern1.finditer(tweets.loc[(i,'text')])
    initCheck = len(isQuoteTweet)
    for match1 in matches1:
        isQuoteTweet.append('t1')
        quotedFrom.append(match1.group(1))
        quoteText.append(match1.group(2))
        trumpResp.append(match1.group(3))
    if initCheck == len(isQuoteTweet):
        matches2 = pattern2.finditer(tweets.loc[(i,'text')])
        for match2 in matches2:
            isQuoteTweet.append('t2')
            quotedFrom.append(match2.group(1))
            quoteText.append(match2.group(2))
            trumpResp.append('n/a')
    if initCheck == len(isQuoteTweet):
        isQuoteTweet.append('f')
        quotedFrom.append('n/a')
        quoteText.append('n/a')
        trumpResp.append('n/a')

tweets['isQuoteTweet'] = isQuoteTweet
tweets['quotedFrom'] = quotedFrom
tweets['quoteText'] = quoteText
tweets['trumpResp'] = trumpResp

In [9]:
tweets.loc[23186]

id                                             321909640742961152
text            """@CpacDean: @realDonaldTrump you've raised c...
isRetweet                                                       f
isDeleted                                                       f
device                                        Twitter for Android
favorites                                                      30
retweets                                                       38
date                                          2013-04-10 08:56:53
isFlagged                                                       f
isQuoteTweet                                                   t1
quotedFrom                                              @CpacDean
quoteText        @realDonaldTrump you've raised children that ...
trumpResp                                                        
Name: 23186, dtype: object

In [10]:
tweets.loc[942]

id                                            1327471652268101633
text            RT @GOP: “Operation Warp Speed is unequaled an...
isRetweet                                                       t
isDeleted                                                       f
device                                         Twitter for iPhone
favorites                                                       0
retweets                                                     9926
date                                          2020-11-14 04:41:19
isFlagged                                                       f
isQuoteTweet                                                    f
quotedFrom                                                    n/a
quoteText                                                     n/a
trumpResp                                                     n/a
Name: 942, dtype: object

### Section 2.2: Hashtags and Mentions  
A quick an easy way to extract all hastags and mentions from a tweet, using python's Regular Expressions Package (re).  
https://stackoverflow.com/questions/45874879/extract-hashtags-from-columns-of-a-pandas-dataframe

In [11]:
# Extract Hashtags and mentions from tweets
tweets['mentions'] = tweets['text'].str.findall(r'(?:(?<=\s)|(?<=^))@.*?(?=\s|$)')
tweets['hashtags'] = tweets['text'].str.findall(r'(?:(?<=\s)|(?<=^))#.*?(?=\s|$)')

In [12]:
tweets

Unnamed: 0,id,text,isRetweet,isDeleted,device,favorites,retweets,date,isFlagged,isQuoteTweet,quotedFrom,quoteText,trumpResp,mentions,hashtags
0,98454970654916608,Republicans and Democrats have both created ou...,f,f,TweetDeck,49,255,2011-08-02 18:07:48,f,f,,,,[],[]
1,1234653427789070336,I was thrilled to be back in the Great city of...,f,f,Twitter for iPhone,73748,17404,2020-03-03 01:34:50,f,f,,,,[],[#KAG2020]
2,1218010753434820614,RT @CBS_Herridge: READ: Letter to surveillance...,t,f,Twitter for iPhone,0,7396,2020-01-17 03:22:47,f,f,,,,[@CBS_Herridge:],[]
3,1304875170860015617,The Unsolicited Mail In Ballot Scam is a major...,f,f,Twitter for iPhone,80527,23502,2020-09-12 20:10:58,f,f,,,,[],[]
4,1218159531554897920,RT @MZHemingway: Very friendly telling of even...,t,f,Twitter for iPhone,0,9081,2020-01-17 13:13:59,f,f,,,,[@MZHemingway:],[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56566,1319485303363571714,RT @RandPaul: I don’t know why @JoeBiden think...,t,f,Twitter for iPhone,0,20683,2020-10-23 03:46:25,f,f,,,,"[@RandPaul:, @JoeBiden]",[]
56567,1319484210101379072,RT @EliseStefanik: President @realDonaldTrump ...,t,f,Twitter for iPhone,0,9869,2020-10-23 03:42:05,f,f,,,,"[@EliseStefanik:, @realDonaldTrump]",[]
56568,1319444420861829121,RT @TeamTrump: LIVE: Presidential Debate #Deba...,t,f,Twitter for iPhone,0,8197,2020-10-23 01:03:58,f,f,,,,[@TeamTrump:],[#Debates2020]
56569,1319384118849949702,Just signed an order to support the workers of...,f,f,Twitter for iPhone,176289,36001,2020-10-22 21:04:21,f,f,,,,[],[]


### Section 2.3: Extracting Datetime information
by the datetime package.

In [13]:
year = []
month = []
day = []
hour = []
date = []
for i in range(len(tweets)):
    datestamp = dt.datetime.strptime(tweets.loc[(i,'date')], '%Y-%m-%d %H:%M:%S')
    year.append(datestamp.date().year)
    month.append(datestamp.date().month)
    day.append(datestamp.date().day)
    hour.append(datestamp.time().hour)
    date.append(datestamp.date())

In [14]:
# Append the date information to our dataframe
tweets['year'] = year
tweets['month'] = month
tweets['day'] = day
tweets['hour'] = hour
tweets['date'] = date

In [15]:
tweets

Unnamed: 0,id,text,isRetweet,isDeleted,device,favorites,retweets,date,isFlagged,isQuoteTweet,quotedFrom,quoteText,trumpResp,mentions,hashtags,year,month,day,hour
0,98454970654916608,Republicans and Democrats have both created ou...,f,f,TweetDeck,49,255,2011-08-02,f,f,,,,[],[],2011,8,2,18
1,1234653427789070336,I was thrilled to be back in the Great city of...,f,f,Twitter for iPhone,73748,17404,2020-03-03,f,f,,,,[],[#KAG2020],2020,3,3,1
2,1218010753434820614,RT @CBS_Herridge: READ: Letter to surveillance...,t,f,Twitter for iPhone,0,7396,2020-01-17,f,f,,,,[@CBS_Herridge:],[],2020,1,17,3
3,1304875170860015617,The Unsolicited Mail In Ballot Scam is a major...,f,f,Twitter for iPhone,80527,23502,2020-09-12,f,f,,,,[],[],2020,9,12,20
4,1218159531554897920,RT @MZHemingway: Very friendly telling of even...,t,f,Twitter for iPhone,0,9081,2020-01-17,f,f,,,,[@MZHemingway:],[],2020,1,17,13
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56566,1319485303363571714,RT @RandPaul: I don’t know why @JoeBiden think...,t,f,Twitter for iPhone,0,20683,2020-10-23,f,f,,,,"[@RandPaul:, @JoeBiden]",[],2020,10,23,3
56567,1319484210101379072,RT @EliseStefanik: President @realDonaldTrump ...,t,f,Twitter for iPhone,0,9869,2020-10-23,f,f,,,,"[@EliseStefanik:, @realDonaldTrump]",[],2020,10,23,3
56568,1319444420861829121,RT @TeamTrump: LIVE: Presidential Debate #Deba...,t,f,Twitter for iPhone,0,8197,2020-10-23,f,f,,,,[@TeamTrump:],[#Debates2020],2020,10,23,1
56569,1319384118849949702,Just signed an order to support the workers of...,f,f,Twitter for iPhone,176289,36001,2020-10-22,f,f,,,,[],[],2020,10,22,21


### Section 2.4: Sentiment Analysis:  
There are few steps involved in this sentiment analysis:  
1) Clean the text, lemminize.  
2) Extract 'Subjectivity' and 'Polerization' Scores.  
3) Simplfy the subjectivity into 'Positive', 'Neutral', and 'Negative' buckets. 

In [16]:
# Cleaing features from the tweets
processed_features = []
for sentence in tweets['text']:
    # Remove all the http: urls
    processed_feature = re.sub('(https?://\S+)', '', str(sentence))
    
    # Remove all the special characters
    processed_feature = re.sub(r'\W', ' ', processed_feature)
 
    #Converting to Lowercase
    processed_feature = processed_feature.lower()
    
    processed_features.append(processed_feature)

In [17]:
# Create a function to get the subjectivity
def getSubjectivity(text):
    return TextBlob(text).sentiment.subjectivity

# Create a function to get the polarity
def getPolarity(text):
    return  TextBlob(text).sentiment.polarity


# Create two new columns 'Subjectivity' & 'Polarity'
tweets['subjectivity'] = pd.Series(processed_features).apply(getSubjectivity)
tweets['polarity'] = pd.Series(processed_features).apply(getPolarity)

In [18]:
tweets

Unnamed: 0,id,text,isRetweet,isDeleted,device,favorites,retweets,date,isFlagged,isQuoteTweet,...,quoteText,trumpResp,mentions,hashtags,year,month,day,hour,subjectivity,polarity
0,98454970654916608,Republicans and Democrats have both created ou...,f,f,TweetDeck,49,255,2011-08-02,f,f,...,,,[],[],2011,8,2,18,0.200000,0.200000
1,1234653427789070336,I was thrilled to be back in the Great city of...,f,f,Twitter for iPhone,73748,17404,2020-03-03,f,f,...,,,[],[#KAG2020],2020,3,3,1,0.483333,0.450000
2,1218010753434820614,RT @CBS_Herridge: READ: Letter to surveillance...,t,f,Twitter for iPhone,0,7396,2020-01-17,f,f,...,,,[@CBS_Herridge:],[],2020,1,17,3,0.300000,0.050000
3,1304875170860015617,The Unsolicited Mail In Ballot Scam is a major...,f,f,Twitter for iPhone,80527,23502,2020-09-12,f,f,...,,,[],[],2020,9,12,20,0.454762,0.029464
4,1218159531554897920,RT @MZHemingway: Very friendly telling of even...,t,f,Twitter for iPhone,0,9081,2020-01-17,f,f,...,,,[@MZHemingway:],[],2020,1,17,13,0.500000,0.268750
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56566,1319485303363571714,RT @RandPaul: I don’t know why @JoeBiden think...,t,f,Twitter for iPhone,0,20683,2020-10-23,f,f,...,,,"[@RandPaul:, @JoeBiden]",[],2020,10,23,3,0.100000,0.200000
56567,1319484210101379072,RT @EliseStefanik: President @realDonaldTrump ...,t,f,Twitter for iPhone,0,9869,2020-10-23,f,f,...,,,"[@EliseStefanik:, @realDonaldTrump]",[],2020,10,23,3,0.200000,0.050000
56568,1319444420861829121,RT @TeamTrump: LIVE: Presidential Debate #Deba...,t,f,Twitter for iPhone,0,8197,2020-10-23,f,f,...,,,[@TeamTrump:],[#Debates2020],2020,10,23,1,0.500000,0.136364
56569,1319384118849949702,Just signed an order to support the workers of...,f,f,Twitter for iPhone,176289,36001,2020-10-22,f,f,...,,,[],[],2020,10,22,21,0.260317,-0.035714


In [19]:
def getAnalysis(score):
    if score < 0:
        return 'negative'
    elif score == 0:
        return 'neutral'
    else:
        return 'positive'
tweets['analysis'] = tweets['polarity'].apply(getAnalysis)

In [20]:
tweets

Unnamed: 0,id,text,isRetweet,isDeleted,device,favorites,retweets,date,isFlagged,isQuoteTweet,...,trumpResp,mentions,hashtags,year,month,day,hour,subjectivity,polarity,analysis
0,98454970654916608,Republicans and Democrats have both created ou...,f,f,TweetDeck,49,255,2011-08-02,f,f,...,,[],[],2011,8,2,18,0.200000,0.200000,positive
1,1234653427789070336,I was thrilled to be back in the Great city of...,f,f,Twitter for iPhone,73748,17404,2020-03-03,f,f,...,,[],[#KAG2020],2020,3,3,1,0.483333,0.450000,positive
2,1218010753434820614,RT @CBS_Herridge: READ: Letter to surveillance...,t,f,Twitter for iPhone,0,7396,2020-01-17,f,f,...,,[@CBS_Herridge:],[],2020,1,17,3,0.300000,0.050000,positive
3,1304875170860015617,The Unsolicited Mail In Ballot Scam is a major...,f,f,Twitter for iPhone,80527,23502,2020-09-12,f,f,...,,[],[],2020,9,12,20,0.454762,0.029464,positive
4,1218159531554897920,RT @MZHemingway: Very friendly telling of even...,t,f,Twitter for iPhone,0,9081,2020-01-17,f,f,...,,[@MZHemingway:],[],2020,1,17,13,0.500000,0.268750,positive
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56566,1319485303363571714,RT @RandPaul: I don’t know why @JoeBiden think...,t,f,Twitter for iPhone,0,20683,2020-10-23,f,f,...,,"[@RandPaul:, @JoeBiden]",[],2020,10,23,3,0.100000,0.200000,positive
56567,1319484210101379072,RT @EliseStefanik: President @realDonaldTrump ...,t,f,Twitter for iPhone,0,9869,2020-10-23,f,f,...,,"[@EliseStefanik:, @realDonaldTrump]",[],2020,10,23,3,0.200000,0.050000,positive
56568,1319444420861829121,RT @TeamTrump: LIVE: Presidential Debate #Deba...,t,f,Twitter for iPhone,0,8197,2020-10-23,f,f,...,,[@TeamTrump:],[#Debates2020],2020,10,23,1,0.500000,0.136364,positive
56569,1319384118849949702,Just signed an order to support the workers of...,f,f,Twitter for iPhone,176289,36001,2020-10-22,f,f,...,,[],[],2020,10,22,21,0.260317,-0.035714,negative


In [21]:
# !python -m spacy download en_core_web_sm

In [22]:
spacy_model = spacy.load('en_core_web_sm')

In [24]:
cleanTweet = []
for tweet in processed_features:
    tweet = spacy_model(tweet)
    tokenTweet = []
    for token in tweet:
        if not token.is_punct and not token.is_stop and not token.like_num and token.lemma_ != '-PRON-':
                tokenTweet.append(token.lemma_)
    cleanTweet.append(tokenTweet)
    
import pickle
file_path = 'cleanTweet.bin'
# Open the file in binary mode
with open(file_path, 'wb') as file:
    pickle.dump(cleanTweet, file)

KeyboardInterrupt: 

In [None]:
# Load the pickle file
with open(file_path, 'rb') as f:
    cleanTweet = pickle.load(f)

In [None]:
tweets['cleanText']=cleanTweet

In [None]:
tweets.head(5)

In [None]:
# Creates a list of all tokens (words)
wordList = []
for tweet in tweets['cleanText']:
    for word in tweet:
        if not word.isspace():
            wordList.append(word)

In [None]:
from collections import Counter
word_counts = Counter([word.lower() for word in wordList])
# Get the most common keywords
most_common_keywords = word_counts.most_common(50)

In [None]:
keyword, count = most_common_keywords[0]
print(f'{keyword}: {count}')

In [None]:
hashtags = {}
for index, row in tweets.iterrows():
    hashtag = row['hashtags']
    if type(hashtag) == type(float(np.nan)):
        continue
    hashtag_list = hashtag
    for key in hashtag_list:
        if key in hashtags:
            hashtags[key] += 1
        else:
            hashtags[key] = 1

sorted_hashtags = dict(sorted(hashtags.items(), key=lambda x: x[1], reverse=True))

hashtags_items = list(sorted_hashtags.items())
# Slice the list to get the first 50 key-value pairs
first_50_hashtags_items = hashtags_items[:50]

In [None]:
print(f'Number of Hashtag types: {(len(sorted_hashtags))}')
hashtags, count = first_50_hashtags_items[0]
print(f'{hashtags}: {count}')

In [None]:
from wordcloud import WordCloud

# Convert the wordList into a single string
positive_words_str = ' '.join(wordList)
# Create the WordCloud object without specifying a font file path
wordcloud = WordCloud(width=800, height=500, max_font_size=100,
                      font_path="./arial.ttf").generate(positive_words_str)

%matplotlib inline
# Plot the graph
plt.figure(figsize=(15, 8))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

In [None]:
# Records the index of tweets which contain 'term'.
def searchTerm(term):
    catcher = []
    for i in range(len(tweets)):
        if term in tweets.loc[(i,'cleanText')]:
             catcher.append(tweets.loc[(i,'text')])
    return pd.Series(catcher)

In [None]:
# Example of search term
sample = searchTerm('election')
sample[0]