# To-do list

Look up the LIWC Pennebaker technique and validity

# An analysis of the world constructed by President Trump's tweets

This project examines President Trump's tweets, using NLP to explore who and what the President tweets about, what those tweets think about those subjects, and how this might be used to construct the worldview of the President, particularly with respect to how a worldview is constructed in the tweets.


### Load and clean
The TrumpTwitterArchive is missing 2019-2020 data but to retrieve that will cost $100 subscription to Twitter API. So, go with the older data. Will need to think about limiting to Android only BUT we also know that the descriptors/NEs might match hours of the day (just check that)

In [13]:
# Necessary imports
import zipfile
import io
import urllib
import json
import pandas as pd
import re

In [20]:
# Importing the data from the github (little trick: the linked file has a different URL than the Github page. *raw*)
url = 'https://github.com/bpb27/trump_tweet_data_archive/raw/master/condensed_2018.json.zip'
access_url = urllib.request.urlopen(url)

z = zipfile.ZipFile(io.BytesIO(access_url.read())) # zip work
data = json.loads(z.read(z.infolist()[0]).decode()) # takes first item from zip file

# Now make a dataframe
tweets = pd.DataFrame.from_records(data)

tweets.head(5)


Unnamed: 0,created_at,favorite_count,id_str,in_reply_to_user_id_str,is_retweet,retweet_count,source,text,hashtag
0,Mon Dec 31 23:53:06 +0000 2018,136012,1079888205351145472,,False,33548,Twitter for iPhone,HAPPY NEW YEAR! https://t.co/bHoPDPQ7G6,[]
1,Mon Dec 31 20:02:52 +0000 2018,65069,1079830268708556800,25073877.0,False,17456,Twitter for iPhone,"....Senator Schumer, more than a year longer t...",[]
2,Mon Dec 31 20:02:52 +0000 2018,76721,1079830267274108930,,False,21030,Twitter for iPhone,Heads of countries are calling wanting to know...,[]
3,Mon Dec 31 15:39:15 +0000 2018,127485,1079763923845419009,,False,29610,Twitter for iPhone,It’s incredible how Democrats can all use thei...,[]
4,Mon Dec 31 15:37:14 +0000 2018,132439,1079763419908243456,,False,30957,Twitter for iPhone,"I’m in the Oval Office. Democrats, come back f...",[]


In [34]:
# Extract mentions by applying a regex
tweets['mentions'] = tweets['text'].apply(lambda x: re.findall(r"@(\w+)", x))

# Extract hashtags via regex
tweets['hashtag'] = tweets['text'].apply(lambda x: re.findall(r"#(\w+)", x))

# Remove URLs and hashtags and mentions
import preprocessor as p
for i,v in enumerate(tweets.text):
    tweets.loc[i,'text'] = p.clean(v)

tweets.head()

Unnamed: 0,created_at,favorite_count,id_str,in_reply_to_user_id_str,is_retweet,retweet_count,source,text,hashtag,nametag,mentions
0,Mon Dec 31 23:53:06 +0000 2018,136012,1079888205351145472,,False,33548,Twitter for iPhone,HAPPY NEW YEAR!,[],[],[]
1,Mon Dec 31 20:02:52 +0000 2018,65069,1079830268708556800,25073877.0,False,17456,Twitter for iPhone,"....Senator Schumer, more than a year longer t...",[],[],[]
2,Mon Dec 31 20:02:52 +0000 2018,76721,1079830267274108930,,False,21030,Twitter for iPhone,Heads of countries are calling wanting to know...,[],[],[]
3,Mon Dec 31 15:39:15 +0000 2018,127485,1079763923845419009,,False,29610,Twitter for iPhone,Its incredible how Democrats can all use their...,[],[],[]
4,Mon Dec 31 15:37:14 +0000 2018,132439,1079763419908243456,,False,30957,Twitter for iPhone,"Im in the Oval Office. Democrats, come back fr...",[],[],[]


### Basic EDA

Most of the tweets (87%, 3046 out 3500) are not re-tweets. We should remove the re-tweets.

In [3]:
print('Retweets (f)')
print(tweets.is_retweet.value_counts())
print('Retweets (%)')
print(tweets.is_retweet.value_counts() / len(tweets.is_retweet))


Retweets (f)
False    3046
True      464
Name: is_retweet, dtype: int64
Retweets (%)
False    0.867806
True     0.132194
Name: is_retweet, dtype: float64


Sources of tweet are shown below. I haven't checked how this compares to the Android vs iPhone debate listed in the references.

In [4]:
tweets.source.value_counts()

Twitter for iPhone      3438
Media Studio              39
Twitter for iPad          17
Twitter Media Studio      12
Twitter Web Client         4
Name: source, dtype: int64

#### Cleaning up the tweets for easier analysis

In [5]:
# Making a neater version of the tweets

tweets['text_cleaned'] =  tweets.text.str.lower()
from nltk.corpus import stopwords
stopwords = set(stopwords.words("english"))
tweets.text_cleaned = tweets.text_cleaned.apply(lambda x : " ".join(word for word in x.split() if word not in stopwords ))

One difficulty will be that Trump uses sarcasm and disbelief. Or describes magnitudes of bad things (e.g. single greatest witch hunt)
e.g. fourth tweet links democrats with 'incredible', although it does also use ridiculous sound bite.
Then, he implies dems are immoral.


In [20]:
print(tweets.text[3])
print(tweets.text[3502])


It’s incredible how Democrats can all use their ridiculous sound bite and say that a Wall doesn’t work. It does, and properly built, almost 100%! They say it’s old technology - but so is the wheel. They now say it is immoral- but it is far more immoral for people to be dying!
The single greatest Witch Hunt in American history continues. There was no collusion, everybody including the Dems knows there was no collusion, &amp; yet on and on it goes. Russia &amp; the world is laughing at the stupidity they are witnessing. Republicans should finally take control!


In [7]:
tweets.text_cleaned

0                 happy new year! https://t.co/bhopdpq7g6
1       ....senator schumer, year longer administratio...
2       heads countries calling wanting know senator s...
3       it’s incredible democrats use ridiculous sound...
4       i’m oval office. democrats, come back vacation...
5       i’m oval office. democrats, come back vacation...
6       person america could say that, “i’m bringing g...
7       campaigned border security, cannot without str...
8       .....except results far better ever said going...
9       ...i campaigned getting syria places. start ge...
10      anybody donald trump syria, isis loaded mess b...
11      concrete wall never abandoned, reported media....
12      president mrs. obama built/has ten foot wall a...
13      great work administration holidays save coast ...
14      veterans president trump’s handling border sec...
15      “it turns true now, department justice fbi, pr...
16      “absolutely nothing” (on russian collusion). k...
17      2018 c

In [6]:
corpus = nlp.pipe(tweets.text_cleaned)


NameError: name 'nlp' is not defined

In [33]:
import numpy as np
tweets_vector = np.array([tweet.vector for tweet in corpus])
print(tweets_vector)

[[ 0.61486125 -1.3177645  -0.8297453  ...  1.0479081   0.19449258
  -0.904479  ]
 [ 0.8187578   0.31961933  0.02575997 ...  0.6472383   0.19267063
  -0.3910993 ]
 [ 1.2160323   0.3199613  -0.11548385 ...  0.475515    0.5724548
  -1.0697542 ]
 ...
 [ 0.8442497   0.3846872   0.03981806 ...  0.9637509   0.64118385
  -0.38999093]
 [ 0.99298507  1.7069277   0.19047529 ... -0.3737102   0.436647
   0.42740962]
 [ 1.0061202   0.6872452   0.5117505  ... -0.01756178  0.3869358
   0.24240337]]


In [1]:
import spacy
from spacy.lang.en import English

nlp = spacy.load("en_core_web_sm")

doc = 
# Get adjectives by token.pos_

#Will store ents in the .ents part of the NLP object, and this will have ent.text, ent.label_ as text and type

# Can search for noun + adjective patterns via a matcher object in spacy
# If they are separated, use the "OP": "*"



How many named entities are there in the tweets? How often is each named? What types are there?

How many adjectives/descriptors are there?
What are the most common and least common?

In [None]:
from spacy.matcher import Matcher
from spacy.mo



What is the distribution of NERs and descriptors by Tweet?
Do any Tweets have more than one named entity?


How are the NERs and descriptors found within a Tweet?
Bigrams, trigrams, same sentence?

What antonym pairs are found in the tweets? Does this match a general understanding of opposites?

### World extraction
Do a data reduction

In [3]:
# Basically import my previous analysis but then consider using NMF instead of FA.

#### Wilder ideas
Here are some other ideas:
1. Can we predict Trump's attitudes to untweeted entities?
2. Can we predict Trump's tweets using known tweets?
3. Can we make fake Trump tweets

In [2]:
# 1 - predicting attitudes to unknowns
#Do this via similarity of an entity to known entities in KDWD. Can also use the Word2Vec or 
#something like GPT2 or GPT3
# Vs KDWD on entity and known reduced structure

In [None]:
# 2 - accuracy of prediction
# Can we predict a holdout set? (DOes time matter?)


In [3]:
# 3 - feed tweets into GPT2/3, get new tweets