# Trump tweets

_Data Structures and Algorithms_

_Imperial College Business School_


---
In this optional section, we will practice dealing with JSON data from Twitter's API.

---


## Submission

This part of the session is optional and meant to be open-ended. You don't need to submit anything. 

## Trump tweets, JSON edition

Earlier in the course, we looked at some of Donald Trump's tweets that had been conveniently packaged into a CSV file that we could open as a spreadsheet. The dataset we used ended in summer 2016. What if we wanted to analyse more recent tweets? We could do this by registering for access to Twitter's API (application programming interface; Twitter's interface for anyone to access its data) and downloading the data directly using one of the libraries that Python users have developed for accessing the API. 

We won't register for the API now. Instead, we'll use data maintained by Github user bpb27. 

The data that we'll use has been cleaned and condensed, but resembles the output from the Twitter API. Instead of CSV, the API gives data in JSON format. We've included the file condensed_2016.json downloaded from Github in the zip file.

What is JSON data, and how can we load it into Python?

### JSON

JSON (JavaScript Object Notation, "Jason") is a common format for semi-structured data on the web. Many APIs provide data as JSON. We can open a JSON file in any text editor. It will look something like this.

```json
{"countries":
  [
  {"Germany": "Berlin"},
  {"France": "Paris"},
  {"Italy": "Rome"}
  ]
}
```

This looks very much like a Python data structure with nested dictionaries and lists. A tweet from Twitter's API will look similar (some fields only here):

```json
{"source": "Twitter for iPhone",
"id_str": "815271067749060609",
"text": "RT @realDonaldTrump: Happy Birthday @DonaldJTrumpJr!\nhttps://t.co/uRxyCD3hBz",
"created_at": "Sat Dec 31 18:59:04 +0000 2016",
"retweet_count": 9529,
"in_reply_to_user_id_str": null,
"favorite_count": 0,
"is_retweet": true}
```
JSON is more flexible than CSV. For example, some tweets might not include some data fields. In a CSV file, we would still include them in tabular form. In JSON, the specific key-value pair would be absent.

We can read the JSON file to Python as we would a text file, or open it in Notepad or another text editor. 


In [None]:
json_file_name = 'condensed_2016.json'
with open(json_file_name, encoding='utf8') as f:
    text = f.read()

# Print first characters of resulting string 
print(text[0:500]) 

Having read the file, we could parse through the string looking for  different aspects of each tweet. But it's much more convenient to use a library that directly exploits the structure of JSON. This is called simply the json library. We import it using the import statement, and use its methods to load data into Python's data structures.

In [None]:
import json
json_file_name = 'condensed_2016.json'
with open(json_file_name, encoding='utf8') as f:
    tweet_data = json.load(f)

The result is a Python list of dictionaries containing the tweets.

In [None]:
tweet_data[0:2]

We're now ready to study patterns in the tweets.

### Exercise: Up all night?

Let's start by analysing President Trump's sleep patterns. We will create a count of the number of tweets by hour of the day. 

Here's how you can get the hour from a tweet timestamp using the `datetime` module.

In [None]:
from datetime import datetime

# Get first tweet
tw = tweet_data[0]
print(tw)
# We see the timestamp is at the field 'created_at'

# Get timestamp of the tweet
date_str = tw['created_at']
print(date_str)

# Make into datetime object, get the attributes of the result
dt = datetime.strptime(date_str,'%a %b %d %H:%M:%S +0000 %Y') # specify format of time string
print(dt.year, dt.month, dt.day, dt.hour)
type(dt.hour)

Your task is to count all tweets by hour. One way to do this is creating a dictionary with keys as hours and values as counts.

#### Sidebar: dictionary comprehension

Recall that Python has a convenient way of reducing the work we need to do for writing loops called _comprehensions_. We can write a loop to create a dictionary in a single line as follows. The same kind of thing can be done to create lists too.

In [None]:
# Initialize dictionary of zero hourly counts using dictionary comprehension
# Dictionary specified as key->hour, value->zero for each hour value in the range
hourly_counts = {hour:0 for hour in range(24)}
hourly_counts

You may find this initialization useful in calculating the counts.

In [None]:
# Your code here. 

# Initialize hourly_counts as above

# loop through tweets in tweet_data
# within loop: get timestamp of tweet as above
# within loop: get hour as above
# within loop: add one to dictionary value for relevant hour


What is the most common hour for tweeting? What can you say about the President's sleeping patterns? What additional analysis would you do?

You can use `matplotlib` to plot the result. You can do a line chart following [the first example here](https://matplotlib.org/users/pyplot_tutorial.html), or a bar chart following [this example](https://pythonspot.com/en/matplotlib-bar-chart/).

You could also check how this pattern changes over different months.

### Exercise: Who's tweeting?

It appears that there are different sources for the tweets in the `source` field of the data. 

Create an hourly count of tweets by the different sources. Can you infer what this suggests about Trump's personal phone and the one his office uses for tweeting?

Let's first find all the sources that are in the data. 

In [None]:
sources = set()
for tweet in tweet_data:
    if tweet['source'] not in sources:
        sources.add(tweet['source'])   
print(sources)

Which are the most common sources and what do their timings suggest about usage?

In [None]:
# Your code here
# Initialize dictionary (or multiple) like above
# Loop through tweets and add to counts like above

### Exercise: Who's tweeting what?

How do the contents of Mr Trump's tweets change depending on the source? We could do some really sophisticated analysis here through [sentiment analysis](http://text-processing.com/demo/sentiment/) of the tweet texts. For the purposes of this exercise, do the following calculations by tweet source:

1. Find the fraction of tweets containing the word 'dumb' in either upper or lower case.

2. Repeat for words you'd like, for example the ones suggested below.

You can also repeat the analysis by source and hour, or look at different words or mentions of different Twitter users. 

Note you probably want to count both upper and lower case words together.


In [None]:
# Your code here
words = ['dumb', 'brexit', '#makeamericagreatagain', 'guns', 'dead', '#crookedhillary']
