## Download Clinton's and Trump's Tweets

In this notebook, we download the tweets and tweet info of each candidate. The data is saved in a CSV file that can later be handled with `Pandas`. Not all the information in the pandas file will be useful. However, we err on the side of caution and save more information than we (immediately, at least) need, since missing information is difficult to add later on mostly due to the Twitter API.

The API limits the user to accessing only the last 3200 tweets. To accumulate more data, the code needs to be run every few months to get the newest tweets. When that is done, it's important to check if a tweet that is already in the CSV file should be updated (e.g., because it received more favorites since we previously downloaded it).

When a new tweet is added to the CSV file, it is appended at the end of the file. This means that the tweets saved will end up not being in chronological order. We can sort things later on if necessary.

Below is a description of the fields in the CSV file:

Column | Description
--- | ---
id | tweet id
created_at | day, date, and time (UTC) when the tweet was posted
source | device from which the tweet was posted
text | text of tweet 
lang | language; for simplicity, only tweets in English are selected 
favorite_count | number of favorites the tweet received
retweet_count | number of retweets the tweet received
original_author | username under which the tweet was posted
possibly_sensitive | boolean, always False (not True)
hashtags | hashtags in the tweet
user_mentions | Twitter users mentioned in the tweet
place | place where the tweet was made based on the coordinates
place_coord_boundaries | corner coordinates of the bounding box defining the location where the tweet was made 

---

Using the Twitter API to download data requires a Twitter account and a mobile phone. The mobile phone number should be added to your Twitter account. If you have these and are willing to use them for the tutorial, then the next step is to create an application and get the keys needed to authorize requests to the Twitter API. 

Go to https://apps.twitter.com/ and click on `Create New App`.

<img src="../img/create_new_app.jpg">

Enter all the information required in the fields marked with \*, agree to the Twitter Development Agreement, and then click `Create your Twitter application`.

<img src="../img/enter_app_info.jpg">

A new window will open to allow you to edit your application settings. Click on the `Keys and Access Tokens` tab at the top of the page.

<img src="../img/keys_and_access_tokens.jpg">

At the bottom of the page, click on `Create my access token`. 

<img src="../img/create_access_token.jpg">

Now you should have all the tokens needed to use the API. You need the Consumer Key (API Key), Consumer Secret (API Secret), Access Token, and Access Token Secret. My advice is to save these in a YAML file rather than hard-code them. The YAML file will look something like this:

<img src="../img/yaml_example.jpg">

Finally, we can load the credentials from the YAML file into Python, and authenticate in the API.

In [1]:
import os
import sys
import yaml

import numpy as np
import pandas as pd
import tweepy

In [2]:
# Change this to match your own credentials filename.
root_dir = os.path.dirname(os.getcwd()) + "/"
twitter_cred = yaml.load(open(root_dir + 'credentials/twitter.cred'))

auth = tweepy.OAuthHandler(twitter_cred['consumer_token'], 
                           twitter_cred['consumer_secret'])
auth.set_access_token(twitter_cred['access_token_key'], 
                      twitter_cred['access_token_secret'])

api = tweepy.API(auth)

I save the data in two separate CSV files: one for Trump's tweets and one for Clinton's tweets.

In [3]:
trumps_tweets_file = "data/trumps_tweets.csv"
clintons_tweets_file = "data/clintons_tweets.csv"

In [4]:
# Column names of the CSV file.
COLS = ['id', 'created_at', 'source', 'text', 'lang', 
        'favorite_count', 'retweet_count', 'original_author', 'possibly_sensitive', 'hashtags', 
        'user_mentions', 'place', 'place_coord_boundaries']

In [5]:
def write_tweets(username, file):
    # If the file exists, then read the existing data from the CSV file.
    if os.path.exists(file):
        df = pd.read_csv(file, header=0)
    else:
        df = pd.DataFrame(columns=COLS)
    
    # The Twitter API limits the number of statuses one can download to
    # about 200 per 'page'. Since we can download at most 3200 tweets, we 
    # loop through 17 pages to get our data.
    for page in tweepy.Cursor(api.user_timeline, screen_name=username, 
                              count=200, include_rts=False).pages(17):
        # Print some of the variables below to understand what's happening.
        for status in page:
            new_entry = []
            status = status._json
            
            # If the tweet was not written in English (either because it's actually 
            # in a different language or because it only contains links, hashtags, 
            # and/or user mentions), move on to the next tweet.
            if status['lang'] != 'en':
                continue

            # Let's say that we want to update a tweet's info in the CSV file if the
            # number of favorites and retweets has changed since previous download.
            if status['created_at'] in df['created_at'].values:
                i = df.loc[df['created_at'] == status['created_at']].index[0]
                if status['favorite_count'] != df.at[i, 'favorite_count'] or \
                   status['retweet_count'] != df.at[i, 'retweet_count']:
                    df.at[i, 'favorite_count'] = status['favorite_count']
                    df.at[i, 'retweet_count'] = status['retweet_count']
                continue
            
            # If the tweet is written in English and is not already in the CSV file, 
            # add some of its info (the attributes that don't need to be processed) 
            # to the new_entry list.
            new_entry += [status['id'], status['created_at'], 
                          status['source'], status['text'], status['lang'], 
                          status['favorite_count'], status['retweet_count']]
            
            # In this case, the original_author column will either have only
            # realDonaldTrump or HillaryClinton, depending on the file. However, 
            # you can try to modify this function to save all the data in a single 
            # file. In that case, it's important to save the username.
            new_entry.append(status['user']['screen_name'])
            
            try:
                is_sensitive = status['possibly_sensitive']
            except KeyError:
                is_sensitive = None
            new_entry.append(is_sensitive)
            
            # Save the hashtags and user mentions as comma-separated strings, e.g.,
            #   "MAGA, CrookedHillary"
            # for a Trump tweet.
            hashtags = ", ".join([hashtag_item['text'] for hashtag_item in status['entities']['hashtags']])
            new_entry.append(hashtags)
            mentions = ", ".join([mention['screen_name'] for mention in status['entities']['user_mentions']])
            new_entry.append(mentions)

            try:
                location = status['place']['full_name']    
            except TypeError:
                location = ''
            new_entry.append(location)

            try:
                coordinates = [coord for loc in status['place']['bounding_box']['coordinates'] for coord in loc]
            except TypeError:
                coordinates = None
            new_entry.append(coordinates)

            single_tweet_df = pd.DataFrame([new_entry], columns=COLS)
            df = df.append(single_tweet_df, ignore_index=True)
    df.to_csv(file, columns=COLS, index=False, encoding="utf-8")

Finally, we can run the function above to download Clinton's and Trump's most recent tweets. Their usernames are realDonaldTrump and HillaryClinton.

In [6]:
write_tweets('realDonaldTrump', root_dir + 'data/trumps_tweets.csv')
write_tweets('HillaryClinton', root_dir + 'data/clintons_tweets.csv')