## Download Hillary Clinton's Tweets

The only purpose of this notebook is to download the tweets and tweet info into a csv file that can later be handled with Pandas. Not all the columns in the csv file are useful and some will need some careful preprocessing. Below is a description of the fields in the csv file.

Column | Description
--- | ---
id | tweet id
source | iPhone or Android (needs preprocessing)
text | text of HC's tweets
favorite_count | number of favorites the tweet received (set to 0 if retweet and affected by the issue above)
retweet_count | number of retweets the tweet received (set to 0 if retweet and affected by the issue above)
is_retweet | boolean (affected by the way HC sometimes quotes tweets)
original_author | original author of the tweet (biased towards HC because of the issue above)
possibly_sensitive | boolean (useless since either no info in the case of Android posts or False for iPhone posts)
created_at | day, date, and time when the tweet was posted on HC's page
hashtags | hashtags in the tweet
user_mentions | Twitter users mentioned in the tweet (only listed if mentioned the typical "Twitter way" using @)
lang | language (usually english, sometimes undefined)
place | place where the tweet was made based on the coordinates
place_coord_boundaries | corner coordinates of the bounding box defining the location where the tweet was made 

---

In [1]:
import os
import sys

import tweepy
import pandas as pd
import numpy as np

In [2]:
import yaml
root_dir = "/Users/gogrean/Documents/kaggle/off_kaggle_ds/elect2016/"
twitter_cred = yaml.load(open(root_dir + 'credentials/twitter.cred'))

auth = tweepy.OAuthHandler(twitter_cred['consumer_token'], 
                           twitter_cred['consumer_secret'])
auth.set_access_token(twitter_cred['access_token_key'], 
                      twitter_cred['access_token_secret'])

api = tweepy.API(auth)

The Twitter API limits the user to accessing only the last 3200 tweets. So I'm starting here with the last 3200 tweets that HC posted, and will run the script every few weeks to download new data.

In [3]:
page_list=[]
for page in tweepy.Cursor(api.user_timeline, screen_name='HillaryClinton', count=200, include_rts=True).pages(17):
    page_list.append(page)

If a file with the tweets already exists, only new tweets are added to it. In this case the new tweets will be added at the end of the file, so the tweets will not be in chronological order. Whether a tweet is new or not is decided based on the date and time it was created; if the date and time are already in the file, then the tweet is an old one and it's skipped over.

In [4]:
root_dir = "/Users/gogrean/Documents/kaggle/off_kaggle_ds/elect2016/data/"
clintons_tweets_file = "clintons_tweets.csv"

if os.path.exists(root_dir + clintons_tweets_file):
    df = pd.read_csv(root_dir + clintons_tweets_file, header=0)
    cols = df.columns.values
else:
    cols = ['id', 'source', 'text', 'favorite_count', 'retweet_count',  
            'is_retweet', 'original_author', 'possibly_sensitive', 
            'created_at', 'hashtags', 'user_mentions', 'lang', 
            'place', 'place_coord_boundaries']
    df = pd.DataFrame(columns=cols)

for page in page_list:
    for status in page:
        status = status._json
        if status['created_at'] in df['created_at'].values:
            continue
        hashtags = ", ".join([hashtag_item['text'] for hashtag_item in status['entities']['hashtags']])
        mentions = ", ".join([mention['screen_name'] for mention in status['entities']['user_mentions']])
        try: 
            status['retweeted_status']
        except KeyError:
            is_retweet = False
        else:
            is_retweet = True    

        if is_retweet:
            n_favorites = 0
            n_retweets = 0
            original_author = status['retweeted_status']['user']['screen_name']
        else:
            n_favorites = status['favorite_count']
            n_retweets = status['retweet_count']
            original_author = status['user']['screen_name']
        
        try:
            location = status['place']['full_name']    
        except TypeError:
            location = ''
        
        try:
            coordinates = [coord for loc in status['place']['bounding_box']['coordinates'] for coord in loc]
        except TypeError:
            coordinates = None
        
        try:
            is_sensitive = status['possibly_sensitive']
        except KeyError:
            is_sensitive = None
        
        single_tweet_df = pd.DataFrame([[
                                        status['id'],
                                        status['source'],
                                        status['text'],
                                        n_favorites,
                                        n_retweets,
                                        is_retweet, original_author, 
                                        is_sensitive,
                                        status['created_at'],
                                        hashtags, mentions, 
                                        status['lang'],
                                        location,
                                        coordinates
                                        ]], columns=cols)
        df = df.append(single_tweet_df, ignore_index=True)

In [5]:
df.to_csv(root_dir + clintons_tweets_file, columns=cols, index=False)