## Notebook 1 - Get Tweets using GetOldTweets
The purpose of this notebook is to gather the tweets. Scott Tarlow, a contact shared with me by my advisor Joshua Cook, directed me to this GetOldTweets-python repo (https://github.com/Jefferson-Henrique/GetOldTweets-python) to circumvent the limits of the Twitter API. Here are some of the tweeks I implemented in GetOldTweets to tailor it to my needs:  
    
**Problem:**  
The functionality was written for Python2 and I'm using Python3.  

**Fix:**  
Replaced `urllib2` references with `urllib` in `getJsonReponse` function within `TweetManager.py`.   
    
**Problem:**  
I also discovered that the more tweets I requested, the time to retrieve them grew exponentially. 

**Fix:**  
Added functionality to pull tweets one month at a time. About 1 minute per 500 tweets compared to 100 minutes per 500.   

**Problem:**  
Originally, I wasn't getting tweets from the last day of each month because the end date is not inclusive.  

**Fix:**  
With the exception of the very first month, each month starts with the end date of the previous month.  
  
**Problem:**  
Tweets sent as replies to other tweets were being included. I only want original tweets.  
  
**Fix:**  
I added a check in `getTweets` function within `TweetManager.py` and it skips replies.  
  
**Problem:**  
The original code references a section of the Twitter API repsonse that no longer exists. It led to empty username in data frame.  
  
**Fix:**  
I relocated where username is in the response, and updated the code.

In [None]:
cd /home/jovyan/capstone/GetOldTweets-python/

In [None]:
import pandas as pd

In [None]:
import got3 as got

# def get_tweets(username, since=None, until=None):
def get_tweets(username):
    from collections import defaultdict
    d = defaultdict(list)
    
    attributes = ['date', 'favorites', 'retweets', 'hashtags', 'id', 'mentions',
                  'text', 'urls', 'username', 'permalink', 'author_id']

    # created to efficiently pull tweets one month at a time    
#     years = ['2017', '2016', '2015', '2014', '2013', '2012', '2011', '2010', '2009', '2008']
    years = ['2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017']
#     months = [('12', '31'), ('11', '30'), ('10', '31'), ('09', '30'), ('08', '31'), ('07', '31'),
#               ('06', '30'), ('05', '31'), ('04', '30'), ('03', '31'), ('02', '28'), ('01', '31')]
    months = [('01', '31'), ('02', '28'), ('03', '31'), ('04', '30'), ('05', '31'), ('06', '30'),
              ('07', '31'), ('08', '31'), ('09', '30'), ('10', '31'), ('11', '30'), ('12', '31')]
    total = 0
    for year in years:
        for month in months:
            # account for non-inclusive end of month by starting where previous month left off
            if year == '2008' and month[0] == '01':
                since = year + '-' + month[0] + '-01'
            else:
                since = until
            until = year + '-' + month[0] + '-' + month[1]
            tweetCriteria = got.manager.TweetCriteria().setUsername(username).setSince(since).setUntil(until)
            tweet_list = got.manager.TweetManager.getTweets(tweetCriteria)                                                                                             
            print('{} tweets from {} to {}.'.format(len(tweet_list), since, until))
            total += len(tweet_list)
            for tweet in tweet_list:
                for att in attributes:
                    d[att].append(eval("tweet." + att))
#         print('Year: {}, Running Total: {}'.format(year, total))
    print('{} total tweets for {}.'.format(total, username))
    return pd.DataFrame(d)

In [None]:
def get_and_export(filename, handle):
    data = get_tweets(str(handle))
    data.to_pickle('../data/' + str(filename) + '_tweets.p')

In [None]:
# list of tuples containing filename for pickle and twitter handle
all_spcas = [('df_sfspca', 'sfspca'), ('df_pspca', 'PSPCA'), ('houston', 'HoustonSPCA'), ('texas', 'spcaoftexas'),
            ('tulsa', 'Tulsa_SPCA'), ('richmond', 'RichmondSPCA'), ('ontario', 'OntarioSPCA'), 
            ('alberta', 'FMSPCA'), ('bc', 'BC_SPCA')]

In [None]:
count = 0
for filename, handle in all_spcas:
    count += 1
    print('Pulling {} of {}: {}'.format(count, len(all_spcas), handle))
    get_and_export(filename, handle)

In [None]:
df_sfspca = get_tweets('sfspca')

In [None]:
df_sfspca.to_pickle('../data/sfspca_tweets.p')

In [None]:
df_pspca = get_tweets('PSPCA')

In [None]:
df_pspca.to_pickle('../data/pspca_tweets.p')

In [None]:
df_houston = get_tweets('HoustonSPCA')

In [None]:
df_houston.to_pickle('../data/houston_tweets.p')

In [None]:
df_texas = get_tweets('spcaoftexas')

In [None]:
df_texas.to_pickle('../data/texas_tweets.p')

In [None]:
df_tulsa = get_tweets('Tulsa_SPCA')

In [None]:
df_tulsa.to_pickle('../data/tulsa_tweets.p')

In [None]:
df_richmond = get_tweets('RichmondSPCA')

In [None]:
df_richmond.to_pickle('../data/richmond_tweets.p')

In [None]:
df_ontario = get_tweets('OntarioSPCA')

In [None]:
df_ontario.to_pickle('../data/sfspca_tweets.p')

In [None]:
df_alberta = get_tweets('FMSPCA')

In [None]:
df_alberta.to_pickle('../data/alberta_tweets.p')

In [None]:
df_bc = get_tweets('BC_SPCA')

In [None]:
df_bc.to_pickle('../data/bc_tweets')