# ECON 6760 Final

---

# Tweet Scraping

Following is the platform we used to retrieve tweets from "@Wendys," the eponymous corporate twitter account.

Our goal with this project is to investigate a VAR model that predicts the stock price of Wendys (NASDQ: WEN) by integrating data from Twitter. The fields being retrieved are: tweet datetime, tweet text, number of time the tweet was favorited, and number of times the tweet was retweeted. The latter fields can be thought of as metrics that lead to higher exposure; since retweets lead to more reads, and favorites lead to the tweet being "bumped-up" in the twitter feeds of relevant readers.

In [1]:
import pandas as pd  # holding tweets in a dataframe
import GetOldTweets3 as got  # API to scrape tweets
import time  # time.sleep 

In [4]:
def tweet_to_df(target, start, stop):
    '''
    Function to scrape tweets from the GetOldTweets3 API into a dataframe.
    
    Refer to [https://pypi.org/project/GetOldTweets3/] for additional info and tweet fields.
    
    Parameters
    ----------
    user: the username of target tweeter - could be changed easily to text scrape
    start: inclusive date to start scraping
    stop: exclusive date to stop scraping
    
    Example
    -------
    >>>tweet_to_df('@Wendys', '2018-01-01', '2018-02-01')
    returns a dataframe of size (3354, 4) of tweets by '@Wendys' during Jan of 2018
    '''
    tweetCriteria = got.manager.TweetCriteria()\
                                .setQuerySearch(target)\
                                .setSince(start)\
                                .setUntil(stop)
    
    tweets = got.manager.TweetManager.getTweets(tweetCriteria)
    text_tweets = [[tweet.date, 
                    tweet.username,
                    tweet.text,
                    tweet.favorites, 
                    tweet.retweets]
                   for tweet in tweets]
    
    return pd.DataFrame(text_tweets,
                        columns=['datetime', 
                                 'username',
                                 'text',
                                 'favorites', 
                                 'retweets'])

In [8]:
%%time

# scrape boundaries - NOTE: limit of 10k / 15 mins
start = '2018-01-01'
stop = '2018-01-02'

# scrape-and-store
raw_data = tweet_to_df('@Wendys', start, stop)    
raw_data.shape

CPU times: user 5.92 s, sys: 261 ms, total: 6.18 s
Wall time: 38.8 s


(867, 5)

In [9]:
raw_data.head()  # tweets are ordered with most recent on top

Unnamed: 0,datetime,username,text,favorites,retweets
0,2018-01-01 23:59:56+00:00,kappi464,@Wendys Only 2 chains raps are more fresh then...,0,0
1,2018-01-01 23:59:34+00:00,MyRealityIsMe,When I go to @Wendys this week in the DMV area...,0,0
2,2018-01-01 23:57:56+00:00,alekzandr904,my brothers titanium legs are as cold as @McDo...,2,2
3,2018-01-01 23:55:13+00:00,HouseOfDamaged,check out our clothing line bringing awareness...,1,0
4,2018-01-01 23:52:58+00:00,SoundDiel,Never going to another place for nuggets. Got ...,2,1


---

# Iterating Scrape

As of this writing, Twitter imposes a limit on API hits. That limit is 10k requests per 15 minutes.

How you workaround that is totally up to you, here are a couple of ways that I found.

In [None]:
# manually adjust the dates to different months and wait
start = '2018-02-01'
stop = '2018-03-01'

# after each new scrape append the new data to the old
more_data = tweet_to_df('@Wendys', start, stop)
raw_data = raw_data.append(more_data, ignore_index=True)
raw_data.shape

In [18]:
from datetime import date, timedelta

sdate = date(2018, 1, 4)   # start date
edate = date(2018, 12, 31)   # end date

delta = edate - sdate       # as timedelta
dates = []
for i in range(delta.days + 1):
    day = sdate + timedelta(days=i)
    dates.append(day)

In [21]:
# baked-in waiting
for d in dates:
    start = str(d)
    stop = str(d + timedelta(days=1))
    # and scrape
    more_data = tweet_to_df('@Wendys', start, stop)
    raw_data = raw_data.append(more_data, ignore_index=True)
    print(f'working... {raw_data.shape[0]} rows now')  # for sanity
    time.sleep(60*5)

working... 25741 rows now
working... 30338 rows now
working... 33523 rows now
working... 35776 rows now
working... 37676 rows now
working... 39453 rows now
working... 41401 rows now
working... 43065 rows now
working... 44608 rows now
working... 45783 rows now
working... 47060 rows now
working... 48685 rows now
working... 50398 rows now
working... 51874 rows now
working... 53217 rows now
working... 54744 rows now
working... 55853 rows now
working... 56916 rows now
working... 58123 rows now
working... 59323 rows now
working... 70028 rows now
working... 74149 rows now
working... 76782 rows now
working... 78758 rows now
working... 80057 rows now
working... 81110 rows now
working... 82275 rows now
working... 84985 rows now
working... 87025 rows now
working... 88868 rows now
working... 90555 rows now
working... 93180 rows now
working... 100758 rows now
working... 103177 rows now
working... 105476 rows now
working... 107550 rows now
working... 109603 rows now
working... 112249 rows now
workin

SystemExit: 

In [23]:
raw_data = raw_data.drop_duplicates().sort_values('datetime').reset_index(drop=True)

raw_data.tail()

Unnamed: 0,datetime,username,text,favorites,retweets
177206,2018-02-27 23:59:22+00:00,GloBowler,I just voted for Devonte' Graham for the @Wend...,0,0
177207,2018-02-27 23:59:24+00:00,itsyaboynate14,@Wendys can you deliver to mi casa?,0,0
177208,2018-02-27 23:59:36+00:00,JmfSavage,"I just had the best smoky mushroom burger, mad...",0,0
177209,2018-02-27 23:59:49+00:00,storybrooke28,I just voted for Marvin Bagley III for the @We...,0,0
177210,2018-02-27 23:59:53+00:00,Coachbird24,I just voted for Devonte' Graham for the @Wend...,1,0


In [24]:
# drop any duplicates and save to csv

raw_data.drop_duplicates().to_csv('all_mentions.csv', index=False)