# Developing a workaround for Tweepy, to obtain tweets from specific blocks of time

#### Michele Waters

* Problem: I would like to be able to search for tweets from individual users by date range. However, Tweepy only takes tweet ids

* Solution: I want to get tweet ids that correspond to specific times. We can do this using 'big_ben_clock' a Twitter bot that has tweeted relatively consistently every hour since October 30, 2009. 

* Goal: Collect 'big_ben_clock''s tweet ids for every tweet at midnight (twitter time; 1am London time) using snscrape, so that these ids can be used to collect tweets on specific dates with Tweepy

### Scrape 'big_ben_clock' tweet urls from 2016-2020

* Use snscrape to get tweet urls from tweets associated with 'big_ben_clock'. These can be obtained all at once, but I found it easier to put urls from each year into their own file, so that I can later use Tweepy in batches 

In [208]:
#!pip install snscrape
import pandas as pd

In [92]:
#To grab all big_ben_clock tweet:
#!snscrape twitter-search "@big_ben_clock since:2016-10-30 until:2020-09-27"> big_ben_clock_tweets.txt

In [237]:
#Function to build snscrape syntax
def build_snscrape_criteria(user_name='big_ben_clock', since_date='2019-01-01', until_date='2020-01-01', file_name='big_ben_clock_2019_tweets.txt'):
    return f'!snscrape twitter-search "@{user_name} since:{since_date} until:{until_date}"> {file_name}'

In [236]:
#Grab big_ben_clock 2020 tweets
build_snscrape_criteria(user_name='big_ben_clock', since_date='2020-01-01', until_date='2020-09-30', file_name='big_ben_clock_2020_tweets.txt')

'!snscrape twitter-search "@big_ben_clock since:2020-01-01 until:2020-09-30"> big_ben_clock_2020_tweets.txt'

In [238]:
!snscrape twitter-search "@big_ben_clock since:2020-01-01 until:2020-09-30"> big_ben_clock_2020_tweets.txt

In [239]:
#Grab big_ben_clock 2019 tweets
build_snscrape_criteria(user_name='big_ben_clock', since_date='2019-01-01', until_date='2020-01-01', file_name='big_ben_clock_2019_tweets.txt')

'!snscrape twitter-search "@big_ben_clock since:2019-01-01 until:2020-01-01"> big_ben_clock_2019_tweets.txt'

In [240]:
!snscrape twitter-search "@big_ben_clock since:2019-01-01 until:2020-01-01"> big_ben_clock_2019_tweets.txt

In [241]:
#Grab big_ben_clock 2018 tweets
build_snscrape_criteria(user_name='big_ben_clock', since_date='2018-01-01', until_date='2019-01-01', file_name='big_ben_clock_2018_tweets.txt')

'!snscrape twitter-search "@big_ben_clock since:2018-01-01 until:2019-01-01"> big_ben_clock_2018_tweets.txt'

In [242]:
!snscrape twitter-search "@big_ben_clock since:2018-01-01 until:2019-01-01"> big_ben_clock_2018_tweets.txt

In [243]:
#Grab big_ben_clock 2017 tweets
build_snscrape_criteria(user_name='big_ben_clock', since_date='2017-01-01', until_date='2018-01-01', file_name='big_ben_clock_2017_tweets.txt')

'!snscrape twitter-search "@big_ben_clock since:2017-01-01 until:2018-01-01"> big_ben_clock_2017_tweets.txt'

In [244]:
!snscrape twitter-search "@big_ben_clock since:2017-01-01 until:2018-01-01"> big_ben_clock_2017_tweets.txt

In [245]:
#Grab big_ben_clock 2016 tweets
build_snscrape_criteria(user_name='big_ben_clock', since_date='2016-01-01', until_date='2017-01-01', file_name='big_ben_clock_2016_tweets.txt')

'!snscrape twitter-search "@big_ben_clock since:2016-01-01 until:2017-01-01"> big_ben_clock_2016_tweets.txt'

In [246]:
!snscrape twitter-search "@big_ben_clock since:2016-01-01 until:2017-01-01"> big_ben_clock_2016_tweets.txt

* Create a dictionary containing tweet urls, with each year as dictionary key

In [247]:
def get_tweet_urls_dict(user_name='big_ben_clock', dates=['2016', '2017', '2018', '2019', '2020']):
    tweet_dict={}
    for year in dates:
        tweet_dict[year]=pd.read_csv(f'{user_name}_{year}_tweets.txt', header=None)
    return tweet_dict

In [248]:
tweet_dict=get_tweet_urls_dict()

In [253]:
tweet_dict['2020'].head()

Unnamed: 0,0
0,https://twitter.com/BigBenTracker/status/13110...
1,https://twitter.com/tableau/status/13077262712...
2,https://twitter.com/big_ben_clock/status/13110...
3,https://twitter.com/atzcnt/status/131101827391...
4,https://twitter.com/BigBenTracker/status/13110...


* The urls scraped above include a mixture of tweet urls from both 'big_ben_clock' and tweets urls from accounts that replied to 'big_ben_clock'. Let's get the tweet ids from just the tweets from big_ben_clock

In [257]:
def get_tweet_ids(user_name, tweet_urls):
    tweet_ids=[tweet.split(f'https://twitter.com/{user_name}/status/')[1] for tweet in tweet_urls[0] if f'{user_name}/' in tweet]
    return tweet_ids

In [258]:
tweet_id_dict={year: get_tweet_ids('big_ben_clock', tweet_dict[year]) for year in tweet_dict.keys()}

In [261]:
tweet_id_dict['2020'][:5]

['1311034031853756416',
 '1311017917274828801',
 '1311003070843805696',
 '1310987968325332993',
 '1310972617210433537']

In [262]:
tweet_id_dict['2016'][:5]

['815345942887092224',
 '815332214871363585',
 '815317116274491392',
 '815302777312342016',
 '815286662502772736']

### Get Tweets from tweet ids using Tweepy

* Now that we have tweet ids for every big_ben_clock tweet from 2016-2020, let's get the actual tweets using Tweepy (Note: requires a Twitter Developer Account. Apply for one here: https://developer.twitter.com/en/apply-for-access)

In [265]:
import tweepy

In [266]:
#get api keys and access token keys 
tokens_df= pd.read_csv('keys_tokens.csv')

In [267]:
#Function to authorize api with keys and tokens
#Note: wait_on_rate_limit=True prevents rate limiting
def auth_api(api_key, api_secret_key, access_token, access_token_secret):
    auth = tweepy.OAuthHandler(api_key, api_secret_key)
    auth.set_access_token(access_token, access_token_secret)
    api = tweepy.API(auth, wait_on_rate_limit=True) 
    return api

In [268]:
#Authorize api using Twitter developer tokens
api=auth_api(api_key=tokens_df['api_key'][0],api_secret_key=tokens_df['api_secret_key'][0],access_token=tokens_df['access_token'][0],access_token_secret=tokens_df['access_token_secret'][0])

In [399]:
#Creation of Tweepy class
class Tweepy():
    def __init__(self, api, user_name):
        self.api=api
        self.user=api.get_user(user_name)
        self.id=self.user.id
        self.follower_num=self.user.followers_count
        self.friend_num=self.user.friends_count
    #Gets user account information, returning select attributes
    def get_user_info_df(self, result='dictionary'):
        select_attributes=['screen_name','name','id', 'profile_location', 'description',\
                           'protected','followers_count', 'friends_count', 'listed_count', \
                           'created_at','profile_image_url','favourites_count','verified', \
                           'statuses_count','lang']
        select_info_dict={k:v for k, v in self.user._json.items() if k in select_attributes}
        if result=='df':
            select_info_dict=pd.DataFrame(select_info_dict, index=[0])
        return select_info_dict
    #Gets last ~3200 tweets from user
    def get_user_tweets(self, since_id=None, max_id=None, num_tweets=0):
        return pd.DataFrame([status._json for status in tweepy.Cursor(api.user_timeline, id=self.id, since_id=since_id, max_id=max_id, tweet_mode='extended').items(num_tweets)])
    #Gets tweets from user from a tweet id list
    def get_tweets_from_tweet_id_list(self, id_list=[1308815533052133382]):
        return pd.DataFrame([api.get_status(id=id)._json for id in id_list])

In [272]:
#Initilaize tweepy class
big_ben_tweepy=Tweepy(api, 'big_ben_clock')

In [273]:
tweet_id_dict.keys()

dict_keys(['2016', '2017', '2018', '2019', '2020'])

In [275]:
tweet_id_dict['2016'][:5]

['815345942887092224',
 '815332214871363585',
 '815317116274491392',
 '815302777312342016',
 '815286662502772736']

In [290]:
#To conserve time, will only save 2020 tweets
#Note: there appears to be a gap in 2019, early 2020, and other discrepancies in tweet activity; so it is worth double checking for consistency for other date ranges
big_ben_tweet_dict={year: big_ben_tweepy.get_tweets_from_tweet_id_list(tweet_id_dict[year]) for year in ['2020']}

In [344]:
big_ben_tweet_dict['2020'].head()

Unnamed: 0,created_at,id,id_str,text,truncated,entities,source,in_reply_to_status_id,in_reply_to_status_id_str,in_reply_to_user_id,...,place,contributors,is_quote_status,retweet_count,favorite_count,favorited,retweeted,lang,date,hour
0,2020-09-29 20:04:05+00:00,1311034031853756416,1311034031853756416,BONG BONG BONG BONG BONG BONG BONG BONG BONG,False,"{'hashtags': [], 'symbols': [], 'user_mentions...","<a href=""http://www.floodgap.com/software/ttyt...",,,,...,,,False,10,72,False,False,tl,9/29/2020,20
1,2020-09-29 19:00:03+00:00,1311017917274828801,1311017917274828801,BONG BONG BONG BONG BONG BONG BONG BONG,False,"{'hashtags': [], 'symbols': [], 'user_mentions...","<a href=""http://www.floodgap.com/software/ttyt...",,,,...,,,False,14,87,False,False,tl,9/29/2020,19
2,2020-09-29 18:01:04+00:00,1311003070843805696,1311003070843805696,BONG BONG BONG BONG BONG BONG BONG,False,"{'hashtags': [], 'symbols': [], 'user_mentions...","<a href=""http://www.floodgap.com/software/ttyt...",,,,...,,,False,23,83,False,False,tl,9/29/2020,18
3,2020-09-29 17:01:03+00:00,1310987968325332993,1310987968325332993,BONG BONG BONG BONG BONG BONG,False,"{'hashtags': [], 'symbols': [], 'user_mentions...","<a href=""http://www.floodgap.com/software/ttyt...",,,,...,,,False,18,90,False,False,tl,9/29/2020,17
4,2020-09-29 16:00:03+00:00,1310972617210433537,1310972617210433537,BONG BONG BONG BONG BONG,False,"{'hashtags': [], 'symbols': [], 'user_mentions...","<a href=""http://www.floodgap.com/software/ttyt...",,,,...,,,False,14,83,False,False,tl,9/29/2020,16


* Save tweets to csv files

In [294]:
import os
#Save dataframes to one csv
def save_big_ben_to_csv(big_ben_dict, file_name='big_ben_clock_tweets_2020.csv'):
    for year in big_ben_dict.keys():
        # if file does not exist write header 
        if not os.path.isfile('filename.csv'):
            big_ben_dict[year].to_csv(file_name, header='column_names')
        # else append dataframe to csv without header
        else: 
            big_ben_dict[year].to_csv(file_name, mode='a', header=False)
    return 'Saved!'

In [295]:
save_big_ben_to_csv(big_ben_tweet_dict)

'Saved!'

* Now that we have a dictionary containing dataframes for big_ben_clock's 2020 tweets, let's create a new dictionary that uses specific dates as keys, with tweet ids as values

In [340]:
#Function to create a dictionary of tweet ids for a specific time each day
#default search_clock_time is set to 0, which corresponds to midnight, twitter time/1am London time/8pm ET
def create_time_id(clock_df, search_clock_time=0):
    clock_df['created_at']=pd.to_datetime(clock_df['created_at'])
    clock_df['date']=clock_df['created_at'].apply(lambda x: f'{x.month}/{x.day}/{x.year}')
    clock_df['hour']=clock_df['created_at'].apply(lambda x: x.hour)
    time_dict_list=[]
    for record in clock_df.to_dict('records'):
        time_dict={}
        if record['hour']==search_clock_time:
            time_dict['date']=record['date']
            time_dict['id']=record['id']
            time_dict_list.append(time_dict)   
    return time_dict_list    

In [341]:
big_ben_time_dicts={year: create_time_id(clock_df=big_ben_tweet_dict[year]) for year in big_ben_tweet_dict.keys()}

In [473]:
big_ben_time_dicts['2020'][:5]

[{'date': '9/29/2020', 'id': 1310731279176998913},
 {'date': '9/28/2020', 'id': 1310368636897554433},
 {'date': '9/27/2020', 'id': 1310007009396371456},
 {'date': '9/26/2020', 'id': 1309644369943764993},
 {'date': '9/25/2020', 'id': 1309281980207562756}]

In [345]:
time_ids_df=pd.DataFrame(big_ben_time_dicts['2020'])

In [346]:
len(time_ids_df)

231

In [475]:
time_ids_df

Unnamed: 0,date,id
0,9/29/2020,1310731279176998913
1,9/28/2020,1310368636897554433
2,9/27/2020,1310007009396371456
3,9/26/2020,1309644369943764993
4,9/25/2020,1309281980207562756
...,...,...
226,2/4/2020,1224483473341198336
227,2/3/2020,1224120845557280768
228,2/2/2020,1223758208369602564
229,2/1/2020,1223395561912573952


In [348]:
time_ids_df.to_csv('time_ids.csv')

* There appears to be a large gap in tweet activity in January. If all calendar days were represented by 'big_ben_clock'activity, there should be 273 rows, however we only have 231. This is far from a perfect solution, but still gives the user slightly more control over time periods in Tweepy 

* Let's try it out:

### Scrape tweets from other users using defined time/tweet ids from big_ben_clock

* Step 1: Import time ids from csv

In [354]:
loaded_time_df=pd.read_csv('time_ids.csv', index_col=0)

* Step 2: Write function to return tweet id from date 

In [517]:
#Function to return tweet id from given date
def get_tweepy_date_id(time_id_file='time_ids.csv', date='2/1/2020'):
    loaded_time_df=pd.read_csv(time_id_file, index_col=0)
    time_dict={record['date']:record['id'] for record in loaded_time_df.to_dict('records')}
    if date in time_dict:
        return time_dict[date]
    else:
        return 'Date not present; choose a different date'

In [518]:
get_tweepy_date_id(date='1/1/2020')

1212161809018437633

In [519]:
type(get_tweepy_date_id(date='6/1/2020')) is int and type(get_tweepy_date_id(date='1/2/2020')) is int

False

In [520]:
get_tweepy_date_id(date='1/2/2020')

'Date not present; choose a different date'

In [525]:
#Function to check if date ranges are valid
def get_date_ranges(time_id_file='time_ids.csv', since_date='6/1/2020', until_date='9/29/2020'):
    not_valid=True
    while not_valid:
        since_date=input("Enter start date for scraping (e.g. '6/1/2020'): ")
        since_id= get_tweepy_date_id(time_id_file='time_ids.csv', date=since_date)                
        until_date=input("Enter end date for scraping (e.g. '9/29/2020'): ")
        until_id= get_tweepy_date_id(time_id_file='time_ids.csv', date=until_date)
        if type(since_id) is int and type(until_id) is int:
            not_valid=False
        return [since_id, until_id]

In [536]:
get_tweepy_date_id(time_id_file='time_ids.csv', date='9/19/2020')

1307107654427463682

In [531]:
get_date_ranges(time_id_file='time_ids.csv')

Enter start date for scraping (e.g. '6/1/2020'): 6/1/2020
Enter end date for scraping (e.g. '9/29/2020'): 9/29/2020


[1267244474989793280, 1310731279176998913]

* Step 3: Retrieve tweets from Tweepy for a different user. Let's get Barack Obama's tweets from Feb 28-May 24 2020. 
* First, double check dates are available:

In [481]:
get_tweepy_date_id(date='2/28/2020')

1233180394251575297

In [482]:
get_tweepy_date_id(date='5/24/2020')

1264345877252096006

* Initialize Tweepy class

In [435]:
barack_tweets=Tweepy(api, 'barackobama')

* Use 'get_user_tweets' method from Tweepy class and 'get_tweepy_date_id' function to set the date range from which to collect tweets from.

In [466]:
#'since_id' is start date; 'max_id' is end date; num_tweets is number of tweets you want to collect
#num_tweets=0 default returns all tweets (~last 3200)
bo_df=barack_tweets.get_user_tweets(since_id=get_tweepy_date_id(date='2/28/2020'), max_id=get_tweepy_date_id(date='5/24/2020'))

In [499]:
bo_df.head()

Unnamed: 0,created_at,id,id_str,full_text,truncated,display_text_range,entities,source,in_reply_to_status_id,in_reply_to_status_id_str,...,favorited,retweeted,possibly_sensitive,lang,quoted_status_id,quoted_status_id_str,quoted_status_permalink,quoted_status,extended_entities,retweeted_status
0,Sat May 23 21:55:57 +0000 2020,1264314146717384712,1264314146717384712,And here’s more on the approach Sweden has tak...,False,"[0, 117]","{'hashtags': [], 'symbols': [], 'user_mentions...","<a href=""https://mobile.twitter.com"" rel=""nofo...",1.264314e+18,1.2643141458995036e+18,...,False,False,False,en,,,,,,
1,Sat May 23 21:55:56 +0000 2020,1264314145899503616,1264314145899503616,South Korea has focused on testing to guard ag...,False,"[0, 87]","{'hashtags': [], 'symbols': [], 'user_mentions...","<a href=""https://mobile.twitter.com"" rel=""nofo...",1.264314e+18,1.2643141438694605e+18,...,False,False,False,en,,,,,,
2,Sat May 23 21:55:56 +0000 2020,1264314143869460484,1264314143869460484,As all 50 states begin the process of reopenin...,False,"[0, 251]","{'hashtags': [], 'symbols': [], 'user_mentions...","<a href=""https://mobile.twitter.com"" rel=""nofo...",,,...,False,False,False,en,,,,,,
3,Sat May 23 18:03:41 +0000 2020,1264255694729084930,1264255694729084930,The Class of 2020 is full of the leaders we ne...,False,"[0, 211]","{'hashtags': [], 'symbols': [], 'user_mentions...","<a href=""http://twitter.com/download/iphone"" r...",,,...,False,False,False,en,1.26424e+18,1.264239670184968e+18,"{'url': 'https://t.co/Qlr5ivBvpJ', 'expanded':...",{'created_at': 'Sat May 23 17:00:00 +0000 2020...,,
4,Wed May 20 21:34:17 +0000 2020,1263221534094774272,1263221534094774272,"As Chicago navigates the health crisis, its re...",False,"[0, 237]","{'hashtags': [], 'symbols': [], 'user_mentions...","<a href=""http://twitter.com/download/iphone"" r...",,,...,False,False,False,en,1.263166e+18,1.2631661509234688e+18,"{'url': 'https://t.co/ALo7kusc8K', 'expanded':...",{'created_at': 'Wed May 20 17:54:13 +0000 2020...,,


In [498]:
bo_df.iloc[0]

created_at                                      Sat May 23 21:55:57 +0000 2020
id                                                         1264314146717384712
id_str                                                     1264314146717384712
full_text                    And here’s more on the approach Sweden has tak...
truncated                                                                False
display_text_range                                                    [0, 117]
entities                     {'hashtags': [], 'symbols': [], 'user_mentions...
source                       <a href="https://mobile.twitter.com" rel="nofo...
in_reply_to_status_id                                              1.26431e+18
in_reply_to_status_id_str                                  1264314145899503616
in_reply_to_user_id                                                     813286
in_reply_to_user_id_str                                                 813286
in_reply_to_screen_name                             

In [490]:
bo_df.columns

Index(['created_at', 'id', 'id_str', 'full_text', 'truncated',
       'display_text_range', 'entities', 'source', 'in_reply_to_status_id',
       'in_reply_to_status_id_str', 'in_reply_to_user_id',
       'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo',
       'coordinates', 'place', 'contributors', 'is_quote_status',
       'retweet_count', 'favorite_count', 'favorited', 'retweeted',
       'possibly_sensitive', 'lang', 'quoted_status_id',
       'quoted_status_id_str', 'quoted_status_permalink', 'quoted_status',
       'extended_entities', 'retweeted_status'],
      dtype='object')

In [544]:
bo_df.to_csv('tweet_csv/barack_tweets.csv', index=None)

In [542]:
new_df=pd.read_csv('tweet_csv/barack_tweets.csv')

In [543]:
new_df.head()

Unnamed: 0,created_at,id,id_str,full_text,truncated,display_text_range,entities,source,in_reply_to_status_id,in_reply_to_status_id_str,...,favorited,retweeted,possibly_sensitive,lang,quoted_status_id,quoted_status_id_str,quoted_status_permalink,quoted_status,extended_entities,retweeted_status
0,Sat May 23 21:55:57 +0000 2020,1264314146717384712,1264314146717384712,And here’s more on the approach Sweden has tak...,False,"[0, 117]","{'hashtags': [], 'symbols': [], 'user_mentions...","<a href=""https://mobile.twitter.com"" rel=""nofo...",1.264314e+18,1.264314e+18,...,False,False,False,en,,,,,,
1,Sat May 23 21:55:56 +0000 2020,1264314145899503616,1264314145899503616,South Korea has focused on testing to guard ag...,False,"[0, 87]","{'hashtags': [], 'symbols': [], 'user_mentions...","<a href=""https://mobile.twitter.com"" rel=""nofo...",1.264314e+18,1.264314e+18,...,False,False,False,en,,,,,,
2,Sat May 23 21:55:56 +0000 2020,1264314143869460484,1264314143869460484,As all 50 states begin the process of reopenin...,False,"[0, 251]","{'hashtags': [], 'symbols': [], 'user_mentions...","<a href=""https://mobile.twitter.com"" rel=""nofo...",,,...,False,False,False,en,,,,,,
3,Sat May 23 18:03:41 +0000 2020,1264255694729084930,1264255694729084930,The Class of 2020 is full of the leaders we ne...,False,"[0, 211]","{'hashtags': [], 'symbols': [], 'user_mentions...","<a href=""http://twitter.com/download/iphone"" r...",,,...,False,False,False,en,1.26424e+18,1.26424e+18,"{'url': 'https://t.co/Qlr5ivBvpJ', 'expanded':...",{'created_at': 'Sat May 23 17:00:00 +0000 2020...,,
4,Wed May 20 21:34:17 +0000 2020,1263221534094774272,1263221534094774272,"As Chicago navigates the health crisis, its re...",False,"[0, 237]","{'hashtags': [], 'symbols': [], 'user_mentions...","<a href=""http://twitter.com/download/iphone"" r...",,,...,False,False,False,en,1.263166e+18,1.263166e+18,"{'url': 'https://t.co/ALo7kusc8K', 'expanded':...",{'created_at': 'Wed May 20 17:54:13 +0000 2020...,,


* While not perfect, this method allows the user to target specific tweet dates to collect from, instead of attempting to scrape all of a users tweets before limiting the date range.