## Pulling Data with Tweepy

**By:** _Jordan McNea_ 

In [3]:
import datetime
import tweepy

# Using my own API Keys
from MD_API_Keys import api_key, api_key_secret, access_token, access_token_secret

In [4]:
# Authenticate the Tweepy API
auth = tweepy.OAuthHandler(api_key,api_key_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth,wait_on_rate_limit=True)

Below is the code used to make sure that I am authorized to pull Twitter data.

In [3]:
#api.verify_credentials()

## Grab follower IDs

I had the WNBA Finals on in the background while creating this Notebook, so I will be collecting followers from the Seattle Storm and Las Vegas Aces, the two finalists. Tweepy only allows users to grab 900 requests per 15 minutes. It'll grab the 900 requests quickly then wait 15 minutes, rather than slowly grab 900 requests over a 15 minute period. Before we start grabbing follower IDs, let's first just check how long it will take. To do this we'll grab the followers_count item from Tweepy. 

In [4]:
# I'm putting the handles in a list to iterate through below
team_handles = ['seattlestorm', 'LVAces']


# This will iterate through each Twitter handle that we're collecting from
for screen_name in team_handles:
    
    # Tells Tweepy we want information on the handle we're collecting from
    # The next line specifies which information we want, which in this case is the number of followers 
    user = api.get_user(screen_name) 
    followers_count = user.followers_count

    # Let's see roughly how long it will take to grab all the follower IDs. 
    print(f'''
    @{screen_name} has {followers_count} followers. 
    That will take roughly {followers_count/(5000*60):.0f} hours and {followers_count/(5000):.2f} minutes
    ''')
    


    @seattlestorm has 72022 followers. 
    That will take roughly 0 hours and 14.40 minutes
    

    @LVAces has 42665 followers. 
    That will take roughly 0 hours and 8.53 minutes
    


It looks like there should only be one fifteen minute break. It'll grab all of the Storm's followers, then some of the Aces before sleeping for fifteen minutes. Let's run it and see how long it'll actually take.

In [5]:
# This creates a dictionary containing a list for each Twitter handle we'll be grabbing follower IDs from
id_dict = {'seattlestorm' : [],
           'LVAces' : []}

# Grabs the time when we start making requests to the API
start_time = datetime.datetime.now()

# .keys() allows us to iterate through each key in the dictionary
for handle in id_dict.keys():
    
    # Each page contains 5,000 records, so since we know there are much more than 5,000 followers for both
    # the Storm and Aces, we must iterate through each of the pages in order to get all follower IDs
    # To grab the follower IDs, we will be using followers_ids
    for page in tweepy.Cursor(api.followers_ids,
                              # This is how we will get around the issue of not being able to grab all ids at once
                              # Once the rate limit is hit, we will be notified that we must wait 15 mins (900 secs)
                              wait_on_rate_limit=True, wait_on_rate_limit_notify=True, compression=True,
                              screen_name=handle).pages():

        # The page variable comes back as a list, so we have to use .extend rather than .append
        id_dict[handle].extend(page)
        

# Let's see how long it took to grab all follower IDs
end_time = datetime.datetime.now()
elapsed_time = end_time - start_time
print(elapsed_time)

Rate limit reached. Sleeping for: 896


0:15:09.290365


The elapsed time the first time I ran the above code was 0:15:09.

Let's look at some ids we gathered.

In [6]:
id_dict['seattlestorm'][:10]

[139873846,
 234047917,
 324498854,
 616338216,
 169716178,
 1687658880,
 466716940,
 1319392911751041026,
 1319391642592464896,
 1319388857863790592]

You'll notice they are all numbers. This is because ids are different from screen names. To see the twitter handles we gathered, we'll have to use the scren_name feature.

In [7]:
users = id_dict['seattlestorm'][:10]

for name in users:
    
    user = api.get_user(name)
    print(user.screen_name)

TheRealOrejuela
AlphaBush68
darthsadie
jay_dee_aye
SpliffsofWizdom
Ripcat1
Momof2Mom
Okikiol70820886
ismaiil6699
Patrice01852956


## Grab descriptions based on the followers IDs

That looks much better. We can get all sorts of information from the ID. We don't just want screen names though, that doesn't tell us much. Let's grab each screen name and their description and write it to a text file for each team account.

In [8]:
headers = ['screen_name','description']

for team in id_dict.keys():
    
    # Descriptions with emoji or non-Roman letters can cause trouble. Encoding your .txt file in utf-8 will help
    with open(f'{team}_followers.txt','w', encoding='utf-8') as out_file:
        out_file.write('\t'.join(headers) + '\n')

        for idx, ids in enumerate(id_dict[team]):
            
            # For accounts set to private, we won't be able to get the description unless we follow them
            # Putting in a try/except statement, we can get around this issue.
            try:
                user = api.get_user(ids)
                description = str(user.description).replace('\t',' ').replace('\n',' ')
                outline = [user.screen_name, user.description]
                
                out_file.write('\t'.join([str(item) for item in outline]) + '\n')
                
            except:
                continue
                
            if idx == 100:
                break
            

I can see that text files have been created for both teams' followers and their descriptions. I can see the person's name and their description separated by a tab. I was surprised to see that the emojis all came through properly.

## Grabbing Tweets by search terms

Tweepy also lets users grab tweets based off of search terms. October 10th was World Mental Health Day, so let's look at tweets containing its official hashtag. Twitter search allows standard search operators (<a href="https://developer.twitter.com/en/docs/twitter-api/v1/rules-and-filtering/overview/standard-operators">read more here</a>). We only want Tweets that occurred on World Mental Health Day, hence the since and until operators, and I'm excluding retweets.


In [9]:
# Note: the search API only goes back 7 days
date_start = datetime.date.today()
date_end = date_start - datetime.timedelta(days=2)

search_words = f'#WorldMentalHealthDay since:{date_end} until:{date_start} -filter:retweets'

# Notice the differences between searching tweets and users. 
for idx, item in enumerate(tweepy.Cursor(api.search,
                   # tweet_mode is defaulted to short, which only holds the first 140 characters of a Tweet.
                   tweet_mode='extended',
                   q=search_words,
                   lang='en').items()):
    
    # There's all sort of information you can get from Tweets
    # Find more tweet objects here: https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/overview/tweet-object
    print(item.user.screen_name)
    print(item.created_at)
    print(item.full_text)
    print('-'*40)
    
    if idx == 50:
        break
    

raccoonfly
2020-10-21 23:00:37
"Laughter and joy is the best medicine!" Discover more secrets to good health with Mu Sen Shen, a doctor of traditional Chinese medicine! #WorldMentalHealthDay
#laugh

https://t.co/IaRVrr0XOy https://t.co/MtsrH8F7Ho
----------------------------------------
donna_sibanda
2020-10-21 22:57:38
Please take a look and also share.

https://t.co/S0J4wlTsE7
.
.
#thanks #mini #help #classiccars #miniclub #health #WorldMentalHealthDay #gofundme #wishlist #topgear #donate #happy #mechanic #austinmini #love #ThisMorning #carsos #KindnessMatters #LoveYourself
----------------------------------------
chiiinnnyyy_
2020-10-21 22:54:09
If u need someone to talk to rn dm me ur mental health matters😩😩😩😩😩😩 #WorldMentalHealthDay #EndPoliceBrutalityinNigeraNOW
----------------------------------------
AlexandriaDrain
2020-10-21 22:50:02
The truth is, I don't know who I am beyond depression, because I've dealt with it my entire life. I've been on and of meds since I was in elemen

TamesideCouncil
2020-10-21 19:46:02
Mask your face, not your feelings. 

If you are struggling to cope, @GiveUsAShout offers free, 24/7, confidential mental health support. Text SHOUT to 85258 to start a conversation with a trained volunteer.
 
This #WorldMentalHealthDay, remember you're not alone 💙 @GiveUsAShout https://t.co/X6D8UFjK7s
----------------------------------------
CassieTheBilli1
2020-10-21 19:45:01
May God heal this land.I am sincerely tired.
#peace #WorldMentalHealthDay #FAITH #5for5
----------------------------------------
sirtheojay
2020-10-21 19:43:23
Protect your mental health, if you need to stay off social media for now, please do.

Nigeria will be great again and it will happen  in our lifetime.
#Endsars 
#WorldMentalHealthDay
----------------------------------------
saveroftheworld
2020-10-21 19:42:01
@JoeBiden it´s time for #COVID19 memorial, national/international monument, one/several?, unique/different?, non-profit architectural/arts concurrence by students/a

The above code prints out:
User Screen Name
Date and Time Posted
Full text of tweet
40 dashes to separate each tweet's info

It's also possible to use this search feature to grab the mentions of a Twitter account. Mentions are any tweet where another user's handle is included (i.e. they are mentioned in the tweet).

In [10]:
search_words = '@GovernorBullock -filter:retweets'


tweets_all = tweepy.Cursor(api.search,
                   tweet_mode='extended',
                   q=search_words,
                   lang='en').items()

# Put all the Tweet objects for a single Tweet into a tuple, and put all those into a list
tweets = [(tweet.full_text,tweet.created_at,tweet.user.screen_name) for tweet in tweets_all]


In [11]:
tweets[:10]

[('@DainesforMT @GovernorBullock Here’s the environmental voting record of this climate change denialist.\nhttps://t.co/JrhGzV5l2A',
  datetime.datetime(2020, 10, 22, 23, 23, 38),
  'HeidiHugh8'),
 ('@DainesforMT @GovernorBullock Vote for @stevebullockmt \nhttps://t.co/ELTmrbIMwW',
  datetime.datetime(2020, 10, 22, 23, 22, 48),
  'Woodysapsucker'),
 ('Yas! Come through blue wave. Let’s keep running up the votes my fellow #Americans! @JoeBiden &amp; @KamalaHarris need an undeniable LANDSLIDE ushering in new US Senators: @HarrisonJaime @AmyMcGrathKY @CaptMarkKelly @Hickenlooper @GreenfieldIowa @GaryPeters @GovernorBullock et al. —TW https://t.co/rlvUU98kL8',
  datetime.datetime(2020, 10, 22, 23, 22, 2),
  'WeHoTim'),
 ("@GovernorBullock \nIllegally declaring a state of emergency when an emergency doesn't exist. You are stealing money for this pretend emergency. Do you actually know what the definition of emergency is? I do. And this is not it!",
  datetime.datetime(2020, 10, 22, 23, 3, 1

The above code resulted in a list of ten 'mention' tweets where where @GovernorBullock was tagged at least once, and oftentimes there was more than one mention in the tweet addressing numerous accounts.

For the assignment, I will be pulling twitter data from the Badgers and the Gophers (Go Badgers!).

In [5]:
##Creating a dictionary that contains a list for each team's twitter
id_dict = {'BadgersFootball' : [],
           'GopherFootball' : []}

In [None]:
for handle in id_dict.keys():
    
    # Each page contains 5,000 records, so since we know there are much more than 5,000 followers for both
    
    # To grab the follower IDs, we will be using followers_ids
    for page in tweepy.Cursor(api.followers_ids,
                              # This is how we will get around the issue of not being able to grab all ids at once
                              # Once the rate limit is hit, we will be notified that we must wait 15 mins (900 secs)
                              wait_on_rate_limit=True, wait_on_rate_limit_notify=True, compression=True,
                              screen_name=handle).pages():

        # The page variable comes back as a list, so we have to use .extend rather than .append
        id_dict[handle].extend(page)

In [None]:
headers = ['screen_name','name', 'id', 'location', 'followers_count', 'friends_count', 'description']

for team in id_dict.keys():
    
    # Descriptions with emoji or non-Roman letters can cause trouble. Encoding your .txt file in utf-8 will help
    with open(f'{team}_followers.txt','w', encoding='utf-8') as out_file:
        out_file.write('\t'.join(headers) + '\n')

        for idx, ids in enumerate(id_dict[team]):
            
            # For accounts set to private, we won't be able to get the description unless we follow them
            # Putting in a try/except statement, we can get around this issue.
            try:
                user = api.get_user(ids)
                description = str(user.description).replace('\t',' ').replace('\n',' ')
                outline = [user.screen_name, user.name, user.id, user.location, user.followers_count, user.friends_count, user.description]
                
                out_file.write('\t'.join([str(item) for item in outline]) + '\n')
                
            except:
                continue
                
            if idx == 100000:
                break
            