To run this notebook, you will need to use your own Twitter API authentication keys in the necessary cells.

# **Exploring Topic Clusters in the #NBA Twittersphere**

# **Tweet Scraping**

Twitter standard search API - `twitter.api.Api.GetSearch()` [(Documentation)](https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets)

- Probably best for analyzing trends, opinions, or popular subtopics of a certain search term
- **"Please note that Twitter's search service and, by extension, the Search API is not meant to be an exhaustive source of Tweets. Not all Tweets will be indexed or made available via the search interface."**
    - We should thus try to collect as many tweets as reasonably possible!
- The maximum number of tweets retrievable from a single API request is 100, but this must be specified through the `.GetSearch()` parameter `count` (the default is 15)
- Rate-limited to 180 *requests* per 15 minutes
- The above two limits imply a max retrieval rate of **20 tweets every second**!
    - This is a *maximum* rate!  We might not want to request tweets at high rates to be on the conservative side of API usage
    - Another downside to frequently retrieving tweets is that **we may not get 100 *different* tweets in a 5 second timeframe**!  Based on some of my own API request tests and checking for duplicate rows, it actually might be beneficial to just request 100 tweets every maybe 10-15 minutes to ensure that you get a unique set of tweets every request.  However, even if our data has duplicate rows, it's not a hassle to store them or delete them
- Limited to tweets from **7 days prior to time of query**
    - Try to do frequent queries to capture a certain time interval as best as possible
    - Ultimately we are "time-limited" or "time-sensitive" - potential cluster topics are based on certain events in the NBA (good games, injuries, etc)
- Think carefully about the search term - if you try to search something like "jazz" (e.g. the NBA team Utah Jazz), you will inevitably get some music-related content. Similarly, it's possible that "NBA" returns non-basketball related content - the acronym could stand for many things other than "National Basketball Association".  Might be safer to search for hashtags ("#raptors", "#nba") instead of just text
    - With that said, we could search the [team-specific hashtags](https://news.sportslogos.net/2019/10/25/nba-twitter-team-emoji-hashtags-2019-2020/) (e.g. #WeTheNorth) to analyze opinions about teams or their players, but tweet frequency/volume for any of these is likely low
- Treat RTs as single tweets?
    - Drop the "RT" or include as stopword
    - **For RTs, the tweet text is truncated, and the full RT'd tweet is included a bit deeper in the JSON object**

Ultimately, gathering a large amount of data is no problem at all!

---

In [1]:
# import necessary packages
import pandas as pd
import twitter # to interact with the Twitter API
from datetime import datetime # for creating datetime strings for naming file
from time import sleep # to add a timer in between scrapes
import glob # for file path names

Using the Python `twitter` package, let's initiate an instance of the `twitter.api.Api` object from which we can makes calls to the Twitter API.

**Please use your own consumer key, secret consumer key, access token key, and secret access token key to initialize the API instance!**

In [None]:
# initialize Twitter API instance - 
twitter_api = twitter.Api(
    consumer_key = '',
    consumer_secret = '',
    access_token_key = '',
    access_token_secret = '',
    tweet_mode = 'extended' # return full tweets instead of the truncated (140 character) version
) 

# verify credentials
twitter_api.VerifyCredentials()

## **Exploring the Twitter Standard Search API**

In [3]:
# make a .GetSearch() call to the Twitter API 
test_fetch = twitter_api.GetSearch(
    term = '#nba', # search term
    lang = 'en', # English only
    result_type = 'recent', # don't want only the popular tweets
    include_entities = True, # extra information we might need?
    return_json = True, # return the request as a JSON object
    count = 2 # number of tweets to return
)

In [4]:
type(test_fetch) # check type

dict

In [5]:
test_fetch # preview

{'statuses': [{'created_at': 'Thu Apr 16 20:59:37 +0000 2020',
   'id': 1250891622264561665,
   'id_str': '1250891622264561665',
   'full_text': 'RT @OuttaHiura: #NBATwitter\xa0\xa0 GAIN TIME!\n\nIf you are a fan of the #NBA\xa0\xa0\xa0and you follow back:\n\n• LIKE THIS TWEET\n• FOLLOW EVERYONE WHO…',
   'truncated': False,
   'display_text_range': [0, 135],
   'entities': {'hashtags': [{'text': 'NBATwitter', 'indices': [16, 27]},
     {'text': 'NBA', 'indices': [66, 70]}],
    'symbols': [],
    'user_mentions': [{'screen_name': 'OuttaHiura',
      'name': 'Zacv2',
      'id': 1228502629233328129,
      'id_str': '1228502629233328129',
      'indices': [3, 14]}],
    'urls': []},
   'metadata': {'iso_language_code': 'en', 'result_type': 'recent'},
   'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
   'in_reply_to_status_id': None,
   'in_reply_to_status_id_str': None,
   'in_reply_to_user_id': None,
   'in_reply_to_user_id_str': None,
 

The `.GetSearch()` function on the `twitter.api.Api` object simply returns tweets and their various meta/data in a nested `dict` format (i.e. JSON).

In [6]:
test_fetch.keys()

dict_keys(['statuses', 'search_metadata'])

There are just two dictionary keys in the `.GetSearch()` return: `statuses` and `search_metadata`.

In [7]:
type(test_fetch['statuses'])

list

In [8]:
test_fetch['statuses']

[{'created_at': 'Thu Apr 16 20:59:37 +0000 2020',
  'id': 1250891622264561665,
  'id_str': '1250891622264561665',
  'full_text': 'RT @OuttaHiura: #NBATwitter\xa0\xa0 GAIN TIME!\n\nIf you are a fan of the #NBA\xa0\xa0\xa0and you follow back:\n\n• LIKE THIS TWEET\n• FOLLOW EVERYONE WHO…',
  'truncated': False,
  'display_text_range': [0, 135],
  'entities': {'hashtags': [{'text': 'NBATwitter', 'indices': [16, 27]},
    {'text': 'NBA', 'indices': [66, 70]}],
   'symbols': [],
   'user_mentions': [{'screen_name': 'OuttaHiura',
     'name': 'Zacv2',
     'id': 1228502629233328129,
     'id_str': '1228502629233328129',
     'indices': [3, 14]}],
   'urls': []},
  'metadata': {'iso_language_code': 'en', 'result_type': 'recent'},
  'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
  'in_reply_to_status_id': None,
  'in_reply_to_status_id_str': None,
  'in_reply_to_user_id': None,
  'in_reply_to_user_id_str': None,
  'in_reply_to_screen_name': None,

Under the 'statuses' key is a `list` of all the tweets and info returned.

In [9]:
type(test_fetch['statuses'][0])

dict

Each of these `statuses` list items is a `dict` that contains info for a single tweet.

Let's index one of these and see what information lies within.

In [10]:
test_fetch['statuses'][0]

{'created_at': 'Thu Apr 16 20:59:37 +0000 2020',
 'id': 1250891622264561665,
 'id_str': '1250891622264561665',
 'full_text': 'RT @OuttaHiura: #NBATwitter\xa0\xa0 GAIN TIME!\n\nIf you are a fan of the #NBA\xa0\xa0\xa0and you follow back:\n\n• LIKE THIS TWEET\n• FOLLOW EVERYONE WHO…',
 'truncated': False,
 'display_text_range': [0, 135],
 'entities': {'hashtags': [{'text': 'NBATwitter', 'indices': [16, 27]},
   {'text': 'NBA', 'indices': [66, 70]}],
  'symbols': [],
  'user_mentions': [{'screen_name': 'OuttaHiura',
    'name': 'Zacv2',
    'id': 1228502629233328129,
    'id_str': '1228502629233328129',
    'indices': [3, 14]}],
  'urls': []},
 'metadata': {'iso_language_code': 'en', 'result_type': 'recent'},
 'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
 'in_reply_to_status_id': None,
 'in_reply_to_status_id_str': None,
 'in_reply_to_user_id': None,
 'in_reply_to_user_id_str': None,
 'in_reply_to_screen_name': None,
 'user': {'id': 12270

In [25]:
test_fetch['statuses'][0].keys()

dict_keys(['created_at', 'id', 'id_str', 'full_text', 'truncated', 'display_text_range', 'entities', 'metadata', 'source', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place', 'contributors', 'retweeted_status', 'is_quote_status', 'retweet_count', 'favorite_count', 'favorited', 'retweeted', 'lang'])

Above are all the keys under each `statuses` `dict`.  Potentially useful fields to retrieve information from are:

- `created_at` - timestamp
- `id` - ID of tweet
- `full_text`/`retweeted_status` - the tweet or retweeted tweet text
- `entities` - extra info?
- `metadata` - extra info?
- `user`

Let's look into each of these.

In [26]:
test_fetch['statuses'][0]['created_at']

'Thu Apr 16 20:59:37 +0000 2020'

In [27]:
test_fetch['statuses'][0]['id']

1250891622264561665

The above two fields are indeed useful - `created_at` could help us identify time-cyclical trends in our data, and `id` gives us a unique identifier for each tweet.

In [29]:
test_fetch['statuses'][0]['full_text']

'RT @OuttaHiura: #NBATwitter\xa0\xa0 GAIN TIME!\n\nIf you are a fan of the #NBA\xa0\xa0\xa0and you follow back:\n\n• LIKE THIS TWEET\n• FOLLOW EVERYONE WHO…'

Note that if a retweeted tweet contains more than 140 characters, it gets truncated to 140 characters.  We have to look further into the JSON for the full retweeted tweet text.

In [30]:
test_fetch['statuses'][0]['retweeted_status']['full_text']

'#NBATwitter\xa0\xa0 GAIN TIME!\n\nIf you are a fan of the #NBA\xa0\xa0\xa0and you follow back:\n\n• LIKE THIS TWEET\n• FOLLOW EVERYONE WHO        LIKES THIS TWEET\n• RT FOR MORE AUDIENCES\n• FOLLOW ME @OuttaHiura 🔥\n• REPLY WITH “IFB” \n• FOLLOW EVERYBODY WHO DOES \n\n#NBAFollowTrains #NBATwitter'

Let's check `entities` now.

In [31]:
test_fetch['statuses'][0]['entities']

{'hashtags': [{'text': 'NBATwitter', 'indices': [16, 27]},
  {'text': 'NBA', 'indices': [66, 70]}],
 'symbols': [],
 'user_mentions': [{'screen_name': 'OuttaHiura',
   'name': 'Zacv2',
   'id': 1228502629233328129,
   'id_str': '1228502629233328129',
   'indices': [3, 14]}],
 'urls': []}

From the `entities`, we can get the `hashtags`, `symbols` (not sure what those are?), and `user_mentions`.

In [32]:
test_fetch['statuses'][0]['metadata']

{'iso_language_code': 'en', 'result_type': 'recent'}

We will not need anything from `metadata`.

In [34]:
test_fetch['statuses'][0]['user']

{'id': 1227095906111180800,
 'id_str': '1227095906111180800',
 'name': '𝔹𝕒𝕤𝕜𝕖𝕥𝕓𝕒𝕝𝕝 𝕋𝕒𝕝𝕜™ 🚫🦠',
 'screen_name': 'LetsTalkBucketz',
 'location': '',
 'description': '🚨 TALK TO US • SPEAK YOUR MIND • Any and All Things #NBA Basketball 🏀 ▂▃▅▆▇ 📢 #NBATwitter #DUBNation #FearTheDeer #WeGoHard •••• I FOLLOW BACK 💪🏼',
 'url': None,
 'entities': {'description': {'urls': []}},
 'protected': False,
 'followers_count': 1082,
 'friends_count': 1691,
 'listed_count': 0,
 'created_at': 'Tue Feb 11 05:04:15 +0000 2020',
 'favourites_count': 4545,
 'utc_offset': None,
 'time_zone': None,
 'geo_enabled': False,
 'verified': False,
 'statuses_count': 4661,
 'lang': None,
 'contributors_enabled': False,
 'is_translator': False,
 'is_translation_enabled': False,
 'profile_background_color': 'F5F8FA',
 'profile_background_image_url': None,
 'profile_background_image_url_https': None,
 'profile_background_tile': False,
 'profile_image_url': 'http://pbs.twimg.com/profile_images/1248099801922666496/kF5LDWhS_nor

There's a lot of information under `user`.  Let's grab the `name`, `followers_count`, and `friends_count` from here.

We also saw that the second dictionary key in the returned object from `.GetSearch()` was `search_metadata`.  Let's have a look at what that is.

In [35]:
type(test_fetch['search_metadata'])

dict

In [36]:
test_fetch['search_metadata']

{'completed_in': 0.03,
 'max_id': 1250891622264561665,
 'max_id_str': '1250891622264561665',
 'next_results': '?max_id=1250891580246044677&q=%23nba&lang=en&count=2&include_entities=1&result_type=recent',
 'query': '%23nba',
 'refresh_url': '?since_id=1250891622264561665&q=%23nba&lang=en&result_type=recent&include_entities=1',
 'count': 2,
 'since_id': 0,
 'since_id_str': '0'}

It looks like this contains information about the search we carried out.  Because none of this information is specific to a single one of the tweets we retrieved, we won't use anything from here.

---

## **Scraping Tweets**

Below is a function that makes a single `.GetSearch()` request to the Twitter API for 100 recent tweets containing "#NBA".  The tweets are organized into a pandas DataFrame, which is what the function returns.

**Please use your own consumer key, secret consumer key, access token key, and secret access token key to initialize the API instance!**

In [21]:
def get_nba_tweets():
    '''
    Uses the .GetSearch() request method on the python-twitter Api object to gather 100 recent tweets
    that contain "#nba".  tweets are then formatted into a pandas DataFrame.
    
    INPUTS
    n/a
    
    OUTPUTS
    df: The DataFrame which contains all tweets with basic tweet information.  Each row is a tweet.
    '''
    
    import pandas as pd # for DataFrame
    import twitter # to interact with the Twitter API
    from datetime import datetime # for timestamps for file names
    
    # initiate Twitter API instance with authentication
    twitter_api = twitter.Api(
        consumer_key = '',
        consumer_secret = '',
        access_token_key = '',
        access_token_secret = '',
        tweet_mode = 'extended' # ensures we get the full tweet text if it has 140-280 characters
    )
    
    # verify authentication - will raise TwitterError if unsuccessful
    twitter_api.VerifyCredentials()
        
    df = pd.DataFrame() # empty dataframe to append tweet information to

    # create search API request 
    r = twitter_api.GetSearch(
        term = '#nba', # search term
        lang = 'en', # english language tweets only
        result_type = 'recent', # covers more of the general public
        include_entities = True, # extra info we may need
        return_json = True, # returns full json object with all info
        count = 100 # tweets retrieved per request (max 100)
    )

    # loop through each tweet in the request
    for tweet in r['statuses']:
        
        # retweeted tweets get truncated to 140 characters
        # extract the full retweeted status if a tweet is a retweet
        try:
            tweet_text = tweet['retweeted_status']['full_text']
            retweet = 1
        except:
            tweet_text = tweet['full_text']
            retweet = 0

        # add tweet information as a row in the dataframe
        # note that the columns will automatically be arranged in alphabetical order in the dataframe
        df = df.append(
            {
            'tweet_id': tweet['id'],
            'created_at': tweet['created_at'],
            'handle': tweet['user']['screen_name'],
            'tweet': tweet_text,
            'retweet': retweet,
            'hashtags': tweet['entities']['hashtags'],
            'symbols': tweet['entities']['symbols'],
            'user_mentions': tweet['entities']['user_mentions'],
            'followers_count': tweet['user']['followers_count'],
            'friends_count': tweet['user']['friends_count']
            },
            ignore_index=True)
        
    return df

Below is a script that makes any specified number of Twitter API `.GetSearch()` requests at any specified frequency - i.e. the above function is used in a loop.  Each single request is saved as a .csv file.  There is a timer in place between requests in order to make the scraping appear more "human".  The notebook kernel can be interrupted to stop the collecting of tweets.

In [22]:
# uncomment to use additional timer in between most recent API call and next API call
# sleep(300)

# number of requests to make, and how many minutes to wait in between each request
n_requests = 100
wait_time_bw_requests_mins = 15

# *** the total runtime wil be (n_requests - 1) * wait_time_bw_requests_mins ***

for n in range(1, n_requests+1):
    
    # use custom made function to return 100 recent tweets containing "#nba" as a dataframe
    request_df = get_nba_tweets()
    
    # note that the authentication is verified every single time the above line is run
    # this is a kind of code inefficiency that should be addressed later
    # however, the code works perfectly fine
    
    # create unique file name for the request using a datetime code
    datetime_string = datetime.now().strftime('%Y%m%d-%H%M%S')
    file_name = f'data/api_request_full_rt_{datetime_string}.csv'

    # write this df to a csv
    request_df.to_csv(file_name, index=False)

    # print a message to confirm a successful request
    print(f'{datetime.now().strftime("%H:%M:%S")} - Request {n} successfully made & saved at {file_name}.') 
    
    # wait timer between requests, with print statement
    if n < n_requests: # ensures that we don't wait after the last request is completed
        
        countdown_seconds = wait_time_bw_requests_mins*60
        
        # countdown message
        while countdown_seconds > 0:
            print(f'{countdown_seconds} seconds remaining...   ', end='\r')
            countdown_seconds -= 1
            sleep(1)
          
# can interrupt kernel to stop query

15:30:11 - Request 1 successfully made & saved at data/api_request_full_rt_20200329-153011.csv.
892 seconds remaining...   

KeyboardInterrupt: 

Finally, below is a script to create a master dataframe by appending all of the individual API request dataframes.  The master dataframe is then saved as a .csv file.

In [23]:
# list of all API request .csv files in working directory
files = glob.glob('data/api_request_full_rt_*.csv')

 # initiate master dataframe (first .csv)
master = pd.read_csv(files[0])

# loop through all .csv's (starting from the second one)
for file in files[1:]:
    master = master.append(pd.read_csv(file)) # append each .csv to the master

# counts
print(f'{len(files)} Twitter API requests combined.') # files combined
print(f'{len(master)} total tweets collected.') # total tweets

# write to csv
master.to_csv('data/master_test.csv', index=False)

1 Twitter API requests combined.
100 total tweets collected.
