# Search Tweets using Twitter's Premium API - Full Archive

<b>Created by [Kate Schneider](https://github.com/kteschneider) <br> Last modified April 16, 2020</b>

This code uses TwitterAPI, a 3rd party library approved by the Twitter Developer team, with Twitter's Premium Search Tweets (Full Archive) API.

For TwitterAPI's documentation, see https://readthedocs.org/projects/twitterapi/downloads/pdf/latest/. There are many libraries that are compatible with the Standard APIs but TwitterAPI is one of the very few compatible with Premium APIs.

### About the Premium Search Tweets API endpoints

There are two premium search API endpoints: 

1. Search Tweets: 30-day endpoint → provides Tweets posted within the last 30 days.
2. Search Tweets: Full-archive endpoint → provides Tweets from as early as 2006, starting with the first Tweet posted in March 2006.

Premium search provides data and counts endpoints. Data endpoints retrieve Tweets matching the specified query. Count endpoints retrieve the number of Tweets matching the specified query. For this project, I will be using the data endpoints. 

The search URL for the data endpoint is:
https://api.twitter.com/1.1/tweets/search/fullarchive/dev.json

### Rate limits
Request rate limits at both minute and second granularity. The per minute rate limit is 60 requests per minute (30 with Sandbox environment). Requests are also limited to 10 per second. Requests are aggregated across both the data and counts endpoints. Monthly request limits are also applied. Sandbox environments are limited to 250 requests per month, and paid access can range between 500 and 10,000 requests.
(from https://developer.twitter.com/en/docs/tweets/search/api-reference/premium-search#SearchRequests)

### Differences with the Standard Search API
For information on how the Search Tweets API (Premium) and the Twitter Search API (Standard), see https://developer.twitter.com/en/docs/tweets/search/guides/integrating-premium ("Migrating from standard search").

### Extracting the JSON-encoded tweet objects
For information on the Tweet JSON returned by the Search Tweets API, see https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet-json.

## A. Setting up 'TwitterAPI' for Search Tweets

### 1. Setting counter to 0

This helps keep track of how many requests you've submitted overall.

In [None]:
global_counter = 0

### 2. Importing libraries

In [None]:
from TwitterAPI import TwitterAPI
import json
import time

### 3. Running on the free Sandbox testing version or the paid Premium version? - *requires input*

There are two versions of the Premium Search Tweets API: Sandbox and Premium (paid). The Sandbox version is free and is already set up when you create your dev environment. This allows for testing and permits 50 requests a month of 100 tweets each (5k tweets overall). The Premium version must be paid for each month and varies by price depending on how many requests you purchase. With the Premium version, each request gives you 500 tweets. Therefore, at the lowest purchase level of 100 requests, you would get 50k tweets overall.

See here for the pricing levels for Premium Search Tweets (Full Archive): https://developer.twitter.com/en/pricing/search-fullarchive 

In [None]:
SANDBOX = True
# set equal to True if running the Sandbox version of the Search Tweets API
# set equal to False if running the Premium (paid) version of the Search Tweets API

### 4. Gaining access and creating the interface - *requires input*

Below you would enter your access tokens and consumer keys found on your Twitter Developer 'Apps' > 'Keys and tokens' page.

The order they must be listed in is TwitterAPI('CONSUMER KEY', 'SECRET CONSUMER KEY', 'ACCESS TOKEN', 'SECRET ACCESS TOKEN')

In [None]:
api = TwitterAPI('CONSUMER KEY', 
                 'SECRET CONSUMER KEY', 
                 'ACCESS TOKEN', 
                 'SECRET ACCESS TOKEN')

## B. Creating and Completing New Requests
⬇︎ FOR NEW REQUESTS RUN FROM HERE AND BELOW ⬇︎

### 1. Setting search parameters - *requires input*

Below, you set your search parameters.

FROM_DATE & TO_DATE 
* take the format YYYYMMDDhhmm and are in UTC.

filename
* must have the extension '.json'

SEARCH_TERM
* determines which tweets you will collect (e.g. can set as a hashtag).

sandbox_requests
* the maximum number of requests that will be collected running through this program once in Sandbox mode. The monthly limit is 50 but you may wish to set it to a smaller number so you can run many requests with different search parameters.

premium_requests
* the maximum number of requests that will be collected running through this program once in Premium mode. The monthly limit is whatever you purchased (e.g. lowest purchase level gives you 100) but you may wish to set it to a smaller number so you can run many requests with different search parameters.

LABEL
* the name of your dev environment (e.g. my dev environment is called "dev")

In [None]:
# Can adjust the following parameters each time
FROM_DATE = '201910181500'
TO_DATE = '201910181600'
filename = 'cdnpoli_tweets.json'
SEARCH_TERM = '#cdnpoli'
sandbox_requests = 1
premium_requests = 2

# Set this parameter the first time
LABEL = 'dev'

#### Do not adjust the parameters below.

PRODUCT specifies that you are using the Full Archive version of the Premium Search Tweets API and not the 30-day window version.

In [None]:
PRODUCT = 'fullarchive'

Below I implemented the necessary limits for either the Sandbox or Premium version of the API to prevent you from reaching the rate limit. Will print out which version you are running with the number of requests and tweets per request.

In [None]:
if SANDBOX == True:
    SYSTEM_LIMIT = 100 
    # Sets maximum number of search results to be returned by a request
    # Accepts any number between 10 and the system limit
    REQUEST_MAX = sandbox_requests
    # Sets maximum number of requests to be made
    # Monthly limit for Sandbox is 50
    TIME_PAUSE = 2.1
    # Sets length of time the program is paused after running
    # Allows for 30 requests per minute (plus buffer millisecond)
    print("Running Sandbox with", SYSTEM_LIMIT, "tweets per request and", REQUEST_MAX, "requests in total.")
else: # If running Premium
    SYSTEM_LIMIT = 500
    REQUEST_MAX = premium_requests
    # Adjust based on how many requests were purchased
    # Purchased amount is your maximum (e.g. at lowest purchase level, max. would be 100)
    TIME_PAUSE = 1.1
    # Allows for 60 requests per minute (plus buffer millisecond)
    print("Running Premium with", SYSTEM_LIMIT, "tweets per request and", REQUEST_MAX, "requests in total.")

### 2. Initializing loop variables

Do not adjust the variables below.

In [None]:
NEXT_TERM = None # Ensure == None for first request
web_request_count = 0
finished = False
nested_dict = {}
num_dict = 1 # Ensure == '1' for first request

### 3. Running loop to collect tweets

Each pass through the loop is a request.

1. Starts out by collecting the tweets using your specified search parameters. 
2. Then, it checks for error codes. The most serious error code (exceeding the rate limit) is built in. To check other error codes, see https://developer.twitter.com/en/docs/tweets/search/api-reference/premium-search#HTTPCodes. 
3. Next, adds 1 to both the global counter (total requests overall) and the web request counter (number of requests you're collecting passing through this program once). Prints out which web request number you've just completed.
4. Checks to see whether there is a 'next' token in the json downloaded from Twitter. If there is no 'next' token, this indicates that you've reached the last page of the search results derived from your search parameters. It will set the finished variable to "true" to prevent you from running through the loop again if you have reached the end. If not, it will continue onto the next pass through the loop for another request.
5. Adds the tweets you just collected to nested dictionary where each individual tweet is stored as a dictionary within a larger dictionary of all tweets.
6. If running through the loop again, pauses the program temporarily to prevent you from exceeding the rate limits. Sets the 'next' token for the subsequent request to the 'next' token you were given in order to avoid pulling the same tweets in your next request.

In [None]:
while finished == False:
    r = api.request('tweets/search/%s/:%s' % (PRODUCT, LABEL), 
                    {'query':SEARCH_TERM,
                     'fromDate':FROM_DATE,
                     'toDate':TO_DATE,
                     'maxResults':SYSTEM_LIMIT,
                     'next':NEXT_TERM})
    
    # checking for error codes
    if r.status_code != 200:
        print('Error: r.status_code: ', r.status_code)
        if r.status_code == 429:
            print("Exceeded rate limit")
        break
    
    # counting the requests and exiting if reached the max.
    web_request_count += 1
    global_counter += 1
    print('Completed web request: ', web_request_count)

    if 'next' not in r.json():
        finished = True
        print("Reached end of results.")
        print("Program complete. Number of requests collected is", web_request_count)
        print("'next' token for subsequent requests is", NEXT_TERM)
        print("Total number of requests completed is", global_counter)
    elif (web_request_count == REQUEST_MAX):
        finished = True
        NEXT_TERM = r.json()['next']
        print("Finished up to number of maximum requests.")
        print("Program complete. Number of requests collected is", web_request_count)
        print("'next' token for subsequent requests is", NEXT_TERM)
        print("Total number of requests completed is", global_counter)
    
    # adding to the dictionary containing the results
    results = r.json()['results']
    for tweet in results:
        nested_dict[num_dict] = tweet
        num_dict += 1
    
    # Pausing program to prevent exceeding the rate limits.
    # If this is the last loop (i.e. finished == True), this step is omitted.
    if finished == False:
        print("Pausing for", TIME_PAUSE, "seconds before next request.")
        time.sleep(TIME_PAUSE) # pauses program as only 30 requests
        print("Finished pausing. Moving on to next request.")
        NEXT_TERM = r.json()['next']

### 4. Storing dictionary of tweets

In [None]:
# storing the nested dictionary of tweets
with open(filename, 'a', encoding='utf8') as f:
    json.dump(nested_dict, f)

#### *If you are wanting to complete another request with different search parameters, return to top of B. above, adjust your parameters, and run all of B. again. Do NOT rerun A.*

## C. Loading your Saved Tweets from a JSON File

### 1. Loading dictionary of tweets

In [None]:
with open(filename, 'r') as g:
    datastore = json.load(g)

### 2. Printing out your results

In [None]:
# printing something random to test results read properly
print(datastore['1']['created_at'])
print(datastore['100']['created_at'])
print(datastore['200']['created_at'])
print(datastore['300']['created_at'])
print(datastore['400']['created_at'])
print(datastore['500']['created_at'])