### Purpose of script

The purpose of this script is to develop a program that can scrape old tweets (using our Premium Search API). https://developer.twitter.com/en/docs/twitter-api/premium/search-api/quick-start/premium-full-archive

Rate limits to keep in mind: https://developer.twitter.com/en/docs/twitter-api/premium/rate-limits

In particular, we can make 60 requests/minute for the full-archive endpoint

According to https://developer.twitter.com/en/docs/twitter-api/premium/search-api/overview, it seems like each GET request can get you 500 tweets. 

Reference page for Premium search API: https://developer.twitter.com/en/docs/twitter-api/premium/search-api/api-reference/premium-search

In [101]:
import requests
import json

We'll use the Twitter `Full Archive` endpoint, coupled with the lab's Premium Twitter account. 

https://developer.twitter.com/en/docs/twitter-api/premium/search-api/guides/integrating-premium

Examples of how to set up args for queries: 

https://developer.twitter.com/en/docs/twitter-api/premium/rules-and-filtering/using-premium-operators

Overview of what the different fields mean:

https://developer.twitter.com/en/docs/twitter-api/premium/data-dictionary/overview



### Example Code

The example below won't run because it needs the correct bearer token to be included

In [68]:
endpoint = "https://api.twitter.com/1.1/tweets/search/fullarchive/dev.json" 
headers = {"Authorization":"Bearer VERYLONGBEARERTOKENNNN(REPLACE)", "Content-Type": "application/json"}
data = '{"query":"(snow OR sleet OR hail OR freezing rain)", "fromDate": "201802020000", "toDate": "201802240000", "maxResults":"100"}'

In [69]:
response = requests.post(endpoint,data=data,headers=headers).json()

In [70]:
print(json.dumps(response))



In [63]:
len(response["results"])

100

In [64]:
print(json.dumps(response, indent = 2))

{
  "results": [
    {
      "created_at": "Fri Feb 23 23:59:59 +0000 2018",
      "id": 967187313284534272,
      "id_str": "967187313284534272",
      "text": "@stephaniemain2 all hail the overly domesticated mice eating cats of the modern age!!! \n\n all rise!!!",
      "display_text_range": [
        16,
        101
      ],
      "source": "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>",
      "truncated": false,
      "in_reply_to_status_id": 967187022447292416,
      "in_reply_to_status_id_str": "967187022447292416",
      "in_reply_to_user_id": 609991829,
      "in_reply_to_user_id_str": "609991829",
      "in_reply_to_screen_name": "stephaniemain2",
      "user": {
        "id": 968640829,
        "id_str": "968640829",
        "name": "M \ud83d\udc7d",
        "screen_name": "Who_IsM",
        "location": "Earth  ",
        "url": null,
        "description": "All you have to do is scroll half way through the page to get to the year 1999, currently in

### Package into function

In [94]:
def get_past_tweets(bearer_token, search_terms=["genomics", "variants"], start_date="20200101", end_date="20200103", max_results=100):
    
    """
        Uses Twitter Premium API, with Historical Search Full Archive endpoint, to get past tweets
        
        Args:
            • search_terms: list of terms to look up (type:list)
            • start_date: start date, in the format YYYYMMDD (type:str)
            • end_date: end_date, in the format YYYYMMDD (type:str)
            • bearer_token: long string containing bearer token creds for application (type:str)
            • max_results: maximum # of results to return (default: 100) (type: int)
            
        Returns list of tweet objects. Each item in the list is a tweet object, and the 
        text of the tweet can be accessed using the "text" key. 
        
        e.g., 
            for tweet in tweet_list:
                print(tweet["text"])
    """
    
    if type(start_date) != str or type(end_date) != str:
        raise Exception("The start and end dates need to be strings in the format YYYYMMDD")
    
    if len(start_date) != 8 or len(end_date) != 8:
        raise Exception("The start and end dates have the wrong length. They must be in the format YYYYMMDD")
        
    if len(search_terms) < 1:
        raise Exception("Need at least 1 term to look up")
        
    # set up params for request
    endpoint = "https://api.twitter.com/1.1/tweets/search/fullarchive/dev.json"
    headers = {"Authorization":f"Bearer {bearer_token}", "Content-Type": "application/json"}
    
    query_str = search_terms[0]
    
    for idx in range(1, len(search_terms)):
        query_str += f" OR {search_terms[idx]}"
        
    
    start = start_date + "0000"
    end = end_date + "0000"
        
    data_query = '{"query":"(' + query_str + ')", "fromDate": "' + start + '", "toDate": "' + end + '", "maxResults":"' + str(max_results) + '"}'

    # send request
    response = requests.post(endpoint,data=data_query,headers=headers).json()
    
    # print results, for ease of viewing
    print(json.dumps(response, indent = 2))
    
    # return results, as list of tweets
    tweet_list = response["results"]
    return tweet_list

In [96]:
new_tweets = get_past_tweets(bearer_token)

{
  "results": [
    {
      "created_at": "Thu Jan 02 23:59:57 +0000 2020",
      "id": 1212886277420462082,
      "id_str": "1212886277420462082",
      "text": "RT @AlexDrawsNSFW: Alright first Commission of the new year. With this comes new Prices Decided I don't like black and white or sketches so\u2026",
      "source": "<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>",
      "truncated": false,
      "in_reply_to_status_id": null,
      "in_reply_to_status_id_str": null,
      "in_reply_to_user_id": null,
      "in_reply_to_user_id_str": null,
      "in_reply_to_screen_name": null,
      "user": {
        "id": 3094051231,
        "id_str": "3094051231",
        "name": "Dawn Malf",
        "screen_name": "Dawn_Before_Day",
        "location": "Corpus Christi, TX",
        "url": null,
        "description": "I am 22. Likes butts. Dislikes Netflix adaptations. \nTrial artist.alpha",
        "translator_type": "none",
        "derived": {


In [100]:
for tweet in new_tweets[0:10]:
    print("================")
    print(tweet["text"])
    print("================")

RT @AlexDrawsNSFW: Alright first Commission of the new year. With this comes new Prices Decided I don't like black and white or sketches so…
THE MEETINGS
Josef Hoffman
Dye transfer drawings, 6 supplementary variants on xerograph from a dry plate backed cut-out aluminum
RT @Nayeliefox: LUCKY GRAB PACKS!!
4 VARIATIONS ALL WITH 4 HIGH QUALITY VINYL STICKERS AND AN ENAMEL PIN! ($25.50 value)
Each variation is…
DF SPECIAL PREMIERE GOLD #Marvel &amp; #dccomics! #retailers VARIANTS! CGC #comics! #signed #comicbook! #LimitedEdition… https://t.co/AaXqMdoNJQ
RT @thanukiart: Happy new years everybody! Thank you all for supporting me through all of 2019 and thank you for 6000 followers! I wasn't a…
The more I play these old RPG's the more I yearn for the world building and the sense of place that seems missing f… https://t.co/miLa2pn18H
#linux #devicetree [PATCH v3 1/3] clk: composite: add _register_composite_pdata() variants https://t.co/yJlLZmDWhW
JCI - Human C-terminal CUBN variants associate w

In [99]:
new_tweets[0]["text"]

"RT @AlexDrawsNSFW: Alright first Commission of the new year. With this comes new Prices Decided I don't like black and white or sketches so…"