# Twitter API v2

* **Tutorial: How to analyze the sentiment of your own Tweets**   
https://developer.twitter.com/en/docs/tutorials/how-to-analyze-the-sentiment-of-your-own-tweets


* **Tutorial: Search tweets**    
https://developer.twitter.com/en/docs/twitter-api/tweets/search/introduction


* **Tutorial: Comparing Features**   
https://developer.twitter.com/en/docs/twitter-api/search-overview
https://developer.twitter.com/en/docs/twitter-api/tweets/search/migrate 



* **Twitter API v2**   
https://github.com/twitterdev/Twitter-API-v2-sample-code  
https://github.com/twitterdev/search-tweets-python/tree/v2

**Search tweets, using Twitter API**

1. **Recent search** - last 7 days (Standard and Academic Research product tracks)

> This endpoint can deliver up to 100 Tweets per request in reverse-chronological order, and pagination tokens are provided for paging through large sets of matching Tweets. 


2. **Full-archive search** (Academic Research product track only)

> At this time, the v2 full-archive search endpoint is only available via the Academic Research product track. The endpoint allows you to programmatically access public Tweets from the complete archive dating back to the first Tweet in March 2006, based on your search query. This endpoint can deliver up to 500 Tweets per request in reverse-chronological order, and pagination tokens are provided for paging through large sets of matching Tweets. 

## Define functions

**Tutorial:** https://towardsdatascience.com/an-extensive-guide-to-collecting-tweets-from-twitter-api-v2-for-academic-research-using-python-3-518fcb71df2a


**Steps**

1. Introduction
2. Prerequisites to Start
3. Bearer Token
4. Create Headers
5. Create URL
	- search url
	- query params
6. Connect to Endpoint
7. Call the API endpoint
	- data
	- meta
8. Save Results to CSV
9. Looping Through Requests

> If we just send a request to collect tweets between the 1st of January 2020 and the 31st of December 2020, we will hit our cap very quickly without having a good distribution from all 12 months.
So what we can do is, we can set a limit for tweets we want to collect per month, so that if we reach the specific cap at one month, we move on to the next one.

* A For-loop that goes over the months/weeks/days we want to cover (Depending on how it is set)

* A While-loop that controls the maximum number of tweets we want to collect per time period.

* Notice that a time.sleep() is added between calls to ensure you are not just spamming the API with requests.

In [1]:
# 2. Prerequisites to Start

# For sending GET requests from the API
import requests
# For saving access tokens and for file management when creating and adding to the dataset
import os
# For dealing with json responses we receive from the API
import json
# For displaying the data after
import pandas as pd
# For saving the response data in CSV format
import csv
# For parsing the dates received from twitter in readable formats
import datetime
import dateutil.parser
import unicodedata
#To add wait time between requests
import time

import yaml

In [2]:
# 3. Bearer token
def auth():
    with open("../config.yaml") as file:
        passwords = yaml.safe_load(file)
    return passwords["search_tweets_api"]["bearer_token"]

In [3]:
# 4. Create Headers
def create_headers(bearer_token):
    headers = {"Authorization": "Bearer {}".format(bearer_token)}
    return headers

In [4]:
# 5. Create url
def create_url(keyword, start_date, end_date, max_results = 10):
    
    search_url = "https://api.twitter.com/2/tweets/search/all" #Change to the endpoint you want to collect data from

    #change params based on the endpoint you are using
    query_params = {'query': keyword,
                    'start_time': start_date,
                    'end_time': end_date,
                    'max_results': max_results,
                    'tweet.fields': 'id,text,author_id,created_at,public_metrics',
                    # 'expansions': 'author_id,in_reply_to_user_id,geo.place_id',
                    # 'tweet.fields': 'id,text,author_id,in_reply_to_user_id,geo,conversation_id,created_at,lang,public_metrics,referenced_tweets,reply_settings,source',
                    # 'user.fields': 'id,name,username,created_at,description,public_metrics,verified',
                    # 'place.fields': 'full_name,id,country,country_code,geo,name,place_type',
                    'next_token': {}}
    return (search_url, query_params)

In [5]:
# 6. Connect to endpoint
def connect_to_endpoint(url, headers, params, next_token = None):
    params['next_token'] = next_token   #params object received from create_url function
    response = requests.request("GET", url, headers = headers, params = params)
    print("Endpoint Response Code: " + str(response.status_code))
    if response.status_code != 200:
        raise Exception(response.status_code, response.text)
    return response.json()

In [12]:
## // Inputs for the request //
bearer_token = auth()
headers = create_headers(bearer_token)

# keyword = "ethereum -is:retweet lang:en"
# keyword = "(ethereum OR ether OR eth) lang:en -is:retweet -is:reply"
                            keyword = "(audiusproject OR audius OR $audio OR audiocoin) lang:en -is:retweet -is:reply"

start_time = "2021-03-01T00:00:00.000Z"
end_time = "2021-03-31T00:00:00.000Z"
max_results = 10

In [13]:
## // Call the API //
url = create_url(keyword, start_time,end_time, max_results)
json_response = connect_to_endpoint(url[0], headers, url[1])
print(json.dumps(json_response, indent=4, sort_keys=True))

Endpoint Response Code: 200
{
    "data": [
        {
            "author_id": "19953652",
            "created_at": "2021-03-30T23:57:36.000Z",
            "id": "1377047388590899203",
            "public_metrics": {
                "like_count": 1,
                "quote_count": 0,
                "reply_count": 0,
                "retweet_count": 0
            },
            "text": "I just hit over 100 followers on #Audius! https://t.co/WLjNXmRCLz #NewMusic"
        },
        {
            "author_id": "1117417373940711426",
            "created_at": "2021-03-30T23:57:11.000Z",
            "id": "1377047282785353737",
            "public_metrics": {
                "like_count": 0,
                "quote_count": 0,
                "reply_count": 0,
                "retweet_count": 0
            },
            "text": "https://t.co/1uPc7jThCw hmmm what  is this"
        },
        {
            "author_id": "904521983777492993",
            "created_at": "2021-03-30T23:55:26.000Z",

In [14]:
## Save Results
def append_to_csv(json_response, fileName):

    #A counter variable
    counter = 0

    #Open OR create the target CSV file
    csvFile = open(fileName, "a", newline="", encoding='utf-8')
    csvWriter = csv.writer(csvFile)

    #Loop through each tweet
    for tweet in json_response['data']:
        
        # We will create a variable for each since some of the keys might not exist for some tweets
        # So we will account for that

        # 1. Author ID
        author_id = tweet['author_id']

        # 2. Time created
        created_at = dateutil.parser.parse(tweet['created_at'])

        # 3. Tweet ID
        tweet_id = tweet['id']

        # 6. Tweet metrics
        retweet_count = tweet['public_metrics']['retweet_count']
        reply_count = tweet['public_metrics']['reply_count']
        like_count = tweet['public_metrics']['like_count']
        quote_count = tweet['public_metrics']['quote_count']

        # 8. Tweet text
        text = tweet['text']
        
        # Assemble all data in a list
        # 'author_id', 'created_at', 'tweet_id', 'text', 'like_count', 'quote_count', 'reply_count', 'retweet_count'
        res = [author_id, created_at, tweet_id, text, like_count, quote_count, reply_count, retweet_count]
        
        # Append the result to the CSV file
        csvWriter.writerow(res)
        counter += 1

    # When done, close the CSV file
    csvFile.close()

    # Print the number of tweets for this iteration
    print("# of Tweets added from this response: ", counter) 

In [13]:
# Create file
csvFile = open("data.csv", "a", newline="", encoding='utf-8')
csvWriter = csv.writer(csvFile)

#Create headers for the data you want to save, in this example, we only want save these columns in our dataset
csvWriter.writerow([ 'author_id', 'created_at', 'tweet_id', 'text', 'like_count', 'quote_count', 'reply_count', 'retweet_count'])
csvFile.close()

append_to_csv(json_response=json_response, fileName="data.csv")

# of Tweets added from this response:  10


In [14]:
## // Store collected tweets in a pickle file //

# import pandas as pd
# from pandas import json_normalize 
# df = json_normalize(json_response, 'data')
# display(df)
# print(df.columns)
# df.to_pickle('data/collected_tweets.pkl')
# df = pd.read_pickle('data/collected_tweets.pkl')

In [15]:
## // Store the data in Human readable format (collected_tweets.txt) // 
# but also store the data in a pickle dataframe ready to be processed using pandas.

# keys = json_response["data"][0].keys()
# print(keys)

# with open('../data/collected_tweets.txt', 'a') as outfile:
#     for i in range(len(json_response["data"])):
#         json.dump(json_response["data"][i], outfile)
#         outfile.write('\n')

# Pagination (Loops)

**Query**

* Να περιέχει αυτά τα keywords: (ethereum OR ether OR eth)
* Να είναι γραμμένο στα αγγλικά (lang:en)
* Να μην είναι retweet (-is:retweet)
* Να μην ειναι answer (-is:reply)
* Να μην είναι διαφημιστικό (-is:nullcast)

**How many tweets?**

* 24 hours, 365 days = 8760 hours
* 100-200 tweets per hour

* 200 * 24 * 365 = 1_752_000

In [15]:
date_list = []

for month in range(5, 9):
    if month in [2,4,6,9,11]:
        for day in range(1, 31):
            for hour in range(0,24):
                month = str(month).zfill(2)
                day = str(day).zfill(2)
                hour = str(hour).zfill(2)
                date = f"2021-{month}-{day}T{hour}:00:00.000Z"

                # print(date)
                date_list.append(date)
    elif month in [1,3,5,7,8,10,12]:
        for day in range(1, 32):
            for hour in range(0,24):
                month = str(month).zfill(2)
                day = str(day).zfill(2)
                hour = str(hour).zfill(2)
                date = f"2021-{month}-{day}T{hour}:00:00.000Z"

                # print(date)
                date_list.append(date)

print(f"Total timestamps: {len(date_list)}")

Total timestamps: 2952


In [16]:
# Create start_list and end_list
start_list = date_list
end_list = date_list[1:]

# Sanity check
pd.DataFrame(data = {"start":start_list[:25],
                    "end":end_list[:25]})

Unnamed: 0,start,end
0,2021-05-01T00:00:00.000Z,2021-05-01T01:00:00.000Z
1,2021-05-01T01:00:00.000Z,2021-05-01T02:00:00.000Z
2,2021-05-01T02:00:00.000Z,2021-05-01T03:00:00.000Z
3,2021-05-01T03:00:00.000Z,2021-05-01T04:00:00.000Z
4,2021-05-01T04:00:00.000Z,2021-05-01T05:00:00.000Z
5,2021-05-01T05:00:00.000Z,2021-05-01T06:00:00.000Z
6,2021-05-01T06:00:00.000Z,2021-05-01T07:00:00.000Z
7,2021-05-01T07:00:00.000Z,2021-05-01T08:00:00.000Z
8,2021-05-01T08:00:00.000Z,2021-05-01T09:00:00.000Z
9,2021-05-01T09:00:00.000Z,2021-05-01T10:00:00.000Z


In [17]:
print("starting timestamp:", start_list[0])
print("ending timestamp:", start_list[-1])

starting timestamp: 2021-05-01T00:00:00.000Z
ending timestamp: 2021-08-31T23:00:00.000Z


In [43]:
# start_list =    ['2021-01-01T00:00:00.000Z',
#                  '2021-02-01T00:00:00.000Z',
#                  '2021-03-01T00:00:00.000Z',
#                  '2021-04-01T00:00:00.000Z',
#                  '2021-05-01T00:00:00.000Z',
#                  '2021-06-01T00:00:00.000Z',
#                  '2021-07-01T00:00:00.000Z',
#                  '2021-08-01T00:00:00.000Z',
#                  '2021-09-01T00:00:00.000Z',
#                  '2021-10-01T00:00:00.000Z',
#                  '2021-11-01T00:00:00.000Z',
#                  '2021-12-01T00:00:00.000Z',
#                 ]

# end_list =      ['2021-01-31T00:00:00.000Z',
#                  '2021-02-28T00:00:00.000Z',
#                  '2021-03-31T00:00:00.000Z',
#                  '2021-04-30T00:00:00.000Z',
#                  '2021-05-31T00:00:00.000Z',
#                  '2021-06-30T00:00:00.000Z',
#                  '2021-07-31T00:00:00.000Z',
#                  '2021-08-31T00:00:00.000Z',
#                  '2021-09-30T00:00:00.000Z',
#                  '2021-10-31T00:00:00.000Z',
#                  '2021-11-30T00:00:00.000Z',
#                  '2021-12-31T00:00:00.000Z']


# start_list = list(pd.date_range(start="2021-01-01T00:00:00.000Z", end="2021-12-31T00:00:00.000Z", freq='1H'))

In [19]:
#Inputs for tweets
bearer_token = auth()
headers = create_headers(bearer_token)

# keyword = "(ethereum OR ether OR eth) lang:en -is:retweet -is:reply"
keyoword = "(audiusproject OR audius OR $audio OR audiocoin) lang:en -is:retweet -is:reply"
datafile = "../data/audio_twitter_data_hourly_full.csv"

max_results = 300 # total results per API call


#Total number of tweets we collected from the loop
total_tweets = 0

# Create file
csvFile = open(datafile, "a", newline="", encoding='utf-8')
csvWriter = csv.writer(csvFile)

# (Only the first time)
# Create headers for the data you want to save, in this example, we only want save these columns in our dataset
# csvWriter.writerow(['author_id', 'created_at', 'tweet_id', 'text',
#                     'like_count', 'quote_count', 'reply_count', 'retweet_count'])

csvFile.close()

for i in range(0,len(start_list)):

    # Inputs
    count = 0 # Counting tweets per time period
    max_count = 10_000 # Max tweets per time period
    flag = True
    next_token = None
    
    # Check if flag is true
    while flag:
        
        # Check if max_count reached
        if count >= max_count:
            break
        print("-------------------")
        print("Token: ", next_token)
        url = create_url(keyword, start_list[i],end_list[i], max_results)
        json_response = connect_to_endpoint(url[0], headers, url[1], next_token)
        result_count = json_response['meta']['result_count']

        if 'next_token' in json_response['meta']:
            # Save the token to use for next call
            next_token = json_response['meta']['next_token']
            print("Next Token: ", next_token)
            if result_count is not None and result_count > 0 and next_token is not None:
                print("Start Date: ", start_list[i])
                append_to_csv(json_response, datafile)
                count += result_count
                total_tweets += result_count
                print("Total # of Tweets added: ", total_tweets)
                print("-------------------")
                time.sleep(5)                
        
        # If no next token exists
        else:
            if result_count is not None and result_count > 0:
                print("-------------------")
                print("Start Date: ", start_list[i])
                append_to_csv(json_response, datafile)
                count += result_count
                total_tweets += result_count
                print("Total # of Tweets added: ", total_tweets)
                print("-------------------")
                time.sleep(5)
            
            #Since this is the final request, turn flag to false to move to the next time period.
            flag = False
            next_token = None
        
        time.sleep(2)
        
print("Total number of results: ", total_tweets)

-------------------
Token:  None
Endpoint Response Code: 200
-------------------
Start Date:  2021-05-01T00:00:00.000Z
# of Tweets added from this response:  7
Total # of Tweets added:  7
-------------------
-------------------
Token:  None
Endpoint Response Code: 200
-------------------
Start Date:  2021-05-01T01:00:00.000Z
# of Tweets added from this response:  11
Total # of Tweets added:  18
-------------------


KeyboardInterrupt: 