### Gathering Data

I chose to gather tweets surrounding the 2022 IN-01 congressional district race to analyze political Twitter for bot activity. The reason that I chose such a specific scope for this project is because I thought that such a specific event would have less automatic monitoring than something like a presidential election. For three weeks, I compiled a dataset of tweets mentioning four distinct topics: Frank Mrvan (D), Jennifer-Ruth Green (R), IN-01, and NWI Times (the leading news source for Northwest Indiana). This notebook shows step 1 of the overall process, which is to gather tweets using the Twitter API.

In [1]:
import json
import os
import requests
import time

As a preliminary step, I needed to sign up for a Twitter Developer account and recieve my bearer token, which is necessary for any Twitter API query.

In [2]:
bearer_token = ...

The next step was to create a generalized function that can return a list of tweets from a specific query. This function pulls batches of 10 tweets at a time from a query and returns them as a JSON object.

In [None]:
def searchTwitter(query, tweet_fields, user_fields, next_token, bearer_token):
    
    headers = {"Authorization": f"Bearer {bearer_token}"}

    # Checks if another batch of tweets can be pulled from query
    if next_token is not None:
        url = f"https://api.twitter.com/2/tweets/search/recent?query={query}&{tweet_fields}&{user_fields}&next_token={next_token}"
    else:
        url = f"https://api.twitter.com/2/tweets/search/recent?query={query}&{tweet_fields}&{user_fields}"

    response = requests.request("GET", url, headers=headers)

    # Checks for search limit and pauses for an extended time
    if response.status_code == 429:
        print("Request limit reached.")
        time.sleep(secs=1800)
        response = requests.request("GET", url, headers=headers)

    if response.status_code != 200:
        raise Exception(response.status_code, response.text)

    # Returns a JSON object
    return response.json()


As a Twitter Developer student account holder, my workflow was messy through this process. I could only pull out roughly 4,000 tweets every 30 minutes, and my account was limited to tweets from the past seven days. These hinderances made for a less-than-optimal data gathering process, as I had to time my queries to be around the same time on Sunday nights for three weeks straight.

Next, I had to determine the queries that I would use for each of the four topics.

In [None]:
mrvan_query = '("Frank Mrvan" OR Mrvan OR @RepMrvan) -is:retweet'
jrg_query = '("Jennifer-Ruth Green" OR JRG OR @JenRuthGreen) -is:retweet'
in_query = '(IN01 OR "IN-01") -is:retweet'
nwi_query = '(@nwi) -is:retweet'

The queries would pull the fields below from each tweet.

In [None]:
tweet_fields = """
    tweet.fields=
    id,text,edit_history_tweet_ids,attachments,author_id,
    created_at,entities,in_reply_to_user_id,
    possibly_sensitive,public_metrics,source
"""
user_fields = """
    user.fields=
    name,username,created_at,description,entities,
    location,pinned_tweet_id,protected,public_metrics,verified
"""

For each query search, the query runs until either (a) the query is successfully completed and no more tweets need to be pulled or (b) the soft limit for the query has been reached. If the soft limit is reached, the searchTwitter function pauses for an extended time until the soft limit expires.

In [None]:
results = []
json_results = searchTwitter(
    query, 
    tweet_fields,
    user_fields,
    None,
    bearer_token
)
next_token = json_results["meta"]["next_token"]
results.append(json_results)

while next_token:
    json_results = searchTwitter(
        query, 
        tweet_fields,
        user_fields,
        next_token,
        bearer_token
    )
    next_token = json_results["meta"]["next_token"]
    results.append(json_results)

At this point, the tweet information is stored in a list of paginated results that needs to be reorganized into a single list of tweets. The list of tweets is then converted into a JSON object that can be exported.

In [4]:
tweets = []
for page in results:
    for item in page["data"]:
        tweets.append(item)
with open('query_type.json', 'w') as f:
    json.dump(tweets, f)

This process was repeated on 3 different occasions for each of the 4 queries. The resulting JSON files can be found in the `data` subdirectory.