This notebook contains the code used for collecting tweet data on politicians in the U.S. senate and house of representatives. I used Tweepy to access the Twitter API through Python. Note that without your own authentication for the Twitter API, you will not be able to test this notebook. You can apply for a Twitter Developer account at https://developer.twitter.com

In [1]:
import tweepy
import time

Using Tweepy requires authentication from Twitter. I applied for a Twitter development account, and once this was approved, I was able to get the key and secret to access the API. For privacy, I have removed the actual key and secret from the cell below.

To learn how to use Tweepy, I used the resource here: https://towardsdatascience.com/how-to-scrape-tweets-from-twitter-59287e20f0f1

In [2]:
consumer_key = "your key here"
consumer_secret = "your secret here"

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
api = tweepy.API(auth)

To find Twitter handles for all politicians in congress, I used the following source: https://triagecancer.org/congressional-social-media

Some of the handles were out of date, or have been deleted, so I copied the data from the site above into a csv file, and updated it as needed. This file is included in the github repository for this project.

The following block of code loads this csv file from my Google drive, into a pandas dataframe.

We take a peak at the first few lines of this dataframe, seeing that we have the state and/or district, branch of government, last name, first name, party, and Twitter handle for each politician.

In [3]:
import pandas as pd
import numpy as np

link = "https://raw.githubusercontent.com/lynn0032/CourseProject/main/data_files/twitter_handles.csv"

politicians_df = pd.read_csv(link)

politicians_df.head()

Unnamed: 0,State,Branch,Last Name,First Name,Party,Twitter Handle
0,Alabama,U.S. Senator,Shelby,Richard,R,SenShelby
1,Alabama,U.S. Senator,Tuberville,Tommy,R,Ttuberville
2,Alabama 1st District,U.S. Representative,Carl,Jerry,R,RepJerryCarl
3,Alabama 2nd District,U.S. Representative,Moore,Barry,R,RepBarryMoore
4,Alabama 3rd District,U.S. Representative,Rogers,Mike,R,RepMikeRogers


The following block of code creates a dataframe to store the information for all tweets. For each tweet, we store the handle, timestamp, ID, and text of the tweet.

The count, here 18, determines how many tweets are retrieved for each account. This number is limited, because the Twitter API only allows accessing about 10,000 tweets a day. 

For each handle, I create a dataframe with the most recent tweets from that user. These are temporarily stored in a dataframe user_tweets_df, and exported to a csv file in my drive, as a backup. The dataframe user_tweets_df is also concatenated onto tweets_df, which collects the tweets from all users.

Once all of the tweets are collected, the dataframe tweets_df is joined with the dataframe politicians_df created above. The rows of the merged dataframe each represent a single tweet, and also include information on the politician who wrote the tweet (including their name and party). The dataframe tweets_df is exported to a csv file, to be stored and used later.

In [4]:
# Mount google drive
from google.colab import drive 
drive.mount('/content/gdrive')

# This can be replaced with a location on your own drive
drive_location = 'gdrive/My Drive/CS 410 Text Information Systems/Final Project/Tweets/'

column_names = ["Twitter Handle", "Created At", "ID", "Tweet Text"]
tweets_df = pd.DataFrame(columns = column_names)

count = 18

for username in politicians_df["Twitter Handle"]:
  try:
    tweets = tweepy.Cursor(api.user_timeline,id=username).items(count)
    tweets_list = [[tweet.created_at, tweet.id, tweet.text] for tweet in tweets]
    tweets_list = [[username] + tweet for tweet in tweets_list]
    user_tweets_df = pd.DataFrame(tweets_list, columns = column_names)
    tweets_df = pd.concat([tweets_df, user_tweets_df], axis = 0)
    user_tweets_df.to_csv(drive_location + str(username) + ".csv")
  except BaseException as e:
    print("Failed on_status,", str(e), "for account", username)
    time.sleep(3)

tweets_dataset = pd.merge(tweets_df, politicians_df, how = "right", on = ["Twitter Handle"])
tweets_dataset.to_csv(drive_location + "all_tweets.csv")

Mounted at /content/gdrive


In order to create a dataset of tweets, I ran the code above every evening from 11/15/21 through 11/26/21. In a subsequent notebook, we'll see that (after removing duplicate tweets), this resulted in 84,948 tweets collected.