Analysing the tweets of U.S. congress members (2017-2023) in relation to political affiliation and current issues, by considering the sentiment, frequency, and trends of related statements to understand the priorities and features of the two parties.
Link to Analysis: https://html-preview.github.io/?url=https://github.com/kennethkn/congresstweets-analysis/blob/main/analysis.html
This project aims to foster a boarder understanding of the bipartisan U.S. politics by analyzing the tweets of U.S. congress members, including democratic and republican senators and representatives. The importance of it lies in the potential to reveal patterns and trends in the political discourse of recent years. Understanding these patterns can provide insights into the priorities and strategies of the two parties. Additionally, analyzing the sentiment of tweets can reveal their stance on current issues.
The dataset available for this project is from the GitHub repository congresstweets. It contains a comprehensive collection of tweets from U.S. congress members since 2017, making it a rich resource for a diverse analysis.
- What is the trend of the most common words/hashtags used by democratic and republican congress members in their tweets?
- What are the sentiments of tweets by democratic and republican congress members on significant issues such as COVID-19, climate change, abortion, gun control, and etc?
Given the enormous size of the dataset (~4M entries), I have chosen a database approach to store and query the data. The database is hosted locally on my computer via PostgresSQL, but you can reproduce the database by executing Python scripts in the scripts
folder, which holds scripts for database construction as well as text mining.
- Ready PostgreSQL server (
brew install postgresql && brew services start postgresql
if you are using macOS) - Create a database named
congresstweets
(createdb congresstweets
) - Clone the repository
- Notice the empty
data/tweets/
folder. You need to download the tweets data from the congresstweets repo, as well as here for older 2017 data. Place the downloaded json files (eg2020-03-24.json
) in thedata/tweets/
folder. - Setup venv and activate it (
python -m venv venv && source venv/bin/activate
) - Install the required packages (
pip install -r requirements.txt
) - Open
.env
and replaceYOUR_USERNAME
with your PostgreSQL username. (DATABASE_URL=postgresql://YOUR_USERNAME@localhost:5432/congresstweets
) - Run
models.py
to create the tables. - Run
db_insert_members.py
to populate themembers
table in the database. - Run
db_insert_tweets.py
to populate thetweets
table in the database. - Run
text_mining.py
to populate columns pertaining to text mining results in thetweets
table. - Open
analysis.rmd
in RStudio and knit the file to generate the analysis.
- Tweet Count by Party and Year
- Tweet Count by Chamber and Year
- Top Tweeters by Year
- Top Hashtags
- Top Hashtags by Party
- Top Hashtags by Party and Year
- Top Hashtags by Chamber
- Top Words
- Top Words by Party
- Top Words by Party and Year
- Top Words by Chamber
- Sentiment Analysis by Party and Year
- Sentiment Analysis by Chamber and Year
- Sentiment Analysis by Topic and Party
- Top Accounts Retweeted
- Top Accounts Retweeted by Party
- Top Accounts Quoted
- Top Accounts Quoted by Party
- Top Accounts Mentioned
- Top Accounts Mentioned by Party
- Use of BERT or GPT to infer topics from tweets.
- Even more categories, such as top words by chamber and year, sentiment of tweets by topic and chamber, etc.
Major credits to Alex Litel for providing the dataset. https://github.com/alexlitel/congresstweets