# Political Polarization in the US


### By Toke Bøgelund Andersen (s164202), Mikkel Grønning (s144968) and Ida Riis Jensen (s161777)

#### Course 02805 Social Graphs and Interactions 

<img src="https://img.youtube.com/vi/KEkrWRHCDQU/0.jpg" alt="image info" />

## Setup

In [None]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
import urllib.request
import camelot
import tweepy
import tqdm
import networkx as nx 
import pickle
import itertools
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D
from svgpathtools import svg2paths
from svgpath2mpl import parse_path
import matplotlib as mpl
from operator import itemgetter
import seaborn as sns
sns.set()
from fa2 import ForceAtlas2
from community import community_louvain
import plotly.express as px

## Motivation

### What is your dataset?
The idea is to analyze data from `Twitter` with focus on tweets from the American congress in the period 2017-2020 to get an understanding of the political polarization in the US. The data used in this project consists of tweets from 1072 congress members from the 115th and 116th congress respectively and the president of the United States, Donald J. Trump. Data is from Harvard Dataverse (insert ref.) and Trump Twitter Archive (insert ref.) and contains the following information:

You can read all about how tweets were extracted in [this] notebook. Notice, that to extract all the tweets you need to have a Twitter developer account in order for accessing the Twitter API. Extracting all the raw data takes approx. 24 hours, however extracting the cleaned and processed data takes approx. 6-8 hours as there are many duplicates and unecessary duplicates in raw data. In this notebook we will only consider the cleaned data but again we refer to the other notebook for full elaboration on how data was extracted and cleaned. 


* The state they are from,
* whether they are representative, senator or POTUS,
* their full name,
* which party they are member of,
* and their Twitter handle. 
The Twitter handles have been used to download tweets from all members in the given period using the Twitter API (insert ref.). In addition, we have added 16 of the largest media in the US with same attributes but without tweets. 

Information about followers and retweets have been extracted for all users (both persons and medias) in order to create networks that might reveal some interesting information about the polarization.

<img src="../figures/congress.png" width=350 height=250 /> <img src="../figures/trumpeten.png" width=350 height=250 /> <img src="../figures/medias.jpg" width=350 height=250 />

Per Twitter's Developer Policy, tweet ids may be publicy shared for academic purposes; tweets may not (insert ref.). Thus, the data available for our readers will not contain the tweets.

### Why did you choose these particular datasets?
These particular datasets have been chosen as we want to investigate whether the political polarization in the US appears in the congress members tweets. One could suspect that the polarization was especially expressed during Donald Trump's presidency and therefore the period of his presidency is interesting to look at. It also guarantees us a large network which can be analysed based on both followers, retweets and a bipartite graph showing the polarization.

In relation to text analysis, the language used on Twitter is allegedly not as neutral as the langugage used on Wikipedia which will result in a more interesting sentiment analysis. 

### What was your goal for the end user's experience?

The goal is to provide an analysis of whether the polarization of the political fronts is expressed in the form of tweets but also whether there exists a pattern in who follows and retweets each other internally in the congress. Additionally, the aim is also to provide insight into how the media influences this polarization. 

## Basic stats

We are working with large datasets which demands a lot of cleansing and preprocessing. This section presents how data for the network part and the text analysis part of the assignment has been cleaned and processed. The steps are explained in this notebook and the corresponding notebooks and functions can be find on `GitHub` (insert ref.)

<img src="../figures/data_processing.png" width=660 height=150 />

### Data cleaning and preprocessing

#### 116<sup>th</sup> congress

First the twitter handles for the 116<sup>th</sup> congress will be extracted using [this](https://triagecancer.org/congressional-social-media) source. The choice of source comes from the fact that the Twitter handle as well as the party is desired.

`BeautifulSoup` is used to extract the HTML table from the webpage (that has been downloaded to allow for offline work).

In [None]:
# Open data
with open('../Data/Raw/116_congress_twitter.html') as fp:
    soup = BeautifulSoup(fp, 'html.parser')

# Find table
table = soup.find('table', attrs={'id':"footable_16836"})

# Extract data row wise from table
l = []
for tr in table.findAll('tr'):
    td = tr.find_all('td')
    row = [tr.text for tr in td]
    l.append(row)

# Make the data into a Pandas data frame and drop irrelevant columns
Data116 = pd.DataFrame(l[1:], columns = [header.getText() for header in table.findAll('th')]).drop(columns = ['Name Links', 'Twitter Links', 'Instagram', 'Facebook Page', 'Facebook'])

# Ensure that the type of politician is alligned
rename_chamber = {'U.S. Representative': 'Representative', 'U.S. Senator': 'Senator'}
Data116 = Data116.replace(rename_chamber).rename(columns = {'Chamber of Congress': 'Type'})

In this data set the state is given as well as congressional district. This is fixed using regex strings as shown below. Moreover the "@" are removed from the Twitter handles as the Twitter API does not need it. The vancant positions in Congress are also disregarded.

In [None]:
# All states abbreviations
us_state_abbrev = {
    r'Alabama.*': 'AL',
    r'Alaska.*': 'AK',
    r'American Samoa.*': 'AS',
    r'Arizona.*': 'AZ',
    r'Arkansas.*': 'AR',
    r'California.*': 'CA',
    r'Colorado.*': 'CO',
    r'Connecticut.*': 'CT',
    r'Delaware.*': 'DE',
    r'District of Columbia.*': 'DC',
    r'Florida.*': 'FL',
    r'Georgia.*': 'GA',
    r'Guam.*': 'GU',
    r'Hawaii.*': 'HI',
    r'Idaho.*': 'ID',
    r'Illinois.*': 'IL',
    r'Indiana.*': 'IN',
    r'Iowa.*': 'IA',
    r'Kansas.*': 'KS',
    r'Kentucky.*': 'KY',
    r'Louisiana.*': 'LA',
    r'Maine.*': 'ME',
    r'Maryland.*': 'MD',
    r'Massachusetts.*': 'MA',
    r'Michigan.*': 'MI',
    r'Minnesota.*': 'MN',
    r'Mississippi.*': 'MS',
    r'Missouri.*': 'MO',
    r'Montana.*': 'MT',
    r'Nebraska.*': 'NE',
    r'Nevada.*': 'NV',
    r'New Hampshire.*': 'NH',
    r'New Jersey.*': 'NJ',
    r'New Mexico.*': 'NM',
    r'New York.*': 'NY',
    r'North Carolina.*': 'NC',
    r'North Dakota.*': 'ND',
    r'Northern Mariana Islands.*':'MP',
    r'Ohio.*': 'OH',
    r'Oklahoma.*': 'OK',
    r'Oregon.*': 'OR',
    r'Pennsylvania.*': 'PA',
    r'Puerto Rico.*': 'PR',
    r'Rhode Island.*': 'RI',
    r'South Carolina.*': 'SC',
    r'South Dakota.*': 'SD',
    r'Tennessee.*': 'TN',
    r'Texas.*': 'TX',
    r'Utah.*': 'UT',
    r'Vermont.*': 'VT',
    r'Virgin Islands.*': 'VI',
    r'Virginia.*': 'VA',
    r'Washington.*': 'WA',
    r'West V.*': 'WV', # Written in different ways
    r'Wisconsin.*': 'WI',
    r'Wyoming.*': 'WY'
}

# Convert states to two letter abbreviations
Data116['State'] = Data116['State'].replace(regex = us_state_abbrev)

# Remove @
Data116 = Data116.replace(regex = {r'^@': ''})

# Remove vacant positions
Data116 = Data116[Data116.Name != "Vacant"]

# Look at the data
Data116

It is also seen that there are an inconsistency in the ways the names are written. This is changed so all names are written with the first name first:

In [None]:
Data116['Name'] = [name[1][1:]+ " " +name[0] if len(name) == 2 else name[0] for name in [name.replace(u'\xa0', u'').split(',') for name in Data116.Name]]

#### 115<sup>th</sup> congress

Now we move onto the 115th congress. This is data stored in a pdf.table, so for this the `camelot` library is used. 

In [None]:
# Get data
file115 = '../Data/Raw/115_congress_twitter.pdf'

# Read table across all pages
tables = camelot.read_pdf(file115, pages = 'all')

# Convert data to pandas data frame
Data115 = pd.DataFrame(np.concatenate([d.df.drop(0).values for d in tables]), columns=tables[0].df.iloc[0]).drop(columns = "District")

# Align chamber name with the 116 data
rename_chamber = {'Rep.': 'Representative', 'Sen.': 'Senator'}
Data115 = Data115.replace(rename_chamber)

# Align name with the 116 data and store it in one column
Data115["Name"] = Data115["First Name"] + " " + Data115["Last Name"]
Data115 = Data115.drop(columns = ["First Name", "Last Name"])

# Align columns name with the 116 data
Data115 = Data115.rename(columns = {'Title': 'Type', "Twitter Handle": "Twitter"})

#### Merge data

Now the two datasets are merged. Here we need to take duplicate acounts into account which accounts for reelections.

In [None]:
# Merge data set
Data_Full = Data115.append(Data116, ignore_index = True)

# Get shape
Data_Full.shape

In [None]:
# Extra duplicate from AS
Data_Full = Data_Full[Data_Full.Twitter != 'RepTomPrice']

# Drop closed users
Data_Full = Data_Full[~Data_Full.Name.isin(['Aumua Radewages', 'Madeleine Bordallo', 'Elizabeth Esty'])]

# Fix Eric
Data_Full.loc[Data_Full[Data_Full.Name == "Erik Paulsen"].index,"Twitter"] = "ErikPaulsen"

# Fix Bobby
Data_Full.loc[Data_Full[Data_Full.Name == "Bobby Scott"].index,"Twitter"] = "BobbyScott"

# Fix Dave
Data_Full.loc[Data_Full[Data_Full.Name == 'Dave Reichert'].index,"Twitter"] = "TeamReichert"

# Fix Lindsey
Data_Full.loc[Data_Full[Data_Full.Name == 'Lindsey Graham'].index,"Twitter"] = "LindseyGrahamSC"

# Darin's name
Data_Full.loc[Data_Full[Data_Full.Name == "arin LaHood"].index,"Name"] = "Darin LaHood"

In [None]:
Data_Full = Data_Full.drop_duplicates(subset = ["Twitter"], keep = 'last')

In [None]:
Data_Full = Data_Full.drop_duplicates(subset = ["Name"], keep = 'last')

#### Adding the President

In [None]:
Data_Full = Data_Full.append({'State': None, 'Party': 'R', 'Type': 'POTUS', 'Twitter': 'realDonaldTrump', 'Name': 'Donald J. Trump', 'twitter_display_name': 'Donald J. Trump'}, ignore_index=True)

#### Merging data including media and correct period (1: 27.01.2017-02.01.2019, 2: 27.01.2019-07.05.2020)

In [None]:
congress = pd.read_pickle('../Data/Interim/congress.pkl')
trump = pd.read_pickle('../Data/Interim/trump.pkl')

In [None]:
# Concatinating the congress with Trump
congress_tweets = pd.concat([congress, trump])

In [None]:
# Removing duplicates
congress_tweets.drop_duplicates(keep='first', inplace=True)

In [None]:
# Getting handles
medias = pd.read_table('../Data/Raw/LargestMedia.csv', sep=';')
twitter_handles = pd.read_table('../Data/Processed/Twitter_Handles_updated.csv', sep = ',')

In [None]:
# Adding relevant columns to media dataframe
medias['State'] = None
medias['Party'] = None
medias['Type'] = 'Media'
medias.rename(columns={'Twitter name': 'Twitter', 'Media': 'Name'}, inplace=True)

Data_Full = pd.concat([twitter_handles, medias])

How have we cleaned and preprocessed the tweets...

### Discussion of dataset stats

#### Network size

#### Degree distribution

#### Average distance

#### Clustering

## Tools, theory and analysis

In this section, we will go through how we've worked with the text and which network science tools and data analysis strategies we've used for solving the problem about how the political polarization is expressed on Twitter. 

<img src="../figures/tools.png" width=360 height=250/>


### Idea
The overall idea is to use the tools and methods learned in this course to find interesting results about the political polarization. We will explore the coherence of the congress by considering three graphs that are generated with different view on the Twitter data. One graph investigates who follows whom in the congress. Do senators follow senators? Do republicans follow republicans? And so on. Another graph examines whether the political polarization is expressed in the retweets. In this graph the media are also taken into account in order to explore possible patterns of retweets from other sources. By comparing the 'Follow'-graph with the 'Retweet'-graph, we probably get insight into whether people are only lurking on their opponents. Furthermore, a graph visualizing the tags in each tweet is constructed. Do republicans tag democrats? And is the relation to the tag positive or negative? This graph brings us to the text processing in the form of sentiment analysis. We want to examine whether there exist patterns in the language based on what part of the congress and party people belong to. By using TF-TR, word clouds can be created and hopefully they will show some interesting points with regards to the political polarization.

The points of interest for each analysis will be described in more detail in the following subsections. 

### Analysis step 1 - The 'Who Follows Whom'-graph

follow graph

### Analysis step 2 - The 'Who Retweets Whom'-graph

retweet graph

### Analysis step 3 - Natural Language Processing
tag graph

sentiment analysis

text analysis

tf-tr
republicaner vs demokrater -> word clouds

## Discussion


## Contributions

## References

#### Links
[https://developer.twitter.com/en/docs/twitter-api]


[http://www.trumptwitterarchive.com]

[https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/UIVHQR]

[https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/MBOJNS]

### Papers/books
Albert-Laszlo Barabasi. (2015). Network Science. Cambridge: Cambridge University Press. 
http://networksciencebook.com