# Cleaning and Exploratory Data Analysis

Due to file size constraints set by GitHub, this program subdivides the scraped data files into smaller files that will meet the GitHub size restrictions (100 MB). This allows future users of this project to replicate the findings found within.

**Data Sources**

Tweets scraped from archive found at: 
- https://github.com/alexlitel/congresstweets

Twitter handles of elected Senators and Representatives found at: 
- https://www.sbh4all.org/wp-content/uploads/2019/04/116th-Congress-Twitter-Handles.pdf
- https://sharedhope.org/wp-content/uploads/2018/02/US-Senate-Twitter-Handles-115th-Congress.pdf
- https://www.chn.org/wp-content/uploads/2017/01/house-member-twitter-handles-Jan-2017.pdf

Political party affiliations:
- https://en.wikipedia.org/wiki/115th_United_States_Congress
- https://en.wikipedia.org/wiki/116th_United_States_Congress

## Read in Libraries

In [3]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import datetime

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

## Preliminary Cleaning, Creation of Dataframe

### Skip if dataframe has already been created and skip to "Read In Cleaned Data and Further EDA" section

In [4]:
# Tweets of the 115th Congress (Jan 3, 2017 - Jan 3, 2019)
# Data available only for 2018, 2019
df115_1 = pd.read_csv('../data/scrape/tweet_df_2018_1.csv')
df115_2 = pd.read_csv('../data/scrape/tweet_df_2018_2.csv')

# Tweets of the 116th Congress (Jan 3, 2019 - Jan 3, 2021)
df116_1 = pd.read_csv('../data/scrape/tweet_df_2019_1.csv')
df116_2 = pd.read_csv('../data/scrape/tweet_df_2019_2.csv')
df116_3 = pd.read_csv('../data/scrape/tweet_df_2020_1.csv')
df116_4 = pd.read_csv('../data/scrape/tweet_df_2020_2.csv')

# keeping only certain columns 
columns_keep =['id','screen_name','user_id','time','link','text','source']

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [5]:
# 115th Congress
df115_1 = df115_1[columns_keep]
df115_2 = df115_2[columns_keep]

# 116th Congress
df116_1 = df116_1[columns_keep]
df116_2 = df116_2[columns_keep]
df116_3 = df116_3[columns_keep]
df116_4 = df116_4[columns_keep]

In [6]:
# combining the 115th Congress dfs and removing duplicates
df115 = pd.concat([df115_1, df115_2], axis=0, sort=False)
df115 = df115.drop_duplicates()

# combining the 116th Congress dfs and removing duplicates
df116 = pd.concat([df116_1, df116_2, df116_3, df116_4], axis=0, sort=False)
df116 = df116.drop_duplicates()

In [7]:
# Find the shape of 115th congress tweet files for subdivision
df115.shape

(740479, 7)

In [8]:
740479/4

185119.75

In [9]:
185119+185119+185119

555357

In [10]:
# dividing the 115th congress dataframe
df115.iloc[0:185119,:].to_csv('../data/scrape/tweet_df_115_1.csv', index=False)
df115.iloc[185120:370238,:].to_csv('../data/scrape/tweet_df_115_2.csv', index=False)
df115.iloc[370239:555357,:].to_csv('../data/scrape/tweet_df_115_3.csv', index=False)
df115.iloc[555357:740479,:].to_csv('../data/scrape/tweet_df_115_4.csv', index=False)

In [11]:
# Find the shape of 116th congress tweet files for subdivision
df116.shape

(1898432, 7)

In [12]:
1898432/10

189843.2

In [13]:
189843+189843+189843+189843+189843+189843+189843+189843+189843+189843

1898430

In [14]:
# dividing the 116th congress dataframe
df116.iloc[0:189843,:].to_csv('../data/scrape/tweet_df_116_1.csv', index=False)
df116.iloc[189844:379686,:].to_csv('../data/scrape/tweet_df_116_2.csv', index=False)
df116.iloc[379687:569529,:].to_csv('../data/scrape/tweet_df_116_3.csv', index=False)
df116.iloc[569530:759372,:].to_csv('../data/scrape/tweet_df_116_4.csv', index=False)
df116.iloc[759373:949215,:].to_csv('../data/scrape/tweet_df_116_5.csv', index=False)
df116.iloc[949216:1139058,:].to_csv('../data/scrape/tweet_df_116_6.csv', index=False)
df116.iloc[1139059:1328901,:].to_csv('../data/scrape/tweet_df_116_7.csv', index=False)
df116.iloc[1328902:1518744,:].to_csv('../data/scrape/tweet_df_116_8.csv', index=False)
df116.iloc[1518745:1708587,:].to_csv('../data/scrape/tweet_df_116_9.csv', index=False)
df116.iloc[1708588:1898432,:].to_csv('../data/scrape/tweet_df_116_10.csv', index=False)