# Basic Statistics
In this part on the Notebook basic statistics for two dataset `Twitter_Handles.csv` and `congress_cleaned_processed.pkl` will be given. The data set `Twitter_Handles.csv` is publicly available in the 'data/proccesed' folder on the Github repository whereas congress_cleaned_processed.pkl can't be shared per Twitter’s Developer Policy. However, it is possible to see how the dataset was created in the notebook `Exstract_And_Clean_data.ipynb` or qucikly accessed with the following link:
https://nbviewer.jupyter.org/github/MikkelGroenning/social_graph/blob/main/Notebooks/Exstract_And_Clean_data.ipynb. The notebook `Exstract_And_Clean_data.ipynb` illustrates in great detail the process of exstacting the tweets such that the data can be re-created. 

The Basic Statistics section of this notebook is structed of two subparts namely
* **Handle Data** 
* **Congress Tweets Data**

The first part devoted to describing `Twitter_Handles.csv` dataset whereas the second part is devoted to describing `congress_cleaned_processed.pkl` dataset.

## Package import
All needed packages are imported below

In [27]:
import pandas as pd
import numpy as np
import seaborn as sns


import plotly.express as px
import cufflinks as cf
import plotly.graph_objects as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
%matplotlib inline
init_notebook_mode(connected=True)
cf.go_offline()

## Handle Data
The Twitter data is loaded in the cell below. As can be seen it describes the State, Party, Type (Seneator, Representative, or President (POTUS)), their Twitter handle, their name and their name that display on Twitter for every poltician in the dataset.

In [54]:
Handle_data = pd.read_csv('../data/processed/Twitter_Handles.csv')

color_dict_party = {
    "D" : "Democrat", 
    "R" : "Republican",
    "I" : "Independent",
    'L' : 'Libertarian'
}
Handle_data['Party'] = [color_dict_party[p] for p in Handle_data['Party'] ]
Handle_data

Unnamed: 0.1,Unnamed: 0,State,Party,Type,Twitter,Name,twitter_display_name
0,0,AZ,Republican,Senator,JeffFlake,Jeff Flake,Jeff Flake
1,1,AZ,Republican,Senator,SenJonKyl,Jon Kyl,U.S. Senator Jon Kyl
2,2,CA,Democrat,Representative,reppeteaguilar,Peter Aguilar,Rep. Pete Aguilar
3,3,CA,Democrat,Representative,repcardenas,Tony Cardenas,Rep. Tony Cárdenas
4,4,CA,Republican,Representative,DarrellIssa,Darrell Issa,Darrell Issa
...,...,...,...,...,...,...,...
613,613,WI,Republican,Representative,MikeforWI,Mike Gallagher,Mike Gallagher
614,614,WY,Republican,Senator,SenatorEnzi,Mike Enzi,Mike Enzi
615,615,WY,Republican,Senator,SenJohnBarrasso,John Barrasso,Sen. John Barrasso
616,616,WY,Republican,Representative,Liz_Cheney,Liz Cheney,Liz Cheney


### Party distribution

A natural way to start is the distribution of parties from the 618 politicians. They are calculated below and plotted as a bar plot. The party colors are used with gray from independents.

In [29]:
party_distribution = Handle_data.groupby('Party').agg('count')[['Name']].reset_index()
party_distribution.rename(columns={'Name':'Count'},inplace=True)
party_distribution

Unnamed: 0,Party,Count
0,Democrat,310
1,Independent,2
2,Libertarian,1
3,Republican,305


In [49]:
fig = px.bar(
    party_distribution,
    x="Party",
    y="Count",
    color='Party',
    color_discrete_sequence=["#0015BC", "#7f7f7f", "#FED105", "#DE0100"],
    title = 'Distribution of Party sizes',
    text = 'Count'
)
fig.update_traces(texttemplate='%{text}', textposition='outside')
fig.update_layout(
    xaxis=dict(
        tickangle=-45,
    ),
    margin=dict(b=5,l=5,r=5,t=40),
    titlefont_size=16,
)
# save for website
fig.write_html(
    file = "../web_app/plotly_files/tweet_barplot_parties.html", 
    full_html = False,
    include_plotlyjs='cdn'
)
fig

In [31]:
Handle_data[~Handle_data.Party.isin(['Republican', 'Democrat'])]

Unnamed: 0.1,Unnamed: 0,State,Party,Type,Twitter,Name,twitter_display_name
300,300,ME,Independent,Senator,SenAngusKing,Angus King,Senator Angus King
329,329,MI,Libertarian,Representative,justinamash,Justin Amash,Justin Amash
571,571,VT,Independent,Senator,SenSanders,Bernie Sanders,Bernie Sanders


An interesting name to notice is Bernie Sanders who has run in the primary presidential election for the Democrats even though he officially is an independent senator for Vermount. 

### Type distribution

Another interesting aggregation is based on the type (i.e. Representative, Senator, or President (POTUS)). Below are the numbers aggregated together with the party so the distribution can be shown.

In [32]:
party_type_distribution = Handle_data.groupby(['Party','Type']).agg('count')[['Name']].reset_index()
party_type_distribution.rename(columns={'Name':'Count'},inplace=True)

In [50]:
fig = px.bar(
    party_type_distribution,
    x="Type",
    y="Count",
    color='Party',
    color_discrete_sequence=["#0015BC", "#7f7f7f", "#FED105", "#DE0100"],
    title = 'Distributuion per party',
    text = 'Count'
)
fig.update_layout(
    margin=dict(b=5,l=5,r=5,t=40),
    titlefont_size=16,
)
# save for website
fig.write_html(
    file = "../web_app/plotly_files/tweet_parties_2_barplot.html", 
    full_html = False,
    include_plotlyjs='cdn'
)
fig

Immediatly it becomes clear that representitatives clearly dominates the dataset with a total of 503 Representatives 244 repulicans and 259 Democrats.
Senators are only 111 while there naturally only is one President, Donald J. Trump. 
Between the 115th and 116th congress all 435 seats in the house of representitatives were up for election while only around a third of the 100 seats in the Senate were up for election. That means some where not re-elected but still appear in the dataset.
Moreover United States overseas territories also gets seats in the house of representitatives (though without voting power). With these reasons it makes sense that representitatives are so dominant.
Furthermore it explains why the total number of Senetors and Representives excessed 435 and 100 respectivly which is number of seats in the House of Representatives and the Senate respectivly.
From the plot the stacked element also clearly illustrates two party system - Democrats and Repuliblicans domminate the political landscape but are fairly equal in terms of sizes.

### Distribution of States
Another interesting feature is the state representatation. Let's first look at the number of different states in the dataset

In [55]:
print(f'There are {Handle_data.State.nunique()} unique states in the data.')

There are 56 unique states in the data.


So there are 56. This is more than the 50 *usual* states that wer hear about. This is because American Samoa, District of Columbia, Northern Mariana Islands, Puerto Rico and Virgin Islands also gets seats in the House of Representiatives. They do not have voting 
power but they can participate in debates. Let's see the number of congress memebers in each state in the data set (please note that the presentident does not a have a state associated with him.)

In [62]:
# Count the number of types per party
with open('../data/processed/us_state_abbrev.json') as json_file:
    us_state_abbrev = json.load(json_file)
# flip state dict
us_state_abbrev = {value:key for key, value in us_state_abbrev.items()}

party_state_distribution = Handle_data.groupby(['Party', 'State']).agg('count')[['Name']].reset_index()
party_state_distribution.rename(columns={'Name':'Count'},inplace=True)
party_state_distribution['State'] = [us_state_abbrev[state_abrreviation] for state_abrreviation in party_state_distribution['State']]

In [65]:
fig = px.bar(
    party_state_distribution,
    x="State",
    y="Count",
    color='Party',
    color_discrete_sequence=["#0015BC", "#7f7f7f", "#FED105", "#DE0100"],
    title = 'Distributuion per party',
)
fig.update_layout(
    xaxis=dict(
        tickangle=90,
        dtick = 1,
        showticklabels = True,
    ),
    margin=dict(b=5,l=5,r=5,t=40),
    titlefont_size=16,
)
fig.write_html(
    file = "../web_app/plotly_files/tweet_parties_3_barplot.html", 
    full_html = False,
    include_plotlyjs='cdn'
)
fig

Above the distribution is seen. Remember that the data is from two congresses so reelected politicians will only count as one - thus it does not fully illustrate the political landscabe. But the plot gives the exaxct distribution of the dataset that is used in the project. It also becomes very apparent how the sizes (based on memebers of congress) varies between states. California is the largest by some distance down to Texas. Then there is an additional large jump to Florida and New York with 32 each - and additional large jump is then down to Illinois and Pennsylvania with 23. Many states are found in the band in with 11-14 members. The full distribution as a histogram is found below:

In [53]:
fig = px.histogram(
    party_state_distribution.groupby('State').agg('sum').reset_index(),
    x="Count", 
    nbins=15,
    color_discrete_sequence=['#ff7f0e'],
    title='Stacked histogram of state sizes based on number of congress members'
)
fig.update_layout(
    margin=dict(b=5,l=5,r=5,t=40),
    titlefont_size=16,
)
fig.write_html(
    file = "../web_app/plotly_files/tweet_parties_histogram.html", 
    full_html = False,
    include_plotlyjs='cdn'
)
fig

From this histogram it really becomes apparent how most state sizes are below 15 why California and Texas are more *edge cases*. This sizes are relevant to keep in mind if community detection will be done based states.

## Congress Tweets Data
In this part the basic statistics of the exstracted tweets from the politicians will be presented. In the below cell is the `congress_cleaned_processed.pkl` loaded and merged with the twitter handles such that the statitics can be made on a party level.

In [43]:
df_congress = pd.read_pickle('../data/processed/congress_cleaned_processed.pkl')
df_congress = pd.merge(df_congress, Handle_data, how='left',left_on='user_name', right_on='twitter_display_name')

The below cell exstract the total number of tweets exstracted from every politician

In [44]:
tweet_counts = df_congress.groupby(['Name', 'Party', 'Type']).agg('count')[['text']].reset_index()
tweet_counts.rename(columns={"text": "Total Tweets"}, inplace=True)
tweet_counts.head()

Unnamed: 0,Name,Party,Type,Total Tweets
0,Abby Finkenauer,Democrat,Representative,2193
1,Abigail Spanberger,Democrat,Representative,2229
2,Adam Kinzinger,Republican,Representative,3182
3,Adam Schiff,Democrat,Representative,4103
4,Adam Smith,Democrat,Representative,2638


### Total number of tweets per party
Let's first look at the number of tweets based on the party and type states in the dataset.

In [45]:
tweet_count_party = tweet_counts.groupby(['Party', 'Type']).agg('sum').reset_index()
tweet_count_party['Party and Type'] = tweet_count_party['Party'] + ' ' + tweet_count_party['Type']

In [46]:
fig = px.bar(
    tweet_count_party,
    x="Party and Type",
    y="Total Tweets",
    color='Party',
    color_discrete_sequence=["#0015BC", "#7f7f7f", "#FED105", "#DE0100"],
    text='Total Tweets',
    title = 'Total number of tweets per group'
)
fig.update_traces(texttemplate='%{text:.2s}', textposition='outside')
# Rotate labels 45 degrees
fig.update_layout(
    xaxis=dict(
        tickangle=-45,
    ),
    yaxis=dict(
        range=[0, 8.5e5]
    ),
    margin=dict(b=5,l=5,r=5,t=40),
    titlefont_size=16,
)
# save for website
fig.write_html(
    file = "../web_app/plotly_files/tweet_barplot.html", 
    full_html = False,
    include_plotlyjs='cdn'
)
fig.show()

From the barplot, it can be seen that Democrats tweet much more than the Republicans colleges despite their proportion of profile is close to equal. Interesting is also seen that Senators tweet disproportionally more than their colleagues in the House of Representatives - particularly for the Republicans. Recall there is data for
* 244 Republican Representatives
* 259 Democratic Representatives
* 60 Republican Senators
* 51 Democratic Senators

### Distribution of tweets
Lets explore the distribution of number of tweets posted per politician. 

In [48]:
fig = px.histogram(
    tweet_counts, 
    x="Total Tweets", 
    title='Stacked histogram of number of total tweets posted per politicans',
    color='Party',
    color_discrete_sequence=["#0015BC", "#DE0100", "#7f7f7f", "#FED105"]
)
fig.update_layout(
    margin=dict(b=5,l=5,r=5,t=40),
    titlefont_size=16,
)
# save for website
fig.write_html(
    file = "../web_app/plotly_files/tweet_distribution.html", 
    full_html = False,
    include_plotlyjs='cdn'
)
fig

From the stacked historgram a skewed distribution is seen. Most politicans have tweeted less than 5000 times, but a few are super active on the platform. 