# Basic Stats
In this part on the Notebook basic statistics for two dataset `Twitter_Handles.csv` and `congress_cleaned_processed.pkl` will be given. The data set `Twitter_Handles.csv` is publicly available in the 'data/proccesed' folder on the Github repository.


`Exstract_And_Clean_data.ipynb`

https://nbviewer.jupyter.org/github/MikkelGroenning/social_graph/blob/main/Notebooks/Exstract_And_Clean_data.ipynb

In this part of the Notebbok the basic statistics of the data will be revialed. The part is structed of two subparts namely
* **Handle Data**
* **Congress Tweets Data**

The first part devoted to describing

## Package import
All needed packages are imported below

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns


import plotly.express as px
import cufflinks as cf
import plotly.graph_objects as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
%matplotlib inline
init_notebook_mode(connected=True)
cf.go_offline()

## Handle Data

In [95]:
Handle_data = pd.read_csv('../data/processed/Twitter_Handles.csv')

color_dict_party = {
    "D" : "Democrat", 
    "R" : "Republican",
    "I" : "Independent",
    'L' : 'Libertarian'
}
Handle_data['Party'] = [color_dict_party[p] for p in Handle_data['Party'] ]


### Party distribution

A natural way to start is the distribution of parties from the 610 nodes. They are calculated below and plotted as a bar plot. The party colors are used with gray from independnets.

In [51]:
party_distribution = Handle_data.groupby('Party').agg('count')[['Name']].reset_index()
party_distribution.rename(columns={'Name':'Count'},inplace=True)
party_distribution

Unnamed: 0,Party,Count
0,Democrat,310
1,Independent,2
2,Libertarian,1
3,Republican,305


In [52]:
fig = px.bar(
    party_distribution,
    x="Party",
    y="Count",
    color='Party',
    color_discrete_sequence=["#0015BC", "#7f7f7f", "#FED105", "#DE0100"],
    title = 'Distribution of Party sizes',
    text = 'Count'
)
fig.update_traces(texttemplate='%{text}', textposition='outside')
# Rotate labels 45 degrees
fig.update_layout(
    xaxis=dict(
        tickangle=-45,
    ),
    margin=dict(b=5,l=5,r=5,t=40),
    titlefont_size=16,
)
fig

In [None]:
Handle_data[~Handle_data.Party.isin(['Republican', 'Democrat'])]

An interestign name to notice is Bernie Sanders who has run in the primary presidential election for the Democrats even though he officially is an independent senator for Vermount. 

### Type distribution

Another interesting aggregation is based on the type (i.e. representitative, senator or Presidnet (POTUS)). Below tre numbers are aggregated together with the party so the distribution can be show:

In [93]:
party_type_distribution = Handle_data.groupby(['Party','Type']).agg('count')[['Name']].reset_index()
party_type_distribution.rename(columns={'Name':'Count'},inplace=True)

In [78]:
fig = px.bar(
    party_type_distribution,
    x="Type",
    y="Count",
    color='Party',
    color_discrete_sequence=["#0015BC", "#7f7f7f", "#FED105", "#DE0100"],
    title = 'Distributuion per party',
    text = 'Count'
)
fig.update_layout(
    margin=dict(b=5,l=5,r=5,t=40),
    titlefont_size=16,
)
fig

Immediatly it becomes clear that representitatives clearly dominates the dataset with 505. Senators are only 111 while there naturally only is one president, Donald J. Trump. Between the 115$^{\text{th}}$ and 116$^{\text{th}}$ congress all 435 seats in the house of representitatives were up for election while only around a third of the 100 seats in the Senate were up for election. Moreover United States overseas territories also gets seats in the house of representitatives (though without voting power). With these reasons it makes sense that representitatives are so dominant.

From the plot the stacked element also clearly illustrates that the Democrats and Repuliblicans are so large - but also fairly equal in terms of sizes.

### Distribution of States
The last attirbute to look at is the states. Let's first look at the number of different states in the dataset


In [94]:
print(f'There are {Handle_data.State.nunique()} unique states in the data.')

There are 56 unique states in the data.


So there are 55. This is more than the 50 *usual* states that wer hear about. This is because American Samoa, District of Columbia, Northern Mariana Islands, Puerto Rico and Virgin Islands also gets seats in the House of Representiatives. They do not have voting 
power but they can participate in debates. Let's see the number of congress memebers in each state in the data set (please note that the presentident does not a have a state associated with him.)

In [92]:
# Count the number of types per party
party_state_distribution = Handle_data.groupby(['Party', 'State']).agg('count')[['Name']].reset_index()
party_state_distribution.rename(columns={'Name':'Count'},inplace=True)

In [67]:
fig = px.bar(
    party_state_distribution,
    x="State",
    y="Count",
    color='Party',
    color_discrete_sequence=["#0015BC", "#7f7f7f", "#FED105", "#DE0100"],
    title = 'Distributuion per party',
)
fig.update_layout(
    xaxis=dict(
        tickangle=90,
        dtick = 1,
        showticklabels = True,
    ),
    margin=dict(b=5,l=5,r=5,t=40),
    titlefont_size=16,
)
fig

Above the distribution is seen. Remember that the data is from two congresses so reelected politicians will only count as one - thus it does not fully illustrate the political landscabe. But the plot gives the exaxct distribution of the dataset that is used in the project. It also becomes very apparent how the sizes (based on memebers of congress) varies between states. California is the largest by some distance down to Texas. Then there is an additional large jump to Florida and New York with 32 each - and additional large jump is then down to Illinois and Pennsylvania with 23. Many states are found in the band in with 11-14 members. The full distribution as a histogram is found below:

In [91]:
fig = px.histogram(
    party_state_distribution.groupby('State').agg('sum').reset_index(),
    x="Count", 
    nbins=15,
    color_discrete_sequence=['#ff7f0e'],
    title='Stacked histogram of state sizes based on number of congress members'
)
fig.update_layout(
    margin=dict(b=5,l=5,r=5,t=40),
    titlefont_size=16,
)

From this histogram it really becomes apparent how most state sizes are below 15 why California and Texas are more *edge cases*. This sizes are relevant to keep in mind if community detection will be done based states.

## Congress Tweets Data

In [96]:
df_congress = pd.read_pickle('../data/processed/congress_cleaned_processed.pkl')
df_congress = pd.merge(df_congress, Handle_data, how='left',left_on='user_name', right_on='twitter_display_name')

In [97]:
tweet_counts = df_congress.groupby(['Name', 'Party', 'Type']).agg('count')[['text']].reset_index()
tweet_counts.rename(columns={"text": "Total Tweets"}, inplace=True)
tweet_counts.head()

Unnamed: 0,Name,Party,Type,Total Tweets
0,Abby Finkenauer,D,Representative,2193
1,Abigail Spanberger,D,Representative,2229
2,Adam Kinzinger,R,Representative,3182
3,Adam Schiff,D,Representative,4103
4,Adam Smith,D,Representative,2638


### Total number of tweets per party

In [98]:
tweet_count_party = tweet_counts.groupby(['Party', 'Type']).agg('sum').reset_index()
tweet_count_party['Party and Type'] = tweet_count_party['Party'] + ' ' + tweet_count_party['Type']

In [99]:
tweet_count_party

Unnamed: 0,Party,Type,Total Tweets,Party and Type
0,D,Representative,754504,D Representative
1,D,Senator,281491,D Senator
2,I,Senator,17275,I Senator
3,L,Representative,4926,L Representative
4,R,Representative,402410,R Representative
5,R,Senator,196327,R Senator


In [100]:
fig = px.bar(
    tweet_count_party,
    x="Party and Type",
    y="Total Tweets",
    color='Party',
    color_discrete_sequence=["#0015BC", "#7f7f7f", "#FED105", "#DE0100"],
    text='Total Tweets',
    title = 'Total number of tweets per group'
)
fig.update_traces(texttemplate='%{text:.2s}', textposition='outside')
# Rotate labels 45 degrees
fig.update_layout(
    xaxis=dict(
        tickangle=-45,
    ),
    yaxis=dict(
        range=[0, 8.5e5]
    ),
    margin=dict(b=5,l=5,r=5,t=40),
    titlefont_size=16,
)
# save for website
fig.write_html(
    file = "../web_app/plotly_files/tweet_barplot.html", 
    full_html = False,
    include_plotlyjs='cdn'
)
fig.show()

### Distribution of tweets

In [101]:
df_congress = pd.read_pickle('../data/processed/congress_cleaned_processed.pkl')

In [103]:
fig = px.histogram(
    tweet_counts, 
    x="Total Tweets", 
    title='(Stacked) Distribution of number of total tweets posted per politicans',
    color='Party',
    color_discrete_sequence=["#0015BC", "#DE0100", "#7f7f7f", "#FED105"]
)
fig.update_layout(
    margin=dict(b=5,l=5,r=5,t=40),
    titlefont_size=16,
)
# save for website
fig.write_html(
    file = "../web_app/plotly_files/tweet_distribution.html", 
    full_html = False,
    include_plotlyjs='cdn'
)
fig