# Network Analysis in the Flint Dataset

## Flint FOIA Dataset
The data for this notebook is taken from a collection of scanned email documents relating to the Flint Water Crisis. They were released to the public in the July of 2016 and can be found [here](https://archive.org/details/snyder_flint_emails/DHHS1/). They are an invaluable resource of researchers but remain difficult to use because of the format in which they were released. For this example, we'll be looking at just a subset of the data. This network has 1717 edges, while the whole dataset has *467224* edges!

## Named Entity Recognition
As a part of the large project surrounding these emails, we needed the metadata of each email. This includes the:
* From line
* Date (called Sent) line
* To line
* Cc line and
* Subject line

We trained a language model to do it and once we did, we could begin some network anaylsis to understand relative importance of actors across the dataset.

## Network Analysis
For both data and social scientists, a network is an arrangment of relationships into nodes and edges. In the Flint dataset, we decided that each node would representa person or actor in the emails and an edge would represent if a certain actor emailed another actor. We could then use thesoftware package `networkx` to collect statistics on these actors and their relationships.

### Centrality
When analyzing a network, researchers look at a variety of metrics. We would like to determine what actors are the most important for the functioning of the network. **Centrality** is a one measure of this node importance that we'll explore deeper in this notebook. There are several ways of calculating centrality, but all of the methods we'll look at attempt to understand how much the graph would be changed or impacted if a certain actor was removed. If the impact is low, then we can infer that that actor's importance is low, but if the impact is high, then we can infer that the actor's importance is high. 

We'll look a three different, but similar, measures:
* Degree centrality: The nodes with the highest number of edges, scaled by the raw number of edges in the network.
* Betweeness centrality: The nodes which other nodes would need to pass through to reach the other side of the network.
* Pagerank: The nodes which a random walk through the network would have to pass through to get to their destination. This 

## Using this notebook

Below, you can explore the interactive visualization of the graph. Try to guess the central nodes qualitatively, then compare your guesses to the results of the three different measures above. Finally, you can select a person of interest from any of the measures and search their correspondance. 

In [1]:
!pip install plotly whoosh -q
!wget https://tufts.box.com/shared/static/31euwdc8un37w7fxjc39aarktulv2ih1.pkl -O flint_graph.pkl

In [43]:
import plotly.graph_objects as go
from whoosh.qparser import QueryParser
from IPython.display import display, HTML, clear_output
import re
from whoosh import index
import os
from whoosh.fields import Schema
import pickle
import networkx as nx
import ipywidgets as widgets

def plot_graph(G):
    edge_x = []
    edge_y = []
    for edge in G.edges():
        x0, y0 = G.nodes[edge[0]]['pos']
        x1, y1 = G.nodes[edge[1]]['pos']
        edge_x.append(x0)
        edge_x.append(x1)
        edge_x.append(None)
        edge_y.append(y0)
        edge_y.append(y1)
        edge_y.append(None)

    edge_trace = go.Scatter(
        x=edge_x, y=edge_y,
        line=dict(width=0.5, color='#888'),
        hoverinfo='none',
        mode='lines')

    node_x = []
    node_y = []
    for node in G.nodes():
        x, y = G.nodes[node]['pos']
        node_x.append(x)
        node_y.append(y)

    node_trace = go.Scatter(
        x=node_x, y=node_y,
        mode='markers',
        hoverinfo='text',
        marker=dict(
            showscale=True,
            colorscale='YlGnBu',
            reversescale=True,
            color=[],
            size=10,
            colorbar=dict(
                thickness=15,
                title='Node Connections',
                xanchor='left',
                titleside='right'
            ),
            line_width=2))
    
    node_adjacencies = []
    node_text = []
    for node, adjacencies in enumerate(G.adjacency()):
        node_adjacencies.append(len(adjacencies[1]))
        node_text.append(f'<i>{adjacencies[0]}</i>'+ ': ' + '# of connections: '+str(len(adjacencies[1])))

    node_trace.marker.color = node_adjacencies
    node_trace.text = node_text
    
    fig = go.Figure(data=[edge_trace, node_trace],
                 layout=go.Layout(
                    title='<br>Flint Network Visualization',
                    titlefont_size=16,
                    showlegend=False,
                    hovermode='closest',
                    margin=dict(b=20,l=5,r=5,t=40),
                    annotations=[ dict(
                        text="",
                        showarrow=False,
                        xref="paper", yref="paper",
                        x=0.005, y=-0.002 ) ],
                    xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
                    yaxis=dict(showgrid=False, zeroline=False, showticklabels=False))
                    )
    fig.show()
    

def sort_dict(_dict, k=10):
    return sorted(_dict.items(), key=lambda x: x[1], reverse=True)[:k]

def on_button_click(b):
    query_str = search_bar.value
    poi = poi_select.value
    
    html_template = """
    <p>{hit}</p>
    <hr/>
    """.strip()

    qp = QueryParser("body", schema=ix.schema)
    q = qp.parse(query_str)

    with output:
        clear_output()
    
    with output:
        with ix.searcher() as searcher:
            results = searcher.search(q, limit=None)
            results.fragmenter.maxchars = 1500
            results.fragmenter.surround = 350
            if len(results) == 0:
                display(HTML("<p>No results found.</p>"))
            for i, hit in enumerate(results):
                if poi == hit['from_name']:
                    subject = re.sub(r'\[', '', hit['subject'])
                    subject = re.sub(r'\]', '', subject)
                    display(HTML(f"<h4>{subject}</h4>"))
                    r = re.split('\w\.\.\.\w', hit["body"].replace("\n\n", ""))
                    for h in r:
                        display(HTML(html_template.format(hit=h)))


ix = index.open_dir("staff14_index")

with open('flint_graph.pkl', 'rb') as f:
    G = pickle.load(f)

connected_components = list(nx.connected_components(G))
connected_subgraphs = [G.subgraph(c) for c in connected_components]
G = connected_subgraphs[0]

plot_graph(G)

cent_measures = {"Degree Centrality":nx.degree_centrality, "Betweenness Centrality":nx.betweenness_centrality, "Pagerank":nx.pagerank}
display(HTML("<h2>Centrality Measures</h2>"))

pois = []
for m in cent_measures:
    display(HTML(f"<h3>{m}</h3>"))
    res = sort_dict(cent_measures[m](G))
    for i, tup in enumerate(res):
        pois.append(tup[0])
        display(HTML(f"<p>{i+1}: {tup[0]}, importance score: {tup[1]}</p>"))

In [44]:
search_bar = widgets.Text(
    value='',
    placeholder='Search anything',
    disabled=False   
)

poi_select = widgets.Select(
    options=pois,
    description="People of Interest"
)

button = widgets.Button(description="Search")
button.on_click(on_button_click)
output = widgets.Output()

In [45]:
display(poi_select)
display(search_bar)
display(button)
display(output)

Select(description='People of Interest', options=('Baird, Richard (GOV)', 'Muchmore, Dennis (GOV)', 'Clement, …

Text(value='', placeholder='Search anything')

Button(description='Search', style=ButtonStyle())

Output()

In [42]:
HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')