# Graphs

In this notebook, you will practice using networkx to interact with a graph in Python.

In [None]:
import networkx as nx
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

#### Load the IMDB actors graph

Each actor in the IMDB database is a node. There is an edge between those two nodes if the actors have appeared in a movie together

In [None]:
G = nx.read_edgelist('data/actor_edges.tsv', delimiter='\t')

In [None]:
!head data/actor_edges.tsv

Total number of nodes (actors)

Total number of edges

Take a look at the notes and edges

Neighborhood of John Goodman

Shortest path between two actors

All of the shortest paths

#### Exercise

Make a histogram of the degree distribution of this graph. Does it match any of the distributions discussed in the async?

#### Exercise

Which actor has the highest degree centrality?

#### Exercise

How many connected components are there?

#### Exercise

Make a histogram of the sizes of the connected components. (Try removing the first component for better results)

For the remainder of the problems, work with a subset of the original graph

In [None]:
smallG = nx.read_edgelist('data/small_actor_edges.tsv', delimiter='\t')

In [None]:
smallG.number_of_edges()

In [None]:
smallG.number_of_nodes()

#### Exercise

How many connected components are there in this graph?

#### Exercise

Find the most important actor in this subgroup according to each of the following centrality measures:
- Degree
- Betweenness
- Eigenvector
- Pagerank

#### Exercise

Compare the ranks of some of the top actors. That is, make a table with one row for each actor and one column for each centrality measure. The values should be the relative rank of the each actor according to that centrality measure. Hint: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.rank.html

#### Exercise

Community detection is like clustering for graphs. There's lots of approaches, and we won't go into the details here. 

One good approach seems to be the [Louvain Method](https://www.quora.com/Is-there-a-simple-explanation-of-the-Louvain-Method-of-community-detection). It is implemented in the [community](https://github.com/taynaud/python-louvain) package in Python.

Once you get it installed, try using the `best_partition` method to apply the Louvain Method for community detection


#### Exercise

See if you can find the most important actors in each community. Do you recognize any of the communities?

Hint: Use nx.subgraph to create a sub-graph from each community

#### Exercise

Pick one of the communities you found above and plot the network using the code below

In [None]:
plt.figure(figsize=(12,12))

# Pick a node layout
# node_position = nx.fruchterman_reingold_layout(subG)
# node_position = nx.spring_layout(subG)
node_position = nx.random_layout(subG)
# node_position = nx.circular_layout(subG)
# node_position = nx.spectral_layout(subG)
# node_position = nx.shell_layout(subG)

nx.draw_networkx_nodes(subG, node_position, node_size=100, alpha=.8)
nx.draw_networkx_edges(subG, node_position, alpha=.1)
nx.draw_networkx_labels(subG, node_position)

plt.show()

You can experiment with the node layouts, but none of them are really satisfactory. A better approach is to use a graph visualization tool like [gephi](https://gephi.org/)

In [None]:
nx.write_gml(subG, 'my_community.gml')

Here's a few steps to get started

1. Open the gml file in gephi
2. Use the Layout tab and choose Force Atlas as the layout. Adjust the repulsion strength until the nodes are sufficiently spread apart. More densely connected nodes should be near the center.
3. Add the labels by clicking the T in the lower left hand corner.
4. Use the statistics tab to calculate eigenvector centrality, pagerank, and anything else you're interested in.
5. Use the appearance tab to scale the text color/size according to various centrality measures.
6. Keep experimenting!

#### Extra Credit

How many actors are connected to Kevin Bacon, but by more than six degrees?

In [None]:
G = nx.read_edgelist('data/actor_edges.tsv', delimiter='\t')