## Assignment Week 3
The edge data for this project is taken from the Stanford Large Network Dataset Collection. It is the wiki-vote dataset, a collection of relationships between wiki users who voted for other wiki users to be elected to the position of administrator. Each node represents one user (given as an id number) and the edge data shows who each user voted for in the network.

This assignment looks at the diameter, degree metrics and pagerank of a small neighborhood around an arbitrary user.


In [1]:
import graphlab as gl


In [2]:
edge_data = gl.SFrame.read_csv("wiki-vote-edges-clean.txt",delimiter='\t')

This non-commercial license of GraphLab Create for academic use is assigned to john.deblase@spsmail.cuny.edu and will expire on August 29, 2017.


[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1474206252.log


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [3]:
g = gl.SGraph(edges=edge_data, src_field='FromNodeId', dst_field='ToNodeId')

In [4]:
## load a small subset of the graph... neighborhood of radius 1 node 3

subg = g.get_neighborhood(ids=[3],radius=1)

In [5]:
gl.canvas.set_target('ipynb')
subg.show(arrows=True, highlight=[3], vlabel='__id')

## Metric 1 - Diameter

In [20]:
## To find diameter -> max of shortest path between vertices

# This is an unconnected graph and therefore the diameter is infinite
sp = gl.shortest_path.create(subg, 3)  # shortest paths from node 3

In [21]:
sp['max_distance']  # 1e+30 means inf

1e+30

## Metric 2 - Degree Count and Aggregation

In [25]:
# max out_degree and in_degree, average total degree of graph

deg_model = gl.degree_counting.create(subg)
deg_graph = deg_model['graph']


In [22]:
deg_frame = deg_graph.vertices[['__id', 'in_degree', 'out_degree','total_degree']]
deg_frame

__id,in_degree,out_degree,total_degree
178,7,5,12
73,0,4,4
47,0,15,15
29,6,14,20
10,5,11,16
34,6,4,10
30,9,0,9
604,5,0,5
586,3,0,3
300,10,0,10


In [9]:
# maxs and mins for in and out degree -> which nodes in nhood got most votes
# which nodes got the least vote
# what were the average total degrees in the nhood graph

max_indeg_indx = deg_frame[deg_frame['in_degree'].argmax()]
max_outdeg_indx = deg_frame[deg_frame['out_degree'].argmax()]

min_indeg_indx = deg_frame[deg_frame['in_degree'].argmin()]
min_outdeg_indx = deg_frame[deg_frame['out_degree'].argmin()]

avg_deg = deg_frame['total_degree'].mean()

In [10]:
max_indeg_indx

{'__id': 3, 'in_degree': 31, 'out_degree': 23, 'total_degree': 54}

In [11]:
max_outdeg_indx

{'__id': 6, 'in_degree': 8, 'out_degree': 27, 'total_degree': 35}

In [12]:
min_indeg_indx

{'__id': 25, 'in_degree': 0, 'out_degree': 22, 'total_degree': 22}

In [13]:
min_outdeg_indx

{'__id': 611, 'in_degree': 9, 'out_degree': 0, 'total_degree': 9}

In [14]:
avg_deg

13.423076923076925

## Metric 3 - pagerank centrality measure

In [15]:
# calculate pagerank of each node in the nhood and find top 5 

pr_model = gl.pagerank.create(subg)
pr_graph = pr_model['graph']
pr_frame = pr_graph.vertices[['__id', 'pagerank']]

In [23]:
pagerank_top_nodes = pr_frame.sort('pagerank',ascending=False)[0:5]
pagerank_top_nodes

__id,pagerank
271,1.49255977897
3,1.09222054154
590,0.950949870307
28,0.799160571616
214,0.697230547633


## File Export for graph database

In [17]:
# export subgraph with pagerank and degree info added to view in neo4j
vertices = deg_graph.vertices.join(pr_graph.vertices[['__id', 'pagerank']],'__id')
edges = subg.edges


In [18]:
#vertices.export_csv("vertices.csv")

In [19]:
#edges.export_csv("edges.csv")

## Link to NB Viewer
<a href="https://github.com/bsnacks000/IS620_Web_Analytics/blob/master/AssignmentWeek3/Assignment%20Week%203.ipynb"> Assignment Week 3 Link </a>