# Graphs

Graphs are one of the most common data structures in computer science;
graph-based modeling of problems is at the heart of many systems we use
every day, such as routing in Google Maps or recommending friends on Facebook.

[Social Network Analysis](https://en.wikipedia.org/wiki/Social_network_analysis)
(SNA) is a branch of sociology that explores social structures through the use
of analytical tools, such as graphs. In this assignment, you will implement
basic social network analysis algorithms on a graph extracted from GitHub.

You can find the graph data [at this link](github.graph). The data looks like
this (CSV format):

```
follower,followed
1570,9236
9236,1570
13256,9236
9236,13256
13256,1570
1570,13256
```

If we take the first line, it means that user `1570` follows user `9236`.

In [34]:
class Graph:
    
    def load(graph_file='github.graph'):
        print("hallo")
    
    def most_connected(graph, n = 10):
        pass
    
    def shortest_paths():
        pass
    

## Loading the graph

**T (10 points):** Write a function that takes as input a file name and
loads the data into an adjacency list representation.

```python
def load(graph_file='github.graph'):
  """
    Loads the data from the file in the provided argument into an in-memory
    graph (as an adjacency list)
  """
  pass
```

In [66]:
def load(graph_file='github.csv'):
    graph_data = open(graph_file)
    
    graph = list()
    
    for line in graph_data:
        pair = line.split(',')
        
        existing_node = [(node, neighbours) for (node, neighbours) in graph if node == int(pair[0].strip())]
        if len(existing_node) == 1:
            existing_node[0][1].append(int(pair[1].strip()))
        else:
            graph.append((int(pair[0].strip()), [int(pair[1].strip())]))
    
    return sorted(graph)
        
            
graph = load()
for node in graph:
    print(node)

(92, [59570, 4140, 10967])
(150, [143808, 37207, 36671, 92, 29166, 197069, 75784, 103533, 111065, 198885, 152772, 233097, 141830, 1567, 11726, 2576, 11149, 6985, 2556, 17407, 13256, 6891, 14006, 47281, 563808, 13823, 4620, 5274, 67159, 198581, 38066, 30453, 22596, 160493])
(306, [8267, 41223, 9236, 33018, 26395, 1063, 1934, 167231, 11154, 44990])
(344, [61272, 35469, 6569])
(346, [7547, 59216, 1570, 3892, 5995, 96349, 1563, 1063, 52402, 3063, 1934, 2146, 2556, 8637, 2159, 16499, 36872, 64971, 78115])
(350, [6072, 176616, 33018, 3914, 134106, 82381, 134197, 132738, 278406, 4189, 11154, 2584])
(352, [14922, 1570, 2556, 3892, 1564, 9236, 7547, 1567, 24531, 10073726, 46887])
(391, [7912, 1063, 14922, 75784, 48903, 20631, 60568, 4328])
(416, [24452, 2653])
(418, [7182, 2983, 28117, 28328, 13803, 17800])
(433, [1570, 14613, 9266, 10942, 13395, 1565, 185563, 103222, 61024, 32206, 24452, 17642, 16499, 316735, 46795, 82280])
(455, [457, 32872, 107385])
(457, [13395, 4567, 1570, 1563, 107385, 10

In [2]:
print(load() + load())

202


From now on, you must use the graph returned by `load` in all the assignments
below.

## Basic graph metrics

**T (10 points):** Who are the 10 most connected users?

```python
def most_connected(graph, n = 10):
  """
    Returns the ids of the top-n most connected users
  """
  pass
```

_Hint_: it helps if you first define a method called `in_degree` that calculates
the number of incoming edges in to a node.

**T (10 points):** What is the mean and what is the median number of connections?

## Computing shortest paths

Shortest paths are the basis for many network measures. You will need to implement Dijkstra algorithm.

**T (20 points):** Write a function that computes the shortest paths
between all node pairs in the graph. 

```python
def shortest_paths(graph):
  """
  Computes the shortest paths between any pair of nodes in the graph
  
  @return A dictionary whose keys are node pairs and values are sequences
  indicating the shortest path between the node pair.
  """
  pass
```

_Hint_: Choose the appropriate shortest path algorithm for undirected graph
with no edge weights. A pair of nodes $(n_1, n_2)$ is, for our purposes,
equivalent to the pair $(n_2, n_1)$.

_Hint_: How to find all unique node pairs? Given that you create a non-duplicate
list of all your nodes, you can use Python's `itertools.combinations` function 
like so:

```{python}
from itertools import combinations
a = [1,2,3,4,5]
pairs = list(combinations(a, 2))
print pairs
```

## Ranking important users

One of the primary uses of SNA is to identify important/influencial nodes.
A typical metric we use to quantify the importance of a node is centrality.
Several [centrality measures](https://en.wikipedia.org/wiki/Centrality) 
exist; for our purposes it is enough to calculate the **Betweeness Centrality**
of each node. The pseudocode to calculate it is given below.

To compute the betweenness of a node $n$

1. For each pair of nodes $(v1, v2)$, compute the shortest paths between them
2. For each pair of nodes $(v1, v2)$ determine the fraction of shortest paths
that include $n$
3. Sum this fraction over all pairs of vertices $(v1, v2)$

**T (30 points):** Write a function that computes the Betweenness centrality for
all nodes in the provided network

```python
def betweenness(graph):
  pass
```

**T (10 points):** Use the function above to rank the nodes (users) in
terms of importance.

```{r child="footer.Rmd", include=FALSE}
```