# SD212: Graph mining
## Exam

First complete the following text cell with your name:

**Name:**

The duration of the exam is **3h**.

You must upload your notebook on the pedagogical size **before 4:30pm**.<br>
After **4:35pm**, there will be a penalty of **1 point per minute**.

There are 3 parts:
1. **Graph sampling** (5 points)
2. **Graph pruning** (5 points)
3. **Clustering by PageRank** (10 points)

Total = 20 points

The answer to each question must consist of:
* a text cell with your answer written either in **French** or in **English**,
* a code cell showing the **code** used to get the answer; this code must be running, without errors.

Useless code **must** be deleted.

Access to documents, slides and notebooks of the course is allowed.

Access to the Internet is **not** allowed (except for the pedagogical site).<br>

**Any** form of communication between students is strictly forbidden.

## Import

In [None]:
import networkx as nx

In [None]:
import numpy as np

In [None]:
import matplotlib.pyplot as plt

In [None]:
%matplotlib notebook

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Set colors
prop_cycle = plt.rcParams['axes.prop_cycle']
colors = prop_cycle.by_key()['color']

**Hint:** To get the documentation on a `function` you can type `function?`

In [None]:
nx.pagerank?

## Data

You will need the following datasets (the same as in the labs, no need to download them again):
* **Les Misérables**<br>  Graph connecting the characters of the novel of Victor Hugo when they appear in the same chapter. The graph is undirected and weighted. Weights correspond to the number of chapters in which characters appear together. 
* **Openflights**<br>
Graph of the main international flights. Nodes are airports. The graph is undirected (all flights are bidirectional). Weights correspond to the number of daily flights between airports. 

If you don't have these datasets in your working directory, you can download them from the pedagogical site.

In [None]:
miserables = nx.read_graphml("miserables.graphml", node_type = int)

In [None]:
openflights = nx.read_graphml("openflights.graphml", node_type = int)

To get the index of a node from its name, you may use the following:

In [None]:
graph = miserables.copy()
name = nx.get_node_attributes(miserables, 'name')

In [None]:
word = 'jean'
selected_nodes = {i: name[i] for i in name if name[i].find(word) >= 0}

In [None]:
selected_nodes

## 1. Graph sampling (5 points)

We first consider the graph Les Misérables.

In [None]:
graph = miserables.copy()

### Question 1.a

What are the 5 nodes of highest degree?<br>
Give the names of the characters.

**Your answer:**



In [None]:
# Your code

### Question 1.b

How many nodes have degree 2?<br>
Give the names of the characters.

**Your answer:**



In [None]:
# Your code

### Question 1.c

Give the exact probabilities of sampling Cosette under:
* uniform node sampling
* uniform edge sampling (i.e., a random end of an edge, the edge being chosen uniformly at random)
* weighted edge sampling (i.e., a random end of an edge, the edge being chosen in proportion to the weights)

Interpret the results.

**Your answer:**



In [None]:
# Your code

## 2. Graph pruning (5 points)

We still consider the graph of Les Misérables.

### Question 2.a

Remove nodes of degree 1.<br>
How many edges remain?

You may use the method `remove_nodes_from` of a `networkx` graph.

**Your answer:**



In [None]:
# Your code

In [None]:
# graph.remove_nodes_from(nodes) 

### Question 2.b

Remove recursively nodes of degree 1 until there are no more nodes of degree 1.<br>
How many edges remain?

**Your answer:**



In [None]:
# Your code

### Question 2.c

Compare the top-3 nodes for PageRank before and after pruning (i.e., without nodes of degree 1).<br>
Comment the results (you may need to visualize the graph).

You may use the `pagerank` function of `networkx`.

**Your answer:**



In [None]:
# Your code

In [None]:
# nx.pagerank(graph)

## 3. Clustering by PageRank (10 points)

We now consider a clustering algorithm based on PageRank. 

The proposed algorithm consists in two steps:
1. Expand some seed set $S\subset V$ by successively adding the furthest node of the set $S$ in terms of Personalized PageRank. The initial seed set consists of a single node. This node is removed from the final seed set $S$.
2. Cluster the nodes with respect to their Personalized PageRank with respect to each node of $S$, i.e., the cluster of each node $i$ is given by:
$$
 C(i) = \arg\max_{s\in S} \text{PPR}_s(i)
$$
where $\text{PPR}_s(i)$ is the Personalized PageRank of node $i$, when personalized by node $s$.

### Question 3.a

1. Complete the function `get_furthest_node` below, that returns the furthest node from some seed set in terms of Personalized PageRank.
2. What is the furthest node of the set {Cosette, Marius} in Les Misérables? Give the name of the character.

You may use the `pagerank` function of `networkx`.

**Your answer:**



In [None]:
# Your code

In [None]:
# nx.pagerank(graph, personalization)

In [None]:
def get_furthest_node(graph, seed_set):
    '''
    graph: networkx graph
        undirected graph 
    seed_set: set
        set of nodes
        
    Returns: int
        node
    '''

    node = 0
    # to be completed
    return node

### Question 3.b

1. Complete the function `get_seed_set` below, where `size` is the final size of the seed set (after removing the initial seed node).
2. Give the seed sets of size 5 in Les Misérables, starting from Cosette and Marius, respectively.<br> What do you observe? Interpret the results (you may take a look at the degrees of the nodes in the seed set).

**Your answer:**



In [None]:
# Your code

In [None]:
def get_seed_set(graph, seed_node, size):
    '''
    graph: networkx graph
        undirected graph 
    seed_node: int
        seed node
    size: int
        size of the seed set
        
    Returns: set
        set of nodes
    '''
    seed_set = {seed_node}
    # to be completed
    return seed_set

### Question 3.c

1. Complete the function `pagerank_clustering` below.
2. Use this to cluster Les Misérables with 5 clusters, starting from Cosette.<br> What is the node of highest degree in each cluster? Give the name of each character.<br>
What is the strongest cluster?

**Your answer:**



In [None]:
# Your code

In [None]:
def pagerank_clustering(graph, seed_set):
    '''
    graph: networkx graph
        undirected graph 
    seed_set: set
        set of nodes
        
    Returns: dictionary
        cluster index of each node 
    '''
    
    C = {} 
    # to be completed
    return C

### Question 3.d

1. Apply the above algorithm to OpenFlights with 20 clusters, starting from the airport Paris Charles-de-Gaulle, and visualize the clustering.
2. Compute the modularity of this clustering.
3. Describe another graph clustering algorithm where the number of clusters can be specified;  apply this algorithm to OpenFlights with 20 clusters and compute the new modularity.

**Your answer:**



In [None]:
# Your code

In [None]:
openflights = nx.read_graphml("openflights.graphml", node_type = int)

In [None]:
# Get positions
pos_x = nx.get_node_attributes(openflights,'pos_x')
pos_y = nx.get_node_attributes(openflights,'pos_y')
pos = {i: (pos_x[i], pos_y[i]) for i in openflights}

In [None]:
plt.figure(figsize=(8,4))
plt.axis('off')
show_nodes = nx.draw_networkx_nodes(openflights, pos, node_size = 20)
plt.show()