# NTDS 2019 : Assignment 1


[Eda Bayram,](https://lts4.epfl.ch/bayram) [EPFL](http://epfl.ch) [LTS4,](http://lts4.epfl.ch)
Nikolaos Karalias, [EPFL](http://epfl.ch) [LTS2](http://lts2.epfl.ch)

## Students
Team: `<your team number>`

## Rules

* The first deadline is for individual submission, the second one is for the team submission. No collaboration between teams is allowed.
* All team members will receive the same grade for the assignment regarding the solution they submit on the latter deadline.
* However, a team is allowed to ask for individual grading, which will regard the solution submitted on the former deadline.
* Textual answers shall be short. Typically one to two sentences.
* Code has to be clean.
* In the first and second section, the libraries to be used are given and you cannot import any other library than those. You cannot use Networkx in the first section.
* When submitting, the notebook is executed and the results are stored. I.e., if you open the notebook again it should show numerical results and plots. We won't be able to execute your notebooks.
* The notebook is re-executed from a blank state before submission. That is to be sure it is reproducible. You can click "Kernel" then "Restart & Run All" in Jupyter.

# Objective 

The purpose of this milestone is to explore a given dataset, represent it by network by constructing different graphs. In the first section, you will analyze the network properties. In the second section, you will explore various network models and find out the network model fitting the ones you construct from the dataset.

# Dataset : Cora Dataset

The [Cora dataset](https://linqs-data.soe.ucsc.edu/public/lbc/cora.tgz) consists of scientific publications classified into one of seven reaserch fields. 

* **Citation graph** The citation network can be constructed from the connections given in `cora.cites` file. 
* **Feature graph** Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary and its reasearch field, given in `cora.content` file. The dictionary consists of 1433 unique words. A feature graph can be constructed regarding the Euclidean distance between the feature vector of the publications.

The `README` file in the dataset provides the details about the content of the files. 

# Section 1 : Network Properties

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

### Question  1: Construct a Citation Graph and a Feature Graph

Read the cora.content file into a `Pandas` data frame by setting a header for the column names. Check `README` file.

In [None]:
column_list = # Your code here.
pd_content = pd.read_csv('cora/cora.content',delimiter='\t', names=column_list) 
pd_content.head()

Print out the number of papers contained in each of the reasearch fields.

**Hint:** You can use `value_counts()` function from `Pandas`.

In [None]:
# Your code here.

Select all papers from a field of your choice and store their IDsextract their feature vectors into a `numpy` array and check its shape.

In [None]:
my_field = # Your code here.
features = # Your code here.
features.shape

Construct a distance matrix $D$ as a `Numpy` array whose $(i,j)$ element is denoted by $d(i,j)$ which signifies the Euclidean distance between feature vectors of papers $i$ and $j$.

In [None]:
distance = # Your code here.
distance.shape

Check the mean pairwise distance $\mathbb{E}[D]$.

In [None]:
mean_distance = distance.mean()
mean_distance

Plot the histogram of the euclidean distances.

In [None]:
plt.figure(1, figsize=(8,4))
plt.title("Histogram of euclidean distances between papers")
plt.hist(distance.reshape(-1));

Now create an adjacency matrix for the papers by thresholding the Euclidean distance matrix.
The resulting (unweighted) adjacency matrix should have entries: $ A_{ij} = \begin{cases} 1, \; \text{if} \; d(i,j)< \mathbb{E}[D], \; i \neq j \\ 0, \; \text{otherwise.} \end{cases}$

First, let us choose the mean distance as the threshold.

In [None]:
thresh = mean_distance
A_feature = # Your code here.

Now read the `cora.cites` file and construct the citation graph by converting the given citation connections into an adjacency matrix. 

In [None]:
cora_cites = np.genfromtxt('cora/cora.cites', delimiter='\t')

A_citation = # Your code here.
A_citation.shape

Get the adjacency matrix of the citation graph for the field that you choose. You have to appropriately reduce the adjacency matrix of the citation graph.

In [None]:
# Your code here.

Check if your adjacency matrix is symmetric. Symmetrize your final adjacency matrix if it's not already symmetric.

In [None]:
np.nonzero(A_citation-A_citation.transpose())

Check the shape of your adjacency matrix again.

In [None]:
A_citation.shape

### Question 2: Degree Distribution and Moments


What is the total number of edges in each graph?

In [None]:
num_edges_feature = # Your code here.
num_edges_citation = # Your code here.
print("Number of edges in the feature graph: ", num_edges_feature)
print("Number of edges in the citation graph: ", num_edges_citation)

Plot the degree distribution histogram for each of the graphs.

In [None]:
degrees_citation = # Your code here.
degrees_feature = # Your code here.

deg_hist_normalization = np.ones(degrees_citation.shape[0])/degrees_citation.shape[0]

plt.figure(figsize=(16,4))
plt.subplot(131)
plt.title('Citation graph degree distribution')
plt.hist(degrees_citation, weights = deg_hist_normalization);
plt.subplot(132)
plt.title('Feature graph degree distribution')
plt.hist(degrees_feature, weights = deg_hist_normalization);

Calculate the first and second moments of each graph.

In [None]:
cit_moment_1 = # Your code here.
cit_moment_2 = # Your code here.

feat_moment_1 = # Your code here.
feat_moment_2 = # Your code here.

print("1st moment of citation graph: ", (cit_moment_1))
print("2nd moment of citation graph: ", (cit_moment_2))
print("1st moment of feature graph: ", (feat_moment_1))
print("2nd moment of feature graph: ", (feat_moment_2))

What information do the moments provide you about the graphs?
Explain the differences in moments between graphs by comparing their degree distributions.

**Your answer here:**

Select the 20 largest hubs for each of the graphs and remove them. Observe the sparsity pattern of the adjacency matrices of the citation and feature graphs before and after such a reduction.

In [None]:
reduced_A_feature = # Your code here
reduced_A_citation = # Your code here


plt.figure(1, figsize=(16,8))
a=plt.subplot(121)
plt.title('Feature graph: adjacency matrix sparsity pattern')
plt.spy(A_feature);
b=plt.subplot(122)
plt.title('Feature graph without top 20 hubs: adjacency matrix sparsity pattern')
plt.spy(reduced_A_feature);


plt.figure(2, figsize=(16,8))
a=plt.subplot(121)
plt.title('Citation graph: Adjacency matrix sparsity pattern')
plt.spy(A_citation);
b=plt.subplot(122)
plt.title('Citation graph without top 20 hubs: Adjacency matrix sparsity pattern')
plt.spy(reduced_A_citation);

Plot the new degree distribution histograms.

In [None]:
reduced_degrees_feat = # Your code here.
reduced_degrees_cit = # Your code here.

deg_hist_normalization = np.ones(reduced_degrees_feat.shape[0])/reduced_degrees_feat.shape[0]

plt.figure(3,figsize=(16,4))
plt.subplot(121)
plt.title('Feature graph degree distribution')
plt.hist(reduced_degrees_feat, weights = deg_hist_normalization);
plt.subplot(122)
plt.title('Citation graph degree distribution')
plt.hist(reduced_degrees_cit, weights = deg_hist_normalization);

Compute the first and second moments for the new graphs.

In [None]:
reduced_cit_moment_1 = # Your code here.
reduced_cit_moment_2 = # Your code here.

reduced_feat_moment_1 = # Your code here.
reduced_feat_moment_2 = # Your code here.


print("Citation graph first moment:", reduced_cit_moment_1)
print("Citation graph second moment:", reduced_cit_moment_2)
print("Feature graph first moment: ", reduced_feat_moment_1)
print("Feature graph second moment: ", reduced_feat_moment_2)

Print the number of edges in the reduced graphs.

In [None]:
# Your code here

Is the effect of removing the hubs the same for both networks? Look at the percentage changes for each moment. Which of the moments is affected the most and in which graph? Explain why.  

**Hint:** Examine the degree distributions.

**Your answer here:**

### Question 3: Pruning, sparsity,  paths

By adjusting the threshold of the euclidean distance matrix, prune the feature graph so that its number of edges is roughly close (within a hundred edges) to the number of edges in the citation graph.

In [None]:
threshold = # Your code here.

A_feature_pruned = # Your code here
num_edges_feature_pruned = # Your code here.

print("Number of edges in the feature graph: ", (num_edges_feature))
print("Number of edges in the feature graph after pruning: ", (num_edges_feature_pruned))
print("Number of edges in the citation graph: ", num_edges_citation)

Check your results by comparing the sparsity patterns and total number of edges between the graphs.

In [None]:
plt.figure(figsize=(12,6))
plt.subplot(121)
plt.title('Citation graph sparsity')
plt.spy(A_citation);
plt.subplot(122)
plt.title('Feature graph sparsity')
plt.spy(A_feature_pruned);

Let $C_{k}(i,j)$ denote the number of paths of length $k$ from node $i$ to node $j$. 

We define the path matrix $P$, with entries:
$ P_{ij} = \displaystyle\sum_{k=0}^{N}C_{k}(i,j) $

Calculate the path matrices for both the citation and the unpruned feature graphs for $k =10$.  
**Hint:** Use [the powers of adjacency matrix](https://en.wikipedia.org/wiki/Adjacency_matrix#Matrix_powers).

In [None]:
path_matrix_citation = # Your code here.
path_matrix_feature = # Your code here.

Check the sparsity pattern for both of path matrices.

In [None]:
plt.figure(figsize=(16,9))
plt.subplot(121)
plt.title('Citation Path matrix sparsity')
plt.spy(path_matrix_citation);
plt.subplot(122)
plt.title('Feature Path matrix sparsity')
plt.spy(path_matrix_feature);

Now calculate the path matrix of the pruned feature graph for $k=10$. Plot the corresponding sparsity pattern. Is there any difference?

In [None]:
path_matrix_pruned = # Your code here.

plt.figure(figsize=(12,6))
plt.title('Feature Path matrix sparsity')
plt.spy(path_matrix_pruned);

**Your answer here:**

Describe how you can use the above process of counting paths to determine whether a graph is connected or not. Is the original (unpruned) feature graph connected?

**Your answer here:** 

If the graph is connected, how can you guess its diameter using the path matrix?

**Your answer here:**

If any of your graphs is connected, calculate the diameter using that process.

In [None]:
diameter = # Your code here.
print("The diameter is: ", diameter)

Check if your guess was correct using [NetworkX](https://networkx.github.io/documentation/networkx-1.10/reference/generated/networkx.algorithms.distance_measures.diameter.html).

Note: Usage of NetworkX is allowed only in this part of Section 1.

In [None]:
from networkx import nx
feature_graph = nx.from_numpy_matrix(A_feature)
print("Diameter according to networkx: ", nx.diameter(feature_graph))

# Section 2 : Network Models

In this section, you will analyze the feature graph and citation graph you constructed in the previous section in terms of the network model types. For this purpose, you can use NetworkX libary imported below.

In [None]:
import networkx as nx
import warnings
warnings.simplefilter("ignore")

Let us create NetworkX graph objects from the adjacency matrices computed in the previous section.

In [None]:
G_citation = nx.from_numpy_matrix(A_citation)
print('Number of nodes: {}, Number of edges: {}'. format(G_citation.number_of_nodes(), G_citation.number_of_edges()))
print('Number of self-loops: {}, Number of connected components: {}'. format(G_citation.number_of_selfloops(), nx.number_connected_components(G_citation)))

In the rest of this assignment, we will consider the pruned feature graph as the feature network.

In [None]:
G_feature = nx.from_numpy_matrix(A_feature_pruned)
print('Number of nodes: {}, Number of edges: {}'. format(G_feature.number_of_nodes(), G_feature.number_of_edges()))
print('Number of self-loops: {}, Number of connected components: {}'. format(G_feature.number_of_selfloops(), nx.number_connected_components(G_feature)))

### Question 4: Simulation with Erdős–Rényi and Barabási–Albert model

Create an Erdős–Rényi and a Barabási–Albert graph using NetworkX to simulate the citation graph and the feature graph you have. When choosing parameters for the networks, take into account the number of vertices and edges of the original networks.

The number of nodes should exactly match the number of nodes in the original citation/feature graph.

In [None]:
n = len(G_citation.nodes())
n

The number of match shall fit the average of the number of edges in the citation and the feature graph.

In [None]:
m = np.round((G_citation.size() + G_feature.size())/2)
m

How do you determine the probability parameter for the Erdős–Rényi graph?

**Your answer here:**

In [None]:
p = # Your code here.
G_er = nx.erdos_renyi_graph(n,p)

Check the number of edges in the Erdős–Rényi graph.

In [None]:
print('My Erdos-Rényi network to simulate citation graph has {} edges.'.format(G_er.size()))

How do you determine the preferential attachement parameter for Barabási–Albert graphs?

**Your answer here:**

In [None]:
q = # Your code here.
G_ba = nx.barabasi_albert_graph(n,q)

Check the number of edges in the Barabási–Albert graph.

In [None]:
print('My Barabási-Albert network to simulate citation graph has {} edges.'.format(G_ba.size()))

### Question 5:  Giant component

Check the size of the largest connected component in the citation and feature graph.

In [None]:
giant_citation = # Your code here.
print('The giant component of the citation graph has {} nodes and {} edges.'.format(giant_citation.number_of_nodes(), giant_citation.size()))

In [None]:
giant_feature = # Your code here.
print('The giant component of the feature graph has {} nodes and {} edges.'.format(giant_feature.number_of_nodes(), giant_feature.size()))

Check the size of the giant components in the generated Erdős–Rényi graph.

In [None]:
giant_er = # Your code here.
print('The giant component of the Erdos-Rényi network has {} nodes and {} edges.'.format(giant_er.number_of_nodes(), giant_er.size()))

Let us match the number of nodes in the giant component of the feature graph by simulating a new Erdős–Rényi network.
How do you choose the probability parameter this time? 

Hint: Recall the expected giant component size from the lectures.

**Your answer here:**

In [None]:
p_new = # Your code here.
G_er_new = nx.erdos_renyi_graph(n,p_new)

Check the size of the new Erdős–Rényi network and its giant component.

In [None]:
print('My new Erdos Renyi network to simulate citation graph has {} edges.'.format(G_er_new.size()))
giant_er_new = # Your code here.
print('The giant component of the new Erdos-Rényi network has {} nodes and {} edges.'.format(giant_er_new.number_of_nodes(), giant_er_new.size()))

### Question 6: Degree Distributions

Recall the degree distribution of the citation and the feature graph.

In [None]:
plt.figure(figsize=(15,6))
plt.subplot(121)
plt.title('Citation graph')
ciatation_degrees = # Your code here.
plt.hist(ciatation_degrees);
plt.subplot(122)
plt.title('Feature graph')
feature_degrees = # Your code here.
plt.hist(feature_degrees);

Now, plot the degree distribution historgrams for the simulated networks.

In [None]:
plt.figure(figsize=(20,6))
plt.subplot(131)
plt.title('Erdos-Rényi network')
er_degrees = # Your code here.
plt.hist(er_degrees);
plt.subplot(132)
plt.title('Barabási-Albert network')
ba_degrees = # Your code here.
plt.hist(ba_degrees);
plt.subplot(133)
plt.title('new Erdos-Rényi network')
er_new_degrees = # Your code here.
plt.hist(er_new_degrees);

In terms of the degree distribution, is there a good match between citation and feature graph and the simulated networks? For the citation graph, choose one of the simulated networks above that matches its degree distribution at best. Indicate your preference below:

**Your answer here:** 

You can also simulate a network using the configuration model to match its degree disctribution exactly. Refer to [Configuration model](https://networkx.github.io/documentation/stable/reference/generated/networkx.generators.degree_seq.configuration_model.html#networkx.generators.degree_seq.configuration_model).

Let us create another network to match the degree distribution of the feature graph. 

In [None]:
G_config = nx.configuration_model(feature_degrees) 
print('Configuration model has {} nodes and {} edges.'.format(G_config.number_of_nodes(), G_config.size()))

Does it mean that we create the same graph with the feature graph by the configuration model? If not, how do you understand that they are not the same?

**Your answer here:** 

### Question :  Clustering Coefficient

Let us check the average clustering coefficient of the original citation and feature graphs. 

In [None]:
nx.average_clustering(G_citation)

In [None]:
nx.average_clustering(G_feature)

What does the clustering coefficient tell us about a network? Comment on the values you obtain for the citation and feature graph.

**Your answer here:**

Now, let us check the average clustering coefficient for the simulated networks.

In [None]:
nx.average_clustering(G_er)

In [None]:
nx.average_clustering(G_ba)

In [None]:
nx.average_clustering(nx.Graph(G_config))

Comment on the values you obtain for the simulated networks. Is there any good match to the citation or feature graph in terms of clustering coefficient?

**Your answer here:**

Check the other [network model generators](https://networkx.github.io/documentation/networkx-1.10/reference/generators.html) provided by NetworkX. Which one do you predict to have a better match to the citation graph and the feature graph in terms of degree distribution and clustering coefficient at the same time? Justify your answer.

**Your answer here:**

If you find another network model you predict to achieve a good match either to the citation or to the feature graph, create a graph object below for that network model. Print the number of edges and the average clustering coefficient. Plot the histogram of the degree distribution.

In [None]:
# Your code here.

Comment on the similarities of your match.

**Your answer here:**