# NTDS 2019 : Assignment 1


[Eda Bayram,](https://lts4.epfl.ch/bayram) [EPFL](http://epfl.ch) [LTS4,](http://lts4.epfl.ch)
Nikolaos Karalias, [EPFL](http://epfl.ch) [LTS2](http://lts2.epfl.ch)

## Students
Team: `<your team number>`

## Rules

* The first deadline is for individual submission, the second one is for the team submission. No collaboration between teams is allowed.
* All team members will receive the same grade for the assignment regarding the solution they submit on the latter deadline.
* However, a team can is allowed to ask for individual grading, which will regard the solution submitted on the former deadline.
* Textual answers shall be short. Typically one to two sentences.
* Code has to be clean.
* In the first and second section, the libraries to be used are given and you cannot import any other library than those. You cannot use Networkx in the first section.
* When submitting, the notebook is executed and the results are stored. I.e., if you open the notebook again it should show numerical results and plots. We won't be able to execute your notebooks.
* The notebook is re-executed from a blank state before submission. That is to be sure it is reproducible. You can click "Kernel" then "Restart & Run All" in Jupyter.

# Objective 

The purpose of this milestone is to explore a given dataset, represent it by network by constructing different graphs. In the first section, you will analyze the network properties. In the second section, you will explore various network models and find out the network model fitting the ones you construct from the dataset.

# Dataset : Cora Dataset

The [Cora dataset](https://linqs-data.soe.ucsc.edu/public/lbc/cora.tgz) consists of scientific publications classified into one of seven classes. 

* **Citation graph** The citation network is constructed from the connections given in `cora.cites` file. 
* **Feature graph** Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary, given in `cora.content` file. The dictionary consists of 1433 unique words. A feature graph is contructed regarding the Euclidean distance between the feature vector of the publications.

The `README` file in the dataset provides the details about the content of the files. 

# Section 1 : Network Properties

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

### Question 1 : Construct a Citation Graph and a Feature Graph

# Section 2 : Network Models

In this section, you will analyze the feature graph and citation graph you constructed in the previous section in terms of the network model types. For this purpose, you can use NetworkX libary imported below.

In [None]:
import networkx as nx
import warnings
warnings.simplefilter("ignore")

Let us create NetworkX graph objects from the adjacency matrices computed in the previous section.

In [None]:
G_citation = nx.from_numpy_matrix(A_citation)
print('Number of nodes: {}, Number of edges: {}'. format(G_citation.number_of_nodes(), G_citation.number_of_edges()))
print('Number of self-loops: {}, Number of connected components: {}'. format(G_citation.number_of_selfloops(), nx.number_connected_components(G_citation)))

In [None]:
G_feature = nx.from_numpy_matrix(A_feature)
print('Number of nodes: {}, Number of edges: {}'. format(G_feature.number_of_nodes(), G_feature.number_of_edges()))
print('Number of self-loops: {}, Number of connected components: {}'. format(G_feature.number_of_selfloops(), nx.number_connected_components(G_feature)))

### Question : Simulation with Erdős–Rényi and Barabási–Albert model

Create an Erdős–Rényi and Barabási–Albert graph using NetworkX to simulate the citation graph and the feature graph you have. When choosing parameters for the networks, take into account the number of vertices and edges of the original networks.

The number of nodes should exactly match the number of nodes in the original citation/feature graph.

In [None]:
n = len(G_citation.nodes())
n

The number of match shall fit the average of the number of edges in the citation and the feature graph.

In [None]:
m = (G_citation.size() + G_feature.size())/2
m

How do you determine the probability parameter for the Erdős–Rényi graph?

**Your answer here:**

In [None]:
p = # Your code here.
G_er = nx.erdos_renyi_graph(n,p)

Check the number of edges in the Erdős–Rényi graph.

In [None]:
print('My Erdos-Rényi network to simulate citation graph has {} edges.'.format(G_er.size()))

How do you determine the preferential attachement parameter for Barabási–Albert graphs?

**Your answer here:**

In [None]:
q = # Your code here.
G_ba = nx.barabasi_albert_graph(n,q)

Check the number of edges in the Barabási–Albert graph.

In [None]:
print('My Barabási-Albert network to simulate citation graph has {} edges.'.format(G_ba.size()))

### Question :  Giant component

Check the size of the largest connected component in the citation and feature graph.

In [None]:
giant_citation = # Your code here.
print('The giant component of the citation graph has {} nodes and {} edges.'.format(giant_citation.number_of_nodes(), giant_citation.size()))

In [None]:
giant_feature = # Your code here.
print('The giant component of the feature graph has {} nodes and {} edges.'.format(giant_feature.number_of_nodes(), giant_feature.size()))

Check the size of the giant components in the generated Erdős–Rényi graph.

In [None]:
giant_er = # Your code here.
print('The giant component of the Erdos-Rényi network has {} nodes and {} edges.'.format(giant_er.number_of_nodes(), giant_er.size()))

Let us match the number of nodes in the giant component of the feature graph by simulating a new Erdős–Rényi network.
How do you choose the probability parameter this time? 

Hint: Recall the expected giant component size from the lectures.

In [None]:
p_new = # Your code here.
G_er_new = nx.erdos_renyi_graph(n,p_new)

Check the size of the new Erdős–Rényi network and its giant component.

In [None]:
print('My new Erdos Renyi network to simulate citation graph has {} edges.'.format(G_er_new.size()))
giant_er_new = # Your code here.
print('The giant component of the new Erdos-Rényi network has {} nodes and {} edges.'.format(giant_er_new.number_of_nodes(), giant_er_new.size()))

### Question : Degree Distributions

You already plotted the degree distribution of the citation and feature graph in the first section. Now, plot the degree distribution historgrams for the simulated networks.

In [None]:
plt.figure(figsize=(20,6))
plt.subplot(131)
plt.title('Erdos-Rényi network')
er_degrees = # Your code here.
plt.hist(er_degrees);
plt.subplot(132)
plt.title('Barabási-Albert network')
ba_degrees = # Your code here.
plt.hist(ba_degrees);
plt.subplot(133)
plt.title('new Erdos-Rényi network')
er_new_degrees = # Your code here.
plt.hist(er_new_degrees);

In terms of the degree distribution, is there a good match between original citation and feature graph and the simulated networks? For the citation graph, choose one of the simulated networks above that matches its degree distribution at best. Indicate your preference below:

**Your answer here:**

You can also simulate a network using the configuration model to match its degree disctribution exactly. Refer to [Configuration model](https://networkx.github.io/documentation/stable/reference/generated/networkx.generators.degree_seq.configuration_model.html#networkx.generators.degree_seq.configuration_model).

Let us create another network to match the degree distribution of the feature graph. 

In [None]:
feature_degrees = # Your code here.
G_config = nx.configuration_model(feature_degrees) 
print('Configuration model has {} nodes and {} edges.'.format(G_config.number_of_nodes(), G_config.size()))

Does it mean that we create the same graph with the feature graph by the configuration model? If not, how do you understand that they are not the same?

**Your answer here:**

### Question :  Clustering Coefficient

Let us check the average clustering coefficient of the original citation and feature graphs. 

In [None]:
nx.average_clustering(G_citation)

In [None]:
nx.average_clustering(G_feature)

What does the clustering coefficient tell us about a network? Comment on the values you obtain for the citation and feature graph.

**Your answer here:**

Now, let us check the average clustering coefficient for the simulated networks.

In [None]:
nx.average_clustering(G_er)

In [None]:
nx.average_clustering(G_ba)

In [None]:
nx.average_clustering(nx.Graph(G_config))

Comment on the values you obtain for the simulated networks. Is there any good match to the citation or feature graph in terms of clustering coefficient?

**Your answer here:**

Check the other [network model generators](https://networkx.github.io/documentation/networkx-1.10/reference/generators.html) provided by NetworkX. Which one do you predict to have a better match to the citation graph and the feature graph in terms of degree distribution and clustering coefficient at the same time? Justify your answer.

**Your answer here:**

If you find other network models, create at most one graph object for the citation and one for feature graph below. Print the number of edges and the average clustering coefficient. Plot the histogram of the degree distributions. Comment on the similaries. 

In [None]:
# Your code here.

**Your answer here:**