# Assignment 1

### This notebook consists of 2 parts: 
    - Power Law, 
    - Random graphs and generative network models. 

### You can receive max 20 points for both.

## Complete [this form](https://forms.gle/J9uArMMmG9L6tKGu9) with your name, email and .ipynb until 15.02 23:59 msk

# Power law

## 1. Guess graph by degree distribution (0 points)

Graph is described by the histogram [0, 2, 10] — 0 nodes with degree 0, 2 nodes with degree 1, 10 nodes with degree 2. The goal is to implement a function which can guess graph structure.

In [None]:
import numpy as np
from typing import List, Tuple

dens_a = [0, 2, 10]
plt.bar(range(len(dens_a)), dens_a)

In [None]:
def generate_graph(nodes : np.ndarray, dens : List[int]) -> List[np.ndarray]:
    # Your code here
    raise NotImplementedError()

In [None]:
# Another examples
dens_b = [0, 0, 10]
dens_c = [0, 0, 0, 0, 5]
dens_d = [0, 5, 0, 0, 0, 1]

## 2. Gygantic connected component (1 point)

Two parameters are expected in the function input: vertices array and edges array. The goal is to implement a function that calculates the gygantic connected component.

In [None]:
def select_gygantic_component(nodes : np.ndarray, edges : List[np.ndarray]) -> [np.ndarray, List[np.ndarray]]:
    # Your code here
    raise NotImplementedError()

## 3. Power law CDF (1 point)

Let us generate observations from a Power Law RV. The first step is to derive CDF of Powel law:  $𝐹(𝑥)=𝑃(𝑋≤𝑥)$
 
$$F(x) = 1 - \int_{x}^\infty p(t) dt.$$
 
The goal is to implement a function with imput parameters x, $\alpha$ and $x_{min}$ that calculates power law CDF. You should take the integral and derive CDF analytically. 

In [None]:
def power_law_pdf(x : float, alpha : float = 3.5, x_min : int = 1) -> float:
    C = (alpha - 1) / x_min ** (1 - alpha)
    return C * x ** (-alpha)

In [None]:
def power_law_cdf(x : float, alpha : float = 3.5, x_min : int = 1) -> float:
    # Your code here
    raise NotImplementedError()

## 4. Power law PPF (1 point)

Let $X \sim \text{Power law}$. Next, define a random variable $R$, s.t. $R = F(X)$, so $R$ will be uniformly distributed on interval [0, 1] ([proof](https://en.wikipedia.org/wiki/Probability_integral_transform#Proof)). Good thing here is that we easily can generate uniformly distributed pseudorandom numbers and then transform them into Power Law. Let us find an expression for $x = F^{-1}(r)$, where $r$ is an observation from uniform distrubution on interval [0, 1]. 

Find an analytical form of $F^{-1}(r)$ and implement a function `power_law_ppf` (percent point function, also known as a quantile) with parameters `r`, `alpha` and `x_min`. Here `r` is a list of observations.

In [None]:
def power_law_ppf(r : List[float], alpha : float = 3.5, x_min : int = 1) -> List[float]:
    # Your code here
    raise NotImplementedError()

In [None]:
def power_law_generate(n : int, alpha : float = 3.5, x_min : int = 1, random_seed : int = 1) -> List[float]:
    np.random.seed(random_seed)
    uni_sample = np.random.uniform(0, 0.999, n)
    return power_law_ppf(uni_sample, alpha, x_min)

## 5. Estimation of alpha with linear binning (2 points)

Given observations from the Power Law distribution, try to estimate  𝛼
 . The easiest way is to draw an empirical PDF with linear binning in log-log scale and apply linear regression. By linear binning we mean to keep a bin width is fixed.

The goal is to implement a function alpha_lin_bins that takes a train set, number of linear bins and returns an estimated $\alpha$.

You can use the following hints:

* Take log in both side of  $𝑝(𝑥)=𝐶𝑥^{−𝛼}$
* To calculate an empirical PDF, use np.histogram(x_train, bins=bins, density=True)
* To calculate pseudoinverse matrix, use np.linalg.pinv
* Also you can fit sklearn.linear.LinearRegression

In [None]:
def alpha_lin_bins(x_train : List[float], bins : int) -> float:
    # Your code here
    raise NotImplementedError()

## 6. Estimation of alpha with logarithmic binning (2 points)

The goal is to implement a function alpha_log_bins that takes a train set, number of log bins and returns an estimated $\alpha$.

In [None]:
def alpha_log_bins(x_train : List[float], bins : int) -> float:
    # Your code here
    raise NotImplementedError()

## 7. Estimation parameters of Power Law by MLE (3 points)

### 7.1. Graph test data

Data for verification.
You need to take the graphs from the social network at the url_1 link to verify your programms. Each of the graphs is defined in the edges file, where each line contains a pair of vertices connected by an edge. It is assumed that there are no vertices with degree 0 in these graphs.

The url_2 link contains a transcript of all the data in the archive at url_1.

Use these dataset to test a function in 7.2.

In [None]:
url_1 = 'https://snap.stanford.edu/data/twitter.tar.gz'
url_2 = 'https://snap.stanford.edu/data/readme-Ego.txt'

In [None]:
import urllib.request
import tarfile
urllib.request.urlretrieve(url_2, "readme.txt")
urllib.request.urlretrieve(url_1, "test.tar.gz")
file = tarfile.open("test.tar.gz") 
file.extractall('./test_folder')

In [None]:
def totuple(a):
    try:
        return tuple(totuple(i) for i in a)
    except TypeError:
        return a

In [None]:
# Example: how to read edges
edges = []
for line in open('./test_folder/twitter/12831.edges', "r"):
    values = line.split()
    if not values:
        continue
    edges.append(totuple(np.array([values[0], values[1]], dtype = int)))

### 7.2. Estimation parameters of Power Law by MLE

The MLE consists of:
1. Fix $x_\min$ as a minimal node degree (drop node degrees that less than $x_\min$)
2. Calculate $\alpha$ via maximum likelihood estimation using fixed $x_\min$
$$\alpha = 1 + n \left[\sum_i \log \frac{x_i}{x_\min} \right]^{-1}$$
3. Calculate Kolmogorov-Smirnov test
4. Fix $x_\min$ as the next node degree
5. Repeat 2-4 by scanning all possible $x_\min$ and find the best $\alpha$ and $x_\min$ with respect to Kolmogorov-Smirnov test

The goal is to implement a function `ml_power_law_params` that takes a node degree sequence `degree_sequence` and returns a tuple of two values: the best $\alpha$ and $x_\min$.

_Hint: use `scipy.stats.kstest` where a theoretical CDF is a `power_law_cdf` function and `args=(alpha, k_min)`_

Use the data provided in 7.1 to test your function.

In [None]:
def mle_power_law_params(degree_sequence : List[int]) -> [float, int]:
    # Your code here
    raise NotImplementedError()

# Random graphs and generative network models

## 1. Erdos-Renyi model (0.5 point)

Two parameters are expected in the function input: vertices array and p parameter. The goal is to implement Erdos-Renyi model (random graph) — each pair of  𝑛 nodes are connected with some fixed probability 𝑝.

In [None]:
import numpy as np
from typing import List, Tuple

def erdos_renyi_graph(nodes : np.ndarray, p : float) -> List[np.ndarray]:
    # Your code here
    raise NotImplementedError()

### 1.1. Graph visualization and test data

Data for verification.
You need to take the graphs from the social network at the url_1 link to verify your programms. Each of the graphs is defined in the edges file, where each line contains a pair of vertices connected by an edge. It is assumed that there are no vertices with degree 0 in these graphs.

The url_2 link contains a transcript of all the data in the archive at url_1.

Use this code to visualize test graphs from url provided below. This dataset is required for tests of the functions you will implement if the next tasks of this assignment.

In [None]:
url_1 = 'https://snap.stanford.edu/data/facebook.tar.gz'
url_2 = 'https://snap.stanford.edu/data/readme-Ego.txt'

In [None]:
import urllib.request
import tarfile
urllib.request.urlretrieve(url_2, "readme.txt")
urllib.request.urlretrieve(url_1, "test.tar.gz")
file = tarfile.open("test.tar.gz") 
file.extractall('./test_folder')

In [None]:
def totuple(a):
    try:
        return tuple(totuple(i) for i in a)
    except TypeError:
        return a

In [None]:
# Example: how to read edges
edges = []
for line in open('./test_folder/facebook/107.edges', "r"):
    values = line.split()
    if not values:
        continue
    edges.append(totuple(np.array([values[0], values[1]], dtype = int)))

In [None]:
import networkx as nx
import matplotlib.pyplot as plt

G1 = nx.Graph()
G1.add_edges_from(edges)

In [None]:
plt.figure(figsize=(12,12))
nx.draw(G1, node_size=3)
plt.show()

## 2. Vertices degrees distribution (0.5 point)

Two parameters are expected in the function input: vertices array and edges array. The goal is to implement a function that calculates the parameters n and p of the binomial distribution of the degrees of vertices of the graph.

Test this function of the graphs provided via the url in task 1.1.

In [None]:
def estimate_binomial(nodes : np.ndarray, edges : List[np.ndarray]) -> [int, float]:
    # Your code here
    raise NotImplementedError()

## 3. Poisson distribution to evaluate a vertices degrees distribution (0.5 point)

Two parameters are expected in the function input: vertices array and edges array. The goal is to implement a function that calculates the parameters 𝜆 of the Poisson distribution of the degrees of vertices of the graph.

Test this function of the graphs provided via the url in task 1.1.

In [None]:
def estimate_poisson(nodes : np.ndarray, edges : List[np.ndarray]) -> int:
    # Your code here
    raise NotImplementedError()

## 4. Graph's component average size (1 points)

Two parameters are expected in the function input: number of graph vertices and probability p for Erdos-Renyi model. The goal is to implement a function that generates random graph for each probability from probabilities array. It is required to evaluate average siz of small graph components for each graph generated. Function returns an array of average sizes.

In [None]:
def largest_connection_component(nodes : int, probabilities : np.ndarray) -> np.ndarray:
    # Your code here
    raise NotImplementedError()

## 5. Graph's shortest path average length (1 points)

Two parameters are expected in the function input: the first array contains average degrees of graphs, the second one contains amount of graphs vertices. The goal is to generate random graphs with corresponding average degree and amount of vertices. Function is expected to return an array of average lengths of graphs shortest paths.

In [None]:
def average_shortest_path_length(average_degree : np.ndarray, nodes_number : np.ndarray) -> np.ndarray:
    # Your code here
    raise NotImplementedError()

## 6. Clustering coefficient (0.5 point)

Two parameters are expected in the function input: vertices array and p parameter for Erdos-Renyi model. The goal is to generate a random graph and calculate all vertices' degrees and clustering coefficients.

In [None]:
def clustering_coefficient(nodes : np.ndarray, p : float) -> [np.ndarray, np.ndarray]:
    # Your code here
    raise NotImplementedError()

## 7. Watts-Strogatz model (2 point)

The goal is to implement a function Watts-Strogatz model (small-world model) — rewire an edge with probability $p$ in a ring lattice with $n$ nodes and $k$ degree.

This function should be splitted into little functions described below.

In [None]:
def watts_strogatz_graph(nodes : np.ndarray, k : int, p : float) -> List[np.ndarray]:
    AdjList = ring_lattice(n, k)
    for node in tqdm(nodes):
        rewire(AdjList, node, k, p)
    return AdjList

A ring_lattice function generates a regular ring lattice with $n$ nodes $(0, 1, 2, ..., n-1)$ and $k$ node degree. In a case of an odd node degree, it round it to the nearest smaller even number.

In [None]:
def ring_lattice(nodes : np.ndarray, k : int) -> List[np.ndarray]:
    # Your code here
    raise NotImplementedError()

A function `rewire` takes in input a ring lattice defined by adjacency list `AdjList`. The other input parameters are: a `node`, a model parameter `k` and probability `p`. For every right hand side neighbor $i$, the function rewires an edge (`node`, $i$) into a random edge (`node`, $j$) with probability `p` where $i \neq j \neq $ `node`.

*Hints:*
* *Why do we only rewire right hand side edges? We want to guarantee that only untouched in previous iterations edges will be rewound. Look at the picture — we could not move the red edges in previous iterations.*

![](https://raw.githubusercontent.com/netspractice/network-science/main/images/watts_strogatz_how_to_rewire.png)

* *To speed up the generation, do not filter nodes to random selection. If a selected node produces an existing edge or a loop, just skip it.*

Function should return rewired adjacency list.

In [None]:
def rewire(AdjList : List[np.ndarray], node : int, k : int, p : float) -> List[np.ndarray]:
    # Your code here
    raise NotImplementedError()

## 8. Barabasi-Albert model (2 point)

The goal is to implement Barabasi-Albert model (preferential attachment model) -- a growth process where each new node connects to `m` existing nodes. The higher node degree, the higher probability of the connection. The final number of nodes is `n`.

You start from a star graph with `m + 1` nodes. In each step you create `m` edges between a new node and existing nodes. The probability of connection to the node $i$ is 
$$p(i) = \frac{k_i}{\sum k}$$

The function requres another function described below.

In [None]:
def barabasi_albert_graph(nodes : np.ndarray, m : int) -> List[np.ndarray]:
    AdjList = nx.star_graph(m)
    for i in trange(1, n - m):
        attach(m + i, AdjList, m)
    return AdjList

A function `attach` attaches a `node` to a graph `G` through `m` edges. It should not return any variables but just change an adjacency list passed to the input.

*Hint: Create a list with repeated nodes from a list of edges. For example, $[(1, 2), (2, 3), (2, 4)] \to [1, 2, 2, 3, 2, 4]$. Uniformly select nodes one-by-one. Apply `random.choice` instead of `np.random.choice` to speed up the generation.*

In [None]:
def attach(node : int, AdjList : List[np.ndarray], m : int) -> None:
    # Your code here
    raise NotImplementedError()

## 9. Degree dynamics in Barabasi-Albert model (2 points)

Measure the degree dynamics in Barabasi-Albert model of one of the initial nodes and of the nodes added to the network at intermediate time moments (steps of the algorithm).

The goal is to implement a function generate_degree_dynamics that takes np.array with considered nodes, generates Barabasi-Albert graph ( 𝑛=3000
 ,  𝑚=6
 ) and returns a np.array of the shape (30, len(cons_nodes)) — degrees of these nodes at time moments when nodes 99, 199, 299, ..., 2999 appear. If a node does not exist yet, pass np.nan value.

Hint: use the barabasi_albert_graph function as a template.

In [None]:
def generate_degree_dynamics(cons_nodes):
    # YOUR CODE HERE
    raise NotImplementedError()