# Class 1

## Degree Centrality and Closeness Centrality

### Node Importance

In this lesson we're going to talk about **How to Find Important Nodes in a Network**.

Recall these friendship network among 34 people in a karate club. Based on the structure of this network, which would you say are the five most important nodes in this network?

Examples of Different ways of thinking about *"importance"*:

- *Example 1* $\rightarrow$ **Degrees** :  Nodes who have a very high degree, nodes who have lots of friends are important nodes. 
- Top 5 Most Relevant Nodes in Karate Club: Nodes 34, 1, 33, 3 and 2.


- *Example 2* $\rightarrow$ **Average Proximity to Other Nodes** :  Nodes who are important are the nodes who are very close to other nodes and network, nodes who have high proximity to other nodes and network.
- Top 5 Most Relevant Nodes in Karate Club: Nodes 1, 3, 34, 32 and 9.


- *Example 3* $\rightarrow$ **Fraction of Shortest Paths that Passes Through Nodes** : Nodes who are important are nodes who tend to connect other nodes into network, and so, we could imagine measuring importance by the fraction of shortest paths that pass through a particular node.
- Top 5 Most Relevant Nodes in Karate Club: Nodes 1, 34, 33, 3 and 32.


At this point, these 3 examples of definitions of importance are very informal and the goal of this week's lecture is going to set a more precise definitions of how to measure importance in a network.

### Network Centrality

This general topic is called **Network Centrality** and these centrality **measures are the ones that allows us to find the most important nodes in a particular network**.

Possible Cases when we want to use the **Network Centrality**:
- Find the influential Nodes in a Social Network
- Which are the Nodes that disseminate more information to other Nodes in the Network
- Which are the Nodes that are good at Preventing any sort of bad behaviors or epidemics from spreading on a Social Network
- Apply the Centrality to find Hubs in a Transportation Network
- Find the Important pages on the Web
- Nodes that Prevent the Network from Breaking up, in other words, which Nodes to Remove in order to break the network in different components

### Centrality Measures List

Here's a list of the Centrality Measures that are commonly used:

- **Degree Centrality**
- **Closeness Centrality**
- Betweeness Centrality
- Load Centrality
- Page Rank
- Katz Centrality
- Percolation Centrality

### Degree Centrality

**Assumption**: Important Nodes have many Connections

Degree centrality makes the assumption that important nodes have many connections. This is the most basic way in which you could define importance or centrality, simply, that **nodes who have lots of neighbors, lots of friends are very important**. 


Of course, depending on the type of network you have you would have to use different types of degree. So, if you have an undirected network you can simply use that degree, but if you have a directed network then you have to choose between using in-degree, or out-degree, or a combination of both.
- **Undirected Networks**: Use the **nodes degrees**
- **Directed Networks**: You have to choose between using **in-degree, or out-degree, or a combination of both**

#### Degree Centrality - Undirected Network

This centrality measure goes between the values 0 and 1

$$\textbf{Centrality Degree} \Rightarrow C_{deg}(v) = \frac{d_{v}}{|N| - 1} \textit{ ,where $C_{deg}(v)$ is the degree of node $v$ and $N$ is the set of total number of nodes}$$

$$C_{deg}(v) \in [0, 1]$$

$$\textit{$C_{deg}(v) = 1$ $\rightarrow$ If the Node $v$ is connected to EVERY NODE}$$

$$\textit{$C_{deg}(v) = 0$ $\rightarrow$ If the Node $v$ is NOT connected to ANY NODE}$$

```python
import networkx as nx

# Load the karate club graph
In: G = nx.karate_club_graph()

# Transform the labels from string to numbers
In: nx.convert_labels_to_integers(G=G, first_label=1)

# Use degree_centrality() to measure the centrality of all nodes in graph (return a dict)
In: deg_centrality = nx.degree_centrality(G)

# Check the centrality of node 34
In: deg_centrality[34]
Out: 0.515  # Because node 34 has: 17 connection from a total of 34 nodes (17/33 = 0.515)

# Check the centrality of node 33
In: deg_centrality[33]
Out: 0.364  # Because node 33 has: 12 connection from a total of 34 nodes (12/33 = 0.364)
```


#### Degree Centrality - Directed Network

Now, in directed networks, we have the choice of using the *in-degree centrality* or the *out-degree centrality* of a node and everything else is defined in the same way.

##### I - In-Degree Centrality

So, the in-degree centrality of a node V is going to be its in-degree divided by the number of nodes in the graph minus one. And we can use the function `in_degree_centrality()` from `NetworkX` to find the in-degree centrality of all the nodes in a directed network.

$$\textbf{In-Degree Centrality} \Rightarrow C_{in-deg}(v) = \frac{d^{in}_{v}}{|N| - 1}$$ 

$$\textit{where, $d^{in}_{v}$ is the in-degree centrality of node $v$, and $N$ is the set of total number of nodes}$$

```python
import networkx as nx

# Load the karate club graph
In: G = nx.karate_club_graph()

# Use in_degree_centrality() to measure the indegree centrality of all nodes in graph (return a dict)
In: in_deg_centrality = nx.in_degree_centrality(G)

# Check the in-degree centrality of node A
In: in_deg_centrality['A']
Out: 0.143  # Because node A has: 2 connection from a total of 15 nodes (2/14 = 0.143)

# Check the in-degree centrality of node L
In: in_deg_centrality['L']
Out: 0.214  # Because node L has: 3 connection from a total of 15 nodes (3/14 = 0.214)
```

##### II - Out-Degree Centrality

And the very same way we can define not using the out-degree instead of the in-degree centrality using the function `out_degree_centrality()` and the out-degree centrality.

$$\textbf{Out-Degree Centrality} \Rightarrow C_{out-deg}(v) = \frac{d^{out}_{v}}{|N| - 1}$$

$$\textit{where, $d^{out}_{v}$ is the out-degree centrality of node $v$, and $N$ is the set of total number of nodes}$$

```python
import networkx as nx

# Load the karate club graph
In: G = nx.karate_club_graph()

# Use out_degree_centrality() to measure the indegree centrality of all nodes in graph (return a dict)
In: out_deg_centrality = nx.out_degree_centrality(G)

# Check the out-degree centrality of node A
In: out_deg_centrality['A']
Out: 0.214  # Because node A has: 3 connection from a total of 15 nodes (3/14 = 0.214)

# Check the out-degree centrality of node L
In: out_deg_centrality['L']
Out: 0.071  # Because node L has: 1 connection from a total of 15 nodes (1/14 = 0.071)
```

### Closeness Centrality

**Assumption**: Important Nodes are Close to Other Nodes

Another way of measuring centrality is what we call closeness centrality. And the assumption here is that nodes that are important are going to be a short distance away from all other nodes in the network. Recall that we measure distance between two nodes by looking at the length of the shortest path between those nodes.

We can use the function `closeness_centrality()` which returns the dictionary of the centrality of the closeness centrality of all the nodes.
$$$$
#### I - Closeness Centrality - Connected Nodes

**Assumption**: All the nodes can actually reach all the other nodes

$$\textbf{Closeness Centrality} \Rightarrow C_{close}(v) = \frac{|N| - 1}{\sum_{u ∈ N/\text{{v}}}{d(v,u)}}$$

$$ \textit{where, $d(v,u)$ is the length of the shortest path between nodes $(v \text{ and } u)$, and $N$ is the set of total number of nodes}$$

```python
import networkx as nx

# Load the karate club graph
In: G = nx.karate_club_graph()

# Estimate the closeness centrality of node 32
In: close_centrality = nx.closeness_centrality(G)
In: close_centrality[32]
Out: 0.541

# Let's apply the closeness centrality formula to determine how to get the result 0.541
In: sum(nx.shortest_path_length(G, 32).values())
Out: 61
In: (len(G.nodes) - 1) / 61
Out: 0.541

# Whole folmula in one line of code
In: (len(G.nodes) - 1) / sum(nx.shortest_path_length(G, 32).values())
Out: 0.541
```

Of course, we're making the **implicit assumption that all the nodes can actually reach all the other nodes**, but of course, this is not always the case.
$$$$
#### II - Closeness Centrality - Disconnected Nodes

**Assumption**: Not all the nodes can actually reach all the other nodes

How do we measure closeness centrality when a node cannot actually reach all the other nodes?

**What is the closeness centrality of Node L?**
$$$$
- Option Nº1 - **Don't Normalize the Closeness Centrality**

Consider only the nodes that L can reach

$$C_{close}(L) = \frac{|R(L)|}{\sum_{u ∈ R(L)}{d(L,u)}} \text{ ,where R(L) is the set of nodes L can reach}$$

In this example, we can see that the node L can only reach node M and it has a shortest path of length of 1

$$\Rightarrow C_{close}(L) = \frac{1}{1} = 1$$
$$\therefore \texttt{ The closeness centrality is it's maximum value of 1}$$

$$\Rightarrow \textbf{Probem: } \textit{Centrality of 1 is too high for a node than can only reach only 1 node, so we have the 2nd alternative to solve this issue}$$
$$$$
- Option Nº2 - **Normalize the Closeness Centrality**

Consider only the nodes that L can reach and normalize it by the fraction of nodes L can reach:

$$C_{close}(L) = \frac{|R(L)|}{|N-1|} · \frac{|R(L)|}{\sum_{u ∈ R(L)}{d(L,u)}} \text{ ,where R(L) is the set of nodes L can reach}$$

$$\Rightarrow C_{close}(L) = \frac{1}{14} · \frac{1}{1} = 0.071$$

$$\textit{Note that this definition matches our definition of closeness centrality, when a graph is connected since } R(L) = N-1$$
$$$$
$$$$
We use `NetworkX` to find the closeness centrality by using the function `closeness_centrality()`,  and here, you get the option of normalizing or not normalizing the closeness centrality value.

```python
import networkx as nx

# Load the karate club graph
In: G = nx.karate_club_graph()

# Option 1 - closeness centrality of node L (return a dict)
In: close_centrality_not_norm = nx.closeness_centrality(G, normalized=False)
In: close_centrality_not_norm['L']
Out: 1

# Option 2 - closeness centrality of node L (return a dict)
In: close_centrality_norm = nx.closeness_centrality(G, normalized=True)
In: close_centrality_norm['L']
Out: 0.071
```

### Summary 

In summary, we talked about **centrality measures and how they aim to find the most important nodes in the network**. We said that there are many many different ways of defining centrality and today we talked about two specific ones. 

A very **basic definition of degree centrality which makes the assumption that important nodes are those who have many many connections**, and we use this formula to measure the centrality of a node and it's simply the ratio between the degree of the node and the number of nodes in the graph minus one. And **depending on whether we have a direct or undirected graph, we can use the degree of the node or the in-degree or the out-degree**, and these are the different functions that you can use in `NetworkX` to apply them. 

$$$$

$$\textbf{Degree Centrality}$$
$$\textbf{Assumption: }\texttt{Important nodes have many connections}$$

$$C_{deg}(v) = \frac{d_{v}}{|N| - 1}$$

```python
import networkx as nx

In: G = nx.karate_club_graph()    

# Undirected graph
In: nx.degree_centrality(G)       # degree centrality

# Directed graph
In: nx.in_degree_centrality(G)    # in-degree centrality
In: nx.out_degree_centrality(G)   # out-degree centrality
```

$$$$

The second centrality measure that we looked at was **closeness centrality** and the **assumption that it makes is that important nodes are close to other nodes**, and here we can **choose to normalize or not normalize** as we discussed and the function that we use in `NetworkX` to compute it is the function closeness centrality.

$$\textbf{Closeness Centrality}$$
$$\textbf{Assumption: }\texttt{Important nodes are close to other nodes}$$

$$C_{close}(L) = \frac{|R(L)|}{|N-1|} · \frac{|R(L)|}{\sum_{u ∈ R(L)}{d(L,u)}}$$

```python
import networkx as nx

In: G = nx.karate_club_graph()

# Not normalized closeness centrality
In: nx.closeness_centrality(G, normalized=False)   # not normalized

# Normalized closeness centrality
In: nx.closeness_centrality(G, normalized=True)    # normalized
```

---

# Class 2

## Betweeness Centrality

### Centrality Measures List

Here's a list of the Centrality Measures that are commonly used:

- Degree Centrality
- Closeness Centrality
- **Betweeness Centrality**
- Load Centrality
- Page Rank
- Katz Centrality
- Percolation Centrality

### Betweeness Centrality

**Assumption: ** Important nodes are those who connects other nodes

$$C_{btw}(v) = \sum_{s, t \in N}{\frac{\sigma_{s,t}(v)}{\sigma_{s,t}}}$$

$$\textit{$\sigma_{s,t}$ = is the number of shortest paths between nodes (s, t),}$$
$$\textit{$\sigma_{s,t}(v)$ = is the number of shortest paths between nodes (s, t) that passes through node v}$$

The betweenness centrality of node v is going to be the sum of these ratios overall possible nodes $s$ and $t's$. Actually, we're going to find that there are different ways in which we can pick the specific $s$ and $t's$ that we use to compute the centrality node $v$.

The basic idea here is that a **node $v$ has high betweenness centrality if it shows up in many of the shortest paths of nodes $s$ and $t$**.

- **Endpoints: ** We can either include or exclude node $v$ as node $s$ and $t$ in the computation of the betweeness centrality $C_{btw}(v)$ 

We'll find that when we use `NetworkX` to compute this, we'll have the option of either including or excluding the node as one of the endpoints in the pair of nodes.

#### Betweeness Centrality - Directed Graph

What happends in the case that we have a directed graph and not all nodes are connected among each other, meaning that not all nodes are reachable by every node in the network. **To solve this problem, when computing betweeness centrality we only consider nodes $s, t$ such that there is at least one path between them, preventing the denominator term $\sigma_{s,t} = 0$**

### Betweeness Centrality - Normalization

So far we haven't talked about normalizing the betweenness centrality in any way. And the problem with this is that **nodes that are in graphs that have a larger number of nodes will tend to have higher centrality than nodes of graphs that are smaller in terms of the number of nodes**. That's simply because in **large graphs, there are more nodes, s and t, to choose from to compute the centrality of the nodes**. For example, if we look at these friendship network in the 34 person karate club, the nodes there are going to have lower centrality than the nodes in this larger network of 2200 people. And so, sometimes **if we want to compare betweenness centrality across networks, it's useful to normalize**.

- **Normalization: ** Betweeness centrality values will be larger in graphs with many nodes, so to control this, we divide centrality values by the number of pair of nodes in the graph, excluding node $v$, this division also depends if it's a directed or an undirected graph

$$\textbf{Undirected Graph: }\textit{Divide Betweeness Centrality by } \rightarrow \frac{1}{2} · (|N| - 1) · (|N| - 2)$$

$$\textbf{Directed Graph: }\textit{Divide Betweeness Centrality by } \rightarrow (|N| - 1) · (|N| - 2)$$

In `NetworkX` you can use the function `betweenness_centrality()` to find the centrality of every node in the network, and you have the various options that we've discussed, you can choose to normalize or not, and you can also choose the question of the endpoints, whether you use the node that you're computing the centrality of as one of the endpoints in the computation of its centrality.

```python
import networkx as nx
import operator

In: G = nx.karate_club_graph()

# Estimate betweeness centrality with normalization and no node endpoints
In: betweeness_centrality = nx.betweenness_centrality(G, normalized=True, endpoints=False)
In: sorted(betweeness_centrality.items(), key=operator.itemgetter(1), reverse=True)[:5]  # Top 5 high nodes

# Top 5 betweenes centrality nodes
# (node number, value)
Out: [(1, 0.437635),   
      (34, 0.30407),
      (33, 0.14524),
      (3, 0.14365),
      (32, 0.13827)]
```

### Betweeness Centrality - Complexity

Now one of the **issues with betweenness centrality is that it can be very computationally expensive**. Depending on the specific algorithm you're using, this computation can take up to order number of nodes cubed time.

$$\textit{Time Complexity} = O(|N|^3)$$

So one of the things that you can do is **rather than the computing betweenness centrality based on all the possible nodes s, t in the network, you can approximate it by just looking at a sample of nodes, instead of looking at all the nodes**. In `NetworkX` you can do this by using the parameter `k` that says how many nodes you should use to compute the betweenness centrality.

- **Approximation: ** Rather than computing betweeness centrality based on all pair of nodes in the network, we can approximate it based on a sample of nodes 

```python
import networkx as nx
import operator

In: G = nx.karate_club_graph()

# Estimate betweeness centrality with normalization, no node endpoints, and only 10 nodes rather than 34
In: betweeness_centrality = nx.betweenness_centrality(G, normalized=True, endpoints=False, k=10)
In: sorted(betweeness_centrality.items(), key=operator.itemgetter(1), reverse=True)[:5]  # Top 5 high nodes

# Top 5 betweenes centrality nodes
# (node number, value)
Out: [(1, 0.48269),   
      (34, 0.27564),
      (32, 0.20863),
      (3, 0.16975),
      (2, 0.13194)]
```

### Betweeness Centrality - Subsets

Sometimes is useful is that sometimes you rather compute the betweenness centrality based on two subgroups in the network, not necessarily looking at all potential pairs of nodes. But you maybe really care about two groups communicating with each other. So you want to find what are the most important nodes in this network that tend to show up in the shortest paths between a group of source nodes and a group of target nodes. 

To do this in `NetworkX` you can use the function `betweenness_centrality_subset()` in which you pass the graph and then you pass the set of source nodes and the set of target nodes, and you can choose to normalize or not.

```python
import networkx as nx
import operator

In: G = nx.karate_club_graph()

# Estimate betweeness centrality with normalization, no node endpoints, and only 10 nodes rather than 34
In: betweeness_centrality_subset = nx.betweenness_centrality_subset(G,
                                                               s=[34,33,21,30,16,27,15,23,10],  # source nodes
                                                               t=[1,4,13,11,6,12,17,7],         # target nodes
                                                               normalized=True)
In: sorted(betweeness_centrality_subset.items(), key=operator.itemgetter(1), reverse=True)[:5]  # Top 5 nodes

# Top 5 betweenes centrality subset nodes
# (node number, value)
Out: [(1, 0.04899),   
      (34, 0.02881),
      (3, 0.01836),
      (33, 0.01664),
      (9, 0.01415)]
```

### Betweeness Centrality - Edges

The other thing you can do is you can define the betweenness centrality of an edge, rather than the betweenness centrality of a node, in much the same way that you defined betweenness centrality for a node. So if you're defining the betweenness centrality of an edge, you're going to again look at pairs of nodes as t, and you're going to **take the ratio of the number of shortest paths in going from s to t that involve the edge e divided by all shortest paths between nodes s and t**. So it is the exact same definition, **but now rather than asking is this particular node showing up in the shortest path between s and t, we are asking is this particular edge showing up in the shortest path** 

$$C_{btw}(e) = \sum_{s, t \in N}{\frac{\sigma_{s,t}(e)}{\sigma_{s,t}}}$$

$$\textit{$\sigma_{s,t}$ = is the number of shortest paths between nodes (s, t),}$$
$$\textit{$\sigma_{s,t}(e)$ = is the number of shortest paths between nodes (s, t) that passes through edge e}$$

In `NetworkX` you can use the function `edge_betweenness_centrality()` to find the betweenness centrality of all the edges in the network.

```python
import networkx as nx
import operator

In: G = nx.karate_club_graph()

# Estimate edge betweeness centrality with normalization
In: edge_betweeness_centrality = nx.edge_betweenness_centrality(G, normalized=True)
In: sorted(edge_betweeness_centrality.items(), key=operator.itemgetter(1), reverse=True)[:5] # Top 5 high nodes

# Top 5 edge betweenes centrality nodes
# (edge, value)
Out: [((1, 32), 0.12725),   
      ((1, 7), 0.078134),
      ((1, 6), 0.078134),
      ((1, 3), 0.077787),
      ((1, 9), 0.074239)]
```

#### Betweeness Centrality - Edges Subset

In the same way that you could define a specific set of source nodes and a specific set of target nodes, you can do the same thing when you compute the edge betweenness centrality rather than node betweenness centrality. For this, you can use the function `edge_betweenness_centrality_subset()`, and you pass again the graph and the source nodes and the target nodes.

```python
import networkx as nx
import operator

In: G = nx.karate_club_graph()

# Estimate edge betweeness centrality subset with normalization
In: edge_betweeness_centrality_subset = nx.betweenness_centrality_subset(G,
                                                               s=[34,33,21,30,16,27,15,23,10],  # source nodes
                                                               t=[1,4,13,11,6,12,17,7],         # target nodes
                                                               normalized=True)
In: sorted(edge_betweeness_centrality_subset.items(), key=operator.itemgetter(1), reverse=True)[:5]

# Top 5 edge betweenes centrality subset nodes
# (edge, value)
Out: [((1, 32), 0.013665),   
      ((1, 9), 0.013665),
      ((14, 34), 0.01221),
      ((1, 3), 0.012113),
      ((1, 7), 0.012032)]
```

### Summary

In summary, betweenness centrality makes the assumption that **important nodes tend to connect the other nodes**. In general, it's the sum of the fraction of the number of shortest paths that involve a particular node v divided by all the possible shortest paths between the nodes s and t. 

We also talk about **normalizing this, especially if we're comparing betweenness centrality among different networks of different sizes**. So we divide by the number of pair of nodes. We also talked about approximating this because sometimes we're unable compute it exactly because it can be computationally expensive. So we can approximate it by selecting a subset of nodes rather than all the nodes.

We also talked about **choosing specific sets of target nodes and specific sets of source nodes rather than using all possible pairs (alorithm time completity)**. That's if you have a particular sets of nodes that you care about and that you want to know, who are the important nodes that are connecting nodes in this two specific sets. 

Finally, we talked about how we can generalize this a bit more and talked about the **betweenness centrality of not only the nodes but also the edges**. Much in the same way that we define it for nodes, we can also define for edge.

$$$$
$$\textbf{Betweeness Centrality}$$
$$\textbf{Assumption: }\texttt{Important nodes connects other nodes}$$

$$C_{btw}(v) = \sum_{s, t \in N}{\frac{\sigma_{s,t}(v)}{\sigma_{s,t}}}$$



- **Normalization:** Divide by the number of pair of nodes


- **Approximation:** Computing betweeness centrality can be computational expensive (time completity cost), so we can approximate the computation by taking a subset of nodes from the whole network


- **Subsets:** We can define subsets of source and target nodes to compute betweeness centrality


- **Edge Betweeness Centrality:** We can apply the same framework to find important edges in the network rather than nodes, and the procedure to estimate the edge betweeness centrality is the same


---

# Class 3

## Basic PageRank

### Centrality Measures List

Here's a list of the Centrality Measures that are commonly used:

- Degree Centrality
- Closeness Centrality
- Betweeness Centrality
- **Page Rank**
- Load Centrality
- Katz Centrality
- Percolation Centrality

We've been talking about how to measure the importance or the centrality of a node in a network, and now we're going to see another way of doing this. It's called PageRank and it was developed by the Google founders when they were thinking about how to measure the importance of webpages using the hyperlink network structure of the web.

The basic idea is that **PageRank will assign a score of importance to every single node, and the assumption that it makes is that important nodes are those that have many in-links from important pages or important other nodes**.

PageRank can be used for **any type of network**, but is **mainly useful for directed networks**

And so, if we think about the definition, what I said, that important pages are those that have many in-links from more important pages, then there's a little bit of a circular definition because if you imagine trying to measure the PageRank of a particular node, let's say the first node that you're looking at, then you would want to look at the PageRank of the nodes that point to it and you don't have those yet. So the ways it's defined sounds a little bit circular. And in this lecture, we're going to walk through the process of how you can actually compute it.

So the setup is that we're going to have a network with *n* nodes, and then we're going to compute PageRank on a step by step fashion. And so, what's going to happen, is that the sum of the PageRank or all the nodes is always going to be constant, it's always going to be 1. And so at first, we're going to start with every node having a PageRank value of $\frac{1}{n}$. And then what we're going to do is we're going to have every node give all of its PageRank to all the nodes that it points to, and then were going to do this over and over again.

So, what we're going to do is we're going to perform these Basic PageRank Update Rule k times, and the Basic PageRank Update Rule does what I just said. So every node will give an equal share of its current PageRank to all the other nodes that it links to. Then, the new value of PageRank for every node is going to be the sum of all the PageRank that are received from all the nodes that point to it.


$$n = \textit{ Number of Nodes in the Network}$$
$$k = \textit{ Number of Times the Basic PageRank Update Rule is Repeated}$$

$$\textbf{Algorithm Steps}$$
$$\hookrightarrow \text{Step Nº1: } \textit{ Assign all nodes of the network a value of } \rightarrow \frac{1}{n}$$
$$\hookrightarrow \text{Step Nº2: }\textit{ Perform the Basic PageRank Update Rule k times}$$


- **Basic PageRank Update Rule :** Each node gives an equal share of it's current PageRank to all the nodes link to that node
$$$$

**The new PageRank of each node is the sum of all the PageRank score it received from other nodes that points to that specific node**.

$$$$

- How do I know when to stop, and how do I know when the actual values that I should consider are right?
- So what if we continue this process for more more values of *k = 1, 2, 3, 4, 5, ... N*? 

Well, for this particular network (small directed network of 5 nodes), if you continue doing this over and over and do it for many, many, many steps, it turns out that **eventually these values will start to change very little, so they're converging to a unique value**. And that unique value, in this case:

$$\textit{Basic PageRank Converging Values for Nodes (Example: Directed Network of 5 nodes)}$$
$$\textit{Node }B = 0.38$$
$$\textit{Node }C = 0.25$$
$$\textit{Node }D = 0.19$$
$$\textit{Node }A = 0.12$$
$$\textit{Node }E = 0.06$$

Actually, it turns out that **for most networks, these PageRank values will actually converge and that's the value that we think of as the PageRank of the nodes**. So, the PageRank of the node is the value that you get after you do this process many, many, many times.


### Summary

In summary, the steps of Basic PageRank are the following. First, you start with all the nodes with a PageRank value of 1/n. And then you perform these Basic PageRank Update Rule k times. And this Update Rule said says that every node is going to give an equal share of its current PageRank to all the other nodes that it links to. And so the new PageRank value of every node is going to be simply the sum of all the PageRank that it gets from all the nodes that point to it. And for most networks, these PageRank values will actually converge as k gets larger ($k \rightarrow \infty^+ $). We're going to find that these values converge to a unique value. And so that's how you compute a basic PageRank.

$$$$

$$\texttt{Steps of Basic PageRank Algorithm}$$

$$\hookrightarrow \textbf{Step Nº1: } \textit{ All Nodes in the Network starts with a PageRank Score of }\rightarrow \frac{1}{n}$$

$$\hookrightarrow \textbf{Step Nº2: }\textit{ Perform the Basic PageRank Update Rule k times}$$


- **Basic PageRank Update Rule :** Each node gives an equal share of it's current PageRank to all the nodes it links to


- **New PageRank Score :** The new PageRank score for each node is the sum of all the PageRank it received from all the other nodes that are pointing to that specific node

$$$$
**For Most Networks, the PageRank Score values Converge as k gets larger** $\Rightarrow $ ($k \rightarrow \infty^+ $)

## Scaled PageRank 

First, we talked about PageRank and how to compute it on a network. And now, the **next step is to know how to interpret it and identify a potential problem that it has and also a solution**.


### Interpreting PageRank


- The PageRank of a node at a step *k* can be interpretate as the **probability that a random walker lands on a node after taking k random steps**


- **Random walk of k steps :** Start on a random node and then choose a random outgoing edge from than node onto the next node, finally repeat this process *k times*


And so in thinking about this interpretation of PageRank that says that the value of PageRank of each node is the probability that you would land on that node after k steps. Well, we computed the PageRank values for this network. And I told you that if you repeat this for a lot of steps, say k equals infinity. These are the values that you can eventually approach, these are the values that you converges to. So here, B had the highest value of PageRank of 0.38, and you can interpret this value of 0.38 as the probability that are random walk after taking many, many, many steps would land on node B.

### PageRank Problem

So you should have figured out that for large enough k, F and G are going to have a PageRank value of about one half. And all the other nodes are going to have a PageRank value of 0. So, why is that? 

Well, imagine a random walk on this network. Whenever the random walk lands on F or G, which will happen eventually if you walk long enough on this network, then they're going to stock on F and G because there are no edges to go to, right? So if you're in G, the only place you have to go is F. And if you are in F, the only place you have to go to is G. So, there's no way to get back from G and F to any of the other notes. And so all the other nodes, a probability that you land on one of them after taking a very long random walk is going to be 0, and the probability of landing on either F or G after a very long random walk is going to be about half for each. And so this seems like a problem, because while it may be true that F and G are very important for this reason, it's not reasonable to think that all the other nodes have zero importance. We need to figure out a way of how to fix this problem. 

The way we fix this problem is by introducing a new parameter to the PageRank computation called this *"damping parameter" alpha ($\alpha$)*. And so what we're going to do is we're going to change the way we do our random walk. What we're going to do is we're going to take a random walk with the *"damping" parameter $\alpha$. 

The way it works is that we again start at a random node. And then with probability $\alpha$, we're going to follow the outgoing edges at random, just like we did before, but this is only going to happen with probability $\alpha$. Then, with probability $(1 - \alpha)$, we're actually going to choose a node completely at random and jump to it. So again, at every step, what we used to do before was to always follow the edges. **What we're going to do now is that at every step, we're either going to follow the edges with probability $(\alpha)$, or we are going to forget about the edges, and choose a random node, and go to it with probability $(1 - \alpha)$, and we're going to repeat this k times**. 

And so what happens now, if you think about the random walk on this particular network as an example, is that we're no longer stuck on nodes F and G, right? because even if we were to be on node F and G, we have some probability $(1 - \alpha)$ of choosing a random node, then we're going to get unstuck whenever we actually choose a random node.

And so the **Scaled PageRank of k steps with damping parameter alpha of a node n is going to be the probability that this new random walk with damping parameter alpha lands on a node and after k steps**.


- **Random Walk of K Steps and Damping Parameter of $\alpha$ :** Start the random walk and then,
    - With Probability $\alpha$ : Choose an outgoing edge at random and follow it to the next node
    - With Probability $(1 - \alpha)$ : Choose a node at random and go to it
    - *Repeat k times*
    
    
- **Scaled PageRank :** The Scaled PageRank of *k* Steps and damping parameter of $\alpha$ of a node *n* is the probability that a random walk with damping parameter of $\alpha$ lands on a node *n* after *k* steps

$$$$

And so just like with the Basic PageRank for most networks as *k* gets larger $(k \rightarrow \infty^+)$, the **Scaled PageRank converges to a unique value, but now, that unique value will be dependent on the particular value of alpha** $(\alpha)$ that you choose. 

**In practice, what we do is we choose our parameter alpha between 0.8 and 0.9**. So most of the time, we're going to be following the edges. But sometimes, maybe 10% or 20% of the time, we're going to be jumping randomly, that way we're not stuck anywhere in the network. This **damping parameter works better for large networks** like the web or very large social networks, and **for small networks sometimes, it doesn't work very well**.

When using `NetworkX`, you can use the function `pagerank(G, alpha=0.8)` with input G, which is a graph. And then you have to tell what the alpha parameter is to compute this Scaled PageRank of the network G with a damping parameter alpha.

```python
# Scaled PageRank with dumping parameter alpha of 0.8
In: nx.pagerank(G, alpha=0.8)
```


### Summary

In summary, what we find is that the Basic PageRank of a node can be interpreted as the probability that a random walk lands on that node after k steps. And this is a useful interpretation, because well, it allows us to see this problem that Basic PageRank has. 

Sometimes for some networks, a few nodes can sort of suck up all the PageRank from all the other nodes in the network. And so to fix this problem, there is this other version of PageRank which is called Scaled PageRank that introduces this parameter alpha. So this random walker chooses a random node to jump to with probability 1- alpha. So, what that does is that it allows the walker to not be stuck anywhere, but sometimes it's sort of jumping randomly. And typically, we use a perimeter alpha between 0.8 and 0.9. But we do have to keep in mind that the PageRank value that we get will depend on the particular choice of alpha. 

To compute this or to use this in `NetworkX`, you can use the function `pagerank()` with input parameters G, the network, and then the alpha that you want, to compute the Scaled PageRank with the network G with damping parameter alpha.

$$$$

- The Basic PageRank of a node can be **interpreted as the probability that a random walk lands on the node after *k* random steps**


- Basic PageRank has the **problem that in some networks a few nodes can *"suck up/ stuck"* all the PageRank from a network** (the random walk it's stuck among this few nodes and cannot escape this cycle)


- **How to Fix this Problem**: Scaled PageRank introduces a parameter $\alpha$ such that the random walk chooses a random node to jump to with a probability of $(1 - \alpha)$


- Typically Alpha ($\alpha$) Values Used: $\alpha = 0.8$ (80%) or $\alpha = 0.9$ (90%)


- The **damping parameter works better for large networks** like the web or very large social networks, and **for small networks sometimes, it doesn't work very well


- `NetworkX` function `pagerank(G, alpha=0.8)` computes Scaled RankPage of network *G* with damping parameter $\alpha$ of $\alpha = 0.8$

---

# Class 4

## Hubs and Authorities

In this lecture we're going to talk about another way to find central nodes in the network. Just like PageRank, this way was also developed in the context of how a search engine might go about finding important web pages given a query using the hyperlink structure of the web. So the first step will be to find a set of relevant webpages. 

So for example, web pages that contain the query string in the text of the web page or for some reason the search engine thinks these might be an important page to look at. So **these are potential authorities, potential pages that are important given the query that the user submitted. This will be called the root set**. And so let's say in this example that nodes A, B, and C are these potential authorities, this is the root set. And **the next step will be to find all the web pages that link to any page in the root set, and these pages will be potential hubs**. 

So **hubs are pages that are not themselves necessarily relevant to the query that the user submitted, but they link to pages that are relevant**. So they're pages that are good at pointing at things that may be relevant.

Given a query to a search engine:

- **Root Set :** Set of highly relevant web pages (for example pages that contains the query string) - *potential authorities*


- Find all pages that link to a page in the root set - *potential hubs*


- **Base Set :** Root nodes and any node that links to a node in the root set


- Consider all edges connecting nodes in the base set


So the difference between **Hubs and Authorities** and **PageRank** is that **rather than taking the full network, we're starting with a subset of the network**. Again, looking at first just the root set, the web pages that may be relevant, and then any page that links to it. And so these will be just the subset of the full network of the web.

### HITS Algorithm

Now we're going to run the HITS algorithm on this network. The **HITS algorithm just like PageRank works by computing k iterations and keeping track of the score for every node**.

Now, the **difference between HITS and PageRank** is that the **HITS algorithm is going to keep track of two kinds of scores for every node, the Authority score and the Hub score**. 

The first step is to give every node an authority and a hub score of 1, and then we're going to apply two different rules. These rules are going to be similar to the rules that we used when we computed PageRank, but again now we're going to have to keep track of two different scores. So the first rule is going to be the Authority Update Rule, which says that each node's authority score is going to be the sum of the hub scores of each node that points to that node.

Computing *k* iterations of the HITS alorithm to assign an *Authority score* and a *Hub score* to each node in the network:
1. Assing each node an Authority and a Hub Score of 1
2. Apply the **Authority Update Rule :** Each node's **<font color='red'>authority score</font>**  is the sum of the **<font color='red'>hubs scores</font>** of each node that **<font color='red'>it points to</font>**
3. Apply the **Hubs Update Rule :** Each node's **<font color='red'>hubs score</font>**  is the sum of the **<font color='red'>authority scores</font>** of each node that **<font color='red'>it points to</font>**
4. **Normalize** the Authority Score and the Hub Score:
$$authority(j) = \frac{authority(j)}{\sum_{i \in N}{authority(i)}}$$
5. Repeat the process *k* times

$$$$

So you can see at this point that what we're really doing when we get this **new authority score is looking at the in-degree of each one of the nodes** (in-degrees are the edges from a directed graph going from the other nodes and pointing towards the specific node).

Let's move to the **new hub scores. It's going to be very similar, but now instead of looking at the in-degree of every node, we're going to look at the out-degree**. (out-degrees are the edges from a directed graph going from the other nodes and pointing outwards the specific node)

Next, we have to **normalize, and so, to normalize we have to add up the authority scores and add up the hub scores**.

And so if we do that we normalize the scores. And now the new authority and hub scores are going to become our old scores and that's the end of the first iteration $(k = 1)$, and with the new scores are used to repeat the same process and begin the 2nd iteration. 

It's important to note that in the **first iteration, every authority and hub score for every node in the network had the value of 1**, but now, **after the first iteration and the normalization, every node may have different scores from the rest of the nodes**, so it's important to be careful from this point forward, calculation wise.


### HITS Algorithm - Convergence

- What happens to the scores if we continue iterating the algorithm over and over and over again?
- It's going to convert to a unique value?

As it turns out for most networks, as k gets larger $(k \rightarrow \infty^+)$, the **Authority and the Hub scores actually do converge to a unique value**.


### HITS Algorithm - NetworkX

Now, to compute the Hub scores and the Authority scores of network using `NetworkX`, you can use the function `hits()` and give it the graph that you're analyzing as input. The hits function will output **two dictionaries, keyed by node, that contain the hub and authority scores of all the nodes in that network**.


### Summary

In summary, we find that the HITS algorithm starts by constructing a root set of relevant web pages and then expands it to a base set using the network structure.

And then HITS will assign an authority score and a hub score to every node in the network.

And here, nodes that have incoming edges from good hubs are thought to be good authorities, and then nodes that have outgoing edges to good authorities are thought to be good hubs. Authority scores and hub scores for most networks will converge to a unique value. And you can use NetworkX to find the scores by using the function `hits()` on any network that you want to.

$$$$

- The HITS algorithm starts by constructing a *root set* of relevant web pages and exanding it to a *base set*


- HITS algorithm then assigns an **Authority Score** and a **Hub Score** to each of the nodes of the Network


- Nodes that have **incoming edges from good Hubs** $\rightarrow$ **are good Authorities**


- Nodes that have **outgoing edges to good Authorities** $\rightarrow$ **are good Hubs**


- Authority Scores and Hubs Scores in Most Networks, **Converges to a Unique Value**


- Apply `NetworkX` to find the Authority and Hub Scores by using the function `hits()` on any network, remenber that this function returns dictionaries

---

### Final Summary

So, if we summarize the whole document, we find that no pair of centrality measures produces the exact same ranking but there are some commonalities, so you are able to pick out some of the nodes that are very central. Of course, the centrality measures make different assumptions about what it means to be a central node. And so, that's why they produce different rankings. And **to figure out what the best centrality measure is, really depends on the context of the network** that you're analyzing. And usually, the **best thing to do to identify central nodes is to take up multiples centrality measures and figure out which nodes come out central in many of them rather than relying on a single one to do this**.

