# Lab: Link Analysis
Data Mining 2019/2020 <br> 
Authors: Data Mining Teaching Team

**WHAT** This *optional* lab consists of several programming and insight exercises/questions. 
These exercises are ment to let you practice with the theory covered in: [Chapter 5][2] from "Mining of Massive Datasets" by J. Leskovec, A. Rajaraman, J. D. Ullman. <br>

**WHY** Practicing, both through programming and answering the insight questions, aims at deepening your knowledge and preparing you for the exam. <br>

**HOW** Follow the exercises in this notebook either on your own or with a friend. Use Mattermost
to disscus questions with your peers. For additional questions and feedback please consult the TA's at the assigned lab session. The answers to these exercises will not be provided.


[2]: http://infolab.stanford.edu/~ullman/mmds/ch5.pdf

In this exercise you will create the PageRank algorithm, named after Larry Page, co-founder of
Google. PageRank was designed to combat the growing number of term spammers. For this exercise
we will look at PageRank and some of its adaptations. Finally we will use PageRank to compute
which airports in the US are most important.

## Exercise 1: PageRank

We will start this exercise with a small network, simulating the entire internet with a few sites. Then we will simulate what a random surfer would do on this network and where the random surfer is most likely to end up.

### Step 1

Investigate the data of transitions from one vertex to the other in the example below. The data is of the form:

### <center> source|destination|weight </center>

In this case, all weights are set to 1, meaning all transitions are equally likely to happen.

**Example Data:**

<center> A|C|1 </center>
<center> A|D|1 </center> 
<center> B|A|1 </center> 
<center> B|D|1 </center> 
<center> C|A|1 </center> 
<center> D|B|1 </center> 
<center> D|C|1 </center> 

### Question 1.1

Draw the directed graph based on this data.

### Question 1.2

Write out the transition matrix for this network. Verify that all columns sum up to 1.

### Question 1.3

If we would initialze a random surfer at a random location, what are the chances for this random surfer to be at a ceratin location after one iteration? Manually calculate these probabilities.

### Step 2

Create a PageRank object and import the data from the given example. Print the data object to see how the data is stored.

In [None]:
from collections import OrderedDict

# This path might be different on your local machine.
example = 'data/example.txt'

def importData(example): 
    """
    This function loads the given datasets in an OrderedDictionary Object and
    can be used for the consecutive steps in this assignment.
    :param example: The input file containing the (example) data
    :return: A OrderedDictionary containing an OrderedDictionary for each data point
    """
    
    # extract data
    lines = [line.rstrip('\n') for line in open(example)]
    
    # init data structure
    data = OrderedDict()
    for l in lines:
        line = l.split("|")
        data[line[0]] = OrderedDict()

    # Set all possible connections with 0 
    # ex) The OrderedDict of A should be like OrderedDict(('A', 0) ('B', 0), ...)
    
    # Insert code here!
    
    # Update connection with values given from example
    # ex) The first element of the OrderedDict of B should be ('A', 1)
    
    # Insert code here!
    
    return data
 

data = importData(example)
data

### Step 3

Next, a transition matrix has to be constructed, by creating the function: `constructTransitionMatrix`.

In [None]:
import numpy as np

def constructTransitionMatrix(data):
    """
    This function returns a transitionMatrix based on the data given by importData.
    Note: You can convert ODict object to lists by list(ODict_Object).
    :param data: The OrderedDictionary containing the input data
    :return: An array representing the transition matrix
    """
    matrix = None

    # Insert code here!
    
    return matrix

transMatrix = constructTransitionMatrix(data)
transMatrix

### Question 1.4

Is the output matrix from the function `constructTransitionMatrix` the same as the matrix you calculated in question 1.2?

### Step 4

Finish the `getRandomSurfer` function which should create a row vector of length equal to the number of vertices in the data. Each element should have equal probability and should sum up to one. In other words, it should construct the following vector:

<center>$v = \begin{bmatrix}\dfrac{1}{n} \\ \dfrac{1}{n} \\ . \\ . \\ . \\ \dfrac{1}{n}\end{bmatrix}$</center>  
  
Where $n$ is the number of vertices in the data.

In [None]:

def getRandomSurfer(data):
    """
    This function returns a row vector of length equal to the number of vertices in the given data 
    :param data: The OrderedDictionary containing the input data
    :return: An array where each value has the same probability summing up to 1
    """
    result = None
    
    # Insert code here!
    
    return result

getRandomSurfer(data)

### Step 5

Now complete the `calulatePageRank` function. This function should calculate a transition matrix, get a random surfer vector and multiply these for a number of iterations. The iterative step is:  

<center>$v' = Mv$</center>  

Where M is the transition matrix.

Run the `calculatePageRank` function on the example dataset with 10 iterations. Verify that the result is approximately as follows:  

<center>$v_{10} = \begin{bmatrix}A \\ B \\ C \\ D\end{bmatrix} = \begin{bmatrix}0.354 \\ 0.119 \\ 0.294 \\ 0.233\end{bmatrix}$</center>

In [None]:
def calculatePageRank(data, transMatrix, iterations):
    """
    This function calculates the page rank based on the initial data of importData,
    a given transitionMatrix (transMatrix) and a given amount of iterations.
    :param data: The OrderedDictionary containing the input data
    :param transMatrix: The transition matrix
    :param iteration: The amount of iterations
    :return: A set containing the PageRank for each data item.
    """
    
    # Init result
    result = dict()
    
    # Take randomSurfer
    
    # Insert code here!
    
    # Take dot product of transMatrix and randomSurger (times iterations)  
    # Set pagerank for each key of the given data
    
    # Insert code here!

    return result

calculatePageRank(data, transMatrix, 10)

### Step 6

Now run the calculatePageRank function on the `data/example2.txt` dataset with at least 10 iterations.   
  
**example2 Data:**  
<center>A|C|1</center>
<center>A|D|1</center>
<center>B|A|1</center>
<center>B|D|1</center>
<center>C|C|1</center>
<center>D|B|1</center>
<center>D|C|1</center>

As you can see this dataset is slightly different. The edge from C to A is replaced by an edge from C to C itself.

In [None]:
# This path might be different on your local machine.
example2 = 'data/example2.txt'

# Replace this with your implementation!
data2 = None
transMatrix2 = None
calculatePageRank(data2, transMatrix2, 10)

### Question 1.5

Explain the results you now get from the PageRank algorithm.

### Step 7

In order to make sure nodes like these do not corrupt our results, we can use taxation to allow the random surfer to randomly jump from one page to another. This comes down to changing our iterative step to:

<center>$v' = \beta Mv + \dfrac{(1 - \beta)e}{n}$</center>  

Where $e$ is a vector of all ones, $n$ is the number of vertices in the data and $\beta$ is a constant.  
Implement the function `taxationPageRank` which calculates this modified PageRank value using the iterative step. You may set $\beta$ to 0.8.

In [None]:
def taxationPageRank(data, transMatrix, beta, iterations):
    """
    This function calculates the page rank using taxation based on the initial data 
    of importData, a given transitionMatrix (transMatrix), a given beta for the 
    taxation and a given amount of iterations
    :param data: The OrderedDictionary containing the input data
    :param transMatrix: The transition matrix
    :param beta: The beta
    :param iteration: The amount of iterations
    :return: A set containing the PageRank for each data item.
    """
    
    # Init result
    result = dict()
    
    # calc v' iteratively
    # Set pagerank for each key of the given data
    
    # Insert code here!
    
    return result

taxationPageRank(data2, transMatrix2, 0.8, 10)

### Question 1.6

Are the results better using the `taxationPageRank` function? What happens if we lower the beta? What happens if we increase the beta?

### Step 8

Check out the `data/flight_data.txt` file.  
**flight_data (first 10 rows):**  
<center>Cincinnati, OH|Omaha, NE|1</center>
<center>Cincinnati, OH|Los Angeles, CA|56</center>
<center>Cincinnati, OH|Milwaukee, WI|26</center>
<center>Cincinnati, OH|Charlotte, NC|123</center>
<center>Cincinnati, OH|Raleigh/Durham, NC|50</center>
<center>Cincinnati, OH|Nashville, TN|50</center>
<center>Cincinnati, OH|Chicago, IL|353</center>
<center>Cincinnati, OH|Fort Myers, FL|34</center>
<center>Cincinnati, OH|Orlando, FL|87</center>
<center>Cincinnati, OH|San Francisco, CA|25</center>


This file contains information reagrding airports in the US and flights between them. Each line represents a connection from one airport to another with the weight equal to the number of flights in January 2013. Run the algorithm on this dataset.

In [None]:
# This path might be different on your local machine.
example3 = 'data/flight_data.txt'

# Replace this with your implementation!
data3 = None
transMatrix3 = None
flightsPageRank = taxationPageRank(data3, transMatrix3, 0.8, 10)
flightsPageRank

### Question 1.7

What is the most important airport according to the results?

In [None]:
import operator

max(flightsPageRank.items(), key=operator.itemgetter(1))