# CSPB 3104 Assignment 8: Problem Set
## Instructions

> This assignment is to be completed and uploaded to 
moodle as a python3 notebook. 

> Submission deadlines are posted on moodle. 

> The questions  provided  below will ask you to either write code or 
write answers in the form of markdown.

> Markdown syntax guide is here: [click here](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet)

> Using markdown you can typeset formulae using latex.

> This way you can write nice readable answers with formulae like thus:

>> The algorithm runs in time $\Theta\left(n^{2.1\log_2(\log_2( n \log^*(n)))}\right)$, 
wherein $\log^*(n)$ is the inverse _Ackerman_ function.

__Double click anywhere on this box to find out how your instructor typeset it. Press Shift+Enter to go back.__


----

## Question 1: Shortest Cycle Involving a Given Node.

You are given a directed graph $G: (V, E)$ using an adjacency list representation and a vertex (node) $u$ of the graph.
Write an algorithm to perform the following tasks:

__1(A)__ Write an algorithm that decides (true/false) whether the vertex $u$ belongs to a cycle.

What is the complexity for your algorithm in terms of the number of vertices $|V|$ and the number of edges $|E|$?

Note: Throughout this assignment you may describe your algorithms using words and definitely use algorithms that you have already learned in class. A brief description will do.


A vertex, u belongs to a cycle iff a DFS rooted at u has a back-edge adjacent to u. 
That is, there exists an edge from the descendant of u that connects back to u.   

Considering the edge (u,v) with discovery time d and finish time f:

1. Do a DFS starting from u, and check if a back edge exists to u by noting start and finish times.
-That is, edge (u,v) is a back edge iff v.d<u.d<u.f<v.f

2. Do a DFS starting from v (or all edges reachable from u), check if a back edge exists to v by noting start and finish times.
-That is, edge (v,u) is a back edge iff u.d<v.d<v.f<u.f  

For directed graphs, it is important to keep a stack for vertices being visited. 
During DFS visits, keep track of the visited vertices in a recursion stack.  
If a vertex (u) is encountered that is already present in the recursion stack, then a cycle exists.  

If both conditions are found to be true, then vertex u belongs to a cycle. Return True.  
Else, return False.  

The complexity of this algorithm follows from DFS, and is $\Theta(|V|+|E|)$, where V is the number of vertices and E is the number of edges. 

__1(B)__ Write an algorithm which prints the smallest length cycle involving the vertex $u$.

What is the complexity for your algorithm in terms of the number of vertices $|V|$ and the number of edges $|E|$?


We can use Floyd-Warshall's algorithm to definitively find all pairs shortest distances, but this is expensive.  
We modify the CLRS algorithm in this way:
Initialize $distance[u][V]$ to Infinity, and $distance[u][u]$ to Infinity.  
V is any other vertex in the graph besides u.  
We want to find the shortest cycle (from u back to u), so we modify the algorithm and set $dist[u][u] = Infinity$ instead of 0,  
which allows us to find the shortest path from u to u (cycle).  

After the algorithm runs, we will have the value for $dist[u][u]$, which will be the shortest cycle involving vertex u.  
The running time of Floyd-Warshall is $\Theta({V^3})$. 

----

## Question 2: Tracing an Epidemic

An email with a malicious attachment has evaded the antivirus software of company X.
We know that the CEO's computer was infected during a business trip last month. Since then,investigators have 
been trying to determine whose mailboxes could be infected. For an employee's mailbox to be infected, he or she must have received
and read  an email sent by an already affected employee. 

Starting from the time $0$ denoting when the CEO's mailbox was first infected, investigators have "metadata" for all
the emails from all employees in the form

$(P_i, P_j, t_k, t_l)$ meaning that employee $P_i$ sent an email at time $t_k$ to employee $P_j$, and $P_j$ opened the email at
time $t_l > t_k$.  We assume that $P_j$'s mailbox is infected instantaneously at time $t_l$ if $P_i$'s mailbox was infected before time $t_k$. 

You are given a collection of email records in the form given above, and  you know that person $P_0$ is the CEO who was infected at time $t = 0$.

we ask if a given person of interest $P_j$ could have been infected at a given time of interest $t = T$.

__2(A)__ Write an algorithm that, given a person $P_j$ and time $T$, determines if $P_j$'s mailbox was infected before or at time $T$. What is the worst case complexity of your algorithm in terms of the number of persons $|P|$,  and the number of emails sent $|E|$.

**Hint** You need to first make a graph that represents the possible flow of the "infection" through emails. It is easier to make a complicated graph (in this case, one where each vertex represents more than just a person) and then run a simple graph algorithm (one of the vanilla algorithms we learned this week, ie BFS/DFS/Topological sort) rather than making a simple graph and running a complicated ad-hoc algorithm on it (If your algorithm requires table lookups or passing on metadata specific to the problem at hand, it's probably too complicated).  

Each node represents an employee, $P_i$. Each node has attributes,  
$P_i$=employee designation (to differentiate nodes).     
parent=parent of current node (previous sender).  
visited=false  
infectionTime=inf  
A=adjaceny list of node $P_i$, which may include $P_j$ and other nodes.   
-This list contains information for the sender $P_i$, for each receiver, and corresponding sent ($T_k$) and opened ($T_l$) times.      

First, populate the adjacency list of each node, from metadata infection information like $(P_i, P_j, t_k, t_l)$.  
The adjacency list for a vertex will then contain all of the emails (infected or not) that vertex sent, each sent time ($t_k$),     
receiver ($P_j$), and the time ($t_l$) that each was opened by each receiver.  
Note that, for $P_j$ to be placed in the adjacency list of $P_i$,  $t_l>t_k$ so that employee $P_j$ opened the email after it was sent by $P_i$.   
Also, here $P_j$ is reachable from $P_i$, but not viceversa, so that $P_i$ should not be placed in the adjacency list of $P_j$ (making a directed graph).  
Next, from this adjacency list, create a directed graph that follows the flow of emails through employees.  
An arrow from person one to person two means the first person sent the possibly infected email to the second, after which the second person opened it.   

Next, run a BFS search starting from the first infected person, which in this example is the CEO, $P_0$. 
Set $P_0$.visited=true, $P_0$.parent=NIL, and $P_0$.infectionTime=0.  
For all unvisited adjacent nodes from the currently visited node, enqueue them for the BFS to continue.  
Continue with the BFS search, which accesses adjacency lists in a Queue structure, so that the search explores all nodes at a given level before proceeding.   
Each time a node is first visited, visted is set to true, and its parent is recorded (according to the BFS algorithm).  
Update the infectionTime to the time that the current node opened the email from the parent (the corresponding $t_l$).  
This information is accessed from the adjacency list of the parent node, or $p_j.parent.A.p_j.t_l$.    


To determine if a node, $P_j$, was infected before or after time T, you must access attributes of $P_j$ after the BFS search has completed,  
and compare $P_j.infectionTime$ to T.  

If time $T<=infectionTime$, then it means $P_j$ was infected before time T, and was reachable via the infection graph. 

Else, if node $P_j$ was not reachable via the infection path, infectionTime will still be infinity,  
or if the node was reachable, but $T>infectionTime$, either of the last two conditions means that $P_j$ could not have been infected at time T.      

This follows closely from BFS, with an overall complexity of $\Theta(|P|+|E|)$.  




__2(B)__ Write an algorithm that prints out each person who is infected in increasing order of the times in which they
first got infected.


Time=0 is a universal variable used to update start and finish times.  
Each node represents an employee, $P_i$. Each node has attributes,  
$P_i$=employee designation (to differentiate nodes).        
parent=parent of current node (previous sender).  
visited=false  
A=adjaceny list of node $P_i$, which may include $P_j$ and other nodes.   
-This list contains information for the sender ($P_i$), for each receiver, and corresponding sent ($T_k$) and opened ($T_l$) times.  
start=start time in DFS algorithm.    
finish=finish time in DFS algorithm.    

Create a directed graph of infection, using the same graph as was created in 2A).    
Next, run a DFSVist search starting from the first infected person, which in this example is the CEO, $P_0$. 
Set $P_0$.visited=true, and $P_0$.parent=NIL. Set $P_0$.start=time, then increment time.       
Continue with the DFSVisit search, which accesses adjacency lists in a Stack structure, so that the search goes as deep as possible before backtracking.   
Each time a node is first visited, visted is set to true, and its parent is recorded (according to the DFS algorithm).  
Set node.start=time, and increment time. When all of a node's adjacent neighbors have been visited, set node.finish=time, increment time.  

As DFSVist runs, when a node receives a finish time, place the node at the head of a list.  
After DFSVisit is finished, the list will ultimately be sorted according to finish time in descending order.   
A higher finishing time means the person had a lower original infection time.  This means that the list will be sorted by ascending infection time.  
Simply print the list in order to output each person who is infected in increasing order of the times in which they first got infected.  

This follows closely from DFS doing a Topological sort, with an overall complexity of $\Theta(|P|+|E|)$.  

----

## Question 3: Testing Moth Age Expert

A person claims to have spent his life studying the emperor gum moth  *Opodiphthera eucalypti*. 
Given two moth samples, he claims to tell us which one is the older. Of course, 
we ourselves are no experts and they all in fact look the same to us.


We test the person as follows: (a) collect a large number $n$ of e.g. moth specimen; (b) randomly
select $m$ different pairs from our collection and have the person tell us which one is older; 
(c) record their answers and analyze them to see if they are _consistent_

Write an algorithm to detect if the "expert" opinions are _consistent_. 


**Hint:** We have refrained from discussing what consistency means in this case. But can provide you an example as a hint.

__Example__ 

Suppose $n= 4$ and the expert says that

Specimen \# $1$ is older than $2$, $3$ is older than $4$, $4$ is older than $2$ and $2$ is older
than $3$.

The expert's opinion is clearly *inconsistent*.

Suppose $n=4$ and the expert says that

Specimen \# $1$ is older than $2$, $3$ is older than $4$ and $4$ is older than $1$. The
expert's answer is *consistent*.



An opinion is consistent if the set of inequalities are each consistent with one another.  

The first example is not consistent. If 1>2, 3>4, 4>2, and 2>3, 
then by the transitive property of inequalities, 1>2>3>4, but this is inconsistent because it was given that 2<4.  

In the second example, 1>2, 3>4, and 4>1.  This means by the transitive property, that  
2<1<4<3. This is consistent with the m comparisons given.  

First, create a nxn matrix, A. Iterate over the m comparisons, and store the result as $A[i][j]=1$ if $i>j$, or $=0$ if $i<j$.   
This will also by extension populate the values of $A[j][i]$, which should be the opposite of $A[i][j]$.    
Next, create a large transitive inequality from the m comparisons.   
Fill in the table once more (possibly changing previous cell values) by individually considering all pairs within the large, m value (transitive) inequality.  

One row should sum to m, and this moth number, L, should be older than all the other moths.  
If the values of the sum of each row is taken, placed in a list and sorted,  
then the list should be continuous, having each value from 0 to m. This means that the case is consistent.  

If, however, the values are not consistent, then the list will not have continuous values from 0 to m, 
and will instead have repeated and/or omitted values. 

In the consistent example (above), The sums of the rows are 1, 0, 3, and 2, which when sorted form a continuous list from 0 to m, (0,1,2,3).   
This means that the L moth was specimen 3, which was older than the rest.  

In the insconsistent example, the sums of the rows when sorted do not form a continuous list from 0 to 4. 


----

## Question 4: Testing if an undirected graph is acyclic

You are given a strongly connected, undirected graph $G$ with $n$ vertices as an adjacency list. Write an algorithm to check if $G$ has a cycle that runs in time $\Theta(n)$.

*Hint* A connected, undirected acyclic graph is a tree. Since you are already given that $G$ is connected, you are just checking if $G$ is a tree. How many edges would a tree have?


Connected acyclic undirected graphs are trees. A connected graph G: (V,E) is a tree iff |E|=|V|-1. 
That is, the numEdges=numVertices-1.  
Here, n is known before hand as the number of vertices (|v|).  
The algorithm is to iterate over each vertex in the adjaceny list, keeping a counter for the number of edges that exist, e.    
Each time a new edge is encountered, e is incremented and added to a map lookup table if it is a new edge.   
This can be done in linear time, or $\Theta(|V|+|E|)$.  
If |e|=|v|-1, then it is a tree. Else, it is not a tree.  