# INFO 4271 - Exercise 6 - Link Analysis

Issued: May 28, 2024

Due: June 3rd, 2024

Please submit this filled sheet via Ilias by the due date.

---

# 1. Co-Linking Similarity 
The directed graph of resource pointers (e.g., hyperlinks on the Internet, or citations in academic publishing) implicitly encodes topic information but can be much cheaper to process than the content words of the individual documents.

a) Implement a document similarity measure based only on graph topology, assuming that documents are similar if they link to similar documents.

In [24]:
#An example graph topology. Each entry represents a document alongside the outgoing links found in its content. 
graph = {'D1' : ['D14', 'D16'],
		 'D2' : ['D5', 'D6', 'D7'],
		 'D3' : ['D4', 'D14', 'D15', 'D18', 'D19'],
		 'D4' : ['D2', 'D9', 'D14'],
		 'D5' : ['D2', 'D8', 'D17'],
		 'D6' : ['D3', 'D8', 'D12', 'D15'],
		 'D7' : ['D3', 'D19'],
		 'D8' : ['D1', 'D2', 'D3', 'D5', 'D9', 'D10', 'D11', 'D13', 'D14', 'D15', 'D17', 'D19'],
		 'D9' : [],
		 'D10' : ['D1', 'D14', 'D19'],
		 'D11' : ['D6'],
		 'D12' : ['D9', 'D11', 'D13', 'D16', 'D18'],
		 'D13' : ['D2', 'D4', 'D18'],
		 'D14' : ['D2', 'D14'],
		 'D15' : ['D7'],
		 'D16' : ['D2', 'D10', 'D16'],
		 'D17' : ['D1', 'D4', 'D6', 'D7', 'D11', 'D12'],
		 'D18' : ['D2', 'D9', 'D14'],
		 'D19' : [],
		 'D20' : ['D12']
		}

#Measure the similarity between two documents x and y in a graph based on their outgoing links. 
def sim_out(x, y, graph):
	links_x = set(graph[x])
	links_y = set(graph[y])

	intersection_links = links_x.intersection(links_y)
	union_links = links_x.union(links_y)

	if len(links_x) == 0 or len(links_y) == 0:
		return 0.0

	return round(len(intersection_links) / len(union_links), 3)

#Print a document simialrity matrix 
l = '\t'
for doc in graph:
	l += doc+'\t'
print(l)
for doc in graph:
	l = doc+'\t'
	for d in graph:
		l += str(sim_out(doc, d, graph))+'\t'
	print(l)

	D1	D2	D3	D4	D5	D6	D7	D8	D9	D10	D11	D12	D13	D14	D15	D16	D17	D18	D19	D20	
D1	1.0	0.0	0.167	0.25	0.0	0.0	0.0	0.077	0.0	0.25	0.0	0.167	0.0	0.333	0.0	0.25	0.0	0.25	0.0	0.0	
D2	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.071	0.0	0.0	0.333	0.0	0.0	0.0	0.333	0.0	0.286	0.0	0.0	0.0	
D3	0.167	0.0	1.0	0.143	0.0	0.125	0.167	0.214	0.0	0.333	0.0	0.111	0.333	0.167	0.0	0.0	0.1	0.143	0.0	0.0	
D4	0.25	0.0	0.143	1.0	0.2	0.0	0.0	0.25	0.0	0.2	0.0	0.143	0.2	0.667	0.0	0.2	0.0	1.0	0.0	0.0	
D5	0.0	0.0	0.0	0.2	1.0	0.167	0.0	0.154	0.0	0.0	0.0	0.0	0.2	0.25	0.0	0.2	0.0	0.2	0.0	0.0	
D6	0.0	0.0	0.125	0.0	0.167	1.0	0.2	0.143	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.111	0.0	0.0	0.25	
D7	0.0	0.0	0.167	0.0	0.0	0.2	1.0	0.167	0.0	0.25	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	
D8	0.077	0.071	0.214	0.25	0.154	0.143	0.167	1.0	0.0	0.25	0.0	0.214	0.071	0.167	0.0	0.154	0.125	0.25	0.0	0.0	
D9	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	
D10	0.25	0.0	0.333	0.2	0.0	0.0	0.25	0.25	0.0	1.0	0.0	0.0	0.0	0.25	0.0	0.0	0.125

b) Now let us modify the above scheme to also use the documents' incoming links in the calculation of the similarity score.

In [37]:
#Measure the similarity between two documents x and y in a graph based on their incoming and outgoing links. 
def sim_inout(x, y, graph):
	links_out_x = set(graph[x])
	links_out_y = set(graph[y])

	intersection_links = links_out_x.intersection(links_out_y)
	union_links = links_out_x.union(links_out_y)

	if len(links_out_x) == 0 or len(links_out_y) == 0:
		return 0.0
	
	links_in_x = {}
	links_in_y = {}

	for document in graph:
		print("Document: ", document)
		for link in graph[document]:
			print("Link: ", link)
			print("x: ", x)
			if link == x:
				links_in_x.add(document[0])
			if link == y:
				links_in_y.add(document[0])

	print(x)
	print(links_in_x)

	return round(len(intersection_links) / len(union_links), 3)

#Print a document simialrity matrix 
l = '\t'
for doc in graph:
	l += doc+'\t'
# print(l)
for doc in graph:
	l = doc+'\t'
	for d in graph:
		l += str(sim_inout(doc, d, graph))+'\t'
	# print(l)

Document:  D1
Link:  D14
x:  D1
Link:  D16
x:  D1
Document:  D2
Link:  D5
x:  D1
Link:  D6
x:  D1
Link:  D7
x:  D1
Document:  D3
Link:  D4
x:  D1
Link:  D14
x:  D1
Link:  D15
x:  D1
Link:  D18
x:  D1
Link:  D19
x:  D1
Document:  D4
Link:  D2
x:  D1
Link:  D9
x:  D1
Link:  D14
x:  D1
Document:  D5
Link:  D2
x:  D1
Link:  D8
x:  D1
Link:  D17
x:  D1
Document:  D6
Link:  D3
x:  D1
Link:  D8
x:  D1
Link:  D12
x:  D1
Link:  D15
x:  D1
Document:  D7
Link:  D3
x:  D1
Link:  D19
x:  D1
Document:  D8
Link:  D1
x:  D1


AttributeError: 'dict' object has no attribute 'add'

c) Discuss the differences between these two simialrity score variants. What are the salient advantages and disadvantages they offer?

# 2. PageRank

The PageRank algorithm models page authoritativeness. Is it robust to tempering? Can you think of ways to game the PageRank scheme and give your website an artificially high score? What are ways to defend against such attacks?