# INFO 4271 - Exercise 6 - Link Analysis

Issued: May 28, 2024

Due: June 3rd, 2024

Please submit this filled sheet via Ilias by the due date.

---

# 1. Co-Linking Similarity 
The directed graph of resource pointers (e.g., hyperlinks on the Internet, or citations in academic publishing) implicitly encodes topic information but can be much cheaper to process than the content words of the individual documents.

a) Implement a document similarity measure based only on graph topology, assuming that documents are similar if they link to similar documents.

In [1]:
#An example graph topology. Each entry represents a document alongside the outgoing links found in its content. 
graph = {'D1' : ['D14', 'D16'],
		 'D2' : ['D5', 'D6', 'D7'],
		 'D3' : ['D4', 'D14', 'D15', 'D18', 'D19'],
		 'D4' : ['D2', 'D9', 'D14'],
		 'D5' : ['D2', 'D8', 'D17'],
		 'D6' : ['D3', 'D8', 'D12', 'D15'],
		 'D7' : ['D3', 'D19'],
		 'D8' : ['D1', 'D2', 'D3', 'D5', 'D9', 'D10', 'D11', 'D13', 'D14', 'D15', 'D17', 'D19'],
		 'D9' : [],
		 'D10' : ['D1', 'D14', 'D19'],
		 'D11' : ['D6'],
		 'D12' : ['D9', 'D11', 'D13', 'D16', 'D18'],
		 'D13' : ['D2', 'D4', 'D18'],
		 'D14' : ['D2', 'D14'],
		 'D15' : ['D7'],
		 'D16' : ['D2', 'D10', 'D16'],
		 'D17' : ['D1', 'D4', 'D6', 'D7', 'D11', 'D12'],
		 'D18' : ['D2', 'D9', 'D14'],
		 'D19' : [],
		 'D20' : ['D12']
		}

#Measure the similarity between two documents x and y in a graph based on their outgoing links. 
def sim_out(x, y, graph):
	links_x = set(graph[x])
	links_y = set(graph[y])

	intersection_links = links_x.intersection(links_y)
	union_links = links_x.union(links_y)

	if len(links_x) == 0 or len(links_y) == 0:
		if len(links_x) == 0 and len(links_y) == 0:
			return 1.0
		else:
			return 0.0

	return round(len(intersection_links) / len(union_links), 3)

#Print a document simialrity matrix 
l = '\t'
for doc in graph:
	l += doc+'\t'
print(l)
for doc in graph:
	l = doc+'\t'
	for d in graph:
		l += str(sim_out(doc, d, graph))+'\t'
	print(l)

	D1	D2	D3	D4	D5	D6	D7	D8	D9	D10	D11	D12	D13	D14	D15	D16	D17	D18	D19	D20	
D1	1.0	0.0	0.167	0.25	0.0	0.0	0.0	0.077	0.0	0.25	0.0	0.167	0.0	0.333	0.0	0.25	0.0	0.25	0.0	0.0	
D2	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.071	0.0	0.0	0.333	0.0	0.0	0.0	0.333	0.0	0.286	0.0	0.0	0.0	
D3	0.167	0.0	1.0	0.143	0.0	0.125	0.167	0.214	0.0	0.333	0.0	0.111	0.333	0.167	0.0	0.0	0.1	0.143	0.0	0.0	
D4	0.25	0.0	0.143	1.0	0.2	0.0	0.0	0.25	0.0	0.2	0.0	0.143	0.2	0.667	0.0	0.2	0.0	1.0	0.0	0.0	
D5	0.0	0.0	0.0	0.2	1.0	0.167	0.0	0.154	0.0	0.0	0.0	0.0	0.2	0.25	0.0	0.2	0.0	0.2	0.0	0.0	
D6	0.0	0.0	0.125	0.0	0.167	1.0	0.2	0.143	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.111	0.0	0.0	0.25	
D7	0.0	0.0	0.167	0.0	0.0	0.2	1.0	0.167	0.0	0.25	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	
D8	0.077	0.071	0.214	0.25	0.154	0.143	0.167	1.0	0.0	0.25	0.0	0.214	0.071	0.167	0.0	0.154	0.125	0.25	0.0	0.0	
D9	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	
D10	0.25	0.0	0.333	0.2	0.0	0.0	0.25	0.25	0.0	1.0	0.0	0.0	0.0	0.25	0.0	0.0	0.125

b) Now let us modify the above scheme to also use the documents' incoming links in the calculation of the similarity score.

In [23]:
#Measure the similarity between two documents x and y in a graph based on their incoming and outgoing links. 
def sim_inout(x, y, graph):
	links_out_x = set(graph[x])
	links_out_y = set(graph[y])

	intersection_links_out = links_out_x.intersection(links_out_y)
	union_links_out = links_out_x.union(links_out_y)

	similarity_out = 0.0
	if len(links_out_x) == 0 or len(links_out_y) == 0:
		if len(links_out_x) == 0 and len(links_out_y) == 0:
			similarity_out = 1.0
	else:
		similarity_out = len(intersection_links_out) / len(union_links_out)

	
	links_in_x = set()
	links_in_y = set()

	for document in graph:
		for link in graph[document]:
			if link == x:
				links_in_x.add(document)
			if link == y:
				links_in_y.add(document)

	intersection_links_in = links_in_x.intersection(links_in_y)
	union_links_in = links_in_x.union(links_in_y)

	similarity_in = 0.0
	if len(links_in_x) == 0 or len(links_in_y) == 0:
		if len(links_in_x) == 0 and len(links_in_y) == 0:
			similarity_in = 1.0
	else:
		similarity_in = len(intersection_links_in) / len(union_links_in)


	return round((similarity_out + similarity_in) / 2, 3)

#Print a document simialrity matrix 
l = '\t'
for doc in graph:
	l += doc+'\t'
print(l)
for doc in graph:
	l = doc+'\t'
	for d in graph:
		l += str(sim_inout(doc, d, graph))+'\t'
	print(l)

	D1	D2	D3	D4	D5	D6	D7	D8	D9	D10	D11	D12	D13	D14	D15	D16	D17	D18	D19	D20	
D1	1.0	0.056	0.183	0.225	0.125	0.1	0.1	0.038	0.083	0.25	0.25	0.183	0.125	0.292	0.1	0.125	0.125	0.125	0.2	0.0	
D2	0.056	1.0	0.056	0.056	0.062	0.0	0.0	0.098	0.188	0.143	0.222	0.0	0.062	0.2	0.222	0.056	0.286	0.056	0.05	0.0	
D3	0.183	0.056	1.0	0.071	0.125	0.062	0.083	0.232	0.083	0.292	0.1	0.156	0.292	0.139	0.25	0.0	0.175	0.071	0.2	0.0	
D4	0.225	0.056	0.071	1.0	0.1	0.1	0.1	0.125	0.0	0.1	0.1	0.171	0.1	0.389	0.1	0.1	0.0	0.75	0.083	0.0	
D5	0.125	0.062	0.125	0.1	1.0	0.208	0.125	0.077	0.1	0.167	0.125	0.0	0.267	0.188	0.125	0.1	0.167	0.1	0.1	0.0	
D6	0.1	0.0	0.062	0.1	0.208	1.0	0.35	0.071	0.0	0.0	0.1	0.1	0.0	0.0	0.0	0.0	0.056	0.0	0.0	0.125	
D7	0.1	0.0	0.083	0.1	0.125	0.35	1.0	0.083	0.0	0.125	0.1	0.1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	
D8	0.038	0.098	0.232	0.125	0.077	0.071	0.083	1.0	0.0	0.125	0.0	0.232	0.036	0.083	0.125	0.077	0.229	0.125	0.0	0.0	
D9	0.083	0.188	0.083	0.0	0.1	0.0	0.0	0.0	1.0	0.1	0.2	0.0	0.25	0.188	0.083	0.083	0.1	

c) Discuss the differences between these two simialrity score variants. What are the salient advantages and disadvantages they offer?

Similarity based on outgoing links:
If we only use the outgoing links for similarity measurment, we assume that documents linking to the same documents have similar content or topics.

Advantages: 
- Easier to compute
- Through directly referencing another document by a link we can identify topic-specific relatedness

Disadvantages:
- Doesn't take the full context into account because the incoming links are left out
- Can be biased towards documents that have many outgoing links

Similarity based on in- and outgoing links:
If we use both in- and outgoing links we get a better view of the big picture in which context the document is in the graph.

Advantages: 
- Provides a more balanced similarity measurment because it's based on in- and outgoing links
- Solves the bias problem that documents with many outgoing links have when we would only use outgoing links
- Also measures the context especially how documents are referenced by others

Disadvantages:  
- More complex to compute
- We assume that incoming and outgoing links are of equal importance, which may not always be true (we could adjust the weight with which they influence the similarity score to solve this)
- We could misjudge similarity if for example the outgoing links of two documents are the same but the incoming links have no overlap


# 2. PageRank

The PageRank algorithm models page authoritativeness. Is it robust to tempering? Can you think of ways to game the PageRank scheme and give your website an artificially high score? What are ways to defend against such attacks?

PageRank is robust but not completly protected from tempering/manipulating. If we can artificially increase the number of pages, especially pages with high trust/authority, that link to our page we can increase our ranking significantly. We could do this by
- buying links from pages with high authority to transfer some of the authority of the pages to my page.
- organize in groups that want to increase their authority. Then create a network of websites that link to each other to increase the authority score. We can recruite more participants by advertising it as a "link for link" program.
- manipulate link anchor texts. Since PageRank analyizes anchor text we can manipulate anchor texts to appear in more search engine results for specific search terms.


Search Engines like google can defend against these types of manipulation by
- analysing network behavior and flagging suspicious linkings
- punish manipulation attempts by applying harsh penalties to the authority score of the pages
- analysing user behavior by measuring if links are really used by organic users and analyizing bounce rates (if users instantly leave the site after link click)
- giving more weight to pages with high authority. This increases the difficulty to manipulate the score of pages in a low authority network
- giving the users tools to report websites that do suspicous linking
- reviewing suspicious pages manually by employees