In [1]:
import math
from collections import Counter

Exercise 10.1: Retrieval using TF-IDF and unigram language models

In [2]:

def cosine_similarity(a, b):
    dot_product = sum(a.get(word, 0) * b.get(word, 0) for word in set(a) & set(b))

    magnitude = math.sqrt(sum(value**2 for value in a.values())) * math.sqrt(
        sum(value**2 for value in b.values())
    )
    return dot_product / magnitude

In [3]:

# Define the documents and the query
documents = {
    'd1': "the robot is insightful but you are strong and i may answer and the wise fox is insightful and you are insightful and i am insightful but i will explain the insightful bird",
    'd2': "the bird is insightful",
    'd3': "when will they explain the friendly insightful strong insightful bird and is the bird strong and is a strong robot insightful",
    'd4': "a cat is strong but you are cautious and i may help but a fox is insightful but are they strong and when may you answer"
}
query = "insightful bird"

# Split the documents and the query into words
documents = {doc: text.split() for doc, text in documents.items()}
query = query.split()

In [4]:
# Compute the term frequencies
tfs = {doc: Counter(words) for doc, words in documents.items()}
query_tf = Counter(query)

In [5]:

# Compute the document frequencies
dfs = Counter(word for words in documents.values() for word in set(words))

# Compute the inverse document frequencies
idfs = {word: math.log(len(documents) / df) for word, df in dfs.items()}


In [6]:
# Compute the TF-IDF vectors for the documents and the query
tfidfs = {
    doc: {word: tf * idfs[word] for word, tf in tfs[doc].items()} for doc in documents
}
query_tfidf = {word: tf * idfs.get(word, 0) for word, tf in query_tf.items()}

In [7]:



cosine_similarities = {
    doc: cosine_similarity(tfidf, query_tfidf) for doc, tfidf in tfidfs.items()
}

print("Cosine similarities between the query and the documents:")
for doc, cosine_similarity in cosine_similarities.items():
    print(f"{doc}: {cosine_similarity}")

# Compute the unigram probabilities for the query given by each document
unigram_probabilities = {
    doc: math.prod(tfs[doc].get(word, 0) / len(words) for word in query)
    for doc, words in documents.items()
}

print("\nUnigram probabilities for the query given by each document:")
for doc, unigram_probability in unigram_probabilities.items():
    print(f"{doc}: {unigram_probability}")

Cosine similarities between the query and the documents:
d1: 0.06591142916966435
d2: 0.7071067811865475
d3: 0.2249420126105055
d4: 0.0

Unigram probabilities for the query given by each document:
d1: 0.004591368227731864
d2: 0.0625
d3: 0.013605442176870746
d4: 0.0


Exercise 10.2: Pagerank.

In [8]:
links = {
    'd1': ['d2', 'd3'],
    'd2': ['d4'],
    'd3': ['d2', 'd4'],
    'd4': ['d3']
}
pageranks = {doc: 1 for doc in links}
print(pageranks)
while True:
    new_pageranks = {}
    
    for doc in links:
        new_pageranks[doc] = sum(pageranks[incoming_doc] / len(links[incoming_doc]) for incoming_doc in links if doc in links[incoming_doc])
    if all(abs(new_pageranks[doc] - pageranks[doc]) < 0.001 for doc in links):
        break
    pageranks = new_pageranks

for doc, pagerank in pageranks.items():
    print(f"{doc}: {pagerank}")


{'d1': 1, 'd2': 1, 'd3': 1, 'd4': 1}
d1: 0
d2: 0.80078125
d3: 1.599609375
d4: 1.599609375


Exercise 10.3: Limitations and improvements of Pagerank.

Query:
1. what the limitations of the Pagerank algorithm are, 
2. in what ways the algorithm can be improved.

CHATGPT RESPONSE:

# 1. Limitations of the PageRank Algorithm
The PageRank algorithm, despite its effectiveness in ranking web pages, has some limitations:

Static Nature: PageRank was initially designed for static web pages, and its effectiveness decreases when applied to dynamic or rapidly changing content on the internet.

*-> This is because PageRank scores are not calculated at the time of search but are determined at the time of indexing. [1]* 

Link Spamming: It is susceptible to manipulation by link spammers who create numerous links to artificially inflate the ranking of a page.

Topic Ignorance: PageRank doesn't consider the topical relevance of content, leading to potential inaccuracies in ranking when content is diverse. 
*-> It also struggles with queries containing natural language and information outside of keywords [1]*

# 2. Ways to Improve the PageRank Algorithm

To enhance the PageRank algorithm, several improvements have been proposed:

BlockRank Algorithm: An improved approach involves the BlockRank algorithm, which reduces the dimension of the linear system, making calculations more efficient for general nodes.

Adaptation to Evolving Networks: Recognizing the limitations in growing networks, modifications like considering the decay of relevance can address the failure of PageRank in ranking nodes effectively.

*-> The ideas above are correct, maybe nothing to add or support more. Some information I can find more:*

*Another method: [2]* 

*Eliminate dangling pages: Dangling pages are pages that have no out links. They are not considered in the PageRank algorithm.*

*Aggregate out links on pages with many site links*

*Split long pages into several pages: This is a good idea to improve the PageRank algorithm. Because the PageRank algorithm is based on the number of links, the more links, the higher the PageRank score. Therefore, splitting long pages into several pages will increase the number of links, thereby increasing the PageRank score.
*

[1]: https://blogs.cornell.edu/info2040/2015/10/28/limitations-of-pagerank/
[2]: https://www.zdnet.com/article/3-simple-ways-to-optimize-for-google-pagerank/

1: https://blogs.cornell.edu/info2040/2015/10/28/limitations-of-pagerank/

2: https://www.zdnet.com/article/3-simple-ways-to-optimize-for-google-pagerank/