### Part 1: Simple Problem

**1. An IR system returns 10 relevant documents and 8 non-relevant documents. There are a total of 25 relevant documents in the collection. What is the precision of the system on this search, and what is its recall?**

Precision = true positives / (true positives + false positives)

Recall = true positives / (true positives + false negatives)

In [2]:
10 / (10 + 8) # precision

0.5555555555555556

In [3]:
10 / (10 + 15) # recall

0.4

**2. Draw the inverted index that would be built for the following document collection.** 

Doc 1: new home sales top forecasts

Doc 2: home sales rise in july

Doc 3: increase in home sales in july

Doc 4: july new home sales rise

A simple inverted index can be built by collection documents (above), tokenizing each document, preprocess to normalized tokens (indexing terms), and finally indexing the documents in an inverted index, consisting of a dict and postings. Lets just build it before we draw it. According to Manning's Information Retrieval, inverted index structures are essentially without rivals as the most efficient structure for supporting ad hoc text search.

In [4]:
from nltk.tokenize import word_tokenize

docs = ["new home sales, top forecasts#", # inserted some non-alpa-numeric entities demonstrate normalization.
        "home sales rise, in #july",
        "increase in home, sales in july",
        "july new home#:)/--'' sales, rise"] # list of docs

tokenized_docs = [word_tokenize(s) for s in docs]
normalized_docs = [[word.lower() for word in s if word.isalpha()] for s in tokenized_docs] # we also redudantly lower()


In [5]:
from collections import defaultdict

inv_indx = defaultdict(list) # using a defaultdict provides a defaul value for a nonexistent key as to avoid KeyErrors
for idx, text in enumerate(normalized_docs): # enumerating over the list of normalized docs and their indexes
    for word in text: 
        inv_indx[word].append(idx) # appending the indexes to which every word belongs. 

In [6]:
inv_indx # I hope that the output below counts as a drawing. 

defaultdict(list,
            {'new': [0, 3],
             'home': [0, 1, 2, 3],
             'sales': [0, 1, 2, 3],
             'top': [0],
             'forecasts': [0],
             'rise': [1, 3],
             'in': [1, 2, 2],
             'july': [1, 2, 3],
             'increase': [2]})

**3. Consider two documents A and B whose Euclidean distance is d and cosine similarity is c (using no normalization other than raw term frequencies). If we create a new document A' by appending A to itself and another document B' by appending B to itself, then:**


**a) What is the Euclidean distance between A' and B' (using raw term frequency)?**



**Short answer:**
The Euclidean Distance between A' and B' is $d^2$, e.g, for the example below, 3.1622776601683795 ** 2 = 6.324555320336759‬

**Long answer:** 
See below until subquestion b.

The equaton for euclidian distance between 2 data objects is

$d\left( p,q\right)   = \sqrt {\sum _{i=1}^{n}  \left( q_{i}-p_{i}\right)^2 } $


where n is the number of dimensions / attributes, $p_k$ and $q_k$ are, respectively, the $Kth$ attributes of data objects p and q. Which basically means that we compute the distance of the individual attributes, sqaure it and then sum it all, before we take the square root of the sum. Let's run through an example, using raw term frequency (as opposed to TF-IDF maybe) and 2 sentences of equal length, len = 10. 

In [7]:
doc1 = "the quicker brown dogs easily jumps over the lazy dogs" 
doc2 = "the quicker dogs pose a serious problem for lazy dogs"
corpus = [doc1, doc2]
len(doc1.split()), len(doc1.split())

(10, 10)

First we calculate raw the term frequencies for each doc, using the equation: tf(t,d) = count of t in d / number of words in d

In [8]:
def calc_term_frequency(doc : list):
    
    dic = {}
    for word in doc.split():
        if word in dic:
            dic[word] = dic[word] + 1
        else:
            dic[word]= 1
    
    for word, frequency in dic.items():
       dic[word] = frequency
    
    return dic

tfs_doc1 = calc_term_frequency(doc1)
tfs_doc2 = calc_term_frequency(doc2)
print(tfs_doc1)

{'the': 2, 'quicker': 1, 'brown': 1, 'dogs': 2, 'easily': 1, 'jumps': 1, 'over': 1, 'lazy': 1}


Now we can calculate their inter partes euclidian distance. We can use SciKit-learn to check that we got the right answer.

In [9]:
import math
math.sqrt(sum((tfs_doc1.get(k, 0) - tfs_doc1.get(k, 0))**2 for k in set(tfs_doc1.keys()).union(set(tfs_doc1.keys())))) # output: 0
math.sqrt(sum((tfs_doc1.get(k, 0) - tfs_doc2.get(k, 0))**2 for k in set(tfs_doc1.keys()).union(set(tfs_doc2.keys())))) # output: 0.316227766016838

3.1622776601683795

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import euclidean_distances

corpus_vect = CountVectorizer().fit_transform(corpus).todense() 

print(euclidean_distances(corpus_vect[0], corpus_vect[0])) # output: 0
print(euclidean_distances(corpus_vect[0], corpus_vect[1] )) # output: 3.

[[0.]]
[[3.]]


This computed distances above is the same as the one we got manually, when we round them down. Which means our base euclidian distance, d = 0.316227766016838

Now, let's answer the question, what happens if we append each doc to itself and then run the whole thing again. 

In [11]:
doc1_double = "the quicker brown dogs easily jumps over the lazy dogs the quicker brown dogs easily jumps over the lazy dogs" 
doc2_double = "the quicker dogs pose a serious problem for lazy dogs the quicker dogs pose a serious problem for lazy dogs"
corpus = [doc1_double, doc2_double]

tfs_doc1_double = calc_term_frequency(doc1_double)
tfs_doc2_double = calc_term_frequency(doc2_double)

print(math.sqrt(sum((tfs_doc1_double.get(k, 0) - tfs_doc1_double.get(k, 0))**2 for k in set(tfs_doc1_double.keys()).union(set(tfs_doc1_double.keys())))))
print(math.sqrt(sum((tfs_doc1_double.get(k, 0) - tfs_doc2_double.get(k, 0))**2 for k in set(tfs_doc1_double.keys()).union(set(tfs_doc2_double.keys())))))

corpus_vect = CountVectorizer().fit_transform(corpus).todense() 

print(euclidean_distances(corpus_vect[0], corpus_vect[0])) # output: 0
print(euclidean_distances(corpus_vect[0], corpus_vect[1] )) # output: 3.

0.0
6.324555320336759
[[0.]]
[[6.]]


Surprise surprise! The euclidean distance doubles!

**b.What is the cosine similarity between A' and B' (using raw term frequency)?**

**Short answer:** 
It doubles just like ED. 

**Long answer:**
See below to subquestion C.

The equation for cosine similarity between 2 data objects is

$\cos ({\bf t},{\bf e})= {{\bf t} {\bf e} \over \|{\bf t}\| \|{\bf e}\|} = \frac{ \sum_{i=1}^{n}{{\bf t}_i{\bf e}_i} }{ \sqrt{\sum_{i=1}^{n}{({\bf t}_i)^2}} \sqrt{\sum_{i=1}^{n}{({\bf e}_i)^2}} }$

where $||t||$ is the Euclidean norm of $t = (t_1, t_2, ..., t_n)$, defined as $\sqrt{x_1^2 + x_2^2 + ... + x_p^2}$. Practically speaking, the len(t). $||e||$ = the Euclidean norm of vector e. So, the formula is vector t times vector e over the Euclidean norm of t times the Euclidean norm of e. A cosine value of 0 means that the two vectors are at 90 degrees to each other (orthogonal) and have no match. The closer the cosine value to 1, the smaller the angle and the greater the match between vectors. Let's run through an example, using the same docs as before.

In [12]:
doc1 = "the quicker brown dogs easily jumps over the lazy dogs" 
doc2 = "the quicker dogs pose a serious problem for lazy dogs"

We can use sklearn.metrics.pairwise.cosine_similarity to double check that we get the right result, but let's compute it manually first. We can use the term frequency method above

An alternative approach to what we did above, where we joined the sets of the 2 bags of words, is to simple find the intersection. The intuition here, is that if there is no intersection, then there is no similarity. Therfore, a measure of set intersection, $t \cap e$, is equally valid

In [13]:
def get_cosine_similarity(vect_doc1, vect_doc2):
    
    union = set(tfs_doc1.keys()).union(set(tfs_doc1.keys()))
    # intersection = set(vect_doc1.keys()) & set(vect_doc2.keys()) # getting intersection of set t and set e
    
    numerator = sum([vect_doc1.get(x, 0) * vect_doc2.get(x, 0) for x in union]) # define numerator

    sum1 = sum([vect_doc1[x] ** 2 for x in list(vect_doc1.keys())]) # define sum for vectorized doc 1
    sum2 = sum([vect_doc2[x] ** 2 for x in list(vect_doc2.keys())]) # define sum for vectorized doc 2
    
    denominator = math.sqrt(sum1) * math.sqrt(sum2) # define denominator


    return float(numerator) / denominator


cosine = get_cosine_similarity(tfs_doc1, tfs_doc2)

cosine

0.6172133998483678

Now, if we compare that to using sklearns CountVectorizer and cosine_similarity...

In [14]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Create the Document Term Matrix
sparse_matrix = CountVectorizer().fit_transform([doc1, doc2])

# Compute Cosine Similarity
cosine_similarity(sparse_matrix, sparse_matrix)

array([[1.        , 0.64465837],
       [0.64465837, 1.        ]])

We see that our manually-computed cosine similarity is a little off. Perhaps a arithemtic mistake? For the purposes of this subquestion, it doesn't really matter, because what we interested in, is how the cosine similarity changes when the docs "doube" again. 

Now, let's do the same things, but with the docs doubled once again. 

In [15]:
data = [
    "the quicker brown dogs easily jumps over the lazy dogs the quicker brown dogs easily jumps over the lazy dogs",
    "the quicker dogs pose a serious problem for lazy dogs the quicker dogs pose a serious problem for lazy dogs"
]

# Vectorise the data
vec = CountVectorizer()
X = vec.fit_transform(data) # `X` will now be a TF representation of the data, the first row of `X` corresponds to the first sentence in `data`

# Calculate the pairwise cosine similarities (depending on the amount of data that you are going to have this could take a while)
S = cosine_similarity(X)
S

array([[1.        , 0.64465837],
       [0.64465837, 1.        ]])

**c.What does this say about using cosine similarity as opposed to Euclidean distance in information retrieval?**

This say that cosine similarity does not consider magnitude. The ratios between the docs have not changed, and so this is reflected in the score, eventhough the nominal counts have changed. 

**4. Suppose we run the SNOWBALL algorithm on the text below to attempt to extract the FOUNDER-OF relation. Which of the patterns below will extract at least one correct example of that relation without extracting any incorrect ones (select all that apply)?**

a) ORG, founded by PERSON

b) ORG, PERSON

c) founders of ORG, PERSON

d) PERSON of ORG

ORG entities are bold, and PERSON entities are underlined. You can assume that all of the patterns are well formed SNOWBALL patterns. However, let's try and establish the entityis ourselves first.

Correct examples: (Microsoft, Bill Gates) , (Facebook, Mark Zuckerberg) , (Google, Larry Page) , (Google, Sergey Brin)

**Short answer:**
Options A and C produces >= 1 results and no incorrect ones. 


**Long answer:** 
See below until question 4.

In [16]:
src_text = ("Microsoft, founded by Bill Gates, produces both computer software and personal computers. " +
"The founders of Google, Larry Page and Sergei Brin, developed an advanced search experience. " + 
"And Mark Zuckerberg, founder of Facebook, crafted a new communication platform. " + 
"And, usage exists between them: indeed, Bill Gates is a user of Google search, and Larry Page of Microsoft products " +
"such as Word. Bill Gates of Microsoft, Larry Page and Sergei Brin of Google, and Mark Zuckerberg of Facebook were " +
"all pioneers of today’s technology.")

The task here is to extract FOUNDER-OF relations from the source text using the SNOWBALL algorithm. The algorithm itself is a follows:

1. Start with a set of seed tuples (or extract a seed set from the unlabeled text with a few hand-crafted rules) - We have this already

2. Extract occurrences from the unlabeled text that matches the tuples and tag them with a NER (named entity recognizer). - We can use NLTK or SpaCy

3. Create patterns for these occurrences, e.g. “ORG is based in LOC”. - These are given.

4. Generate new tuples from the text, e.g. (ORG:Intel, LOC: Santa Clara), and add to the seed set - No new sets exist.

5. Go step 2 or terminate and use the patterns that were created for further extraction

Snowball is a small string processing language designed for creating stemming algorithms for use in Information Retrieval. Ihe original paper is "Snowball: Extracting Relations from Large Plain-Text Collections", Agichtein & Gravano, 2000).

Let's try out the different patterns. First we need to preproces the text, meaning that we need to do tokenization and POS tagging. For this, we could use SpaCy but since that library doesn't have a built-in relation extraction function, let's go with NLTK instead. 

In [66]:
import nltk
from nltk.tag import pos_tag
import re
from nltk import Tree

pos_tags = nltk.pos_tag(word_tokenize(src_text))  # POS tagging of the sentence
ne = nltk.ne_chunk(pos_tags)  # Named Entity Recognition

# ne.draw()
print(ne) 

(S
  (GPE Microsoft/NNP)
  ,/,
  founded/VBN
  by/IN
  (PERSON Bill/NNP Gates/NNP)
  ,/,
  produces/VBZ
  both/DT
  computer/NN
  software/NN
  and/CC
  personal/JJ
  computers/NNS
  ./.
  The/DT
  founders/NNS
  of/IN
  (GPE Google/NNP)
  ,/,
  (PERSON Larry/NNP Page/NNP)
  and/CC
  (PERSON Sergei/NNP Brin/NNP)
  ,/,
  developed/VBD
  an/DT
  advanced/JJ
  search/NN
  experience/NN
  ./.
  And/CC
  (PERSON Mark/NNP Zuckerberg/NNP)
  ,/,
  founder/NN
  of/IN
  (GPE Facebook/NNP)
  ,/,
  crafted/VBD
  a/DT
  new/JJ
  communication/NN
  platform/NN
  ./.
  And/CC
  ,/,
  usage/JJ
  exists/NNS
  between/IN
  them/PRP
  :/:
  indeed/RB
  ,/,
  (PERSON Bill/NNP Gates/NNP)
  is/VBZ
  a/DT
  user/NN
  of/IN
  (GPE Google/NNP)
  search/NN
  ,/,
  and/CC
  (PERSON Larry/NNP Page/NNP)
  of/IN
  (ORGANIZATION Microsoft/NNP)
  products/NNS
  such/JJ
  as/IN
  (ORGANIZATION Word/NNP)
  ./.
  (PERSON Bill/NNP Gates/NNP)
  of/IN
  (ORGANIZATION Microsoft/NNP)
  ,/,
  (PERSON Larry/NNP Page/NNP)
  and

An issue arises when using the nltk.ne_chunk_sents() method. In our text, some of the companies are classified as being of type "GPE"! This should be changable manually but goes beyond the scope of this assignment. Hours were wasted. Below are the patterns used to return the relation tuples, answering question 4.

In [100]:

# Pattern 3: r'????'
# (c): correctly extracts (Google, Larry Page)

# Pattern 4: r'\bof\b' needs reverse entity positions (PERSON, ORG) instead, and produces
# ('larry_page', 'microsoft'); ('bill_gates', 'microsoft'); ('sergei_brin', 'google') which is wrong
# (d): correctly  extracts  (Microsoft,  Bill  Gates),  (Sergei  Brin,  Google),  (Mark  Zuckerberg,Facebook),
# but also incorrectly extracts (Microsoft, Larry Page


def test_patterns(ne, subj, obj, pattern, relsym):
    for rel in nltk.sem.extract_rels(subj, obj, ne, corpus='ace', pattern=pattern):
        # print(nltk.sem.rtuple(rel))
        print(nltk.sem.clause(rel, relsym=relsym))

# Pattern A, "ORG, founded by PERSON", returns ('microsoft', 'bill_gates'), which is correct
#test_patterns(ne, "GPE", "PERSON", re.compile(r'.*,.*founded.+by.+'), "Pattern 1: ")

# Pattern B, "ORG, PERSON", returns ('google', 'larry_page'); ('microsoft', 'larry_page') which is incorrect
#test_patterns(ne, "GPE", "PERSON", re.compile(r','), "Pattern 2: ")
#test_patterns(ne, "ORGANIZATION", "PERSON", re.compile(r','), "Pattern 2: ")

# Pattern C, "founders of ORG, PERSON", returns ('Google, Larry Page'), which is correct
#test_patterns(ne, "GPE", "PERSON", re.compile(r'[^,]+'), "Pattern 3: ")

#Pattern D, "PERSON of ORG", returns ("Sergei Brin", "Google"); ("Larry Page", "Microoft"); ("Bill Gates", Microsoft") which is incorrect.
test_patterns(ne, "PERSON", "GPE", re.compile(r'\bof\b'), "Pattern 4: ")
test_patterns(ne, "PERSON", "ORGANIZATION", re.compile(r'\bof\b'), "Pattern 4: ")



Pattern 4: ('sergei_brin', 'google')
Pattern 4: ('larry_page', 'microsoft')
Pattern 4: ('bill_gates', 'microsoft')


**5. Suppose that we run 3 queries in a QA system for its evaluation. For the 1st query, the system returns 10 candidate answers and the 4th is the first correct answer. For the 2nd query, the system returns 5 candidate answers and the 2nd is the first correct answer. For the 3rd query, the system returns 3 candidate answers and none of them is the correct answer. What is the mean reciprocal rank (MRR)?**

The mean reciprocal rank is a statistical measurements of evaluation of any process that produces a list of possible responses to a sample of queries, ordered by probability of correctness. 

The reciprocal rank of a query response is the multiplicative inverse of the rank of the first correct answer: 1 for first place, ​1⁄2 for second place, ​1⁄3 for third place and so on. The mean reciprocal rank is the average of the reciprocal ranks of results for a sample of queries Q:

$\mathrm{MRR} = \frac 1 Q \sum_{i=1}^{Q} \frac 1 {Rank_i}$

The quick answer, is that Q1 gives rank $1/4 = 0,25$, Q2 gives rank $1/2 = 0,5$ and Q3 gives rank 0, since no correct answer. Hence, the MRR is $1/3 * (1/4+1/2+0) = 0.25$

In [5]:
1/3 * (1/4+1/2+0)

0.25