# Week 2, Lesson 4, Activity 9: End-to-end IR application

&copy;2021, Ekaterina Kochmar \
(edited: Nadejda Roubtsova, June 2022)

Your task in this activity is to:

- To implement all the steps discussed in the lecture and apply your IR algorithm to the collection of documents and the set of queries provided with this notebook.

Note that we continue working on the same data, so we will rely on the implementation of the previous steps (Steps 1–3) from Lesson 2.3. In this notebook, you need to start with Step 4.

## Step 1: Read in the data (same as before)

There are three components to this data to be read in:
- documents with their ids and content – there are $1460$ of those to be precise;
- questions / queries with their ids and content – there are $112$ of those;
- mapping between the queries and relevant documents.


In [1]:
def read_documents():
    f = open("cisi/CISI.ALL")
    merged = ""
    
    for a_line in f.readlines():
        if a_line.startswith("."):
            merged += "\n" + a_line.strip()
        else:
            merged += " " + a_line.strip()
    
    documents = {}

    content = ""
    doc_id = ""

    for a_line in merged.split("\n"):
        if a_line.startswith(".I"):
            doc_id = a_line.split(" ")[1].strip()
        elif a_line.startswith(".X"):
            documents[doc_id] = content
            content = ""
            doc_id = ""
        else:
            content += a_line.strip()[3:] + " "
    f.close()
    return documents

documents = read_documents()
print(f"{len(documents)} documents in total")
print("Document with id 1:")
print(documents.get("1"))

1460 documents in total
Document with id 1:
 18 Editions of the Dewey Decimal Classifications Comaromi, J.P. The present study is a history of the DEWEY Decimal Classification.  The first edition of the DDC was published in 1876, the eighteenth edition in 1971, and future editions will continue to appear as needed.  In spite of the DDC's long and healthy life, however, its full story has never been told.  There have been biographies of Dewey that briefly describe his system, but this is the first attempt to provide a detailed history of the work that more than any other has spurred the growth of librarianship in this country and abroad. 


Second, let's read in queries from the `CISI.QRY` file and store the result in the `queries` data structure where query contents are stored under corresponding query ids:

In [2]:
def read_queries():
    f = open("cisi/CISI.QRY")
    merged = ""
    
    for a_line in f.readlines():
        if a_line.startswith("."):
            merged += "\n" + a_line.strip()
        else:
            merged += " " + a_line.strip()
    
    queries = {}

    content = ""
    qry_id = ""

    for a_line in merged.split("\n"):
        if a_line.startswith(".I"):
            if not content=="":
                queries[qry_id] = content
                content = ""
                qry_id = ""
            qry_id = a_line.split(" ")[1].strip()
        elif a_line.startswith(".W") or a_line.startswith(".T"):
            content += a_line.strip()[3:] + " "
    queries[qry_id] = content
    f.close()
    return queries

queries = read_queries()
print(f"{len(queries)} queries in total")
print("Query with id 1:")
print(queries.get("1"))

112 queries in total
Query with id 1:
What problems and concerns are there in making up descriptive titles? What difficulties are involved in automatically retrieving articles from approximate titles? What is the usual relevance of the content of articles to their titles? 


Finally, let's read in the mapping between the queries and the documents. We'll keep these in the `mappings` data structure where each query index (key) corresponds to the list of one or more document indices (value).

In [3]:
def read_mappings():
    f = open("cisi/CISI.REL")
    
    mappings = {}

    for a_line in f.readlines():
        voc = a_line.strip().split()
        key = voc[0].strip()
        current_value = voc[1].strip()
        value = []
        if key in mappings.keys():
            value = mappings.get(key)
        value.append(current_value)
        mappings[key] = value

    f.close()
    return mappings

mappings = read_mappings()
print(f"{len(mappings)} mappings in total")
print(mappings.keys())
print("Mapping for query with id 1:")
print(mappings.get("1"))

76 mappings in total
dict_keys(['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '37', '39', '41', '42', '43', '44', '45', '46', '49', '50', '52', '54', '55', '56', '57', '58', '61', '62', '65', '66', '67', '69', '71', '76', '79', '81', '82', '84', '90', '92', '95', '96', '97', '98', '99', '100', '101', '102', '104', '109', '111'])
Mapping for query with id 1:
['28', '35', '38', '42', '43', '52', '65', '76', '86', '150', '189', '192', '193', '195', '215', '269', '291', '320', '429', '465', '466', '482', '483', '510', '524', '541', '576', '582', '589', '603', '650', '680', '711', '722', '726', '783', '813', '820', '868', '869', '894', '1162', '1164', '1195', '1196', '1281']


## Step 2: Preprocess the data (same as before)

Practise application of the following steps:
- tokenize the texts
- put all to lowercase
- remove stopwords
- apply stemming

Implement and apply these steps to a sample text:

In [4]:
import nltk
import string
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.lancaster import LancasterStemmer

def process(text): 
    stoplist = set(stopwords.words('english'))
    st = LancasterStemmer()
    word_list = [st.stem(word) for word in word_tokenize(text.lower())
                 if not word in stoplist and not word in string.punctuation]
    return word_list
  
word_list = process(documents.get("27"))
print(word_list)

['cost', 'analys', 'sim', 'proc', 'evalu', 'larg', 'inform', 'system', 'bourn', 'c.p', 'ford', 'd.f', 'comput', 'program', 'writ', 'us', 'sim', 'several-year', 'op', 'inform', 'system', 'comput', 'estim', 'expect', 'op', 'cost', 'wel', 'amount', 'equip', 'personnel', 'requir', 'tim', 'period', 'program', 'us', 'analys', 'sev', 'larg', 'system', 'prov', 'us', 'research', 'tool', 'study', 'system', 'many', 'compon', 'interrel', 'op', 'equ', 'man', 'analys', 'would', 'extrem', 'cumbersom', 'tim', 'consum', 'perhap', 'ev', 'impract', 'pap', 'describ', 'program', 'show', 'exampl', 'result', 'sim', 'two', 'sev', 'suggest', 'design', 'spec', 'inform', 'system']


## Step 3: Term weighting (same as before)

In [5]:
def get_terms(text): 
    stoplist = set(stopwords.words('english'))
    terms = {}
    st = LancasterStemmer()
    word_list = [st.stem(word) for word in word_tokenize(text.lower())
                 if not word in stoplist and not word in string.punctuation]
    for word in word_list:
        terms[word] = terms.get(word, 0) + 1
    return terms

doc_terms = {}
qry_terms = {}
for doc_id in documents.keys():
    doc_terms[doc_id] = get_terms(documents.get(doc_id))
for qry_id in queries.keys():
    qry_terms[qry_id] = get_terms(queries.get(qry_id))


print(f"{len(doc_terms)} documents in total")
d1_terms = doc_terms.get("1")
print("Terms and frequencies for document with id 1:")
print(d1_terms)
print(f"{len(d1_terms)} terms in this document")
print()
print(f"{len(qry_terms)} queries in total")
q1_terms = qry_terms.get("1")
print("Terms and frequencies for query with id 1:")
print(q1_terms)
print(f"{len(q1_terms)} terms in this query")

1460 documents in total
Terms and frequencies for document with id 1:
{'18': 1, 'edit': 4, 'dewey': 3, 'decim': 2, 'class': 2, 'comarom': 1, 'j.p.': 1, 'pres': 1, 'study': 1, 'hist': 2, 'first': 2, 'ddc': 2, 'publ': 1, '1876': 1, 'eighteen': 1, '1971': 1, 'fut': 1, 'continu': 1, 'appear': 1, 'nee': 1, 'spit': 1, "'s": 1, 'long': 1, 'healthy': 1, 'lif': 1, 'howev': 1, 'ful': 1, 'story': 1, 'nev': 1, 'told': 1, 'biograph': 1, 'brief': 1, 'describ': 1, 'system': 1, 'attempt': 1, 'provid': 1, 'detail': 1, 'work': 1, 'spur': 1, 'grow': 1, 'libr': 1, 'country': 1, 'abroad': 1}
43 terms in this document

112 queries in total
Terms and frequencies for query with id 1:
{'problem': 1, 'concern': 1, 'mak': 1, 'describ': 1, 'titl': 3, 'difficul': 1, 'involv': 1, 'autom': 1, 'retriev': 1, 'artic': 2, 'approxim': 1, 'us': 1, 'relev': 1, 'cont': 1}
14 terms in this query


Second, collect shared vocabulary from all documents and queries:

In [6]:
def collect_vocabulary():
    all_terms = []
    for doc_id in doc_terms.keys():
        for term in doc_terms.get(doc_id).keys():            
            all_terms.append(term)
    for qry_id in qry_terms.keys():
        for term in qry_terms.get(qry_id).keys():
            all_terms.append(term)
    return sorted(set(all_terms))

all_terms = collect_vocabulary()
print(f"{len(all_terms)} terms in the shared vocabulary")
print("First 10:")
print(all_terms[:10])

7775 terms in the shared vocabulary
First 10:
["''", "'60", "'70", "'anyhow", "'apparent", "'basic", "'better", "'bibliograph", "'bibliometrics", "'building"]


Represent each document and query as vectors containing word counts in the shared space:

In [7]:
def vectorize(input_terms, shared_vocabulary):
    output = {}
    for item_id in input_terms.keys():
        terms = input_terms.get(item_id)
        output_vector = []
        for word in shared_vocabulary:
            if word in terms.keys():
                output_vector.append(int(terms.get(word)))
            else:
                output_vector.append(0)
        output[item_id] = output_vector
    return output

doc_vectors = vectorize(doc_terms, all_terms)
qry_vectors = vectorize(qry_terms, all_terms)

print(f"{len(doc_vectors)} document vectors")
d1460_vector = doc_vectors.get("1460")
print(f"{len(d1460_vector)} terms in this document")
print(f"{len(qry_vectors)} query vectors")
q112_vector = qry_vectors.get("112")
print(f"{len(q112_vector)} terms in this query")

1460 document vectors
7775 terms in this document
112 query vectors
7775 terms in this query


## Step 4: Retrieval of the most similar documents

Use cosine similarity on a toy example:

In [8]:
import math

query = [1, 1]
document = [3, 5]

def length(vector):
    sq_length = 0
    for index in range(0, len(vector)):
        sq_length += math.pow(vector[index], 2)
    return math.sqrt(sq_length)
    
def dot_product(vector1, vector2):
    if len(vector1)==len(vector2):
        dot_prod = 0
        for index in range(0, len(vector1)):
            if not vector1[index]==0 and not vector2[index]==0:
                dot_prod += vector1[index]*vector2[index]
        return dot_prod
    else:
        return "Unmatching dimensionality"

def calculate_cosine(query, document):
    cosine =  dot_product(query, document) / (length(query) * length(document)) 
    return cosine

cosine = calculate_cosine(query, document)
print (cosine)

0.9701425001453319


Get cosine similarity for some examples of a particular query and a particular document from the collection:

In [9]:
document = doc_vectors.get("60")
query = qry_vectors.get("3")

cosine =  dot_product(query, document) / (length(query) * length(document))     
print(cosine)

0.4830458915396479


**Document and query weighting schemes:**
In the implementation of Lesson 2.3 (if you copied all the cells from Step 3), you may have noticed that the documents are not weighted in the same way as the queries. This is not uncommon. More information on the practice and different weighting schemes can be found in: 
Christopher D. Manning  *et.al*, *Introduction to Information Retrieval*, Cambridge University Press. 2008. \
https://nlp.stanford.edu/IR-book/pdf/06vect.pdf (Section 6.4.3) \
Please feel free to experiment with alternative weighting schemes for query and document vectors as part of these exercises.

Apply the search algorithm to find relevant documents for a particular query:

In [10]:
from operator import itemgetter

results = {}

for doc_id in doc_vectors.keys():
    document = doc_vectors.get(doc_id)
    cosine = calculate_cosine(query, document)    
    results[doc_id] = cosine

# sort the results dictionary by cosine similarity in descending order and get the top 10
sorted_results = sorted(results.items(), key=itemgetter(1), reverse=True)[:10]

for items in sorted_results:  # print out the top 10 most similar documents according to cosine similarity
    print(f"Doc {items[0]} with similarity {round(items[1], 4)}")

Doc 469 with similarity 0.6107
Doc 1179 with similarity 0.5643
Doc 372 with similarity 0.4899
Doc 1116 with similarity 0.4862
Doc 1190 with similarity 0.4845
Doc 60 with similarity 0.483
Doc 803 with similarity 0.4781
Doc 1181 with similarity 0.4603
Doc 599 with similarity 0.4554
Doc 241 with similarity 0.4526


## Step 5: Evaluation

**Optional**: Implement evaluation metrics from Lesson 5 (see Activity 11).