# A. Dataset:

Upon reviewing some research papers (mentioned in references) we have chosen to use CISI Collection dataset by University of Glasgow, which is a benchmark dataset for information retrieval problems and so best suited for this assignment.


## Preprocessing:

1. Dataset has Id, title, author name, references to other documents and the abstract of the current document.
2. We have mapped the abstract to the current documents Id (and removed other thinks like title, name, references)
3. Removed English stopwords as the abstract is in english only
4. Removed symbols if any, and also single lettered words

In [None]:
!pip install nltk
print('Installing necessary libraries')

In [45]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import nltk
import re
from nltk.corpus import stopwords

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

print('Imported Necessary Libraries')

Imported Necessary Libraries


In [21]:
# Reading lines from the dataset paragraphs

with open('archive/CISI.ALL') as f:
    lines = ""
    for l in f.readlines():
        lines += "\n" + l.strip() if l.startswith(".") else " " + l.strip()
    lines = lines.lstrip("\n").split("\n")

print('Some sample lines from the above step: ')
for line in lines[:25]:
    print(line)

Some sample lines from the above step: 
.I 1
.T 18 Editions of the Dewey Decimal Classifications
.A Comaromi, J.P.
.W The present study is a history of the DEWEY Decimal Classification.  The first edition of the DDC was published in 1876, the eighteenth edition in 1971, and future editions will continue to appear as needed.  In spite of the DDC's long and healthy life, however, its full story has never been told.  There have been biographies of Dewey that briefly describe his system, but this is the first attempt to provide a detailed history of the work that more than any other has spurred the growth of librarianship in this country and abroad.
.X 1	5	1 92	1	1 262	1	1 556	1	1 1004	1	1 1024	1	1 1024	1	1


In [50]:
# Mapping the abstract with its Document Id

documents_dict = {}
current_document = ""
current_document_id = ""

for line in lines:
    if line.startswith('.I'):
        current_document_id = line.split(" ")[1].strip()
    elif line.startswith('.X'):
        documents_dict[current_document_id] = current_document.lstrip(" ")
        current_document = ""
    elif line.startswith('.T'):
        continue
    else :
        current_document += line.strip()[3:]

print('A sample document with document id - 10: ')
print(documents_dict["10"])

A sample document with document id - 10: 
Palmour, V.E.The purpose of this study was to develop, evaluate, and recommend a national plan  for improving access to periodical resources.  About 48 percent of all academic interlibrary loans are for periodical materials, with the bulk of the loans being satisfied in the form of photocopies.  A major consideration in the long-range improvement of the interlibrary loan system is the possible augmentation with a national system for acquiring, storing, and satisfying loan requests for periodical materials. This study focused on the physical access to the periodical literature.  Based on the needs of the library community, design features were developed, and included the following: Service should be made available to all users without any restriction other than access through a library. Initially, the service should be confined primarily to rapid, dependable delivery of photocopies of journal articles. The collection of a center should be compre

In [None]:
## stop words, symbol and single letter removal from the document (and also converted the text into lowercase so that the text is case-insensitive)

nltk.download('stopwords')
english_stop_words = set(stopwords.words('english'))

def clean_paragraph(paragraph):
    clean_text = re.sub(r'[^A-Za-z0-9\s]', ' ', paragraph)
    clean_text = clean_text.lower()
    tokens = nltk.word_tokenize(clean_text)
    tokens = [word for word in tokens if word not in english_stop_words and len(word) != 1]
    cleaned_paragraph = ' '.join(tokens)
    return cleaned_paragraph

In [52]:
for document_id in documents_dict:
    documents_dict[document_id] = clean_paragraph(documents_dict[document_id])

In [53]:
print(documents_dict["10"]) # Processed document with document Id "10"

palmour purpose study develop evaluate recommend national plan improving access periodical resources 48 percent academic interlibrary loans periodical materials bulk loans satisfied form photocopies major consideration long range improvement interlibrary loan system possible augmentation national system acquiring storing satisfying loan requests periodical materials study focused physical access periodical literature based needs library community design features developed included following service made available users without restriction access library initially service confined primarily rapid dependable delivery photocopies journal articles collection center comprehensive subject coverage excluding medicine worthwhile journals collected irrespective language


# B

a) Created the python function using nltk library and porter stemmer for stemming.

b)
Stemming:

Instead of treating different forms of a word separately (like "run" and "running"), we simplify them to a common base form (just "run"). This helps us manage and understand words better.

Positional Inverted Inex:

Imagine we organize our information like a book index, but with extra details. We not only note where words are but also exactly where in a sentence or paragraph they appear. This way, when searching, we can be more specific about where we find the words, making our searches more acc

By combining Positional inverted index and stemming, we can add surrouding context (by using the position of the word) to the root word (stemming). Stemming makes words simpler, and the Positional Inverted Index helps us find words more precisely in our collection of information. Combining these things makes searching for information better because it's both simpler and more accurate (with respect to information and context).urate.

In [54]:
def stem_and_add_to_index(documents_dict):
    stemmer = PorterStemmer()
    positional_inverted_index = {}
    for document_id in documents_dict:
        words = word_tokenize(documents_dict[document_id])
        pos=0
        for word in words:
            pos += 1
            stemmed_word = stemmer.stem(word)
            if positional_inverted_index.get(stemmed_word) == None:
                positional_inverted_index[stemmed_word] = { document_id : [pos] }
            elif positional_inverted_index[stemmed_word].get(document_id) == None:
                positional_inverted_index[stemmed_word][document_id] = [pos]
            else:
                positional_inverted_index[stemmed_word][document_id].append(pos)
    return positional_inverted_index;


In [55]:
positional_inverted_index = stem_and_add_to_index(documents_dict)

In [60]:
positional_inverted_index

{'comaromi': {'1': [1]},
 'present': {'1': [2],
  '7': [4, 35],
  '14': [58],
  '17': [35, 92, 147],
  '18': [3],
  '26': [47],
  '47': [82],
  '69': [12],
  '74': [68],
  '75': [17],
  '81': [32],
  '89': [137],
  '96': [79],
  '97': [54],
  '98': [12],
  '100': [7, 60],
  '111': [3],
  '112': [15],
  '114': [17],
  '118': [3],
  '122': [10],
  '131': [64],
  '134': [5],
  '142': [36],
  '143': [8, 32],
  '146': [20],
  '151': [39],
  '152': [16],
  '153': [49, 96],
  '156': [8],
  '161': [34],
  '164': [3],
  '185': [60],
  '186': [91],
  '191': [22, 40],
  '193': [3],
  '202': [47],
  '204': [68],
  '206': [55],
  '209': [4],
  '217': [44, 147],
  '228': [27],
  '229': [109],
  '239': [10, 38],
  '240': [33],
  '246': [2],
  '248': [7],
  '250': [77],
  '264': [94],
  '282': [67],
  '306': [20],
  '309': [73],
  '311': [70],
  '312': [3],
  '316': [16],
  '318': [38],
  '320': [78],
  '321': [32],
  '324': [33],
  '327': [8],
  '331': [36],
  '335': [9, 38],
  '339': [92, 113, 119],

In [69]:
print(positional_inverted_index['present']['1'])
print(positional_inverted_index['present']['100'])
print(positional_inverted_index['studi']['1'])
print(positional_inverted_index['studi']['9'])

[2]
[7, 60]
[3]
[2, 6, 56]


In [70]:
documents_dict['1']

'comaromi present study history dewey decimal classification first edition ddc published 1876 eighteenth edition 1971 future editions continue appear needed spite ddc long healthy life however full story never told biographies dewey briefly describe system first attempt provide detailed history work spurred growth librarianship country abroad'

## b) Illustration of the construction process of the combined inverted index using the results of above cell

1. Our idea is that, we map the **dictionary** of {document Id and positions of a word (stemmed word} to the word (word is the key, and the dictionary of the document Id and list of positions of this word in the current doucment is the value.

ex: 

    
    {
        'present' : { 
                    '1' : [2], 
                    '100' : [7, 60]
                },
        'studi': {
                    '1' : [3], 
                    '9' : [2, 6, 56] 
                }
    }
    
    
    This dict means that 'present' ,which is key, is present in 2nd position in doc '1' and 7, 60 positions in doc '100' (another dict of doc id as key, list of positions as value). similarly 'studi' is present in 3rd position in doc '1' and 2, 6, 56  positions in doc '9'. So, that when we want a word's position in a doc of id 'doc_Id' we can get it by : index[word][doc_Id])

    
2. we instantiated a porter stemmer instance of the PorterStemmer class from the nltk library, and also initialized an empty `positional_inverted_index` dictionary to store the indices.

   
3. Now we took every document presen in the documents dictionary (`documents_dict`), and obtained the individual words in the document using `word_tokenize` from the nltk library. Once we had broken down the paragraph in to words, we stemmed the individual words using stemmer. 'study' got stemmed to 'studi', similarly 'studies' got converted to 'studi' by this porter stemmer.

   
4. This root word ('studi') is then added to our `positional_inverted_index` dictionary of word as key and another dictionary of document id ('9') and list of positions of the word in the document ([2, 6, 56]) as value ({word : { document_id : [list of positions of this word in this document]}
   
5. If the root word is already present in the index dictionary then we just add the document Id (if not present) and create a list and add the position of this word to the list. If the document Id is already present, then we just accesss it and append this position to that list.
   
6. And finally return the `positional_inverted_index` dictionary


## Impact on retrieval:

Since we have used a dictionary here, the retrieval is very efficient because it takes O(1) - constant time, to retrieve a word from any document that we want. Above example shows how we can retrieve the positions of the word `studi` from document `153`.

## References:

Dataset: 

  