# Information Retrieval Task

### Install required libraries

In [1]:
pip install nltk inflect pandas

Note: you may need to restart the kernel to use updated packages.


## Part 1: Text Preprocessing

from: https://www.geeksforgeeks.org/text-preprocessing-in-python-set-1/

### Create a file called `example.txt` and write text into it:

In [2]:
!echo "Text preprocessing is an important step in Natural Language Processing (NLP) tasks. \
It involves cleaning and preparing text data for analysis or machine learning models. \
\
Some of the key steps in text preprocessing include: \
1 -  Converting text to lowercase \
2 -  Removing punctuations, numbers, and special characters \
3 -  Tokenization: Splitting text into individual words \
4 -  Removing stopwords like 'the', 'is', 'in', etc. \
5 -  Stemming and Lemmatization: Reducing words to their root forms \
\
These steps help standardize text data, making it easier for models to interpret and analyze." > example.txt

#### Confirm the content was added

In [3]:
!cat example.txt

Text preprocessing is an important step in Natural Language Processing (NLP) tasks.  It involves cleaning and preparing text data for analysis or machine learning models.   Some of the key steps in text preprocessing include:  1 -  Converting text to lowercase  2 -  Removing punctuations, numbers, and special characters  3 -  Tokenization: Splitting text into individual words  4 -  Removing stopwords like 'the', 'is', 'in', etc.  5 -  Stemming and Lemmatization: Reducing words to their root forms   These steps help standardize text data, making it easier for models to interpret and analyze.


#### Read the file

In [4]:
with open('example.txt', 'r') as file:
    document = file.read()

document

"Text preprocessing is an important step in Natural Language Processing (NLP) tasks.  It involves cleaning and preparing text data for analysis or machine learning models.   Some of the key steps in text preprocessing include:  1 -  Converting text to lowercase  2 -  Removing punctuations, numbers, and special characters  3 -  Tokenization: Splitting text into individual words  4 -  Removing stopwords like 'the', 'is', 'in', etc.  5 -  Stemming and Lemmatization: Reducing words to their root forms   These steps help standardize text data, making it easier for models to interpret and analyze.\n"

#### Import the necessary libraries

In [5]:
import nltk
import inflect
import string
import re
import pandas

## Text Lowercase

We lowercase the text to reduce the size of the vocabulary of our text data.

In [6]:
def text_lowercase(text):
    return text.lower()

text_lowercase(document)

"text preprocessing is an important step in natural language processing (nlp) tasks.  it involves cleaning and preparing text data for analysis or machine learning models.   some of the key steps in text preprocessing include:  1 -  converting text to lowercase  2 -  removing punctuations, numbers, and special characters  3 -  tokenization: splitting text into individual words  4 -  removing stopwords like 'the', 'is', 'in', etc.  5 -  stemming and lemmatization: reducing words to their root forms   these steps help standardize text data, making it easier for models to interpret and analyze.\n"

## Remove numbers

We can either remove numbers or convert the numbers into their textual representations. 

We can use regular expressions to remove the numbers. 

In [7]:
def remove_numbers(text):
    result = re.sub(r'\d+', '', text)
    return result

remove_numbers(document)

"Text preprocessing is an important step in Natural Language Processing (NLP) tasks.  It involves cleaning and preparing text data for analysis or machine learning models.   Some of the key steps in text preprocessing include:   -  Converting text to lowercase   -  Removing punctuations, numbers, and special characters   -  Tokenization: Splitting text into individual words   -  Removing stopwords like 'the', 'is', 'in', etc.   -  Stemming and Lemmatization: Reducing words to their root forms   These steps help standardize text data, making it easier for models to interpret and analyze.\n"

We can also convert the numbers into words. This can be done by using the `inflect` library.

In [8]:
# import the inflect library

p = inflect.engine()

# convert number into words
def convert_number(text):
    # split string into list of words
    temp_str = text.split()
    # initialise empty list
    new_string = []

    for word in temp_str:
        # if word is a digit, convert the digit
        # to numbers and append into the new_string list
        if word.isdigit():
            temp = p.number_to_words(word)
            new_string.append(temp)

        # append the word as it is
        else:
            new_string.append(word)

    # join the words of new_string to form a string
    temp_str = ' '.join(new_string)
    return temp_str

convert_number(document)

"Text preprocessing is an important step in Natural Language Processing (NLP) tasks. It involves cleaning and preparing text data for analysis or machine learning models. Some of the key steps in text preprocessing include: one - Converting text to lowercase two - Removing punctuations, numbers, and special characters three - Tokenization: Splitting text into individual words four - Removing stopwords like 'the', 'is', 'in', etc. five - Stemming and Lemmatization: Reducing words to their root forms These steps help standardize text data, making it easier for models to interpret and analyze."

## Remove punctuation

We remove punctuations so that we don’t have different forms of the same word. If we don’t remove the punctuation, then been. been, been! will be treated separately.

In [9]:
def remove_punctuation(text):
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

remove_punctuation(document)

'Text preprocessing is an important step in Natural Language Processing NLP tasks  It involves cleaning and preparing text data for analysis or machine learning models   Some of the key steps in text preprocessing include  1   Converting text to lowercase  2   Removing punctuations numbers and special characters  3   Tokenization Splitting text into individual words  4   Removing stopwords like the is in etc  5   Stemming and Lemmatization Reducing words to their root forms   These steps help standardize text data making it easier for models to interpret and analyze\n'

## Remove whitespace

We can use the join and split function to remove all the white spaces in a string.

In [10]:
def remove_whitespace(text):
    return  " ".join(text.split())

remove_whitespace(document)

"Text preprocessing is an important step in Natural Language Processing (NLP) tasks. It involves cleaning and preparing text data for analysis or machine learning models. Some of the key steps in text preprocessing include: 1 - Converting text to lowercase 2 - Removing punctuations, numbers, and special characters 3 - Tokenization: Splitting text into individual words 4 - Removing stopwords like 'the', 'is', 'in', etc. 5 - Stemming and Lemmatization: Reducing words to their root forms These steps help standardize text data, making it easier for models to interpret and analyze."

### Download required resources

In [11]:
nltk.download('stopwords')
nltk.download('punkt_tab')
nltk.download('punkt')
nltk.download('wordnet')


[nltk_data] Downloading package stopwords to /Users/mehdi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /Users/mehdi/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package punkt to /Users/mehdi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/mehdi/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Remove default stopwords

Stopwords are words that do not contribute to the meaning of a sentence. Hence, they can safely be removed without causing any change in the meaning of the sentence. The NLTK library has a set of stopwords and we can use these to remove stopwords from our text and return a list of word tokens.

In [12]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

def remove_stopwords(text):
    stop_words = set(stopwords.words("english"))
    word_tokens = word_tokenize(text)
    filtered_text = [word for word in word_tokens if word not in stop_words]
    return filtered_text

remove_stopwords(document)

['Text',
 'preprocessing',
 'important',
 'step',
 'Natural',
 'Language',
 'Processing',
 '(',
 'NLP',
 ')',
 'tasks',
 '.',
 'It',
 'involves',
 'cleaning',
 'preparing',
 'text',
 'data',
 'analysis',
 'machine',
 'learning',
 'models',
 '.',
 'Some',
 'key',
 'steps',
 'text',
 'preprocessing',
 'include',
 ':',
 '1',
 '-',
 'Converting',
 'text',
 'lowercase',
 '2',
 '-',
 'Removing',
 'punctuations',
 ',',
 'numbers',
 ',',
 'special',
 'characters',
 '3',
 '-',
 'Tokenization',
 ':',
 'Splitting',
 'text',
 'individual',
 'words',
 '4',
 '-',
 'Removing',
 'stopwords',
 'like',
 "'the",
 "'",
 ',',
 "'is",
 "'",
 ',',
 "'in",
 "'",
 ',',
 'etc',
 '.',
 '5',
 '-',
 'Stemming',
 'Lemmatization',
 ':',
 'Reducing',
 'words',
 'root',
 'forms',
 'These',
 'steps',
 'help',
 'standardize',
 'text',
 'data',
 ',',
 'making',
 'easier',
 'models',
 'interpret',
 'analyze',
 '.']

## Stemming

Stemming is the process of getting the root form of a word. Stem or root is the part to which inflectional affixes (-ed, -ize, -de, -s, etc.) are added. The stem of a word is created by removing the prefix or suffix of a word. So, stemming a word may not result in actual words.

In [13]:
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize

stemmer = PorterStemmer()

def stem_words(text):
    word_tokens = word_tokenize(text)
    stems = [stemmer.stem(word) for word in word_tokens]
    return stems

stem_words(document)

['text',
 'preprocess',
 'is',
 'an',
 'import',
 'step',
 'in',
 'natur',
 'languag',
 'process',
 '(',
 'nlp',
 ')',
 'task',
 '.',
 'it',
 'involv',
 'clean',
 'and',
 'prepar',
 'text',
 'data',
 'for',
 'analysi',
 'or',
 'machin',
 'learn',
 'model',
 '.',
 'some',
 'of',
 'the',
 'key',
 'step',
 'in',
 'text',
 'preprocess',
 'includ',
 ':',
 '1',
 '-',
 'convert',
 'text',
 'to',
 'lowercas',
 '2',
 '-',
 'remov',
 'punctuat',
 ',',
 'number',
 ',',
 'and',
 'special',
 'charact',
 '3',
 '-',
 'token',
 ':',
 'split',
 'text',
 'into',
 'individu',
 'word',
 '4',
 '-',
 'remov',
 'stopword',
 'like',
 "'the",
 "'",
 ',',
 "'i",
 "'",
 ',',
 "'in",
 "'",
 ',',
 'etc',
 '.',
 '5',
 '-',
 'stem',
 'and',
 'lemmat',
 ':',
 'reduc',
 'word',
 'to',
 'their',
 'root',
 'form',
 'these',
 'step',
 'help',
 'standard',
 'text',
 'data',
 ',',
 'make',
 'it',
 'easier',
 'for',
 'model',
 'to',
 'interpret',
 'and',
 'analyz',
 '.']

## Lemmatization

Lemmatization is a natural language processing (NLP) technique that reduces a word to its root form. This can be helpful for tasks such as text analysis and search, as it allows you to compare words that are related but have different forms. 

In [14]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

lemmatizer = WordNetLemmatizer()

def lemma_words(text):
    word_tokens = word_tokenize(text)
    lemmas = [lemmatizer.lemmatize(word) for word in word_tokens]
    return lemmas

lemma_words(document)

['Text',
 'preprocessing',
 'is',
 'an',
 'important',
 'step',
 'in',
 'Natural',
 'Language',
 'Processing',
 '(',
 'NLP',
 ')',
 'task',
 '.',
 'It',
 'involves',
 'cleaning',
 'and',
 'preparing',
 'text',
 'data',
 'for',
 'analysis',
 'or',
 'machine',
 'learning',
 'model',
 '.',
 'Some',
 'of',
 'the',
 'key',
 'step',
 'in',
 'text',
 'preprocessing',
 'include',
 ':',
 '1',
 '-',
 'Converting',
 'text',
 'to',
 'lowercase',
 '2',
 '-',
 'Removing',
 'punctuation',
 ',',
 'number',
 ',',
 'and',
 'special',
 'character',
 '3',
 '-',
 'Tokenization',
 ':',
 'Splitting',
 'text',
 'into',
 'individual',
 'word',
 '4',
 '-',
 'Removing',
 'stopwords',
 'like',
 "'the",
 "'",
 ',',
 "'is",
 "'",
 ',',
 "'in",
 "'",
 ',',
 'etc',
 '.',
 '5',
 '-',
 'Stemming',
 'and',
 'Lemmatization',
 ':',
 'Reducing',
 'word',
 'to',
 'their',
 'root',
 'form',
 'These',
 'step',
 'help',
 'standardize',
 'text',
 'data',
 ',',
 'making',
 'it',
 'easier',
 'for',
 'model',
 'to',
 'interpret',


## Part 2: Inverted Index

from: https://www.geeksforgeeks.org/inverted-index/

Define the documents

In [15]:
document1 = "The quick brown fox jumped over the lazy dog."
document2 = "The lazy dog slept in the sun."

### Tokenize the documents

Convert each document to lowercase and split it into words

In [16]:
tokens1 = document1.lower().split()
tokens2 = document2.lower().split()

# Combine the tokens into a list of unique terms
terms = list(set(tokens1 + tokens2))

### Build the inverted index

In [17]:
# Create an empty dictionary to store the inverted index
inverted_index = {}

# For each term, find the documents that contain it
for term in terms:
	documents = []
	if term in tokens1:
		documents.append("Document 1")
	if term in tokens2:
		documents.append("Document 2")
	inverted_index[term] = documents

for term, documents in inverted_index.items():
	print(term, "->", ", ".join(documents))


fox -> Document 1
the -> Document 1, Document 2
dog -> Document 2
slept -> Document 2
dog. -> Document 1
quick -> Document 1
over -> Document 1
sun. -> Document 2
jumped -> Document 1
lazy -> Document 1, Document 2
in -> Document 2
brown -> Document 1


## Part 3: Document Retrieval using Boolean Model

from: https://www.geeksforgeeks.org/document-retrieval-using-boolean-model-and-vector-space-model/

Create documents CSV file

In [18]:
%%sh
echo "documents,term1,term2,term3
document1,\"ice cream\",mango,litchi
document2,hockey,cricket,sport
document3,litchi,mango,chocolate
document4,nice,good,cute" > documents.csv

In [19]:
def filter(documents, rows, cols):
    '''function to read and separate the name of the documents and the terms 
    present in it to a separate list  from the data frame and also create a 
    dictionary which has the name of the document as key and the terms present in
    it as the list of strings  which is the value of the key'''

    for i in range(rows):
        for j in range(cols):
            # traversal through the data frame

            if(j == 0):
                # first column has the name of the document in the csv file
                keys.append(documents.loc[i].iat[j])
            else:
                dummy_List.append(documents.loc[i].iat[j])
                # dummy list to update the terms in the dictionary

                if documents.loc[i].iat[j] not in terms:
                    # add the terms to the list if it is not present else continue
                    terms.append(documents.loc[i].iat[j])

        copy = dummy_List.copy()
        # copying the dummy list to a different list

        dicti.update({documents.loc[i].iat[0]: copy})
        # adding the key value pair to a dictionary

        dummy_List.clear()
        # clearing the dummy list

In [20]:
def bool_Representation(dicti, rows, cols):
    '''In this function we get a boolean representation of the terms present in the
    documents in the form of lists, later we create a dictionary which contains 
    the name of the documents as key and value as the list of boolean values 
    representing the terms present in the document'''

    terms.sort()
    # we sort the elements in the alphabetical order for the convience, the order
    # of the term does not make any difference

    for i in (dicti):
        # for every document in the dictionary we check for each string present in
        # the list

        for j in terms:
            # if the string is present in the list we append 1 else we append 0

            if j in dicti[i]:
                dummy_List.append(1)
            else:
                dummy_List.append(0)
            # appending 1 or 0 for obtaining the boolean representation

        copy = dummy_List.copy()
        # copying the dummy list to a different list

        vec_Dic.update({i: copy})
        # adding the key value pair to a dictionary

        dummy_List.clear()
        # clearing the dummy list

In [21]:
def query_Vector(query):
    '''In this function we represent the query in the form of boolean vector'''

    qvect = []
    # query vector which is returned at the end of the function

    for i in terms:
        # if the word present in the list of terms is also present in the query
        # then append 1 else append 0

        if i in query:
            qvect.append(1)
        else:
            qvect.append(0)

    return qvect
    # return the query vector which is obtained in the boolean form

In [22]:
def prediction(q_Vect):
    '''In this function we make the prediction regarding which document is related
    to the given query by performing the boolean operations'''

    dictionary = {}
    listi = []
    count = 0
    # initialisation of the dictionary , list and a variable which is further
    # required for performing the computation

    term_Len = len(terms)
    # number of terms present in the term list

    for i in vec_Dic:
        # for every document in the dictionary containing the terms present in it
        # the form of boolean vector

        for t in range(term_Len):
            if(q_Vect[t] == vec_Dic[i][t]):
                # if the words present in the query is also present in the
                # document or if the words present in the query is also absent in
                # the document

                count += 1
                # increase the value of count variable by one
                # the condition in which words present in document and absent in
                #query , present in query and absent in document is not considered

        dictionary.update({i: count})
        # dictionary updation here the name of the document is the key and the
        # count variable computed earlier is the value

        count = 0
        # reinitialisaion of count variable to 0

    for i in dictionary:
        listi.append(dictionary[i])
        # here we append the count value to list

    listi = sorted(listi, reverse=True)
    # we sort the list in the descending order which is needed to rank the 
    #documents according to the relevance

    ans = ' '
    # variable to store the name of the document which is most relevant
    
    with open('output.txt', 'w') as f:
        
        with redirect_stdout(f):
            # to redirect the output to a text file
            print("ranking of the documents")

            for count, i in enumerate(listi):
                key = check(dictionary, i)
                # Function call to get the key when the value is known
                if count == 0:
                    ans = key
                    # to store the name of the document which is most relevant

                print(key, "rank is", count+1)
                # print the name of the document along with its rank

                dictionary.pop(key)
                # remove the key from the dictionary after printing

            print(ans, "is the most relevant document for the given query")
            # to print the name of the document which is most relevant

In [23]:
def check(dictionary, val):
    '''Function to return the key when the value is known'''

    for key, value in dictionary.items():
        if(val == value):
            # if the given value is same as the value present in the dictionary
            # return the key

            return key

In [24]:
# module to redirect the output to a text file
from contextlib import redirect_stdout

# list to store the terms present in the documents
terms = []

# list to store the names of the documents
keys = []

# dictionary to store the name of the document and the boolean vector as list
vec_Dic = {}

# dictionary to store the name of the document and the terms present in it as a vector
dicti = {}

# list for performing some operations and clearing them
dummy_List = []

documents = pandas.read_csv(r'documents.csv')
# to read the data from the csv file as a dataframe
rows = len(documents)
# to get the number of rows
cols = len(documents.columns)
# to get the number of columns
filter(documents, rows, cols)
# function call to read and separate the name of the documents and the terms
# present in it to a separate list  from the data frame and also create a
# dictionary which has the name of the document as key and the terms present in
# it as the list of strings  which is the value of the key
bool_Representation(dicti, rows, cols)
# In this function we get a boolean representation of the terms present in the
# documents in the form of lists, later we create a dictionary which contains
# the name of the documents as key and value as the list of boolean values
#representing the terms present in the document
print("Enter query:")
query = input()

print(query)

# to get the query input from the user, the below input is given for obtaining
# the output as in output.txt file
# hockey is a national sport
query = query.split(' ')
# splitting the query as a list of strings
q_Vect = query_Vector(query)
# function call to represent the query in the form of boolean vector
prediction(q_Vect)
# Function call to make the prediction regarding which document is related to
# the given query by performing the boolean operations

Enter query:


hockey is a national sport


In [25]:
!cat output.txt

ranking of the documents
document2 rank is 1
document1 rank is 2
document3 rank is 3
document4 rank is 4
document2 is the most relevant document for the given query
