# Information Retrieval System

Information retrieval is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. The core purpose of this assignment is to give you the flavor of IRS. You need to follow some steps listed below and in the end, you'll be able to build your own small IRS. So, let's start.

In [1]:
# required imports
import numpy as np
import fnmatch
import os


Suppose we have 3 files containing data :

### File Contents

!["This is my book" - File 1](https://github.com/ahmad-14a/CS-F20-ML/blob/main/IRS-Assignment%201/f1.png?raw=true)
!["This is my pen" - File 1](https://github.com/ahmad-14a/CS-F20-ML/blob/main/IRS-Assignment%201/f2.png?raw=true)
!["This is book is intersting" - File 1](https://github.com/ahmad-14a/CS-F20-ML/blob/main/IRS-Assignment%201/f3.png?raw=true)

# Step 1 Create Files with Dummy data

You have to create few files with dummy data of your own choice as shown above.

# Step 2 Traverse Directories

 Now, You have to traverse the directories and store all the files into a dict type variable(files_dict). 

In [2]:
# Here we have initialized some variables, you can add more if required.

file_count = 0             # file_count to count number of files
files_dict = {}            # files_dic to store count of every file    
unique_word_set = set()    # unique_word_set to store all the unique words in a set


In [3]:

# This code uses the os and fnmatch modules to list all the files in the current directory that have the .txt extension.
for file in os.listdir('.'):  # lists all the files in the current directory
    # If format of file matches with parameter which means any file name that starts with f and ends with .txt.
    if fnmatch.fnmatch(file, 'f*.txt'):
        # If the file name matches, the condition is True, and the file count is incremented
        file_count += 1
        # Place the file name inside files_dict
        files_dict[file] = file_count-1

Displaying the count of files.

In [4]:
# Print Total Number  of files
print("\nTotal Number  of files\n", file_count)


Total Number  of files
 3


Displaying Dictionary containing all files.

In [5]:
#print Dictionary containing  files
print("\nDictionary containing  files\n", files_dict)


Dictionary containing  files
 {'f1.txt': 0, 'f2.txt': 1, 'f3.txt': 2}


# Step 3 Extract Unique Vocabulary

In [6]:
# write code to print all the unique words in every file and store them in a set

In [7]:

# open all files in files_dict
for files in files_dict:
    # opening in read mode
    with open(files, 'r') as f:
        # Read the contents of the file, convert to lowercase, and split into words 
        words = f.read().lower().split()
        # add unique words to the set
        unique_word_set.update(words)

#print set
print(unique_word_set)

#print Number of files
print("\nCount of files: ", file_count)



{'interesting', 'my', 'book', 'this', 'is', 'pen'}

Count of files:  3


### Expected Output

!["Expected Output of unique words" - File 1](https://github.com/ahmad-14a/CS-F20-ML/blob/main/IRS-Assignment%201/o1.png?raw=true)


# Step 4 Create Term Document Matrix

Create Term-Doc-matrix using Bag of word approach.and display its contents initially and finally.

In [8]:
# Create Term doc matrix such that colmns will be unique words and all the files will be rows
# Write code to count all the unique words appearances in all the files and store it in a dictionary for words 

In [9]:


# Create the term-document matrix
term_doc_matrix = [[] for i in range(file_count)]

for i in range(file_count):
    # Initialize a row of zeros for the matrix
    term_doc_matrix[i] = [0] * len(unique_word_set)
print(term_doc_matrix)

# Initialize the dictionary to store the word counts
words_dictionary = {}

# variable to assign unique index to words_dictionary
count = 0

# Loop over all the files
for file_name in files_dict.keys():
    # Open the file in read mode
    with open(file_name, 'r') as f:
        # Read the file's contents and split into words
        words = f.read().lower().split()

        # Loop over each word and update its count in the dictionary
        for word in words:
            # if word is not already in dictionary
            if word not in words_dictionary:
                # place it in the dictionary with unique index as value
                words_dictionary[word] = count
                # increment index
                count += 1
# print dictionary
print(words_dictionary)



[[0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0]]
{'this': 0, 'is': 1, 'my': 2, 'book': 3, 'pen': 4, 'interesting': 5}


### Expected Output

!["Expected Output of unique words" - File 1](https://github.com/ahmad-14a/CS-F20-ML/blob/main/IRS-Assignment%201/o2.png?raw=true)

# Step 5 Fill Term Document Matrix

In [10]:
# Fill the term doc matrix by checking if the unique word exists in a file or not
# If it exists then substitute a 1 in term_doc_matrix (eg : TERM_DOC_MATRIX[file][word] = 1 ) 
# Do the same for all the files present in the directory

In [11]:


# Loop over all the files in the dictionary
for file_name in files_dict.keys():
    # Open the file in read mode
    with open(file_name, 'r') as f:
        # Read the contents of the file, convert to lowercase, and split into words
        words = f.read().lower().split()
        # go for every word in words
        for word in words:
            # if word is found in word_dictionary
            if word in words_dictionary.keys():
                # Get the index of the word in the words dictionary
                # files_dict[file_name] gives index of word from file name dictionary
                term_doc_matrix[files_dict[file_name]][words_dictionary[word]] = 1

# Print the term-document matrix
print("\nTerm-document matrix:\n", term_doc_matrix)




Term-document matrix:
 [[1, 1, 1, 1, 0, 0], [1, 1, 1, 0, 1, 0], [0, 1, 1, 1, 0, 1]]


### Expected Output

!["Expected Output of unique words" - File 1](https://github.com/ahmad-14a/CS-F20-ML/blob/main/IRS-Assignment%201/o4.png?raw=true)


# Step 6 Ask for a user Query

In [12]:
# For user query make a column vector of length of all the unique words present in a set

In [13]:


# Create a column vector of zeros of length equal to the number of unique words
query_vector = np.zeros((len(unique_word_set), 1))

# print query_vector
print("Query Vector:\n", query_vector)



Query Vector:
 [[0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]]


### Expected Output

!["Expected Output of unique words" - File 1](https://github.com/ahmad-14a/CS-F20-ML/blob/main/IRS-Assignment%201/o5.png?raw=true)


In [14]:
# query = input("\nWrite something for searching  ")
# Check every word of query if it exists in the set of unique words or not
# If exixts then increment the count of that word in word dictionary


In [15]:

# Taking input
query = input("\nWrite something for searching  ")

# split all the words passed in query
query_words = query.lower().split()

# for each word in query
for word in query_words:
    # if word is present in unique word set
    if word in unique_word_set:
        # increment the query vector
        query_vector[words_dictionary[word]] += 1

# print query vector
print("\nQuery vector:\n", query_vector)




Write something for searching  my pen

Query vector:
 [[0.]
 [0.]
 [1.]
 [0.]
 [1.]
 [0.]]


### Expected Output

!["Expected Output of unique words" - File 1](https://github.com/ahmad-14a/CS-F20-ML/blob/main/IRS-Assignment%201/o6.png?raw=true)


# Step 7 Display Resultant Vector

Display 
1. Resultant vector.
2. Max value in resultant vector.
3. Index of max value in resultant vector.


In [16]:


# multiply term_doc_matrix with query_vector using dot product
resultant = np.dot(term_doc_matrix, query_vector)

# find max value from the resultant vector
maxValue = max(resultant)

# find max index from the resultant vector
max_index = np.argmax(resultant)

# print Resultant vactor
print("Result:\n", resultant)

# print Max Index
print("Max_index:\n", max_index)

# print Max value
print("Max:\n", maxValue)


Result:
 [[1.]
 [2.]
 [1.]]
Max_index:
 1
Max:
 [2.]


### Expected Output

!["Expected Output of unique words" - File 1](https://github.com/ahmad-14a/CS-F20-ML/blob/main/IRS-Assignment%201/o7.png?raw=true)


# Step 8 Display the contents of file


In [17]:
#Write the code to identify the file_name having maximum value in the resultant vector and display its contents.

In [None]:

# Get the file name with the maximum value in the resultant vector
for key, value in files_dict.items():
    # if max index of query vector matches with the files_dict values 
    if value == max_index:
        # Print the contents of the file
        print("\nContents of the file with maximum value in the resultant vector:", key)
        
        # open the file against the file name in read mode
        with open(key, 'r') as f:
            # print the read data
            print(f.read())
        break

