# Assignment 1 IRS

Information retrieval is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. The core purpose of this assignment is to give you the flavor of IRS. You need to follow some steps listed below and in the end, you'll be able to build your own small IRS. So, let's start.

In [56]:
# required imports
import numpy as np
import fnmatch
import os


Suppose we have 3 files containing data :

### File Contents

!["This is my book" - File 1](https://github.com/ahmad-14a/CS-F20-ML/blob/main/IRS-Assignment%201/f1.png?raw=true)
!["This is my pen" - File 1](https://github.com/ahmad-14a/CS-F20-ML/blob/main/IRS-Assignment%201/f2.png?raw=true)
!["This is book is intersting" - File 1](https://github.com/ahmad-14a/CS-F20-ML/blob/main/IRS-Assignment%201/f3.png?raw=true)

# Step 1 Create Files with Dummy data

You have to create few files with dummy data of your own choice as shown above.

# Step 2 Traverse Directories

 Now, You have to traverse the directories and store all the files into a dict type variable(files_dict). 

In [57]:
# Here we have intialized some variables, you can add more if required.

file_count = 0             # file_count to count number of files
files_dict = {}            # files_dic to store count of every file    
unique_word_set = set()    # unique_word_set to store all the unique words in a set
file_list = []             # file_list to store the list of file names


In [58]:
#Your code starts here   
# Traverse directories and store files into dictionary
for root, dirs, files in os.walk("."):   # Traverse all files in the current directory and its subdirectories
    for filename in files:
        if filename.endswith(".txt"):   # Only consider text files
            file_count += 1
            filepath = os.path.join(root, filename)  # Get file path
            file_list.append(filename)   # Add filename to file_list
            with open(filepath, "r") as f:           # Open file in read mode
                words = f.read().lower().split()           # Read file contents and split into words
                for word in words:                  # Loop over each word in the file
                    if word not in files_dict:      # If the word is not already in the dictionary
                        files_dict[word] = {filename: 1}   # Add the word to the dictionary with a dictionary of filenames containing it and their count
                    elif filename not in files_dict[word]:   
                        files_dict[word][filename] = 1    # Add filename to the dictionary for the word with count 1
                    else:
                        files_dict[word][filename] += 1   # Increment the count of the word in the filename in the dictionary
#Your code ends here       

Displaying the count of files.

In [59]:
print("\nTotal Number  of files\n", file_count)


Total Number  of files
 3


Displaying Dictionary containing all files.

In [60]:
print("\nDictionary containing  files\n", files_dict)


Dictionary containing  files
 {'this': {'f1.txt': 1, 'f2.txt': 1}, 'is': {'f1.txt': 1, 'f2.txt': 1, 'f3.txt': 1}, 'my': {'f1.txt': 1, 'f2.txt': 1, 'f3.txt': 1}, 'book': {'f1.txt': 1, 'f3.txt': 1}, 'pen': {'f2.txt': 1}, 'intersting': {'f3.txt': 1}}


# Step 3 Extract Unique Vocabulary

In [61]:
# write code to print all the unique words in every file and store them in a set


In [63]:
#Your code starts here    
unique_word_set = set(files_dict.keys())   # Get unique words from files_dict


print("\nUnique words in every file\n", unique_word_set)    # Print set of unique words
print("\ncount of files    ", file_count)  # Print files count
#Your code ends here


Unique words in every file
 {'book', 'pen', 'is', 'intersting', 'my', 'this'}

count of files     3


### Expected Output

!["Expected Output of unique words" - File 1](https://github.com/ahmad-14a/CS-F20-ML/blob/main/IRS-Assignment%201/o1.png?raw=true)


# Step 4 Create Term Document Matrix

Create Term-Doc-matrix using Bag of word approach.and display its contents initially and finally.

In [64]:
# Create Term doc matrix such that colmns will be unique words and all the files will be rows
# Write code to count all the unique words appearances in all the files and store it in a dictionary for words 

In [65]:
#Your code starts here
# Create Term Document Matrix
term_doc_matrix = []
word_index_dict = {}
file_index_dict = {}

# Loop over each unique word in the vocabulary and assign an index to each word
for i, word in enumerate(sorted(unique_word_set)):
    word_index_dict[word] = i

# Loop over each file in the corpus and assign an index to each file
for i, filename in enumerate(sorted(file_list)):
    file_index_dict[filename] = i

# Loop over each file in the corpus and create a row for each file in the term document matrix
for i in range(len(file_list)):
    row = [0] * len(unique_word_set)
    term_doc_matrix.append(row)

# Print the Term Document Matrix, Dictionary of Unique Words, and Dictionary of Files
print("\nTerm Document Matrix\n")
for sublist in term_doc_matrix:
    print(sublist)
print("\nDictionary of Unique Words:\n", word_index_dict)
print("\nDictionary of Files:\n", file_index_dict)
#Your code ends here


Term Document Matrix

[0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0]

Dictionary of Unique Words:
 {'book': 0, 'intersting': 1, 'is': 2, 'my': 3, 'pen': 4, 'this': 5}

Dictionary of Files:
 {'f1.txt': 0, 'f2.txt': 1, 'f3.txt': 2}


### Expected Output

!["Expected Output of unique words" - File 1](https://github.com/ahmad-14a/CS-F20-ML/blob/main/IRS-Assignment%201/o2.png?raw=true)

# Step 5 Fill Term Document Matrix

In [66]:
# Fill the term doc matrix by checking if the unique word exists in a file or not
# If it exists then substitute a 1 in term_doc_matrix (eg : TERM_DOC_MATRIX[file][word] = 1 ) 
# Do the same for all the files present in the directory

In [67]:
#Your code starts here 
# Fill the term document matrix
for i, filename in enumerate(sorted(file_list)): # Loop over each file in the corpus and assign an index to each file
    with open(filename, 'r') as file: # Open the file in read mode
        file_words = file.read().lower().split() # Read the contents of the file and split the words into a list
        for word in file_words: # Loop over each word in the list of file words
            if word in unique_word_set: # Check if the word is in the unique word set
                term_doc_matrix[i][word_index_dict[word]] = 1 # If the word is in the unique word set, mark the corresponding cell in the term document matrix as 1

# Print the Term Document Matrix, Unique Vocabulary Dictionary, and File Index Dictionary
print("\nDictionary of Unique Words:\n", word_index_dict)
print("\nTerm Document Matrix")
for sublist in term_doc_matrix:
    print(sublist)
#Your code ends here


Dictionary of Unique Words:
 {'book': 0, 'intersting': 1, 'is': 2, 'my': 3, 'pen': 4, 'this': 5}

Term Document Matrix
[1, 0, 1, 1, 0, 1]
[0, 0, 1, 1, 1, 1]
[1, 1, 1, 1, 0, 0]


### Expected Output

!["Expected Output of unique words" - File 1](https://github.com/ahmad-14a/CS-F20-ML/blob/main/IRS-Assignment%201/o4.png?raw=true)


# Step 6 Ask for a user Query

In [68]:
# For user query make a column vector of length of all the unique words present in a set

In [69]:
#Your code starts here    

colVector = [[0] for i in range(len(unique_word_set))]  # create a column vector of zeros with the same length as unique_word_set

print("colVector initially")  # print a message to indicate the initial state of colVector
for sublist in colVector: # iterate over each sublist in colVector and print it
    print(sublist)

#Your code ends here

colVector initially
[0]
[0]
[0]
[0]
[0]
[0]


### Expected Output

!["Expected Output of unique words" - File 1](https://github.com/ahmad-14a/CS-F20-ML/blob/main/IRS-Assignment%201/o5.png?raw=true)


In [70]:
query = input("\nWrite something for searching  ").lower()
# Check every word of query if it exists in the set of unique words or not
# If exixts then increment the count of that word in word dictionary



Write something for searching  very intersting 


In [71]:
#Your code starts here    

for word in query.split(): # Split the query into words and loop over each word
    if word in unique_word_set:  # Check if the word is in the set of unique words
        colVector[word_index_dict[word]][0] += 1      # If the word exists, increment the count of that word in the column vector

print("\ncolVector after query\n") # Print the column vector after query
for sublist in colVector: # iterate over each sublist in colVector and print it
    print(sublist)

#Your code ends here


colVector after query

[0]
[1]
[0]
[0]
[0]
[0]


### Expected Output

!["Expected Output of unique words" - File 1](https://github.com/ahmad-14a/CS-F20-ML/blob/main/IRS-Assignment%201/o6.png?raw=true)


# Step 7 Display Resultant Vector

Display 
1. Resultant vector.
2. Max value in resultant vector.
3. Index of max value in resultant vector.


In [72]:
#Your code starts here  
resultant_vector = np.dot(term_doc_matrix, colVector)  # Calculate the dot product of term_doc_matrix and colVector
print("Resultant vector:\n", resultant_vector)  # Print the resultant vector
max_index = np.argmax(resultant_vector)  # Get the index of the maximum value in the resultant vector
print("Max value in resultant vector:", resultant_vector[max_index])  # Print the maximum value in the resultant vector
print("Index of max value in resultant vector:", max_index)  # Print the index of the maximum value in the resultant vector

#Your code ends here

Resultant vector:
 [[0]
 [0]
 [1]]
Max value in resultant vector: [1]
Index of max value in resultant vector: 2


### Expected Output

!["Expected Output of unique words" - File 1](https://github.com/ahmad-14a/CS-F20-ML/blob/main/IRS-Assignment%201/o7.png?raw=true)


# Step 8 Display the contents of file


In [73]:
#Write the code to identify the file_name having maximum value in the resultant vector and display its contents.

In [74]:
#Your code starts here    

max_index = np.argmax(resultant_vector) # Find the index of the file with the maximum value in the resultant vector
max_file_name = file_list[max_index] # Get the filename for the file with the maximum value
print(f"\nFile with maximum value in resultant vector: {max_file_name}\n") # Print the filename for the file with the maximum value
with open(max_file_name, 'r') as file: # Open and read the file with the maximum value
    print(file.read()) # Print the contents of the file

#Your code ends here


File with maximum value in resultant vector: f3.txt

My book is intersting


Congratulations Now you are able to build your own small IRS.