Information Retrieval Assignment 1

Team: Sandeep, Sajeel, Shekhar, Nischal, Devender

Directory Structure

FIle	Description
stories	Contains all files that needs to be indexed
main.ipynb	Inverted Index creation code
Comparators.ipynb	Performs the query on the created Inverted Index
mapping.json	Mapping of document ID with document location
output.json	Inverted Index(without stemming) i.e terms with list of DocID ntaining them
outputStemmed.json	Inverted Index(with stemming) i.e terms with list DocID containing them
requirements.txt	Libraries required for running this project

Methodology

The list containing all file names is stored in a file.
Then preprocessing is done for data of each file.
- Most of the files were decoded using "utf-8" codec while for some "unicode_escape" was used.
Finally, inverted index was generated using all the words and cached in a file.

Query

The logical operations on query keywords were performed as follows:

OR -> set union
AND -> set intersection
NOT -> set difference

Number of Comparisons

We take the posting lists corresponding to the two keywords and then simply compute the number of comparisons by traversing till we encounter the end of any one list.

For each input query, we first perform preprocessing then extracted keywords are stored in a list. From left to right, we perform operations on two words, save the results and use it in further computations.

Result

From the input query, we retrieve the total number of relevant documents and the minimum number of comparisons. Both document name and the associated ID are retrieved.

Preprocessing Steps

Text conversion to lowercase.
Tokenization using nltk.
Removal of stop words using nltk.
Special characters excluding alphanumeric are removed.
All singly occuring characters are removed.
Finally a set of all the words is created.

Assumptions

Input Query is case insensitive.
We retrieve the results from the unstemmed query keywords. At demo it can be also be presented for stemmed keywords.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Information Retrieval Assignment 1

Team: Sandeep, Sajeel, Shekhar, Nischal, Devender

Directory Structure

Methodology

Query

Number of Comparisons

Result

Preprocessing Steps

Assumptions

About

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
stories		stories
.gitignore		.gitignore
Comparators.ipynb		Comparators.ipynb
IR Assignment1 .pdf		IR Assignment1 .pdf
Readme.md		Readme.md
main.ipynb		main.ipynb
mapping.json		mapping.json
output.json		output.json
outputStemmed.json		outputStemmed.json
requirements.txt		requirements.txt

itissandeep98/IR2021_A1_34

Folders and files

Latest commit

History

Repository files navigation

Information Retrieval Assignment 1

Team: Sandeep, Sajeel, Shekhar, Nischal, Devender

Directory Structure

Methodology

Query

Number of Comparisons

Result

Preprocessing Steps

Assumptions

About

Topics

Resources

Stars

Watchers

Forks

Contributors 3

Languages