Create a fully functional Information Retrieval System, including Indexing, Interactive Query Mode, Ranking and Evaluation.
Start the system by typing
python airs-1.0.0.py [OPTIONS]
If you don't specify any options, the systems runs in an empty run without indexing any documents. Please specify the number of documents you want to index and evaluate on using -rand or -rank.
You may specify several flags to individualize your indexing and evaluation:
python airs-1.0.0.py [-c PATH] [-rand N] [-rank N] [-s] [-p] [-e SYSTEM] [-i] [--version] [-h]
Specify a path for corpus files. Default is ./amazon_reviews
Specify a number of randomized documents used for the inverted index.
Specify the number of documents which should be ranked.
The selection will start at the document with index 0; the specified number is the excluded upper bound. For example for the first 2000 documents to index and rank, give -rank 2000.
Activate this flag if you want to store the inverted index into a pickle file.
Activate this flag if you wish to read the inverted index from a stored pickle file. This only works if there is an existing inverted index pickle file.
Specify which system you would like to evaluate. Type -e bool to evaluate the boolean system, -e tfidf to evaluate the TF-IDF system and -e prox to evaluate the proximity system. The default is -e tfidf.
Activate this flag if you want to start an interactive session. This works together with all flags.
You can use the -e flag to specify how the result for the query should be ranked. Hence in this case the -e flag will not give an evaluation but the system used for ranking for the interactive session. The default of -e is tfidf. -e bool will return all found documents, -e tfidf and -e prox will return the 10 highest ranked documents.
You are asked to enter a query and will get the result displayed in the terminal.
Shows the version number.
Shows information about the different flags and their usage.
The used data base is a collection of 10,000 Amazon reviews, stored in /amazon_reviews
Main class, running the system and combine the different modules of the code.
Reading in the source files and providing the content as a string.
A simple tokenizer that gets a document as input and tokenizes it.
Class to represent a document, storing tokens, document length etc.
Class for representing a token, containing string, string length etc.
Class for representing a term, including pointers to the postings lists, normalizing of terms and storing the terms on disk in an efficient way using folder structures.
Class for storing the postings lists for each term.
Class for performing user queries and performing query processing and parsing
Class for ranking based on cosine similarity or proximity. The class provides a sorted list of output documents for a query, based on the cosine ranking or the proximity ranking.
Class which entails functions for the evaluation process.
A free porter stemmer, used from https://bitbucket.org/mchaput/stemming/src/5c242aa592a6d4f0e9a0b2e1afdca4fd757b8e8a/stemming/porter.py?at=default&fileviewer=file-view-default
Classes which are not listed here are not used for now or will be used in future applications.
The gold data is stored in the file golddata.txt.
The output for our index and a sample query is stored in output.txt