MAZE Search Engine

The MAZE search engine is a search tool built on a standard dump of Wikipedia atricles. The corpus is approximately 40GB in size. The search engine supports two kinds of queries - normal queries and field queries

Setup and Execution

Clone the repository

git clone https://github.com/mallika2011/Maze-Search-Engine.git

Create a venv and install requirements

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

cd phase2

Run the indexer code on the wiki dump. It takes 3 arguments.
The second argument is the path to the folder where the index will be created. And the third argument is a text file that will store statistical details about the dump after and before indexing.

./index.sh <path to wiki dump XML folder> <path to inverted index output folder> <path to statistics .txt file>

Run the splitting code.
This will split the index into multiple smaller sized index files, for better modularity.

python3 split.py

Run the search script

python3 search.py <path the queries.txt file>

Indexing Technique

In order to index the 40 GB corpus of data, the following steps are carried out :

Parsing and Chunkwise blocking:
The XML parser has been used to parse the XML files containing the wiki dump. For every K pages read by the parser (here pages imply wiki articles), the text of the article, the title and the IDs of these K pages are stored in memory. After K pages are read, the data in memory is then processed and written to a mini index file.
Splitting Components
In order to support field queries, the different components of the Wiki Articles are split using regex and text matching techniques.
Text Processing
All the text is proccessed by case folding (lowercase), tokenisation, stop words removal and stemming. In order to improve the stemmign time, a JSON is maintained to keep track of the stemmed words encountered so far. This reduces the stemming time as some stem words are now found in O(1).
Merging mini index files
Once the entire dump has been parsed a number of mini index files will be created (in this case 999 mini index files). These are then merged into a single index0.txt file by the conventional merge sort technique.
Splitting merged index alphabetically
The merged index is finally broken down into smaller index files again, this time ensuring that the smaller files are for each alphabet and one for all other characters. Hence there are a total of 27 index files, finally.
Convention followed in the index
Each posting list has a list of documents and the associated frequencies for the different fields in the wiki article. An example of how the posting lists look like is : schoolbag:d40187f2b1r1d55615f1b1d13913f1b1d16027f1b1d22648f2b2

Searching Mechanism

The search script produces results for top K documents. Every query is of the following format : "K, <query_text>" where query text could include standard or field queries.
Eg of a field query : "5, t:sachin"

The query text is first processed via the same text processing as in the index creation. The tokens are then segregated to split the field queries (if any). The posting lists of the words in the query are then retrieved from the corresponding letters' index files.

Once the posting lists have been retrieved, each document in the posting list is assigned a score. This socre is computed via the Tf-IDf principle. The formula for the weight of a word i in a document j ie: w_i,j is :

w_i,j = tf_i,j • log₁₀(N/di)

Further, to improve the relevancy of the search instead of using simply tf_i,j as the term frequency, 1+ log₁₀(tf_i,j) is used.

This is to prevent accounting for any kind of term spamming and also to ensure uniformity as the second occurence of a word does not weigh more than the first.

Furthermore, to increase the priority of the field query results more, an additional weight of k=100 is assigned to such documents' scores.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
phase1		phase1
phase2		phase2
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
query.txt		query.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

phase1

phase1

phase2

phase2

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

query.txt

query.txt

requirements.txt

requirements.txt

Repository files navigation

MAZE Search Engine

Setup and Execution

Indexing Technique

Searching Mechanism

About

Releases

Packages

Languages

License

mallika2011/Maze-Search-Engine

Folders and files

Latest commit

History

Repository files navigation

MAZE Search Engine

Setup and Execution

Indexing Technique

Searching Mechanism

About

Resources

License

Stars

Watchers

Forks

Languages