## Developing an Information Retrieval System with Spelling Correction and Wildcard Queries

This project aims to enhance the Information Retrieval (IR) system developed in the first assignment by handling
Spelling Correction and Wildcard Queries. This assignment can be completed independently of Project 1. You can
find the data here.


### 1. Document Preprocessing

Your project will begin by reading and preprocessing a collection of text documents. You only need to refer to the dataset as a word list.


In [45]:
import os

import re  # Only for preprocessing

import numpy as np

In [5]:
docs_filenames = os.listdir(path="docs")

In [11]:
def clean_text(text):
    # Lowercasing and removing extra characters.

    text = text.lower()  # Lowercasing

    text = re.sub("[^a-z0-9\s\-]", "", text)  # Removing punctuations

    text = re.sub("\-", " ", text)  # Replacing dash with space

    return text

In [12]:
words_set = set()

for filename in docs_filenames:
    with open(f"docs/{filename}") as f:
        text = f.readline()

        text = clean_text(text)

        words_set.update(text.split())

print(len(words_set))

words_set

1348


{'batteries',
 'course',
 'living',
 'death',
 'sports',
 'hed',
 'life',
 'hawaiian',
 'overpaying',
 'saxophone',
 'stored',
 '12',
 '25000',
 'bizarre',
 'might',
 'reconsider',
 'range',
 'removed',
 'making',
 'deals',
 'abuse',
 'nancy',
 'road',
 'caliber',
 'guides',
 'fill',
 'gets',
 'oil',
 'surrounded',
 '11',
 'consideration',
 'who',
 'each',
 'visiting',
 'san',
 'getting',
 'better',
 'furniture',
 'exhibitors',
 'balls',
 'great',
 'car',
 'gun',
 'studio',
 'rest',
 'embrace',
 'convertible',
 'problems',
 'nitrogen',
 'money',
 'reduce',
 'work',
 'silverware',
 'complications',
 'married',
 '10',
 'moved',
 'arcadia',
 'sara',
 'like',
 'nightit',
 'variables',
 'hot',
 'audience',
 'neighbor',
 'taxes',
 'less',
 'causing',
 'knew',
 'this',
 'brown',
 'into',
 'believes',
 'headaches',
 'triple',
 'officer',
 'bang',
 'reason',
 'newest',
 'donna',
 'medical',
 'damage',
 'fire',
 'movie',
 'prisoners',
 'continued',
 'channels',
 'widow',
 'wife',
 'table',
 'spi

### 2. Spelling Correction:

You will implement a function for isolated spelling correction. Your function needs to
correct an input query using Levenshtein distance based on the words in the list. As the word list derived from
the data is not complete, your function does not work flawlessly for all input queries.


In [75]:
def levenshtein_distance(word1, word2):
    m = np.zeros((len(word1) + 1, len(word2) + 1))

    for i in range(len(word1) + 1):
        m[i, 0] = i

    for j in range(len(word2) + 1):
        m[0, j] = j

    for i in range(1, len(word1) + 1):
        for j in range(1, len(word2) + 1):
            m[i, j] = min(
                m[i - 1, j] + 1,
                m[i, j - 1] + 1,
                m[i - 1, j - 1] + (0 if word1[i-1] == word2[j-1] else 1),
            )

    return m[len(word1), len(word2)]

In [76]:
levenshtein_distance("oslo", "snow")

3.0

In [77]:
def find_nearest_word(word, words_set):
    # Find nearest word using Levenshtein distance method

    min_distance = float("inf")
    nearest_word = None

    for w in words_set:
        distance = levenshtein_distance(word, w)

        if distance < min_distance:
            min_distance = distance
            nearest_word = w

        if distance == 0:
            break

    return nearest_word

In [78]:
def spell_checking(query, words_set):
    # Correct the query using words_set and Levenshtein distance method

    query = clean_text(query)

    query_words = query.split()

    corrected_words = []

    for word in query_words:
        corrected_word = find_nearest_word(word, words_set)

        corrected_words.append(corrected_word)

    corrected_query = " ".join(corrected_words)

    return corrected_query

In [79]:
spell_checking("Hello World. You're wild!", words_set)

'yellow world your will'

In [80]:
spell_checking("aaa bbb cccc dd", words_set)

'san bob each id'

In [87]:
spell_checking("probably this wont change", words_set)

'probably this wont change'

### 3. Standard Boolean Queries

Users can perform standard Boolean queries using the operators AND, OR,
and NOT to retrieve relevant documents. For this project, queries are limited to <span style="color:orange;">two terms</span>.


In [6]:
def boolean_or_query(term1, term2):
    result1 = inverted_index_dict.get(term1) or set()
    result2 = inverted_index_dict.get(term2) or set()

    return result1.union(result2)


def boolean_and_query(term1, term2):
    result1 = inverted_index_dict.get(term1) or set()
    result2 = inverted_index_dict.get(term2) or set()

    return result1.intersection(result2)


def boolean_not_query(term):
    all_docs = set(docs_dict.keys())
    inverse_result = inverted_index_dict.get(term) or set()

    return all_docs.difference(inverse_result)

In [8]:
print(simple_query("different"))
print(simple_query("many"))
print(boolean_or_query("different", "many"))
print(boolean_and_query("different", "many"))
print(boolean_not_query("different"))

{'Freeway Chase Ends at Newsstand.txt', 'Happy and Unhappy Renters.txt'}
{'A Festival of Books.txt', 'Happy and Unhappy Renters.txt'}
{'A Festival of Books.txt', 'Freeway Chase Ends at Newsstand.txt', 'Happy and Unhappy Renters.txt'}
{'Happy and Unhappy Renters.txt'}
{'A Murder-Suicide.txt', 'A Festival of Books.txt', 'Rentals at the Oceanside Community.txt', 'Trees Are a Threat.txt', 'Jerry Decided To Buy a Gun.txt', 'Cloning Pets.txt', 'Sara Went Shopping.txt', 'Better To Be Unlucky.txt', 'Man Injured at Fast Food Place.txt', 'Pulling Out Nine Tons of Trash.txt', 'Food Fight Erupted in Prison.txt', 'Crazy Housing Prices.txt', 'Gasoline Prices Hit Record High.txt'}


In [9]:
print(simple_query("a_term_that_doesn't_exist"))
print(boolean_or_query("a_term_that_doesn't_exist", "many"))
print(boolean_and_query("a_term_that_doesn't_exist", "many"))
print(boolean_not_query("a_term_that_doesn't_exist"))
print(boolean_not_query("the"))

set()
{'A Festival of Books.txt', 'Happy and Unhappy Renters.txt'}
set()
{'A Murder-Suicide.txt', 'A Festival of Books.txt', 'Rentals at the Oceanside Community.txt', 'Happy and Unhappy Renters.txt', 'Trees Are a Threat.txt', 'Jerry Decided To Buy a Gun.txt', 'Cloning Pets.txt', 'Sara Went Shopping.txt', 'Better To Be Unlucky.txt', 'Freeway Chase Ends at Newsstand.txt', 'Man Injured at Fast Food Place.txt', 'Pulling Out Nine Tons of Trash.txt', 'Food Fight Erupted in Prison.txt', 'Crazy Housing Prices.txt', 'Gasoline Prices Hit Record High.txt'}
set()


### 4. Proximity Queries

Users can also perform proximity queries, specifying a maximum distance between
two terms in the documents they want to retrieve.


#### <span style="color:blueviolet">Positional Index</strong>


In [10]:
positional_index_dict = {}

for doc in docs_dict:
    doc_terms = docs_dict[doc].split(" ")

    for i in range(len(doc_terms)):
        if not positional_index_dict.get(doc_terms[i]):
            positional_index_dict[doc_terms[i]] = {}

        if not positional_index_dict[doc_terms[i]].get(doc):
            positional_index_dict[doc_terms[i]][doc] = []

        positional_index_dict[doc_terms[i]][doc].append(i)

positional_index_dict

{'people': {'A Festival of Books.txt': [0, 65, 165],
  'A Murder-Suicide.txt': [351],
  'Gasoline Prices Hit Record High.txt': [121],
  'Happy and Unhappy Renters.txt': [252],
  'Rentals at the Oceanside Community.txt': [146, 262],
  'Trees Are a Threat.txt': [302]},
 'joke': {'A Festival of Books.txt': [1]},
 'that': {'A Festival of Books.txt': [2, 41],
  'A Murder-Suicide.txt': [240],
  'Better To Be Unlucky.txt': [114, 139, 162],
  'Cloning Pets.txt': [6, 93, 118, 211],
  'Crazy Housing Prices.txt': [72, 105, 156, 191, 271],
  'Food Fight Erupted in Prison.txt': [50],
  'Freeway Chase Ends at Newsstand.txt': [56, 89, 97, 129, 187, 241],
  'Gasoline Prices Hit Record High.txt': [68, 124, 214, 256],
  'Happy and Unhappy Renters.txt': [11, 130, 271],
  'Jerry Decided To Buy a Gun.txt': [135, 158],
  'Man Injured at Fast Food Place.txt': [123, 151],
  'Pulling Out Nine Tons of Trash.txt': [56, 204],
  'Rentals at the Oceanside Community.txt': [227, 263, 306, 329],
  'Trees Are a Threat.

In [11]:
def find_match_with_distance(list1, list2, distance):
    if not list1 or not list2:
        return False

    i, j = 0, 0

    while i < len(list1) and j < len(list2):
        if abs(list1[i] - list2[j]) <= distance + 1:
            return True

        if i > j:
            j += 1
        else:
            i += 1

    return False


def proximity_query(term1, term2, distance):
    result1 = positional_index_dict.get(term1) or []
    result2 = positional_index_dict.get(term2) or []

    match_list = []

    for doc in result1:
        list1 = result1.get(doc)
        list2 = result2.get(doc)

        is_match = find_match_with_distance(list1, list2, distance)

        if is_match:
            match_list.append(doc)

    return match_list

In [12]:
proximity_query("to", "you", 0)

['Jerry Decided To Buy a Gun.txt']

In [13]:
proximity_query("to", "you", 3)

['Cloning Pets.txt',
 'Crazy Housing Prices.txt',
 'Jerry Decided To Buy a Gun.txt']

In [14]:
proximity_query("to", "you", 300000000)

['A Festival of Books.txt',
 'A Murder-Suicide.txt',
 'Better To Be Unlucky.txt',
 'Cloning Pets.txt',
 'Crazy Housing Prices.txt',
 'Jerry Decided To Buy a Gun.txt']

In [15]:
proximity_query("a_term_that_doesn't_exist", "you", 3)

[]