## BOOLEAN MODEL 

This code provide a simple implementation of a boolean model for information retrieval. 


It is able to answer to queries in the form of a boolean expression, using the operators AND and OR. The model is able to handle queries with one wildcard (*) which can be in any position of the query.
Every terms (both in documents and queries) are normalized and stemmed using PorterStemmer from nltk library.
I implemented a k-gram index used for spelling correction.





Importing libraries

In [3]:
import numpy as np
from functools import total_ordering, reduce
import csv
import pandas as pd
import re
import json
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
#stopwords = set(stopwords.words('english'))

Function to save the inverted index in the same repository of the code as a json file.

In [4]:
def save_inverted_index(inverted_index, file_path): # save the inverted index in a json file
    with open(file_path, 'w') as file:
        json.dump(inverted_index, file)

### INVERTED INDEX

Here is the code for the inverted index. It is a dictionary where the keys are the terms and the values are the list of books in which the term appears. The documents are represented by their id.

I tried to remove stopwords to reduce time and space complexity, but in terms of query answering it didn't work well. So I decided to keep them.

In [5]:
## INVERTED INDEX
def normalize(text):
    no_punctuation = re.sub(r'[^\w^\s*-]','',text) # remove punctuation
    downcase = no_punctuation.lower() # lowercase
    return downcase

def tokenize(book):
    text = normalize(book.description) # normalize books description
    return list(text.split()) # return a list of tokens

def stemm(book):
    ps = PorterStemmer() # stemmer
    text = tokenize(book) # tokenize
    return list(set([ps.stem(word) for word in text])) 

    # I tried to remove stopwords but it didn't work well, because it removed too many words and the results were not good, this was the code:
    #return list(set([ps.stem(word) for word in text if not word in stopwords])) # return a list of stemmed tokens that are not stopwords

# build an inverted index for given documents with tokenization, normalization, stemming and using stopwords
def build_inverted_index(documents):
    inverted_index = {}
    print("Building inverted index...")
    for doc_id, doc in enumerate(documents): # for each document
        for token in stemm(doc): # for each token in the document
            if token not in inverted_index: # if the token is not in the inverted index we add it
                inverted_index[token] = []
            if doc_id not in inverted_index[token]: # if the document is not in the inverted index we add it to the list of documents for that token
                inverted_index[token].append(doc_id)    
        if (doc_id % 1000 == 0):
                print("ID: " + str(doc_id))
    return inverted_index



### BOOKS DESCRIPTION

Then I created a class to represent the books with title and description as attributes.

After that I populated the inverted index with the books in the dataset. 

First I create the corpus, appending the title and the description of each book. After that I use the corpus to create the inverted index.


In [6]:
## BOOK CLASS
class BooksDescription:
    def __init__(self, title, description):
        self.title = title
        self.description = description
        
    def __repr__(self):
        return self.title
    
def books_title():
    filename = 'booksummaries.txt' # open file
    #with open(filename, 'r') as csv_file: 
    #    books_name = csv.reader(csv_file, delimiter='\t')
    #    names_table = {}
    #    for name in books_name:
    #        names_table[name[0]] = name[2] # create a dictionary with book id as key and book title as value
           
    with open(filename, 'r') as csv_file:
        descriptions = csv.reader(csv_file, delimiter='\t') 
        corpus = []
        # num_lines = sum(1 for _ in descriptions) # count number of lines in the file
        csv_file.seek(0)  # reset the reader
        for i,desc in enumerate(descriptions):
            try: 
                book = BooksDescription(desc[2], desc[6]) # create a BooksDescription object with title and description
                corpus.append(book) # append the object to the corpus
            except KeyError:
                pass
            
            #if i >= num_lines // 2: # uncomment to use only half of the corpus (8000 books instead of 16000)
            #   break
        return corpus

#### CORPUS and INVERTED INDEX

Here I create the inverted list. The code only start if the inverted index is not already created, it checks if the json file is already present in the directory. If it is not present, it creates the inverted index and saves it in a json file.

In [7]:
corpus = books_title() # create the corpus
try:
    inv_index = json.load(open('inverted_index.json')) # load the inverted index from the json file
except FileNotFoundError: # if the file is not found
    inv_index = build_inverted_index(corpus) # create the inverted index
    save_inverted_index(inv_index, 'inverted_index.json') # save the inverted index in a json file


Here is an example of the usage of the inverted index. 

I print the books id which contains the word 'hobbit' then the titles, searching through the corpus.


In [9]:
print(inv_index['hobbit'])

for i in inv_index['hobbit']:
    print(corpus[i])

[78, 85, 248, 249, 3130, 6672, 9586, 9933, 15275]
The Lord of the Rings
The Hobbit
The Two Towers
The Return of the King
The Woods Out Back
The Keeper of the Isis Light
The Armageddon Rag
The Lord of the Rings
The Fellowship of the Ring


### N-gram inverted index

This function creates the n-gram inverted index. It is a dictionary where the keys are the n-grams and the values are the list of terms that contain that n-gram.

The n-gram inverted index will be used for spelling correction.


In [10]:
def build_ngram_inverted_index(documents, n): # function take in input the documents and the n-gram size
    inverted_index = {}
    print("Building ngram inverted index...")
    for doc_id, doc in enumerate(documents): # for each document
        for token in stemm(doc): # for each token in the document
            wild_token = "$" + token + "$" # add initial and final symbol
            for i in range(len(wild_token) - n + 1): # for each ngram in the token
                ngram = wild_token[i:i+n]  # extract the n-gram
                if ngram not in inverted_index:
                    inverted_index[ngram] = [] # if the ngram is not in the inverted index we add it
                if token not in inverted_index[ngram]:
                    inverted_index[ngram].append(token) # if the token is not in the inverted index we add it to the list of tokens for that ngram    
        if (doc_id % 1000 == 0):
                print("ID: " + str(doc_id))
    return inverted_index


In [11]:
corpus = books_title()
try:
    new_inv_index = json.load(open('ngram_inverted_index.json')) # load the inverted index from the json file
except FileNotFoundError:
    new_inv_index = build_ngram_inverted_index(corpus, 3) # create the inverted index with 3-grams
    save_inverted_index(new_inv_index, 'ngram_inverted_index.json') # save the inverted index in a json file


### IR MODEL CLASS

Here is the class that represents the IR model. It has the corpus and the two inverted index as attributes. 

It provides the function to answer to queries with or without wildcard and to correct them if they contain a spelling error.

It also checks if the query contain AND or OR operators and returns the intersection of the postings in the first case and the union in the second case.

In [13]:

# IR MODEL
class IR_model:
    def __init__(self,corpus, index, ngram_index):
        self.corpus = corpus
        self._index = index
        self._ngram_index = ngram_index
        
    def answer_query(self, query):
        words = query.split()
        norm_words = map(normalize, words)
        
        postings = []
        if len(words) == 1 or words[1] == 'and' or words[1] != 'or': # if the query is a single word or a conjunction
            if len(words) != 1 and words[1] == 'and': # if the query is a conjunction
                words.remove(words[1])  # remove 'and' from the list of words
            for word in norm_words:
                try:
                    res = [self.corpus[i].title for i in self._index[word]] # get the list of books for the word
                except KeyError:
                    if '*' in word: # if the word contains a wildcard
                        res = self.wilcard_query(word) # call the wilcard_query function
                    else:
                        sub = find_nearest(word, self._ngram_index)   # find the nearest word using jaccard similarity
                        print("{} not found. Did you mean {}?".format(word, sub))
                        res = [self.corpus[i].title for i in self._index[sub]] # get the list of books for the nearest word

                postings.append(res) # append the list of books to the posting list
            
            if len(set(reduce(np.intersect1d, postings))) == 0: # if the intersection of all postings is empty
                print('No results found.')

            # return intersection (because we are searching for a boolean query with 'AND' or a query with a single word) of all postings
            else:
                return set(reduce(np.intersect1d, postings))
            
        elif words[1] == 'or':
            if len(words) != 1 and words[1] == 'or': # if the query is a disjunction
                words.remove(words[1])
            for word in norm_words:
                try:
                    res = [self.corpus[i].title for i in self._index[word]]
                except KeyError:
                    if '*' in word:
                        res = self.wilcard_query(word)
                    else:
                        sub = find_nearest(word, self._ngram_index)
                        print("{} not found. Did you mean {}?".format(word, sub))

                        res = [self.corpus[i].title for i in self._index[sub]]

                postings.append(res)
            if len(set(reduce(np.union1d, postings))) == 0: # if the union of all postings is empty
                print('No results found.')

            # return union of all postings
            else:
                return set(reduce(np.union1d, postings))
        
    def wilcard_query(self, word):
        sub1 = []
        sub2 = []
        words = word.split('*') # split the word in two parts
        
        first_word = words[0]
        last_word = words[-1]

        if first_word != '': # if the wildcard is not at the beginning of the word
            for word in self._index:
                if word.startswith(first_word):
                    sub1.append(word) # append to sub1 all words that start with the first word

        elif first_word == '': # if the wildcard is at the beginning of the word
            res = []
            for word in self._index:
                if word.endswith(last_word): # if the word ends with the last word
                    res = res + [self.corpus[i].title for i in self._index[word]] # append to res the list of books that finish with last word
            
        if last_word != '': # if the wildcard is not at the end of the word
            for word in self._index:
                if word.endswith(last_word):
                    sub2.append(word) # append to sub 2 all words that end with the last word

        elif last_word == '': # if the wildcard is at the end of the word
            res = []
            for word in self._index:
                if word.startswith(first_word):
                    res = res + [self.corpus[i].title for i in self._index[word]]

        if first_word != '' and last_word != '': # if the wildcard is in the middle of the word
            sub = list([sub1] + [sub2]) # concatenate sub1 and sub2
            ss = set(reduce(np.intersect1d, sub)) # get the intersection of sub1 and sub2
            res = []
            for s in ss:
                res = res + [self.corpus[i].title for i in self._index[s]] # append to res the list of books for each word in the intersection
        return res

### SPELLING CORRECTION

Here is the code for the spelling correction. It takes as input a word and the k-gram inverted index. Provide a list of words that contains at least one of the n-grams of the input word. 

Then for all of these words it calculates the Jaccard similarity with the input word. Jaccard similarity is the intersection of the n-grams of the input word and the n-grams of the word in the list, divided by the union.

Then it returns the word with the highest Jaccard similarity.

In [14]:
# SPELLING CORRECTION using Jaccard similarity
def ngrams(word, n):
    return [word[i:i+n] for i in range(len(word)-n+1)]

def find_nearest(word, index):    
    # Get the list of k-grams for the input word
    word_ngrams = ngrams("$" + word + "$", 3)
    # Build a set of all words that have any of these k-grams
    words_with_kgrams = set()
    for ngram in word_ngrams:
        try: # check if ngram is in inverted index, if not, pass
            words_with_kgrams.update(index[ngram]) 
        except KeyError:
            pass
        
    # Compute the Jaccard similarity coefficient for each candidate word,
    # and take the one that maximizes it
    scores = []
    for w in words_with_kgrams: # for each word in the set of words with k-grams
        w_ngrams = ngrams("$" + w + "$", 3) # get the list of k-grams for the word
        scores.append((w, len(set(word_ngrams).intersection(w_ngrams)) / len(set(word_ngrams).union(w_ngrams)))) # compute the Jaccard similarity coefficient and append it to the list of scores  
    return max(scores, key=lambda x: x[1])[0] # return the word with the highest Jaccard similarity coefficient


In [38]:
ir = IR_model(corpus, inv_index, new_inv_index)
ir.answer_query("obbit gollum")

obbit not found. Did you mean hobbit?


{'The Fellowship of the Ring',
 'The Hobbit',
 'The Lord of the Rings',
 'The Return of the King',
 'The Two Towers'}

In [26]:
ir = IR_model(corpus, inv_index, new_inv_index)
print('First query: ')
ir.answer_query("potter and voldemort")

First query: 


{'Harry Potter and the Chamber of Secrets',
 'Harry Potter and the Goblet of Fire',
 'Harry Potter and the Half-Blood Prince',
 'Harry Potter and the Order of the Phoenix',
 'Harry Potter and the Prisoner of Azkaban'}

In [37]:
print('Second query: ')
ir.answer_query("azkaban")

Second query: 


{'Fantastic Beasts and Where to Find Them',
 'Harry Potter and the Chamber of Secrets'}

In [33]:
print('Third query: ')
ir.answer_query("harry not potter")

Third query: 
harry not found. Did you mean harri?


{'Fantastic Beasts and Where to Find Them',
 'Harry Potter and the Chamber of Secrets',
 'Harry Potter and the Goblet of Fire',
 'Harry Potter and the Half-Blood Prince',
 'Harry Potter and the Order of the Phoenix',
 'Harry Potter and the Prisoner of Azkaban',
 'Operation Chaos',
 'Point Blanc',
 'Quidditch Through the Ages',
 'Wolves of the Calla'}

In [34]:
print('Fourth query:')
ir.answer_query("hobb*t")

Fourth query: 


{'Gradisil',
 'Permutation City',
 'The Armageddon Rag',
 'The Fellowship of the Ring',
 'The Hobbit',
 'The Keeper of the Isis Light',
 'The Lord of the Rings',
 'The Return of the King',
 'The Two Towers',
 'The Woods Out Back'}

In [36]:
print('Fifth query: ')
ir.answer_query("hobb*")

Fifth query: 


{'99 Coffins',
 'A Beautiful Blue Death',
 'A Kestrel for a Knave',
 'American Born Chinese',
 'An Enquiry Concerning the Principles of Morals',
 'Arabian Jazz',
 "Assassin's Quest",
 'Avatar',
 'Beatles',
 'Billy the Kid',
 'Boy Meets Boy',
 'China Marine',
 "Darwin's Dangerous Idea",
 'Dexter by Design',
 'Dexter is Delicious',
 'Dragon Age: The Stolen Throne',
 'Duma Key',
 'Farewell to Manzanar',
 "Friday's Child",
 'Fungus the Bogeyman',
 'Go',
 'Goliath',
 'Gradisil',
 'Hallam Foe',
 'Harlequin',
 'Hatyapuri',
 'House Rules',
 'I Am David',
 'In a Dry Season',
 'Incidents in the Life of a Slave Girl',
 'Isle of the Dead',
 "King Solomon's Carpet",
 'Little Lord Fauntleroy',
 'Living Dead Girl',
 'London Fields',
 'Michael Vey: The Prisoner of Cell 25',
 'Mystery of the Whale Tattoo',
 'No Highway',
 'Our Southern Highlanders',
 'Pacific Vortex!',
 'Palace Walk',
 'Pattern Recognition',
 'Permutation City',
 "Pudd'nhead Wilson",
 'Revenge in the Silent Tomb',
 'Sasquatch',
 'Six R