### Boolean Retrieval System

In [1]:
import nltk
from nltk.corpus import *
from nltk.stem.porter import *
import pickle
from pathlib import Path


#### Initializing PorterStemmer Object for normalization of query terms

In [2]:
# stemming porter object
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))
root = Path("../")

#### Loading the inverted_index, transformed documents and file name binary files

In [3]:
my_path = root / "Pickled_files" / "Inverted_index"
dbfile = open(my_path, 'rb')     
inverted_idx = pickle.load(dbfile)
dbfile.close()

my_path = root / "Pickled_files" / "Documents"
dbfile = open(my_path, 'rb')     
documents = pickle.load(dbfile)
dbfile.close()

my_path = root / "Pickled_files" / "Files"
dbfile = open(my_path, 'rb')     
files_list = pickle.load(dbfile)
dbfile.close()

#### Query
For boolean retrieval we have a created a function **boolean_retrieval** which takes in two arguments, the **query** string and the **type** of the retrieval, if **type** is True then the query is treated as an **AND** query and if false, its treated as an **OR** query. **query** can be a multiword string. First we tokenize the **query** and convert those tokens to lower case. The token is checked for stopword and numbers and further normalized using porter stemming. 
</br>
</br>
The posting list for each of the normalized term is retrived from the inverted index and set union or intersection operation is done as the **type** of the query. While processing the posting list we also count the cumulative frequencies of all the terms document-wise and after getting the final list of documents, we sort them on basis of the frequencies in descending order and return the top 3 results along with original file names these documents are present in.

In [4]:
def boolean_retrieval(query , type) : 
    ans = list()
    tokens = nltk.tokenize.word_tokenize(str(query))
    new_token = list()
    for i in tokens:
        cur_str = i.lower()
        if cur_str in stop_words or cur_str.isnumeric(): 
            continue
        new_token.append(cur_str)
    tokens = new_token
    rank , file_id = dict() , dict()
    set_id = list()
    for words in tokens : 
        freq = dict()
        normalized_word = stemmer.stem(words)
        if inverted_idx.get(normalized_word) == None : 
            return ans
        ids = set()
        for files in inverted_idx[normalized_word] : 
            ids.add(files[0])
            file_id[files[0]] = files[1]
            if rank.get(files[0]) == None : 
                rank[files[0]] = files[2]
            else : 
                rank[files[0]] += files[2]
        set_id.append(ids)

    if type :   
        merged_id = list(set.intersection(*set_id))
    else :
        merged_id = list(set.union(*set_id))
    merged_id = sorted(merged_id , key = lambda y : -rank[y])
    for i in range(0 , min(3 , len(merged_id))) : 
        ans.append((documents[merged_id[i]] , files_list[file_id[merged_id[i]]]))
    return ans

#### Extracting the original file name from its path

In [5]:
def extract_file(path) : 
    ans = ""
    j = len(path) - 1
    while j >= 0 and path[j] != '\\' :
        ans += path[j]
        j -= 1
    return ans[::-1]

#### AND-Query


In [6]:
and_query = boolean_retrieval("Illegal Drugs ComPensation" , True)
for i in range(len(and_query)) : 
    print(str(i + 1) + ") " , extract_file(and_query[i][1]))
    print(and_query[i][0][0])
    print("-----------------------------------")

1)  BRIT-PO-Policy-Wording-May-2016-1.docx
any punitive or exemplary damages , compensations , fines or any penalties of whatsoever nature which the insured is ordered to pay by a forum , authority or body of competent jurisdiction ; in respect of coverage clause 1 eviction of squatters only : - a legal expenses incurred in relation to any dispute where the cause of action involves the insured ’ s legitimate tenant ; b any claim resulting from the occupation of the insured premises or part thereof by squatters prior to the inception of this policy ; c any action consciously taken by the insured that hinders the insurer or appointed representative or adversely affects the course of the legal proceedings initiated for the eviction of squatters ; in respect of coverage clause 5 criminal proceedings only , arising out of any criminal proceedings or allegations in respect of : the ownership , possession of or use of any vehicle ; or any investigation by hmrc or the department for work and p

#### OR-Query

In [7]:
or_query = boolean_retrieval("Policy insured" , False)
for i in range(len(or_query)) : 
    print(str(i + 1) + ") " , extract_file(or_query[i][1]))
    print(or_query[i][0][0])
    print("-----------------------------------")

1)  Residential-Property-Owners-Policy-Wording-1910.docx
these documents are the policy statement of fact and/or schedule endorsements notice to policyholders condition precedent any term expressed condition precedent is extremely important if you are in breach of any of these obligations at the time of a loss we will have no obligation to indemnify you in relation to any claim for that loss however if a condition precedent is intended to reduce the risk of a loss of a particular kind at a particular location or at a particular time we will not rely on the breach of that condition precedent to exclude limit or discharge our liability if the breach could not have increased the risk of the loss which actually occurred in the circumstances in which it occurred property material property schedule the schedule for the time being in force detailing the cover provided statement of fact this is a record of the information that you provided to your insurance agent about you and your business up