## Wildcard Queries

In [8]:
import nltk
from nltk.corpus import *
from nltk.stem.porter import *
import pickle
from pathlib import Path

#### Initializing PorterStemmer Object for normalization of query terms and getting the stop_words corpus

In [9]:
# stemming porter object
stemmer = PorterStemmer()
root = Path("../")

In [10]:
stop_words = set(stopwords.words('english'))

#### Loading the inverted_index, transformed documents and file name binary files

In [11]:
my_path = root / "Pickled_files" / "Inverted_index"
dbfile = open(my_path, 'rb')     
inverted_idx = pickle.load(dbfile)
dbfile.close()

my_path = root / "Pickled_files" / "Documents"
dbfile = open(my_path, 'rb')     
docs = pickle.load(dbfile)
dbfile.close()

my_path = root / "Pickled_files" / "Files"
dbfile = open(my_path, 'rb')     
files = pickle.load(dbfile)
dbfile.close()

In [12]:
def isnumber(token):  
    for char in token:
        if not (char >= '0' and char <= '9'):
            return False
    return True

## Trie
We are handling two types of wildcard queries : trailing and leading. We are using a Prefix and Suffix Trie for the same purpose. Every word is checked for stopword and **isnumber** condition, if its neither, the word is inserted in the Prefix Trie and the reverse of the word is inserted into the Suffix Trie. After the processing both Trie objects are pickled and dumped to a binary file for future use.

In [13]:
def insert(word , root) : 
    cur = 0
    for char in word : 
        if root[cur].get(char) == None : 
            root[cur][char] = len(root)
            root.append(dict())
        cur = root[cur][char]
    root[cur]["end"] = True
    if root[cur].get("cnt") == None : 
        root[cur]["cnt"] = 0
    root[cur]["cnt"] += 1

### Retrieving Words from Trie
Any Wildcard Query has an attribute called **is_suf** which denotes whether the query was a Trailing Wildcard Query or not. The corresponding Query and the Trie Objects are passed to the **wildcard** function. Using the query we traverse over the Trie and upon its end we call a dfs traversal which collects the entire subtree of the point giving us the set of words for the particular wildcard query. Our Trie has a **cnt** attribute in it which signifies how many times the particular prefix/suffix was inserted and hence the retrieved set of words can easily be ranked in decreasing order of **cnt** to get the best 3 results.

In [14]:
def wildcard(word, root , is_suf) : 
    ans = list()
    cur = 0
    for char in word : 
        if root[cur].get(char) == None : 
            return ans
        cur = root[cur][char]
    vis = dict()
    def dfs(token , id) : 
        if vis.get(id) != None : 
            return
        vis[id] = 1
        if root[id].get("cnt") != None:
            if is_suf : 
                token = token[::-1]
            ans.append((token , root[id]["cnt"]))
        for labels in root[id].keys() :
            if len(labels) > 1 :
                continue
            new_str = token + labels
            dfs(new_str , root[id][labels])
    dfs(word , cur)
    ans = sorted(ans , key = lambda y : -y[1])
    ret_list = list()
    for i in range(min(3 , len(ans))) :
        ret_list.append(ans[i])
    return ret_list

In [15]:
Prefix = list()
Prefix.append(dict())
Suffix = list()
Suffix.append(dict())

### Insertion into the Trie

In [16]:
for sentence in docs: 
    sentence = sentence[0]
    tokens = nltk.tokenize.word_tokenize(str(sentence))
    new_token = list()
    for i in tokens:
        new_token.append(i.lower())
    tokens = new_token
    for words in new_token: 
        if words in stop_words or len(words) <= 2 or isnumber(words): 
            continue
        insert(words , Prefix)
        insert(words[::-1] , Suffix)

In [17]:
my_path = root / "Pickled_files" / "Prefix_wildcard"
dbfile = open(my_path, 'wb')
pickle.dump(Prefix, dbfile) 
dbfile.close()

my_path = root / "Pickled_files" / "Suffix_wildcard"
dbfile = open(my_path, 'wb')
pickle.dump(Suffix, dbfile) 
dbfile.close()

In [18]:
my_path = root / "Pickled_files" / "Prefix_wildcard"
dbfile = open(my_path, 'rb')     
Prefix = pickle.load(dbfile)
dbfile.close()

my_path = root / "Pickled_files" / "Suffix_wildcard"
dbfile = open(my_path, 'rb')     
Suffix = pickle.load(dbfile)
dbfile.close()

For the retrieved set of words from Trie, we normalize each term using PorterStemming and retrieve the corresponding posting list from the inverted index table.

In [19]:
def query(word , type) : 
    ans = set()
    if type : 
        word_list = wildcard(word , Suffix , True)
    else :
        word_list = wildcard(word , Prefix , False)
    for words in word_list :
        normalized_word = stemmer.stem(words[0])
        if inverted_idx.get(normalized_word) != None : 
            for i in range(min(len(inverted_idx[normalized_word]), 2)):
                ans.add((docs[inverted_idx[normalized_word][i][0]][0], files[inverted_idx[normalized_word][i][1]]))
    return ans

In [20]:
def extract_file(path) : 
    ans = ""
    j = len(path) - 1
    while j >= 0 and path[j] != '\\' :
        ans += path[j]
        j -= 1
    return ans[::-1]

### Wildcard Examples

In [23]:
results = query("ill" , False)
cnt = 0
for i in results : 
    print(str(cnt + 1) + ") " , extract_file(i[1]))
    cnt += 1
    print(i[0])
    print("-----------------------------------")

1)  Property-Owner-Policy-Wording.docx
specified illnesses contingencies any occurrence of a specified illness at the premises , except where the premises is a private dwelling any discovery of an organism at the premises likely to result in the occurrence of a specified illness , except where the premises is a private dwelling any occurrence of legionellosis at the premises d the discovery of vermin or pests at the premises e any accident causing defects in the drains or other sanitary arrangements at the premises which causes restrictions on the use of the premises on the order or advice of the competent local authority . 
-----------------------------------
2)  1215E.2.docx
illegal use we wo n't pay for loss or damage caused in an incident : if you are unable to maintain proper control of the automobile because you are driving or operating the automobile while under the influence of intoxicating substances ; if you are convicted of one of the following offences under the criminal co