**This is part 2 of TA MA #3**

**Task:** In this exercise you are going to build up a simple Boolean retrieval system by applying inverted index. The dataset consists 1,000 wine reviews including two columns doc_id and doc_content. All the texts are in English. The dataset (wine-reviews.csv) is provided in the form of csv file (wine-reviews.csv)

Let's get an overview of the data first.

In [2]:
import pandas as pd
pd.set_option('display.max_colwidth', None) # set full column width
data = pd.read_csv('wine-reviews.csv', dtype={'doc_content':'string'}) # changing doc_content dtype to str
print(data.info(), data.shape, data.dtypes, data["doc_id"].value_counts()) # basic overview
print(data["doc_content"].apply(len).sort_values()) # getting range of doc_content length, just to see
data.head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   doc_id       1000 non-null   int64 
 1   doc_content  1000 non-null   string
dtypes: int64(1), string(1)
memory usage: 15.8 KB
None (1000, 2) doc_id          int64
doc_content    string
dtype: object 1000    1
329     1
342     1
341     1
340     1
       ..
662     1
661     1
660     1
659     1
1       1
Name: doc_id, Length: 1000, dtype: int64
42      83
30      98
339    100
47     109
250    110
      ... 
293    419
475    426
647    442
989    457
637    474
Name: doc_content, Length: 1000, dtype: int64


Unnamed: 0,doc_id,doc_content
0,1,"Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity."
1,2,"This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled out with juicy red berry fruits and freshened with acidity. It's already drinkable, although it will certainly be better from 2016."
2,3,"Tart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through, with crisp acidity underscoring the flavors. The wine was all stainless-steel fermented."
3,4,"Pineapple rind, lemon pith and orange blossom start off the aromas. The palate is a bit more opulent, with notes of honey-drizzled guava and mango giving way to a slightly astringent, semidry finish."
4,5,"Much like the regular bottling from 2012, this comes across as rather rough and tannic, with rustic, earthy, herbal characteristics. Nonetheless, if you think of it as a pleasantly unfussy country wine, it's a good companion to a hearty winter stew."
5,6,"Blackberry and raspberry aromas show a typical Navarran whiff of green herbs and, in this case, horseradish. In the mouth, this is fairly full bodied, with tomatoey acidity. Spicy, herbal flavors complement dark plum fruit, while the finish is fresh but grabby."
6,7,"Here's a bright, informal red that opens with aromas of candied berry, white pepper and savory herb that carry over to the palate. It's balanced with fresh acidity and soft tannins."
7,8,"This dry and restrained wine offers spice in profusion. Balanced with acidity and a firm texture, it's very much for food."
8,9,"Savory dried thyme notes accent sunnier flavors of preserved peach in this brisk, off-dry wine. It's fruity and fresh, with an elegant, sprightly footprint."
9,10,This has great depth of flavor with its fresh apple and pear fruits and touch of spice. It's off dry while balanced with acidity and a crisp texture. Drink now.


**Task 1)** Build an inverted index for the corpus, that is, the 1,000 wine reviews. Before this step, you need to preprocess your text by tokenizing, lowercase, etc. Your terms should be alphabetically ordered. You may consider to use the sorted function in dictionary. 

First things first. Below we use nltk for preprocessing, namely the tokenize and stem.porter packages to tokenize and stem the reviews in doc_content. 

In [3]:
from nltk.tokenize import word_tokenize
from nltk.stem.porter import *
stemmer = PorterStemmer()
from nltk.corpus import stopwords
stopWords = set(stopwords.words('english'))

tokenized_docs = [word_tokenize(s) for s in data["doc_content"]]
#print(tokenized_docs[0:3])
stemmed_docs = [[stemmer.stem(s) for s in sent if s not in stopWords and s.lower() in s and s.isalnum() and not s.isnumeric()] for sent in tokenized_docs]
#print(stemmed_docs[0:3])

Above we use list comprehension to word_tokenize each word and then stem the resulting tokens IF they meet the following criteria:
* They are not in NLTK's English stopword corpus
* They are lower (this is weird, since they should all be lower after stemming but a handful got through somehow. These were: A, An, As, At, Aÿ, Dr, El, If, In, It, Le, M, Of, On, Ox, PG, SB, So, To, Up. 
* They are alphanumeric
* They are not numeric

We then use the function we built in TA MA# part 1, to create the dictionary.

In [4]:
from collections import defaultdict

inv_indx = defaultdict(list) # using a defaultdict provides a defaul value for a nonexistent key as to avoid KeyErrors
for idx, text in enumerate(stemmed_docs): # enumerating over the list of normalized docs and their indexes
    for word in text: 
        inv_indx[word].append(idx) # appending the indexes to which every word belongs.

# print(len(inv_indx)) # returns 2430
for key in sorted(inv_indx.keys()):
    print("%s: %s" % (key, inv_indx[key]))

19th: [163, 887]
abbess: [395]
abbrevi: [507]
abil: [799]
abli: [893]
abound: [15, 789]
abrupt: [13]
abruptli: [725]
absolut: [276, 279, 457]
abund: [659, 951]
abv: [785, 838]
acacia: [88, 201, 458, 845]
accent: [8, 12, 16, 54, 60, 100, 154, 199, 204, 218, 247, 264, 273, 282, 299, 301, 313, 425, 434, 464, 515, 519, 537, 545, 549, 588, 603, 630, 711, 712, 774, 782, 811, 853, 954]
accentu: [102]
accept: [344]
access: [10, 96, 598, 662, 819, 915]
accompani: [778, 949]
accord: [320]
account: [124]
acet: [233]
acid: [0, 1, 2, 5, 6, 7, 9, 16, 22, 23, 26, 34, 42, 47, 52, 53, 63, 65, 69, 82, 85, 88, 95, 96, 101, 102, 103, 106, 109, 110, 115, 126, 127, 128, 131, 137, 139, 142, 143, 152, 153, 154, 156, 162, 163, 165, 166, 173, 179, 180, 181, 184, 185, 186, 189, 193, 197, 201, 203, 208, 210, 212, 214, 215, 216, 217, 219, 220, 221, 223, 225, 226, 228, 231, 233, 240, 241, 269, 285, 286, 290, 291, 292, 304, 308, 315, 323, 325, 330, 334, 347, 349, 351, 357, 358, 359, 367, 372, 373, 374, 382, 384, 386

peppermint: [335]
percent: [647]
percentag: [477, 871]
percept: [91, 743]
perfect: [82, 131, 173, 274, 308, 332, 400, 416, 465, 475, 589, 598, 755, 805, 897, 995]
perfectli: [83, 119, 163, 351, 495, 776, 899, 906]
perform: [235, 503, 601]
perfum: [191, 209, 259, 316, 398, 419, 427, 464, 477, 502, 549, 578, 632, 637, 648, 707, 719, 733, 796, 811, 848, 849, 850, 874, 921, 994]
perfumey: [255]
perhap: [575, 768, 778]
period: [130]
perlag: [400, 682, 781]
persimmon: [482, 612, 891]
persist: [170, 200, 442, 449, 458, 542, 568, 568, 574, 672, 676]
person: [323, 405, 433, 433, 620, 756, 812, 871]
pervad: [19]
petal: [114, 117, 168, 361, 630, 631, 637, 648, 712]
peter: [133]
petrol: [116, 119, 128, 142, 162, 724, 922]
phenol: [852]
pick: [564]
pictur: [10, 854]
pie: [121, 197, 227, 235, 397, 634, 750, 774, 889]
pierc: [233, 347, 675, 944]
pile: [212]
pillowi: [629, 681]
pinch: [275, 557, 817]
pine: [232, 496, 541, 544, 563, 982]
pineappl: [2, 14, 77, 77, 82, 105, 116, 121, 180, 189, 219, 220, 

This produces a dictionary of len = 2430, meaning that there is 2430 different tokens in the inverted index. 

**Task 2)** write a query function to handle simple queries of AND. Please keep in mind to handle the situation that there are no results returned

So, this is essentially a Boolean query of the inverted index. For a sigle conjuctive query, we have 4 steps
1. Locate posting 1 in the index
2. Retrieve its postings
3. Locate posting 2 in the index
4. Retrieve its postings
5. Find the intersection of the 2 postings. 

This is shown in the algorithm below, credit of Cambridge University Press 2008.



!["conjunction-algo"](intersect-invindx.png "Intersection")

Below, i've created a class, aptly named info_retr_3000. It can / should be able to take any number of tokens and return the intersection of their postings. First it calls the retrieve_postings() method for all args, then finds the intersection between them. It does this by recursively using the algorithm above, implemented in the find_intersection method. Which means that for a query with 3 words, let's say "aroma", "fruit" and "wine", it will find 
<mrow><mi>aroma</mi><mo>&#x2229;</mo><mi>fruit</mi></mrow>, then <mrow><mi>fruit</mi><mo>&#x2229;</mo><mi>wine</mi></mrow> and finally (<mrow><mi>aroma</mi><mo>&#x2229;</mo><mi>fruit</mi></mrow>) <mo>&#x2229;</mo> (<mrow><mi>fruit</mi><mo>&#x2229;</mo><mi>wine</mi></mrow>), which will yield the intersection of all 3.

In [8]:
class info_retr_3000:

    def __init__(self, inverted_index, corpus):
        self.inverted_index = inverted_index # the inverted index of the corpus
        self.corpus = corpus # the actual corpus
    
    def and_retrieval(self, *args : str): # uses *args to take in dynamic args of type str
        
        """
        Takes indexes / tokens and retrieves
        the postings for each. Then finds
        the intersection of these postings.
        """
        
        lists = []
        
        if len(args) > 1:
            
            for post_n in args: # retrieves postings for each stemmed token / arg
                lists.append(self.retrieve_posting(stemmer.stem(post_n)))
                          
        else: # on <= 1 token, simple retrieves that one or empty list.
            return self.display_docs(self.retrieve_posting(stemmer.stem(args[0])))
        
        return self.display_docs(self.find_intersection(lists)) # sends lists to intersection and returns it.
                        
    def find_intersection(self, lists):
        
        result = list()
        
        if len(lists) < 3: # since all lists of len == 1 has been taken care of, this takes in all lists of len == 2
            
            p1, p2 = 0, 0
            
            while p1 < len(lists[0]) and p2 < len(lists[1]): # applies the intersect algo
                if lists[0][p1] == lists[1][p2]:
                    result.append(lists[0][p1])
                    p1 += 1
                    p2 += 1
                elif lists[0][p1] > lists[1][p2]:
                    p2 += 1
                else:
                    p1 += 1
            
        else: # if more than 2 elements in list
            
            """
            this section calls find_intersection iteratively until we have found intersections between
            all relevant postings.
            """
            
            inner_list = []
            
            for index in range(len(lists)): 
                
                if index + 1 >= len(lists): # if index + 1 provokes out of bounds error...
                    return self.find_intersection(inner_list) # .. we call find_intersection again with inner_list
                
                else:
                    
                    # if index + 1 within bounds, we find intersect between 2 sets of postings.  

        return result 
    
    def retrieve_posting(self, posting): # simply retrieves the postings by value.
        for (key, value) in self.inverted_index.items():
            if key == posting:
                return value
        return [] # returned if no match
    
    def display_docs(self, lst): # print the corresponding reviews.
        print(set(lst)) # cast to set to avoid duplicates
        for token in lst:
            print(f"\n{self.corpus[token]}")


In [6]:
info_bot = info_retr_3000(inv_indx, data["doc_content"])

In [7]:
#info_bot.and_retrieval("aroma", "rose")
#info_bot.and_retrieval("aroma", "acidity", "flavor")
#info_bot.and_retrieval("aroma", "acidity", "flavor", "finish", "fruit")
#info_bot.and_retrieval("hotel")
#info_bot.and_retrieval("blackberry")
#print(info_bot.and_retrieval("fruit", "palate"))

{18, 25, 544, 38, 551, 553, 556, 561, 562, 567, 68, 599, 108, 122, 124, 134, 141, 655, 657, 146, 659, 660, 149, 677, 167, 683, 689, 178, 185, 188, 700, 192, 715, 204, 215, 229, 236, 239, 244, 247, 760, 250, 764, 765, 254, 769, 258, 783, 277, 284, 796, 798, 287, 799, 289, 291, 294, 295, 297, 314, 330, 356, 871, 878, 880, 881, 882, 885, 380, 896, 386, 900, 402, 404, 410, 429, 951, 442, 446, 970, 475, 476, 989, 481, 486, 505}

Desiccated blackberry, leather, charred wood and mint aromas carry the nose on this full-bodied, tannic, heavily oaked Tinto Fino. Flavors of clove and woodspice sit on top of blackberry fruit, then hickory and other forceful oak-based aromas rise up and dominate the finish.

Desiccated blackberry, leather, charred wood and mint aromas carry the nose on this full-bodied, tannic, heavily oaked Tinto Fino. Flavors of clove and woodspice sit on top of blackberry fruit, then hickory and other forceful oak-based aromas rise up and dominate the finish.

Oak and earth inte