# Inverted indexing

Finding material of unstructured nature (such as finding a book based on a particular text) from within a large collection.

### Judging criteria of retrieved docs
- Precision: Fraction of retrieved docs that are relevant to user's needs.
- Recall: Fraction of relevants docs in collection that are retrieved.


### Bounds
- The cost is bounded by the min posting list
- To reduce the time complexity of iterating thru the posting list, we sort and go thru the merging from the postings list with the **smallest size** first

### Precedence of query operators (1 is highest)
1. ( brackets )
2. AND NOT (maybe)
3. NOT
4. AND
5. XOR
6. OR

### Advantages and disadvantages of inverted index

#### Advantage of Inverted Index are:

- Inverted index is to allow fast full text searches, at a cost of increased processing when a document is added to the database.
- It is easy to develop.
- It is the most popular data structure used in document retrieval systems, used on a large scale for example in search engines.

#### Inverted Index also has disadvantage:
- Large storage overhead and high maintaenance costs on update, delete and insert.



In [2]:
from nltk.stem import *
from nltk.probability import FreqDist
import nltk


# Porter stemmer can remove the -ed and -s etc etc
stemmer = PorterStemmer()

### The Texts

In [3]:
text = "penyetted penyet"
text1 = "penyet test helloed"
text2 = "penyetted hello"

texts = [text, text1, text2]

### The dictionary

In [4]:
dictionary = {}

Store the words in dictionary with respect to their titles

In [5]:
# By right we're also supposed to store the doc count for each word
# {(term, docCount): [docID1,...]}
for i in range(0, 3):
    for word in nltk.word_tokenize(texts[i]):
        word = stemmer.stem(word) # Stem it first
        if (word not in dictionary):
            dictionary[word] = [i]
        else:
            if (i not in dictionary[word]):
                dictionary[word].append(i)

In [6]:
for items in dictionary:
    print (items, dictionary[items])

# Texts are ordered by their index in increasing order

penyet [0, 1, 2]
test [1]
hello [1, 2]


### The queries

In [7]:
query1 = "penyet"
query2 = "hello"

query1 = stemmer.stem(query1)
query2 = stemmer.stem(query2)

queries = [[len(dictionary[query1]), query1], [len(dictionary[query2]), query2]]
queries.sort() # Sort the queries so we tackle the smallest one first

### Querying AND operation
Time complexity: O(x + y)
```
==== Worst case ====
'a' -> 1,2,3,4,10
'b' -> 5,6,7,8,9,10
```
Will have to iterate from 1 -> 4 first in `a` then 5 -> 9 in `b` also

In [8]:
# We want to find a text which contains both penyet and hello
p1 = 0
p2 = 0
foundTexts = {}
# We can check both of them at the same time as their arrays are sorted
while (p1 < len(dictionary[queries[0][1]]) and p2 < len(dictionary[queries[1][1]])):
    index1 = dictionary[queries[0][1]][p1]
    index2 = dictionary[queries[1][1]][p2]
    
    if (index1 < index2):
        p1 += 1 # If index1 < index2 then we move p1 up
    elif (index1 > index2):
        p2 += 1 # if index2 < index1 then we move p2 up
    elif (index1 == index2): 
        foundTexts[index1] = True
        p1 += 1
        p2 += 1

print (foundTexts.keys())

dict_keys([1, 2])


### Querying with XOR condition

In [9]:
# We want to find a text which contains penyet XOR hello - meaning they can only exist in one of them not both
foundTexts = {}
p1 = 0
p2 = 0

# We use AND in the while loop because intersection stops when any list reaches the end.
while (p1 < len(dictionary["penyet"]) and p2 < len(dictionary["hello"])):
    index1 = dictionary["penyet"][p1]
    index2 = dictionary["hello"][p2]
    if (index1 == index2):
        p1 += 1
        p2 += 1
    elif (index1 < index2):
        foundTexts[index1] = True
        p1 += 1
    else:
        foundTexts[index2] = True
        p2 += 1

        
# Remember to add these two for loops! Very important
while (p1 < len(dictionary["penyet"])):
    index1 = dictionary["penyet"][p1]

    foundTexts[index1] = True
    p1 += 1
    
while (p2 < len(dictionary["hello"])):
    index2 = dictionary["hello"][p2]
    foundTexts[index2] = True
    p2 += 1

print (foundTexts.keys())

dict_keys([0])


### Querying with OR operation
Time complexity: O(max(x,y))

In [10]:
foundTexts = {}
p1 = 0
p2 = 0
# Take not we use OR in this case as we wanna read thru the whole list
while (p1 < len(dictionary["penyet"]) and p2 < len(dictionary["hello"])):
    index1 = dictionary["penyet"][p1]
    index2 = dictionary["hello"][p2]

    foundTexts[index1] = True
    p1+=1
    foundTexts[index2] = True
    p2+=1

# Remember to add these two for loops! Very important
while (p1 < len(dictionary["penyet"])):
    index1 = dictionary["penyet"][p1]

    foundTexts[index1] = True
    p1 += 1
    
while (p2 < len(dictionary["hello"])):
    index2 = dictionary["hello"][p2]
    foundTexts[index2] = True
    p2 += 1
    
    
print (foundTexts.keys())

dict_keys([0, 1, 2])


### Querying/ processing the NOT operation
`NOT B`  
For this we have to consider using the universal set (everything) and then add everything that does not exist in `B`

In [2]:
# Terms - posting list
Brutus = [1,2,4,11,31,45,173,174]
Calpurina = [2,31,54,101]

Universal = list(set().union(Brutus, Calpurina))
Universal.sort()

p1 = 0
p2 = 0

answer = {}

# We want to find NOT BRUTUS
while (p1 < len(Universal) and p2 < len(Brutus)):
    index1 = Universal[p1]
    index2 = Brutus[p2]
    
    if (index1 == index2):
        p1 += 1
        p2 += 1
        
    elif (index1 < index2):
        # Skipping cannot be used here as we will miss out on index1 to add to answer
        answer[index1] = True
        p1 += 1
    else:
        # Skipping can be used here
        p2 += 1
        
        
# Assuming universal set will always be larger than Brutus set
# we only have to add on the remaining stuff from Universal
while (p1 < len(Universal)):
    index1 = Universal[p1]
    answer[index1] = True
    p1 += 1
    
print (list(answer))

[54, 101]


### Querying A AND NOT B

In [1]:
# Terms - posting list
Brutus = [1,2,4,11,31,45,173,174]
Calpurina = [2,31,54,101]

p1 = 0
p2 = 0

answer = {}

# We want to find Calpurina AND NOT BRUTUS
while (p1 < len(Calpurina) and p2 < len(Brutus)):
    index1 = Calpurina[p1] # Put index1 as the first variable
    index2 = Brutus[p2] # index2 as the second variable with NOT operation
    
    if (index1 == index2):
        p1 += 1
        p2 += 1
    elif (index1 < index2):
        # Means it does not contain index1 in Brutus
        # Skipping cannot be used here as we have to add every single index1 which is < index2
        answer[index1] = True
        p1 += 1
    else:
        # Skip lists could be implemented here to try to match p2 pointer to p1.
        p2 += 1

        
# To deal with the remaining Calpurina
while (p2 < len(Calpurina)):
    index1 = Calpurina[p2]
    answer [index1] = True
    p2 += 1

    
print (list(answer))

[54, 101]


In [1]:
# Terms - posting list
Brutus = [1,2,4,11,31,45,173,174]
Calpurina = [2,31,54,101]

p1 = 0
p2 = 0

answer = {}

# We want to find BRUTUS AND NOT CALPURINA
while (p1 < len(Brutus) and p2 < len(Calpurina)):
    index1 = Brutus[p1]
    index2 = Calpurina[p2]
    
    if (index1 == index2):
        p1 += 1
        p2 += 1
        
    elif (index1 < index2):
        answer[index1] = True
        p1 += 1
    else:
        p2 += 1

# To deal with the remaining Brutus
while (p1 < len(Brutus)):
    index1 = Brutus[p1]
    answer [index1] = True
    p1 += 1
        
        
print (list(answer))

[1, 4, 11, 45, 173, 174]
