# Agenda
Information retrieval involves searching for words/text from document(s). Indexing (esp. inverted indexing) is used to improve the efficiency of these searches. This exercise demonstrates the main steps of constructing an **Inverted Index** for a sample document and illustrates a sample query that uses the index to retrieve **vocabulary term**. The content for the remainder of this guide is as follows.

## Table of Contents
1. [Python Libraries](#libraries)  
    1.1. [Installing Python Libraries](#lib-install)  
    1.2. [Importing Python Libraries](#lib-import)  
2. [Reading (dummy) Text File](#dataset)
3. [Information Retrieval](#retrieval)  
    3.1. [Remove Punctuation](#punctuation)  
    3.2. [Tokenization](#tokenize)  
    3.3. [Remove Stop-Words](#stop-words)  
    3.4. [Construct Inverted-Index](#indexing)  
    3.5. [Pose Query](#query)  
4. [Exercise: Construct an Inverted Index for UCI Data set](#exercise)
  
    
# 1. Python Libraries <a name="libraries"></a>
## 1.1. Install Python libraries <a name="lib-install"></a>
This exercise will requires the following **Python** libraries:

<ul>
    <li><strong>nltk:</strong> package for natural language processing.</li>
</ul>

In [1]:
# Installing Libraries (if not installed)
#!pip3 install nltk

## 1.2. Import libraries <a name="lib-import"></a>
We import the **nltk** *word_tokenize()* function for the tokenization stage, also, we download *stop words* from the **nltk** library.

In [2]:
from nltk.tokenize import word_tokenize
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/owuorjnr/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/owuorjnr/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

# Reading (dummy) Text File <a name="dataset"></a>
In this section, we read dummy data from a sample text file and store it in a variable **text**.


In [3]:
# this will open the file
file = open('sample_3.txt', encoding='utf8')
text = file.read()
file.seek(0)
text


'the quick brown fox jumped over the lazy dog.\nhello! how are you? I will be quick to jump over this.\nthis is a Masters unit named \'data mining, storage and retrieval\'. the Masters course is named "data science". there are about 50 quick students in this class.'

# 3. Information Retrieval <a name="retrieval"></a>
This exercise is adopted from:

1. [Create Inverted Index for File using Python](https://www.geeksforgeeks.org/create-inverted-index-for-file-using-python/)
2. [Python: Inverted Index for dummies](http://mocilas.github.io/2015/11/18/Python-Inverted-Index-for-dummies/)


## 3.1. Remove Punctuation <a name="punctuation"></a>
In this section, we remove punctuation marks, accents etc.

In [4]:
punc = '''!()-[]{};:'"\, <>./?@#$%^&*_~'''
for ele in text:  
    if ele in punc:
        text = text.replace(ele, " ")         

# to maintain uniformity
text=text.lower()                    
text

'the quick brown fox jumped over the lazy dog \nhello  how are you  i will be quick to jump over this \nthis is a masters unit named  data mining  storage and retrieval   the masters course is named  data science   there are about 50 quick students in this class '

## 3.2. Tokenization <a name="tokenize"></a>
In this section, we tokenize/split the text read from the sample file. We present 2 tokens:

1. Tokens with positional postings.
2. Tokens with document (standard) postings.


In [5]:
# 1. Positional postings

def word_split(text):
    pos_tokens = []
    wcurrent = []
    windex = None
    for i, c in enumerate(text):
        if c.isalnum():
            wcurrent.append(c)
            windex = i
        elif wcurrent:
            word = u''.join(wcurrent)
            pos_tokens.append((windex - len(word) + 1, word))
            wcurrent = []
    if wcurrent:
        word = u''.join(wcurrent)
        pos_tokens.append((windex - len(word) + 1, word))
    return pos_tokens

pos_tokens = word_split(text)
pos_tokens

[(0, 'the'),
 (4, 'quick'),
 (10, 'brown'),
 (16, 'fox'),
 (20, 'jumped'),
 (27, 'over'),
 (32, 'the'),
 (36, 'lazy'),
 (41, 'dog'),
 (46, 'hello'),
 (53, 'how'),
 (57, 'are'),
 (61, 'you'),
 (66, 'i'),
 (68, 'will'),
 (73, 'be'),
 (76, 'quick'),
 (82, 'to'),
 (85, 'jump'),
 (90, 'over'),
 (95, 'this'),
 (101, 'this'),
 (106, 'is'),
 (109, 'a'),
 (111, 'masters'),
 (119, 'unit'),
 (124, 'named'),
 (131, 'data'),
 (136, 'mining'),
 (144, 'storage'),
 (152, 'and'),
 (156, 'retrieval'),
 (168, 'the'),
 (172, 'masters'),
 (180, 'course'),
 (187, 'is'),
 (190, 'named'),
 (197, 'data'),
 (202, 'science'),
 (212, 'there'),
 (218, 'are'),
 (222, 'about'),
 (228, '50'),
 (231, 'quick'),
 (237, 'students'),
 (246, 'in'),
 (249, 'this'),
 (254, 'class')]

In [6]:
# 2. Standard Postings

for i in range(1):
    # this will convert the word into tokens
    std_tokens = word_tokenize(text)
std_tokens

['the',
 'quick',
 'brown',
 'fox',
 'jumped',
 'over',
 'the',
 'lazy',
 'dog',
 'hello',
 'how',
 'are',
 'you',
 'i',
 'will',
 'be',
 'quick',
 'to',
 'jump',
 'over',
 'this',
 'this',
 'is',
 'a',
 'masters',
 'unit',
 'named',
 'data',
 'mining',
 'storage',
 'and',
 'retrieval',
 'the',
 'masters',
 'course',
 'is',
 'named',
 'data',
 'science',
 'there',
 'are',
 'about',
 '50',
 'quick',
 'students',
 'in',
 'this',
 'class']

## 3.3. Remove Stop-Words <a name="stop-words"></a>
In this section, we use a *Stop List** provided by the **nltk** library to remove common/stop words from our tokenized text.


In [7]:
# 1. Standard Tokens
valid_std_tokens = [
    word for word in std_tokens if not word in stopwords.words()]

# 2. Position Tokens
valid_pos_tokens = [
    (index, word) for index,word in pos_tokens if not word in stopwords.words()]
  
print(valid_std_tokens)
print()
print(valid_pos_tokens)

['quick', 'brown', 'fox', 'jumped', 'lazy', 'hello', 'quick', 'jump', 'masters', 'unit', 'named', 'data', 'mining', 'storage', 'retrieval', 'masters', 'course', 'named', 'data', 'science', '50', 'quick', 'students', 'class']

[(4, 'quick'), (10, 'brown'), (16, 'fox'), (20, 'jumped'), (36, 'lazy'), (46, 'hello'), (76, 'quick'), (85, 'jump'), (111, 'masters'), (119, 'unit'), (124, 'named'), (131, 'data'), (136, 'mining'), (144, 'storage'), (156, 'retrieval'), (172, 'masters'), (180, 'course'), (190, 'named'), (197, 'data'), (202, 'science'), (228, '50'), (231, 'quick'), (237, 'students'), (254, 'class')]


## 3.4. Construct Inverted-Index <a name="indexing"></a>
In this section, we build/construct an **Inverted Index** with positional postings.

* As an assignment, construct a standard **Inverted Index** and pose a sample query.

In [8]:
# Positional Postings
inverted = {}
for index, word in valid_pos_tokens:
    locations = inverted.setdefault(word, [])
    locations.append(index)
    
inverted

{'quick': [4, 76, 231],
 'brown': [10],
 'fox': [16],
 'jumped': [20],
 'lazy': [36],
 'hello': [46],
 'jump': [85],
 'masters': [111, 172],
 'unit': [119],
 'named': [124, 190],
 'data': [131, 197],
 'mining': [136],
 'storage': [144],
 'retrieval': [156],
 'course': [180],
 'science': [202],
 '50': [228],
 'students': [237],
 'class': [254]}

## 3.5. Pose Query <a name="query"></a>
In this section, we pose a sample query that used the **Inverted Index** to return the documents together with the corresponding positions that the query term occurs.

In [9]:
# importing functools for reduce()
import functools

query = 'quick'

words = [word for _, word in word_split(query) if word in inverted]
#results = [set(inverted[word].keys()) for word in words]
results = [set(inverted[word]) for word in words]
answer = functools.reduce(lambda x, y: x & y, results) if results else []
answer

{4, 76, 231}

# 4. Exercise: Construct an Inverted Index for UCI Data set <a name="exercise"></a>
The [Eco_dataset](Eco_dataset.csv) is retrieved from the [UCI Data Repository](#https://archive.ics.uci.edu/ml/datasets/Eco-hotel). The data (**Eco-hotel Data Set**): includes Online Textual Reviews from both online (e.g., TripAdvisor) and offline (e.g., Guests' book) sources from the Areias do Seixo Eco-Resort. Use this data set to:

1. Construct an Inverted Index
2. Pose a query on your *index created in part 1.)
