<a href="https://colab.research.google.com/github/gvogiatzis/CS3320/blob/main/CS3320_Lab_01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS3320 Lab 1. Boolean Retrieval

## Introduction
In this lab we will explore some of the fundamental information retrieval concepts we saw in the lectures including indexing, ranked and boolean retrieval. You are given a small dataset of 737 news stories scraped from the **BBC Sports website** between 2004 and 2005 (full dataset [here](http://mlg.ucd.ie/datasets/bbc.html)). Your task is to write a basic search engine in Python, using the Boolean Retrieval model we talked about in lectures. 
Your engine should consist of an indexer which processes all documents and creates an index as well a query engine that takes a query string and returns a list of documents that contains all the terms.

In [None]:
!wget https://github.com/gvogiatzis/CS3320/raw/main/data/bbc_sport_docs.zip

In [None]:
!unzip bbc_sport_docs.zip -d docs

## The dataset
Before we do anything else, let's have a look at the dataset I will assume you will have downloaded from blackboard the dataset *bbcsport.tar.gz* and unzipped it (`gunzip bbcsport.tar.gz` followed by `tar -xf bbcsport.tar`) in a convenient location. Go into that directory and have a look at some of the files using the `cat` command, i.e.

In [None]:
!cat docs/000.txt

should produce something like

    McCall earns Tannadice reprieve

    Dundee United manager Ian McCall has won a reprieve from the sack, with chairman Eddie Thompson calling for an end to speculation over his future...etc

Now let's try to open and print the file in python. First, you need to import these modules that will be used throughout

In [None]:
import re
import os
from glob import glob

We can open a file using the built-in `open` command as follows:

In [None]:
f = open('docs/000.txt','r', encoding='latin-1')

The method f.read() then reads the file into a string as follows

In [None]:
s = f.read()

Try printing the first 200 characters to see how the output looks. 

In [None]:
print(s[0:199])

We can use the `os.listdir()` command to obtain a list of all the filenames under a directory and then open a specific one using an index into that list. This index will become our document ID that our search engine will use. All that can be neatly placed inside a function for printing files as follows:

In [None]:
def readfile(path, docid):
    files = sorted(glob(path))
    f = open(files[docid], 'r', encoding='latin-1')
    s = f.read()
    f.close()
    return s

Try using this function on the `docs` path for various docid's. So we now know how to open a file and turn its contents into a string. The next step is chopping that string up in tokens.


## Tokenization
Fortunately tokenization is a very simple task in Python because we can use the built-in `split` method for strings. Let's try reading a file into a string, split it and then print the first 20 tokens.

In [None]:
s = readfile('docs/*.txt', 0)
tokens = s.split()
print(tokens[0:19])

We see that the simple `split` method in strings only splits on whitespace characters leaving punctuation marks (as well as numbers, hyphens etc) which is why it returned "`sack,`" as a token. A little digging points us to the direction of the Regular Expression module (`re`) and the split method contained therein, which can accept a whole list of delimiter characters in the form of a delimiter string. Let's try this

In [None]:
DELIM = '[ \n\t0123456789;:.,/\(\)\"\'-]+'
tokens = re.split(DELIM, s)
print(tokens[0:19])

That looks better. Now all we need to do is turn these into lowercase. It's better to do this on the entire string before splitting it:

In [None]:
DELIM = '[ \r\n\t0123456789;:.,/\(\)\"\'-]+'
s_lower = s.lower()
tokens = re.split(DELIM, s_lower)
print(tokens[0:19])

All this can be compacted in a single line of Python code and wrapped inside a function:

In [None]:
def tokenize(text):
    return re.split(DELIM, text.lower())

## Boolean retrieval - Indexing
In this section we will look at Boolean retrieval, as a warming up excercise before you tackle ranked retrieval on your own! The first step in a boolean search engine is to build the indexer. This piece of code is responsible for reading all the documents and producing the postings lists which, as we saw in the lectures, is an efficient way of storing the term-document incidence matrix.

A nice data-structure for storing postings lists is a Python dictionary which as we have seen can be indexed by strings. So we want to produce a dictionary whose keys are words found in the documents and whose values are *sets* of docid's of documents that contain those words. So a postings list such as
    
    {'cricket': {2, 3, 5, 7}, 'football': {0, 2, 4}, 'rugby': {1, 2, 6}}
    
would denote that the word *cricket* can be found in documents with id's 2, 3, 5 and 7 etc. So how to construct this postings dictionary? Well let us read and tokenize the file with id 0 once again.

In [None]:
s = readfile('docs/*.txt', 0)
words = tokenize(s)

A sideffect of the `re.split` function is that it occasionally returns an empty string among the rest of the tokens, if the string ends with one of the delimiter characters. To avoid that we can just remove the empty string if it is in the returned tokens

In [None]:
if '' in words:
    words.remove('')

Now starting with an empty dictionary we will add all the words in `words`. If the word is not contained in the dictionary we create a singleton set with the doc-id 0. If the word is already contained in the dictionary we only add 0 to the corresponding set of docid's. This looks as follows: 

In [None]:
postings = {}
for w in words:
    postings.setdefault(w, set()).add(0)

Print out the `postings` dictionary to see what it contains. You should get a dictionary with keys equal to all the words contained in our doc and values {0}. 

We can now do this for the whole collection of documents, and encapsulate the whole indexing engine in a function as follows:

In [None]:
def indextextfiles_BR(path):
    N = len(sorted(glob(path)))
    postings={}         
    for docID in range(N):
        s = readfile(path, docID)
        words = tokenize(s)
        for w in words:
            if w!='':
                postings.setdefault(w, set()).add(docID)
    return postings

So to process the entire directory and generate the complete postings dictionary we can execute:

In [None]:
postings = indextextfiles_BR('docs/*.txt')

Let's now use this datastructure to find out which documents contain the word '`devastating`'. We just need to execute:

In [None]:
print(postings['devastating'])

which tells us that the word 'devastating' is contained in docs with id's 195, 310 and 55. Neat!

## Processing boolean queries
We are now ready to process a boolean query. We will be assuming that the user gives a string containing all the query terms separated by spaces and they will expect all documents that contain all those query terms. Let's begin by assuming the query text '`england football defeat`'

In [None]:
qtext = 'england football defeat'

Let's begin by tokenizing it

In [None]:
words = tokenize(qtext)
print(words)

To perform boolean retrieval we need to get the postings of each of the three terms and then take their intersection (`&` operator in Python) as follows:

In [None]:
p1 = postings['england']
p2 = postings['football']
p3 = postings['defeat']
print(p1&p2&p3)

More generally, we can iterate through the list of terms in the query, grab the postings sets for each of them and take their intersection. This can be done inside a function as follows. 

In [None]:
def query_BR(postings, qtext):
    words = tokenize(qtext)
    res = None
    for w in words:
        res = postings[w] if res==None else res & postings[w]
    return res

Notice the check inside the for-loop for the very first time through the loop where `res` has not been set yet. If you want, you can try a more elegant (or Pythonic as people call it) way of doing the same thing is using `set.intersection` and the `*` operator that turns a list into arguments for a function. 

In [None]:
def query_BR(postings, qtext):
    words = tokenize(qtext)
    allpostings = [postings[w] for w in words]
    res = set.intersection(*allpostings)
    return res

Let's see if we get the same results

In [None]:
query_BR(postings,'england football defeat')

Nice!