# Joogle - CSI4107 Search Engine Project
#### By: Jacob Danovitch

## System Architecture

## Modules - Mandatory

### Corpus Pre-Processing

#### File(s) 

* `uottawa_scrape.py`

#### Description

This module scrapes and parses the uOttawa course catalogue using `requests` and `beautifulsoup4`, saving the data to [`data/catalogue-uottawa-ca.json`](data/catalogue-uottawa-ca.json).

#### Demonstration

First, the HTML is acquired using the `requests` library, and a `BeautifulSoup` object is created.

In [28]:
from uottawa_scrape import *

html = scrape(to_file=False)
print(html.find("div", {"class": "courseblock"}).text)


CSI 1306 Computing Concepts for Business (3 units)

Introduction to computer-based problem solving from the perspective of the business world. Design of algorithms for solving business problems. Basics of computer programming in a modern programming language. Solving business problems using application packages including spreadsheets and databases. Basics of web design. Collaborative tools. Using open source software.Course Component: Laboratory, LectureThe courses ITI 1120, GNG 1106, CSI 1306, CSI 1308, CSI 1390 cannot be combined for credits.


The HTML is then parsed using `beautifulsoup4`, such that it is stored in the following format.

In [3]:
output = to_json(html, filename=None)
print(output[0])

{'id': 0, 'title': 'CSI 1306 Computing Concepts for Business (3 units)', 'body': 'Introduction to computer-based problem solving from the perspective of the business world. Design of algorithms for solving business problems. Basics of computer programming in a modern programming language. Solving business problems using application packages including spreadsheets and databases. Basics of web design. Collaborative tools. Using open source software.'}


This format facilitates easily manipulating and rendering the data using other modules.

<hr/>

### User Interface

#### File(s) 

* `app.py`
* `templates/`
    * `static/`
        * `css/`
        * `js/`
        * `img/`
    * `index.html`
    * `layouts.html`
    * `about.html`
    * `404.html`


#### Description

This module is implemented as a small Flask app. The UI is a parody of Google (see acknowledgements section below), allowing users to search the catalogue using either retrieval module.

#### Demonstration

See [Demo](#Demo) below for a full demonstration.

<hr/>

### Dictionary building

#### File(s) 

* `build_dictionary.py`
* `construct_index.py`
* `retrieval_model.py`

#### Description

This module contains utilities used to build the dictionary for the retrieval models. The actual construction of the dictionary takes place in `retrieval_model.py`.

#### Demonstration

Note that due to the inclusion of the optional [phrase query indexing](#Phrase-Query-Indexing) module, said module is also a part of the dictionary building process. This will be expanded upon further in its own section below.

First, we reference the data created from the scraping model as seen above. We narrow the data to focus only on the descriptions (the titles were **not** made part of the dictionary).

In [4]:
import json

data = json.load(open("data/catalogue-uottawa-ca.json"))
descriptions = [row['body'] for row in data]
descriptions[0]

'Introduction to computer-based problem solving from the perspective of the business world. Design of algorithms for solving business problems. Basics of computer programming in a modern programming language. Solving business problems using application packages including spreadsheets and databases. Basics of web design. Collaborative tools. Using open source software.'

These descriptions are then passed to the `clean` function from `build_dictionary.py`, which performs casing, normalization, lemmatization, stopword removal, and tokenization.

In [5]:
from build_dictionary import clean
print(clean(descriptions[0]))

{'spreadsheet', 'web', 'database', 'package', 'using', 'language', 'source', 'problem', 'based', 'application', 'algorithm', 'business', 'basic', 'modern', 'programming', 'perspective', 'design', 'including', 'tool', 'computer', 'collaborative', 'software', 'solving', 'world', 'open', 'introduction'}


<hr/>

### Inverted Index Construction

#### File(s) 

* `construct_index.py`

#### Description

#### Demonstration

The descriptions above are passed to the `build_postings` function, which returns both the index and the term dictionary. The index is constructed first, as follows:

In [6]:
from construct_index import build_postings

index, term_dict = build_postings(descriptions)
print(index[0])

(0, ['algorithm', 'modern', 'package', 'programming', 'web', 'perspective', 'design', 'including', 'basic', 'using', 'tool', 'computer', 'language', 'collaborative', 'software', 'source', 'solving', 'problem', 'based', 'spreadsheet', 'application', 'database', 'world', 'open', 'introduction', 'business', 'problem'])


Then, the term dictionary is constructed from the index. As it is, of course, a dictionary, retrieval is completed in **constant** $\mathcal{O}(1)$ time. 

Its construction is completed in $\mathcal{O}(n \cdot (m+1))$ time, where $m$ is the number of words in the longest document in the corpus. The $n$ documents contained in the `.json` data are iterated over once to construct the index (as seen above) containing $n$ rows. Then, each of the $n$ rows of the index is  to construct the term dictionary. At each row, $m$ words are iterated over, incrementally updating its associated count. Pseudocode is presented below.

```python
for (doc_id, words) in index:
    for word in words:
        term_dict[doc_id][word] += 1
```

The outer loop performs $n$ iterations of $m$ steps each for $O(n\cdot m)$ time. Combined with the $n$ iterations of the index construction, we have $\mathcal{O}(n+n\cdot m) = \mathcal{O}(n\cdot(m+1))$ time.

Finally, both are returned.

**Note**: As an implementation decision, the weights are calculated later, in the `VSM` model itself. This simply made for a cleaner and more organized project structure. The weights could have been included here otherwise.

In [7]:
print(term_dict['algorithm'])

{0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 9: 2, 17: 1, 21: 1, 31: 2, 34: 1, 36: 1, 38: 1, 48: 1, 50: 1, 51: 1, 55: 1, 66: 1, 71: 1, 73: 1, 74: 1, 75: 1, 76: 1, 78: 1, 84: 1, 86: 1, 87: 1, 92: 1}


<hr/>

### Corpus Access

#### File(s) 

* same as in [User Interface](#User-Interface)

#### Description

This module is implemented within the user interface, seen within the search results.

#### Demonstration

See [Demo](#Demo) below for a full demonstration.

<hr/>

### Boolean model of information retrieval

#### File(s) 

* `brm.py`

#### Description

The `BRM` (Boolean Retrieval Model) class operates as a model to query the corpus using boolean searches.

#### Demonstration

The `boolean.py` library is used for symbolic manipulation. The query is parsed as a logical expression, and then each literal word of the query is cleaned using the previous module.

In [8]:
from brm import *

b = BRM()
q = "(computer OR systems) AND (data)"

b.preprocess_query(q)

OR(Symbol('computer'), AND(Symbol('systems'), Symbol('data')))

For each document, the truth value of each symbol is evaluated as the presence of the word in the document to resolve the expression. Each of the $n$ documents are evaluated, for a query complexity of $\mathcal{O}(n)$.

As the `BRM` is unranked, the `top_n` argument will simply return the first `n` documents found and is only useful for demonstration purposes (as in this instance) or for the UI to paginate the results (not yet implemented).

In [9]:
b.query(q, top_n=5)

Unnamed: 0,title,body
0,CSI 1306 Computing Concepts for Business (3 un...,Introduction to computer-based problem solving...
1,CSI 1308 Introduction to Computing Concepts (3...,Introduction to computer based problem solving...
2,CSI 1390 Introduction to Computers (3 units),Computing and computers. Problem solving and a...
3,CSI 2101 Discrete Structures (3 units),Discrete structures as they apply to computer ...
21,CSI 4109 Introduction to Distributed Computing...,Computational models. Communication complexity...


<hr/>

### Vector Space Model of information retrieval

#### File(s) 

* `vsm.py`

#### Description

The `VSM` (Vector Space Model) class operates as a model to query the corpus using $\mathtt{tf-idf}$ retrieval. 

#### Demonstration

The $\mathtt{tf-idf}$ scores are computed for each $w_i, d_i$ pair upon construction of the model.

In [10]:
from vsm import *

v = VSM()
v.d_w.T.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,92,93,94,95,96,97,98,99,100,101
algorithm,0.577236,0.577236,0.577236,0.577236,0.577236,0.0,0.0,0.0,0.0,1.154473,...,0.577236,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
modern,1.70757,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
package,1.70757,1.70757,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
programming,0.729847,0.729847,0.729847,0.0,0.0,0.729847,0.0,0.729847,0.0,0.729847,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
web,1.054358,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.054358,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


At runtime, the query is pre-processed in the same fashion as the documents were processed orginally.

In [11]:
q = "operating systems"
q = clean(q)
q

{'operating', 'system'}

Only relevant documents are selected from the weight matrix. As the matrix is a $2$-d array, this access takes place in $O(w)$ time, where $w$ is the number of words in the processed query.

This is mathematically equivalent to performing an inner product on the weight matrix with one-hot vectors for each word, but it is clearly more space-efficient to simply slice the rows in question.

In [12]:
v.d_w.loc[:, q].T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,92,93,94,95,96,97,98,99,100,101
operating,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
system,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


The matrix is then summed along each column (corresponding to a document in the corpus), and sorted in descending order. This is then returned to our original data set and used to return the relevant documents to the user.

In [13]:
v.d_w.loc[:, q].T.apply(sum).sort_values(ascending=False)[:5]

30    2.593794
12    2.593794
88    2.593794
82    2.000167
89    1.187254
dtype: float64

The "confidence" score is obtained by using the $\mathtt{tf-idf}$ score through the sigmoid function (the s-curve), squishing all values between $[0,1]$ with more extreme values on both ends. As of writing, this is not currently displayed to the user, but could be in the future.

In [14]:
v.query("operating systems", top_n=10)

Unnamed: 0_level_0,title,body,confidence
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
30,CSI 4139 Design of Secure Computer Systems (3 ...,Security policies. Security mechanisms. Physic...,0.930461
12,CSI 3131 Operating Systems (3 units),Principles of operating systems. Operating sys...,0.930461
88,CSI 5308 Principles of Distributed Computing (...,Design issues of advanced multiprocessor distr...,0.930461
82,CSI 5174 Validation Methods for Distributed Sy...,Wireless networks support for m-commerce; m-co...,0.880815
89,CSI 5311 Distributed Databases and Transaction...,Issues in modeling and verifying quality and v...,0.76625
55,CSI 5131 Parallel Algorithms and Applications ...,Hardware and software techniques for fault tol...,0.76625
32,CSI 4141 Real Time Systems Design (3 units),Definition of real-time systems; examples. C...,0.76625
87,CSI 5200 Projects on Selected Topics (3 units),Principles involved in the design and implemen...,0.76625
24,CSI 4124 Foundation of Modelling and Simulatio...,The modelling and simulation process from a pr...,0.76625
61,CSI 5139 Selected Topics in Computer Applicati...,Selected topics in Computer Systems (Category ...,0.644197


<hr/>

## Modules - Optional

### Phrase Query Indexing

#### File(s) 

* `phrase_indexing.py`


#### Description

This module uses the Jaccard coefficient to identify phrases from a candidate set (identified as hyphenated words), as seen in class.

#### Demonstration

First, the candidate set is assembled as all hyphenated words in the corpus. 

In [15]:
from phrase_indexing import *

candidates = identify_candidates(descriptions)
candidates[:5]

[('computer', 'based'),
 ('object', 'oriented'),
 ('binary', 'trees'),
 ('object', 'oriented'),
 ('constraint', 'based')]

Then, bigrams of the corpus text are constructed, for calculation of Jaccard coefficients.

In [16]:
corpus = [remove_punc(t.lower(), rm_hyphens=True) for t in descriptions]
bigrams = make_bigrams(corpus)

print(find_phrases(candidates, bigrams, threshold=0.7))

{'entity-relationship': 1.0, 'trap-door': 1.0, 'grey-box': 1.0, 'well-separated': 1.0, 'fail-safe': 1.0, 'locality-sensitive': 1.0, 'large-scale': 1.0}


Of course, the threshold can be tuned as well.

In [17]:
print(find_phrases(candidates, bigrams, threshold=0.5))

{'entity-relationship': 1.0, 'trap-door': 1.0, 'lambda-calculus': 0.6666666666666666, 'grey-box': 1.0, 'well-separated': 1.0, 'fail-safe': 1.0, 'locality-sensitive': 1.0, 'large-scale': 1.0, 'fault-tolerance': 0.5555555555555556}


These phrases are then stored for later use in preprocessing the corpus and user queries.

<hr/>

### Spelling Correction

#### File(s) 

* `spelling.py`

#### Description

This module performs spelling corrections on user queries. However, as this is an extra module (working alone, I only have to implement one), I took some liberties with my implementation and elected to use a character-gram model instead of minimum edit distance (mostly out of curiosity).

#### Demonstration

First, as usual, the query is pre-processed (note: this happens slightly differently for the `BRM`).

In [18]:
from spelling import *

q = "data maagement"
q = clean(q)
q

{'data', 'maagement'}

Then, for each word in the query, if the word is not in the vocabulary, the spell check is applied. This is performed by computing the ratio of shared character-bigrams between the words. If the word is in the vocabulary, it is simply returned with $100\%$ confidence.

Like minimum edit distance, then, it cannot correct words that are already spelled correctly; it is not context-aware.

In [19]:
for w in q:
    top_candidate, conf = spell_check(w, b.build_vocab())[0]
    print(f"{w} => {top_candidate} ({conf*100}%)")

data => data (100%)
maagement => management (87.5%)


As well, the threshold for which candidates are returned can be adjusted from its default value of $0.75$.

In [20]:
print(spell_check("maagement", b.build_vocab(), threshold=0.5))

[('management', 0.875), ('rearrangement', 0.625)]


<hr/>

## Demo

## Acknowledgements