Skip to content

pikulet/intellex

Repository files navigation

This is the README file for the submission of:
SOH JASON
CHOW KENGJI
JOYCE YEO SHUHUI
TAN YAN ZHEN, NATALIE

== Python Version ==

We're using Python Version <3.6> for this assignment.

== General Notes about this assignment ==

# Analysis of the Corpus

A short corpus analysis is done to understand the data we are working with:
https://notebooks.azure.com/jason-soh/projects/homework4/html/index.ipynb
There are a total of 17137 documents in the corpus, and 31 types of courts in the court field.
We found that all the fields were populated.

We identified some cases of duplicate document IDs, which does not make sense in the context of document
retrieval (see duplicate_docs file). We analysed these duplicate entries and found that these entries would have the
title and content being repeated, and the only field changed would be the metadata on the court.

Such occurrences make sense in real life, since legal cases can be transferred between courts depending on their
severity. To treat the case of a duplicate document, we simply update the court and keep the case with
the highest priority. While this may compromise on accuracy, it is almost impossible to determine if the
case went from a high-priority court to a low-priority court or vice versa. We believe that using the
highest priority reflects the importance of a case better. We also update the total number of documents
(subtract 1), which is used in the calculation on idf on query terms.

We also found some Chinese documents from HK High Court and HK Court of First Instance (duplicated
data). More specifically, this was in document 2044863. We did not add any special processing methods
because the case is isolated and it is unlikely that the user wants to look something up from the document.
There are also other unicode characters that are not recognised. These can be easily resolved with utf-8 encoding.

# System Architecture

We divided our system into three main components: Indexing, Helper Modules, Searching.

The indexing modules are used for the indexing algorithm.
The helper modules are used for BOTH indexing and searching. (File reading and writing,
shared term normalisation methods, shared constants)
The searching modules work with document ranking and query expansion.

The architecture can be visualised in diagram.png

# Indexing

We first tokenise the documents using the nltk tokenisers with stemming and case folding as in previous homeworks.
Our aim is to model the baseline tf-idf method as closely as possible. The increased features would mostly come
from the searching stage, where we perform query expansion.

## Parallelising NLTK tokenisation
When running our initial indexing on the Tembusu cluster, we found that the indexing time was infeasible
to work with. Given a lot of changes we wanted to make in terms of storing biwords, which document
properties, we wanted to speed up the indexing time. We timed the different parts of the indexing and
found the NLTK tokenisation to take a significant portion of the time.

In HW3, the program would perform nltk(doc1), then nltk(doc2), then nltk(doc3). Given that the documents
are fairly independent, we do not actually have to wait for nltk(doc1) to complete before starting on
nltk(doc2). Instead of doing the tokenisation sequentially, we realised we could read in all the data,
pass all the documents to be tokenised in parallel, then process the tokenised data sequentially.

(done in parallel): doc1, doc2, doc3... --> nltk(doc1), nltk(doc2), nltk(doc3)...
(sequential): process(tokenised_doc1), process(tokenised_doc2), process(tokenised_doc3)...

We still have to process the documents sequentially because we wanted the posting lists to be sorted by
docID for quicker merging for AND queries. If we processed the docIDs non-sequentially, we would have to
do a sort operation on all the final posting lists, which is time-consuming.

## Storing entire document vectors

We processed the document content body and title body separately, so we have two sets of a dictionary and
posting list, one for the content+title body and one for the title body. We explain our choice of including
the title metadata in ### METADATA - TITLE below.

In the processing, we store the dictionary and posting list data as in HW3. However, we also store the actual document
uniword vector. This vector maps the count of every term in the document.

For example, doc1 { "egg" : 1, "cat" : 3, "the" : 4 }

We subsequently apply log-tf on the counts. Previously in HW3, we would just store the length of this vector as the
document length. Now,  we also store the entire vector because we need the vector for relevance feedback
(see section on Rocchio Algorithm).

While storing the entire vector takes up a lot of space (370MB), we chose to make this tradeoff. The
alternative was to run through the entire posting list to reconstruct the document vector during searching
time. To reconstruct doc1 above, we run through the posting lists of ALL terms, and record which terms contain
doc1. Considering that this process has to be repeated for all the documents involved in the relevance
feedback, it would take up a lot of time in searching, compromising on the search efficiency.Of course, the next
alternative is to forego Rocchio expansion altogether, but we still kept it for experimentation purposes.

## Processing document metadata

DOCUMENT METADATA - TITLE
When considering how to process the document metadata, we saw that the title was mostly a repetition of
the first line of the content. That is, there is no additional data in the title. This means that the
title was meant to separate parts of the content that were deemed more important. We thus experimented with
giving a higher weight to the title field later on.

However, it is likely that the title is too short and contains too many numbers and case references. When
we experimented with weighing the title and content as two separate zones of the document, we did not get
good results, often performing below the baseline. Here are the results when we assigned equal weightage
to both TITLE and CONTENT separately:

# result.1
Average F: 0.0108201093105916
# result.2
Average F: 0.217724508769285
# result.3
Average F: 0.00957123821040243
# result.4
Average F: 0.496296296296296
# result.5.txt content:
Average F: 0.0824991658324992
# result.6.txt content:
Average F: 0.21474358974359

Note that the baseline tf-idf assigns weight 0 to TITLE and weight 1 to CONTENT.

As such, we decided to approach the baseline as closely as possible. We hypothesised that the baseline model
merged both the title and content together, so we did just that. We had one set of processed data for TITLE
only, one set of processed data for TITLE + CONTENT. We assigned 0 weight to TITLE and weight 1 to TITLE + CONTENT.

DOCUMENT METADATA - COURT
Without good prior knowledge on court orders and their nuances, we assigned courts a priority between 1 to 3,
with 1 being the high priority courts. We used the court ordering provided.

DOCUMENT METADATA - DATE POSTED
When processing the date posted, we interpreted that data as being a metadata of the document. That is,
the date posted has no actual content that the user will look up. Users are only likely to search for the
hearing or incident dates. We used the date posted then, as an indicator of how recent the case was. More
specifically, we recorded the seconds from indexing time as a proxy. Nevertheless, the date posted is
unlikely to be an important distinguishing factor. After some experiments, we omitted this result from the
document ranking, though we still kept the data.

## Storing biword and triword information

Later on in the searching, we want to add phrasal biwords and triwords (marked in quotation marks " ")
to the VSM as an additional dimension. These dimensions can be weighted differently, since having a
phrase would be more meaningful than just containing a uniword.

When iterating through the documents, we also collect information on the biword and triword counts, to
produce the biword and triword vectors. Unfortunately, it would be unfeasible to store all the data for
searching. Given a corpus size of 700MB, the uniword posting list and dictionary already takes up about
641MB of space. Storing the biword and triword information as separate indices would not be good in terms
of space usage. As such, we took to using positional indexing to store data on biwords and triwords. That
is, to find the term frequency of "fertility treatment" in document 1, we run a positional merge on the
posting list of "fertiltiy" and "treatment" for document 1.

However, we need information on vector length for the VSM scores. It is not efficient to calculate the
document length at search time for the same reason why we store the entire document vector for the Rocchio
expansion. As such, we collect information on the biword and triword vectors of a document at indexing
time. Every document has a count for all its biword and triword terms. We use the log-tf on the terms,
similar to the uniword vector, then saved the length of the biword and triword vectors.

At searching time, we retrieve term frequencies of phrases using positional merge, and lookup on the
document lengths for the biword and triword vectors. For the queries, we count the document frequencies of
phrases by compiling the results of the positional merge (combining posting list of "fertility" and
"treatment"). The idf is then calculated using this data.

We found this method to be the best tradeoff between space and time. The operations which are inefficient
and take up little space (document length for biword and triword vectors) are done at indexing time and
stored. The operations which can be done quickly, and take up too much space when stored (term frequencies
and document frequencies) are done at searching time.

## Summary of index content

As mentioned above, we have two set of index contents, one for the TITLE ONLY and one for TITLE + CONTENT.
We re-specify this in the "files included" section. In the dictionary, each entry contains a term, the document
frequency and a pointer to the postings list in the postings file. In the postings file, each posting list is a
dictionary mapping docIDs to a list [term frequency, list_of_positions].

dictionary_title.txt - dictionary of terms that appear in all the titles
postings_title.txt - posting lists for terms that appear in all the titles

dictionary.txt - dictionary of terms that appear in all the title + content combined
postings.txt - posting lists for terms that appear in all the title + content combined

postings_vector.txt - storing all the actual document vectors.

We also keep track of document properties and metadata, such as the document length and court.

properties.txt - dictionary mapping docID to document properties

The byte offset needed for unpickling is stored as a document property.

A summary of the document properties we kept track of:
- CONTENT_LENGTH: vector length for document vector from TITLE + CONTENT
- TITLE_LENGTH: vector length for document vector from TITLE only
- COURT_PRIORITY: priority of document court from 1 (highest priority) to 3 (lowest)
- DATE_POSTED: an indicator of how recent the document was posted (in seconds from indexing time)
- VECTOR_OFFSET: the byte offset for pickle to load the document content + title vector from postings_vector.txt
- BIGRAM_CONTENT_LENGTH: length of document TITLE + CONTENT biword index, for normalisation of triword terms
- TRIGRAM_CONTENT_LENGTH: length of document TITLE + CONTENT triword index, for normalisation of triword terms
- BIGRAM_TITLE_LENGTH: length of document TITLE only biword index, for normalisation of triword terms
- TRIGRAM_TITLE_LENGTH: length of document TITLE only triword index, for normalisation of triword terms

# Helper Modules

We have three helper modules that are consistently used across indexing and searching.

## data_helper.py

The data helper contains shared methods, and direct file reading and writing.
Shared methods includes NLTK tokenisation methods, since we want to apply consistent tokenisation for both the
document and query.

The direct file reading and writing is handled by pickle, but we abstract everything here so we could experiment
with different serialisation modules (JSON, XML). We wanted to experiment with this aspect because our initial
searching was very slow and inefficient. We did online research (see references) on serialisation methods, and
tested them out. In the end, our results still directed us back to pickle.

## properties_helper.py

The properties helper module manages document properties. We store the constants needed to access the specific
document properties here. Having this module was useful in our experimentation, because we could add more document
metadata easily. Document properties have no actual relevance to document retrieval, and they mostly help with the
elative ranking of the retrieved documents, and for relevance feedback.

For example:
(1) lengths are used in normalisation of scores
(2) court and data metadata can be weighed in the scores
(3) document vectors are used for relevance feedback

## constants.py

(1) Test files
This file contains the test files we were working with. For example, we only index the first 100 entries for most
of the testing phase for indexing. We can run this indexing locally, and the indexing of the full 17 000 entries on
Tembusu.

(2) Intermediate files
As shown above, we have a lot of intermediate files (that are not dictionary.txt and postings.txt).
These file names are stored here to be written to at indexing time and read from at searching time.

(3) Searching Parameters
To facilitate our experimentation, we tried a lot of settings for the search. These settings are encapsulated here,
for example we can set weights and ranking orders.

# Searching

## Overall approach

At the top level of searching, query expansion is first done to the original query string to produce
multiple versions of the same query (see section on query expansion). Every query can one of the following four types:

1. +phrase, + boolean: e.g. "fertility treatment" AND damages
2. +phrase, -boolean: e.g. "fertility treatment" damages
3. -phrase, +boolean: e.g. fertility AND treatment AND damages
4. -phrase, -boolean: e.g. fertility treatment damages (basic free text query without phrases or boolean operator)

For an original query string with both phrases and the "AND" boolean operator, query expansion can allow
us to relax these restrictions in order to produce the other 3 combinations. When the original query
either does not have phrases or the boolean operator, it can still be relaxed to the free text query.
For a maximally complex query of type 1 (including boolean operator and phrases), the orders of search between
these four types of queries can be permuted and experimented with to determine the importance of preserving
the additional information of phrases and boolean operators.

Limited experimentation did not show any significant benefit from these permutations.

Q1 Average F2: 0.0318471337579618
Q2 Average F2: 0.276839007986549
Q3 Average F2: 0.00854730742939909
Q4 Average F2: 0.516624579124579
Q5 Average F2: 0.103200491131526
Q6 Average F2: 0.172659817351598
Mean Average F2: 0.184953056130269

While Q1 and Q4 performed better than the baseline, the average results decreased. This suggests that the baseline
search would give us better results, given our limited understanding of legal corpora.
 Nevertheless, we note that taking into account additional restrictions has positive effect on
one of the sample queries for which information is available: "fertility treatment" AND damages.

The final submission also includes results from WordNet expansion.
This is because the sample query q1 (quiet phone call) gives evidence that it is important to retrieve synonyms for the
query terms (e.g. 'silent', 'telephone call'). Hence, a free text query string is searched, followed by
appending additional documents from WordNet expansion.

However, when the WordNet expansion does not return enough terms, then the term is rare and we will perform a
Rocchio expansion to retrieve more documents and improve recall. Note that recall is favoured in an F2 system, so
we decided to do Rocchio expansion. We only do it in the case of a weak WordNet expansion so as not to compromise
on our system's precision.

The final order of documents we return is:
1. POSITIVE LIST** (The sample queries on the competition framework did not seem to have a positive list)
2. BASELINE TF-IDF SEARCH
3. WORDNET EXPANSION WITHOUT BOOLEAN SEARCH
4. if the wordnet expansion returns fewer than 2 terms (the term is rare), then we perform Rocchio expansion

Within each category, documents are ranked by their VSM scores. Documents are ranked in the highest category they fulfil.
A more detailed explanation of our experimentation is in BONUS.docx.

The remaining of this document describes the implementation of the complete system, which was not used in our final model
after extensive testing did not yield good results.

## Search system

Before any query is processed by the Vector Space Model (VSM) evaluation class Eval, it is parsed into a
list where each item is either a single term or a list of terms, which represents a phrase. The terms are
also normalised with case folding and stemming. The dictionary which includes the document frequencies and
pointers to the postings lists of each term in the postings file is read into memory. The document
properties dictionary which maps docIDs to the document vector lengths for normalisation is also read into
memory.

The simplest search is done on free text queries without phrases or boolean operators (-phrase, -boolean).
For such a query, no intersection of posting lists is required. The postings lists for each term in the
query is retrieved and passed to the Eval class for evaluation according to the VSM. For queries with
biword or triword phrases, the postings list and term frequency of the phrase have to be retrieved using
the algorithm described in PostionalMerge. This merge returns all the documents in which the phrase occurs
by merging the positional postings lists of the individual terms that make up the phrase. The term
frequency of the phrase in each document is found, and the length of the postings list also represents the
document frequency. The term and its document frequency is thus added to the dictionary.

Where there are phrase queries, single words, biword, and triword phrases are evaluated as separate query
vectors. This is done because the assumption of independence between terms within the same vector might
not hold if phrasal and single terms are included within the same query vector. Furthermore, evaluating
them as separate vectors allows us to avoid recomputing a normalisation factor i.e. the document vector
length whenever there is a phrase query. Instead, the document properties files maps each document to the
vector length for single words, biwords and triwords, which can be used to normalise each vector.
Therefore, each of the three query vectors (where there are single word, biwords and triwords) are
evaluated separated using the VSM, and the final cosine scores for each document are then combined into a
combined dictionary mapping documents to cosine scores before retrieving the most highly weighted
documents.

In other words, for any document, the final cosine score is calculated as:
a*(score from single term query vector) + b*(score from biword query vector) + c*(score from triword query
vector). The weights a, b and c can in principle be tuned through experimental findings. However, due to
lack of data we have decided to assign an equal weight for a, b and c.

If the query is a boolean query with the operator "AND", the posting lists for each term are reduced such
that they contain only documents that are shared between all postings lists. To do this, the merge
algorithm in BooleanMerge is used (see below). After the reduced postings lists are found, evaluation
proceeds as in a non-boolean query using the VSM evaluation. This is ensure that even though a strict
intersection of all terms is enforced, the documents can still be ranked. The merging essentially retrieves
a list of documents we want to consider. The VSM then tells us the relative scores of these documents
that have been retrieved.

In addition, given that title and content were indexed separately, it is possible to run the same query twice
to derive the cosine scores from searching the title and content fields, which are then combined using a linearly
weighted function of the form:
a*(score from title search) + b*(score from content search)
This functionality was implemented, but due to lack of training data, it was not possible to learn the appropriate
weights that should be assigned to each field. One experiment was done on assigning an equal weight, which performed
worse than a simple tf-idf baseline and was hence omitted.

## Vector Space Model evaluation

The VSM evaluation follows the lnc.ltc ranking scheme, such that we compute tf-idf for the query, but only
log(tf) for the documents. To evaluate each query, a list of (term, term frequency) tuples is created from
the query. Terms which are not found in the dictionary are ignored. This list of tuples is then converted
into a truncated query vector. Each component of the vector is multiplied by the idf of the term.

As in HW3, terms which do not appear in the query vector can be ignored since their tf-idf is 0 and have
no effect on the final cosine score, and only the documents that appear in at least one of the postings
lists of the query terms have to be considered. Thus, we iterate through each postings list of the query
terms to obtain the relevant components of each document vector. The term frequency in each posting is
used to compute (1+log(tf)) directly, which is multiplied with the already computed tf-idf score in the
query vector and added directly to the cosine score for that document.

The cosine scores are stored in dictionary mapping documents to cosine scores, which are then normalised
using the precomputed document vector lengths stored in the normalisation file. The query vector is not
normalised since normalisation has the same effect on the final cosine score of when comparing the query
against each document. Finally the scores are negated and a minheap is used to find the top documents.

## Merging Algorithms
### BooleanMerge

Since only "AND" operators were allowed, only the intersection algorithm involved in Boolean search was
relavant. The intersection of n postings lists was optimised by first sorting the lists based on the size
of the list such that the longest lists were merged first. Dynamically generated skip pointers were also
used to speed up the intersection.

### PositionalMerge

For identifying phrases, as positional indexing was done, each posting in the posting list was in the form
[docID, term frequency, position list]. In other to identify the documents with a biword phrase, the
postings lists of each term was merged by docID as in a conventional intersection algorithm used for
merging postings lists in Boolean search. Within the merge algorithm, when a docID which occurs in both
postings lists is identified, an inner merge algorithm is called on the position list for those two
postings. Given two position lists such as [1,3,4,5] and [4,5,6,7], the starting position of the phrase
will be identified from the merge, such that [3,4,5] is returned. The merging algorithm was optimised
using dynamically generated skip pointers based on the size of the lists. A similar approach is used to
identify documents with triword phrases of the form "A B C". The posting list for the phrase "A B" is
first found, followed by "B C", and the two postings lists are then merged together.

## Query expansion

We experimented with 3 types of query expansion techniques, as well as in what order to apply them. More information
is in BONUS.docx.

### Relaxing boolean and phrasal queries

The first stage of query expansion, as explained above, involves relaxing the restrictions placed on the query from
phrases and boolean operators. Since the terms in the user queries may not be the exact terms desired, we need to
relax the AND portion of the query, so that even if the term given is not correct, the results for other parts of the
query can still be returned. To achieve a baseline tf-idf framework, all boolean operators and phrase markers were
stripped from the query string.

### WordNet/Thesaurus Query Expansion

The NLTK WordNet was used as a thesaurus to implement query expansion. In particular, the synonyms of the terms
were found using the synset feature of WordNet. Additionally, we also experimented with using hypernyms and
hyponyms to return related words, however, because too many of such words were returned, we decided to stick with
synonyms. The additional synonyms retrieved were appended onto the free text version of the original query to create
a longer free text query. Due to time constraints we were unable to implement a potential improvement, which involves
ensuring for a boolean query that the synonyms were intersected. For example, a query 'quiet AND phone call' could be
expanded to '(quiet OR silent) AND (phone call OR telephone call)'.

### Relevance Feedback and Rocchio Algorithm

For relevance feedback based on the Rocchio Algorithm, our system makes use of the top 1000 returned documents
from the basic search which are assumed to be relevant, on top of the list of documents identified as relevant
in the original query file. The document IDs are then used to retrieve precomputed and stored document vectors in
the postings vectors file, which are then combined to give the centroid vector of the relevant documents. This is
done such that there is no need to traverse the postings file to build the document vector for each relevant document,
which would be extremely expensive.

An additional optimisation involves storing the vectors as sparse vectors using a dictionary mapping terms to the tf
values. This is necessary since the vectors would include many 0 terms if the dimension of the vector was the size of
the entire dictionary. Furthermore, even after computing the centroid vector, there will still remain many non-zero
dimensions in the centroid. In order to improve on efficiency, there is a need to remove some of the non-zero terms.
To do this, each component was multiplied with idf in order to reduce the value of more common and hence less
useful terms. The top 50 terms were chosen for the final centroid vector.

The new centroid vector can then be added to the original query vector to derive a new query vector used for VSM
retrieval. For simplicity, the original query vector is made to be a free text query such that boolean operators
are removed and phrases are converted to single word terms. The additional documents found from relevance feedback
are appended after the already returned documents.

== Files included with this submission ==

# diagram.png - visualisation of our entire system architecture
# constants.py - constants file with global variables including file names and tuning settings
# BONUS.docx - our query refinement techniques and experiments

Indexing files:
# data_helper.py - manages the direct file reading and writing
# properties_helper.py - helper file for managing the document properties file
# index.py - driver file for indexing
# Dictionary.py - Dictionary class
# PostingList.py - PostingList class

Search files:
# search.py - driver file for search and query processing.
# search_helper.py - helper file for search, query parsing and evaluation.
# PositionalMerge.py - helper file for merging of posting and postional lists for identifying phrase queries.
# BooleanMerge.py - helper file for merging of postings lists in Boolean queries.
# Eval.py - Evaluation class for computing cosine scores based on Vector Space Model (VSM).
# QueryExpansion.py - code for query expansion, WordNet/thesaurus expansion, and relevance feedback.

Data files:
# dictionary.txt - dictionary file with title and content combined
# postings.txt - postings file with title and content combined
# dictionary_title.txt - dictionary file for document titles only
# postings_title.txt - postings file for document titles only
# vector.txt - file storing document vectors
# properties.txt - document properties file storing normalisation constants and other properties of each document.

Additional files:
# duplicate_docs - duplicate document IDs from corpus analysis
# lookdoc.py - script to view the document using the document ID
# lookvector.py - script to view the document vector using the document ID

== Statement of individual work ==

[X] I, SOH JASON, certify that I have followed the CS 3245 Information
Retrieval class guidelines for homework assignments.  In particular, I
expressly vow that I have followed the Facebook rule in discussing
with others in doing the assignment and did not take notes (digital or
printed) from the discussions.

[X] I, CHOW KENGJI, certify that I have followed the CS 3245 Information
Retrieval class guidelines for homework assignments.  In particular, I
expressly vow that I have followed the Facebook rule in discussing
with others in doing the assignment and did not take notes (digital or
printed) from the discussions.

[X] I, TAN YAN ZHEN, NATALIE, certify that I have followed the CS 3245 Information
Retrieval class guidelines for homework assignments.  In particular, I
expressly vow that I have followed the Facebook rule in discussing
with others in doing the assignment and did not take notes (digital or
printed) from the discussions.

[X] I, JOYCE YEO SHUHUI, certify that I have followed the CS 3245 Information
Retrieval class guidelines for homework assignments.  In particular, I
expressly vow that I have followed the Facebook rule in discussing
with others in doing the assignment and did not take notes (digital or
printed) from the discussions.

== References ==

Original source for data storage comparison, but it was overturned by our own investigations.
https://www.benfrederickson.com/dont-pickle-your-data/

Parallel Processing of Panda DataFrames
https://www.tjansson.dk/2018/04/parallel-processing-pandas-dataframes/

Date Processing
https://stackabuse.com/converting-strings-to-datetime-in-python/