# Assignment 1

By: Jordan Ponn (996765781)
Course: MIE1513

## Create Index

In [1]:
# Import Libraries
from whoosh import index, writing
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import *
from whoosh.qparser import QueryParser
import os, os.path
import shutil

In [2]:
# Create dataset paths
OUTPUT_FILE_PATH = "/resources/data/DSS_Fall2016_Assign1/government"

DOCUMENTS_DIR = OUTPUT_FILE_PATH + "/documents"
INDEX_DIR = OUTPUT_FILE_PATH + "/index1"
QUER_FILE = OUTPUT_FILE_PATH + "/topics/gov.topics"
QRELS_FILE = OUTPUT_FILE_PATH + "/qrels/gov.qrels"
OUTPUT_FILE = OUTPUT_FILE_PATH + "/myres"
TREC_EVAL = "/resources/data/DSS_Fall2016_Assign1/trec_eval.8.1/trec_eval"

IMPROVED_INDEX_DIR = OUTPUT_FILE_PATH + "/index2"
LANCASTER_INDEX_DIR = IMPROVED_INDEX_DIR +"/lancaster"
PORTER_INDEX_DIR = IMPROVED_INDEX_DIR +"/porter"
SNOWBALL_INDEX_DIR = IMPROVED_INDEX_DIR +"/snowball"
LEMMA_INDEX_DIR = IMPROVED_INDEX_DIR +"/lemma"

LANCASTER_OUTPUT_FILE = OUTPUT_FILE_PATH + "lanRes"
PORTER_OUTPUT_FILE = OUTPUT_FILE_PATH + "porRes"
SNOWBALL_OUTPUT_FILE = OUTPUT_FILE_PATH + "SnoRes"
LEMMA_OUTPUT_FILE = OUTPUT_FILE_PATH + "LemRes"

### Building the Index

In [3]:
# first, define a Schema for the index
mySchema = Schema(file_path = ID(stored=True),
                  file_content = TEXT(analyzer = RegexTokenizer()))

In [4]:
# if index exists - remove it
if os.path.isdir(INDEX_DIR):
    shutil.rmtree(INDEX_DIR)

# create the directory for the index
os.makedirs(INDEX_DIR)

# create index
myIndex = index.create_in(INDEX_DIR, mySchema)

### Indexing the Files

In [5]:
# first we build a list of all the full paths of the files in DOCUMENTS_DIR
filesToIndex = []
for root, dirs, files in os.walk(DOCUMENTS_DIR):
    filePaths = [os.path.join(root, fileName) for fileName in files if not fileName.startswith('.')]
    filesToIndex.extend(filePaths)

In [6]:
# open writer
myWriter = writing.BufferedWriter(myIndex, period=20, limit=1000)

try:
    # write each file to index
    for docNum, filePath in enumerate(filesToIndex):
        with open(filePath, "r") as f:
            fileContent = f.read()
            myWriter.add_document(file_path = filePath,
                                  file_content = fileContent)
            
            if (docNum % 1000 == 0):
                print("already indexed:", docNum+1)
    print("done indexing.")

finally:
    # save the index
    myWriter.close()

already indexed: 1
already indexed: 1001
already indexed: 2001
already indexed: 3001
already indexed: 4001
done indexing.


## Question 1 - Appropriate TREC measures

a.) Of the measures available in trec_eval, the most appropriate to use for measureing search system perfomance for government web sites is Mean Average Precision (MAP).

b.) This measure is the most appropriate because MAP incorporates document order in its evaluation, meaning that a higher MAP score will have more relevant documents at the beginning of the result list.  This will make it easier for users to find the correct information.

## Querying using TREK_EVAL

In [7]:
# define a query parser for the field "file_content" in the index
myQueryParser = QueryParser("file_content", schema=myIndex.schema)
mySearcher = myIndex.searcher()

In [8]:
# Load topic file - a list of topics(search phrases) used for evalutation
topicsFile = open(QUER_FILE,"r")
topics = topicsFile.read().splitlines()
topicsFile.close()

In [9]:
# create an output file to which we'll write our results
outputTRECFile = open(OUTPUT_FILE, "w")

# for each evaluated topic:
# build a query and record the results in the file in TREC_EVAL format
for topic in topics:
    topic_id, topic_phrase = tuple(topic.split(" ", 1))
    topicQuery = myQueryParser.parse(topic_phrase)
    topicResults = mySearcher.search(topicQuery, limit=None)
        
    for (docnum, result) in enumerate(topicResults):
        score = topicResults.score(docnum)
        outputTRECFile.write("%s Q0 %s %d %lf test\n" % (topic_id, os.path.basename(result["file_path"]), docnum, score))

# close the topic and results file
outputTRECFile.close()

## Question 2 - Indexing and Querying

In [10]:
!$TREC_EVAL -q $QRELS_FILE $OUTPUT_FILE

num_ret        	1	1
num_rel        	1	5
num_rel_ret    	1	0
map            	1	0.0000
R-prec         	1	0.0000
bpref          	1	0.0000
recip_rank     	1	0.0000
ircl_prn.0.00  	1	0.0000
ircl_prn.0.10  	1	0.0000
ircl_prn.0.20  	1	0.0000
ircl_prn.0.30  	1	0.0000
ircl_prn.0.40  	1	0.0000
ircl_prn.0.50  	1	0.0000
ircl_prn.0.60  	1	0.0000
ircl_prn.0.70  	1	0.0000
ircl_prn.0.80  	1	0.0000
ircl_prn.0.90  	1	0.0000
ircl_prn.1.00  	1	0.0000
P5             	1	0.0000
P10            	1	0.0000
P15            	1	0.0000
P20            	1	0.0000
P30            	1	0.0000
P100           	1	0.0000
P200           	1	0.0000
P500           	1	0.0000
P1000          	1	0.0000
num_ret        	2	6
num_rel        	2	2
num_rel_ret    	2	0
map            	2	0.0000
R-prec         	2	0.0000
bpref          	2	0.0000
recip_rank     	2	0.0000
ircl_prn.0.00  	2	0.0000
ircl_prn.0.10  	2	0.0000
ircl_prn.0.20  	2	0.0000
ircl_prn.0.30  	2	0.0000
ircl_prn.0.40  	2	0.0000
ircl_prn.0.50  	

a.) From the trec_eval output above, the basline Woosh system had an overall MAP performance of 0.1971 across all query topics.

b.) Of all topics queried, topics 14 (Agricultural biotechnology) and 22 (Vetran's Benefits) peformed the best, with MAP scores of 0.25 and 0.2 respectivly.  Alternativly, topic 19 (Cybercrime, internet fraud, and cyber fraud) performed the worst as a score was not even returned for it.   Topics 1, 2, 6, 7, 9, 16 and 28 all also performed poorly, each with a MAP score of 0.

## Question 3 - Improving Performance

a.) In the original search system, a basic regular expression analyzer was used to tokenize the words in the documents, by only looking at spaces between text.  As shown below, there are several things that can be done to more intelligently tokenize the words, and improve Woosh's performance on the government collection.  

In [11]:
# define a reader object on the index
myReader = myIndex.reader()

In [12]:
# Topic 1 - mining gold silver coal
print("# docs with 'gold'", myReader.doc_frequency("file_content", "gold"))
print("# docs with 'Gold'", myReader.doc_frequency("file_content", "Gold"))
print("# docs with 'silver'", myReader.doc_frequency("file_content", "silver"))
print("# docs with 'Silver'", myReader.doc_frequency("file_content", "Silver"))
print("# docs with 'coal'", myReader.doc_frequency("file_content", "coal"))
print("# docs with 'Coal'", myReader.doc_frequency("file_content", "Coal"))

# docs with 'gold' 28
# docs with 'Gold' 33
# docs with 'silver' 12
# docs with 'Silver' 45
# docs with 'coal' 33
# docs with 'Coal' 48


First, setting all terms to lowercase will help to merge similar terms together.  As seen above, capitalization will cause the tokenizer to treat the same word as two different words.

In [13]:
# Topic 16 - Emergency and disaster preparedness assistance
print("# docs with 'Emergency'", myReader.doc_frequency("file_content", "Emergency"))
print("# docs with 'emergency'", myReader.doc_frequency("file_content", "emergency"))
print("# docs with 'and'", myReader.doc_frequency("file_content", "and"))
print("# docs with 'disaster'", myReader.doc_frequency("file_content", "disaster"))
print("# docs with 'preparedness'", myReader.doc_frequency("file_content", "preparedness"))
print("# docs with 'prepared'", myReader.doc_frequency("file_content", "prepared"))
print("# docs with 'preparing'", myReader.doc_frequency("file_content", "preparing"))
print("# docs with 'prepare'", myReader.doc_frequency("file_content", "prepare"))
print("# docs with 'assistance'", myReader.doc_frequency("file_content", "assistance"))

# docs with 'Emergency' 204
# docs with 'emergency' 119
# docs with 'and' 3348
# docs with 'disaster' 62
# docs with 'preparedness' 26
# docs with 'prepared' 92
# docs with 'preparing' 56
# docs with 'prepare' 73
# docs with 'assistance' 278


Secondly, stemming words could help merge tokens that have similar meaning, but differ in spelling based on tense, plurality, etc.  In the above sample, the word 'preparedness' could be simplified to its root word 'prepare'.  Looking at other variations of the word show that there are many other additional documents that may be relevant.

Also, removing commonly used words (ie. stop words) could help put more emphasis on the more important terms.  As seen above, the word 'and' is used in a significantly larger number of documents compared to other terms in the query.  As this word does not add any value to the query, it should be ignored.

In [14]:
# Topic 19 - Cybercrime, internet fraud, and cyber fraud
print("# docs with 'Cybercrime,'", myReader.doc_frequency("file_content", "Cybercrime,"))
print("# docs with 'Cybercrime'", myReader.doc_frequency("file_content", "Cybercrime"))

# docs with 'Cybercrime,' 0
# docs with 'Cybercrime' 12


Finally, removing punctuation will help better match search terms.  Since the original tokenizer only separates based on space, any remaining punctuation will cause the system to treat nearby words as unique words.  As seen in the example above, removing the comma from the term 'cybercrime' shows that there were many results that were initially missed.

## Improve Tokenizer

In [15]:
import nltk
from nltk.stem import *

In [16]:
# download required resources
nltk.download("wordnet")

[nltk_data] Downloading package wordnet to /home/notebook/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [17]:
# This filter will run for both the index and the query
from whoosh.analysis import Filter
class CustomFilter(Filter):
    is_morph = True
    def __init__(self, filterFunc, *args, **kwargs):
        self.customFilter = filterFunc
        self.args = args
        self.kwargs = kwargs
    def __eq__(self):
        return (other
                and self.__class__ is other.__class__)
    def __call__(self, tokens):
        for t in tokens:
            if t.mode == 'query': # if called by query parser
                t.text = self.customFilter(t.text, *self.args, **self.kwargs)
                yield t
            else: # == 'index' if called by indexer
                t.text = self.customFilter(t.text, *self.args, **self.kwargs)
                yield t

In [18]:
# Define Tokenizers
lancasterTokenizer = RegexTokenizer() | LowercaseFilter() | IntraWordFilter() | StopFilter() | CustomFilter(LancasterStemmer().stem)
porterTokenizer = RegexTokenizer() | LowercaseFilter() | IntraWordFilter() | StopFilter() | CustomFilter(PorterStemmer().stem)
snowballTokenizer = RegexTokenizer() | LowercaseFilter() | IntraWordFilter() | StopFilter() | CustomFilter(snowball.EnglishStemmer().stem)
lemmaTokenizer = RegexTokenizer() | LowercaseFilter() | IntraWordFilter() | StopFilter() | CustomFilter(WordNetLemmatizer().lemmatize)

b.) Several filters were added to the tokenizer to improve performance.  First, all words were changed to lowercase so words are not missed due to case-sensitivity.  Afterwards, an intra-word filter is used to remove punctuation attached to tokens.  Next, all stop-words are removed.  Finally, a stemmer is used to simplify all similar terms by removing suffixes from words.  The stemmer was performed last in order to avoid accidentally removing tokens that were reduced to appear similar to stop words.  A lemmatizer was also tried.  In this case, instead of removing suffixes at the end of all words, synonymous tokens are convereted to a single word.

From the NLTK package, the Lancaster, Porter and Snowball stememrs were tested, as well as the WordNet Lemmatizer.

See answer c below for performance improvements.


## Rebuild Index with New Tokenizer

In [19]:
# Define schemas for each tokenizer
lancasterSchema = Schema(file_path = ID(stored=True),
                   file_content = TEXT(analyzer = lancasterTokenizer))

porterSchema = Schema(file_path = ID(stored=True),
                   file_content = TEXT(analyzer = porterTokenizer))

snowballSchema = Schema(file_path = ID(stored=True),
                   file_content = TEXT(analyzer = snowballTokenizer))

lemmaSchema = Schema(file_path = ID(stored=True),
                   file_content = TEXT(analyzer = lemmaTokenizer))

In [20]:
# if index exists - remove it
if os.path.isdir(IMPROVED_INDEX_DIR):
    shutil.rmtree(IMPROVED_INDEX_DIR)

# create the directories for the indices
os.makedirs(IMPROVED_INDEX_DIR)
os.makedirs(LANCASTER_INDEX_DIR)
os.makedirs(PORTER_INDEX_DIR)
os.makedirs(SNOWBALL_INDEX_DIR)
os.makedirs(LEMMA_INDEX_DIR)

# create indices or open it if already exists
lancasterIndex = index.create_in(LANCASTER_INDEX_DIR, lancasterSchema)
porterIndex = index.create_in(PORTER_INDEX_DIR, porterSchema)
snowballIndex = index.create_in(SNOWBALL_INDEX_DIR, snowballSchema)
lemmaIndex = index.create_in(LEMMA_INDEX_DIR, lemmaSchema)

# List to hold all indicies
indexList = [lancasterIndex, porterIndex, snowballIndex, lemmaIndex]

In [21]:
# Prepare index  for each schema
for indexNum, index in enumerate(indexList):
    # open writer
    myWriter2 = writing.BufferedWriter(index, period=20, limit=1000)
    print("Preparing index ", indexNum)
    try:
        # write each file to index
        for docNum, filePath in enumerate(filesToIndex):
            with open(filePath, "r") as f:
                fileContent = f.read()
                myWriter2.add_document(file_path = filePath,
                                      file_content = fileContent)

                if (docNum % 1000 == 0):
                    print("already indexed:", docNum+1)
        print("done indexing.\n")

    finally:
        # save the index
        myWriter2.close()

Preparing index  0
already indexed: 1
already indexed: 1001
already indexed: 2001
already indexed: 3001
already indexed: 4001
done indexing.

Preparing index  1
already indexed: 1
already indexed: 1001
already indexed: 2001
already indexed: 3001
already indexed: 4001
done indexing.

Preparing index  2
already indexed: 1
already indexed: 1001
already indexed: 2001
already indexed: 3001
already indexed: 4001
done indexing.

Preparing index  3
already indexed: 1
already indexed: 1001
already indexed: 2001
already indexed: 3001
already indexed: 4001
done indexing.



In [22]:
# define a query parser for the field "file_content" in the index
parserList = []
searcherList = []

for index in indexList:
    parserList.append(QueryParser("file_content", schema=index.schema))
    searcherList.append(index.searcher())
    
outputFileList = [LANCASTER_OUTPUT_FILE, PORTER_OUTPUT_FILE, SNOWBALL_OUTPUT_FILE, LEMMA_OUTPUT_FILE]

In [23]:
# Run trec_eval for each tokenizer being tested
for outputNum, outputFile in enumerate(outputFileList):
    # create an output file to which we'll write our results
    outputTRECFile = open(outputFile, "w")

    # for each evaluated topic:
    # build a query and record the results in the file in TREC_EVAL format
    for topic in topics:
        topic_id, topic_phrase = tuple(topic.split(" ", 1))
        topicQuery = parserList[outputNum].parse(topic_phrase)
        topicResults = searcherList[outputNum].search(topicQuery, limit=None)
        for (docnum, result) in enumerate(topicResults):
            score = topicResults.score(docnum)
            outputTRECFile.write("%s Q0 %s %d %lf test\n" % (topic_id, os.path.basename(result["file_path"]), docnum, score))

    # close the topic and results file
    outputTRECFile.close()

In [24]:
!$TREC_EVAL -q $QRELS_FILE $LANCASTER_OUTPUT_FILE

num_ret        	1	3
num_rel        	1	5
num_rel_ret    	1	0
map            	1	0.0000
R-prec         	1	0.0000
bpref          	1	0.0000
recip_rank     	1	0.0000
ircl_prn.0.00  	1	0.0000
ircl_prn.0.10  	1	0.0000
ircl_prn.0.20  	1	0.0000
ircl_prn.0.30  	1	0.0000
ircl_prn.0.40  	1	0.0000
ircl_prn.0.50  	1	0.0000
ircl_prn.0.60  	1	0.0000
ircl_prn.0.70  	1	0.0000
ircl_prn.0.80  	1	0.0000
ircl_prn.0.90  	1	0.0000
ircl_prn.1.00  	1	0.0000
P5             	1	0.0000
P10            	1	0.0000
P15            	1	0.0000
P20            	1	0.0000
P30            	1	0.0000
P100           	1	0.0000
P200           	1	0.0000
P500           	1	0.0000
P1000          	1	0.0000
num_ret        	2	13
num_rel        	2	2
num_rel_ret    	2	1
map            	2	0.5000
R-prec         	2	0.5000
bpref          	2	0.5000
recip_rank     	2	1.0000
ircl_prn.0.00  	2	1.0000
ircl_prn.0.10  	2	1.0000
ircl_prn.0.20  	2	1.0000
ircl_prn.0.30  	2	1.0000
ircl_prn.0.40  	2	1.0000
ircl_prn.0.50  

In [25]:
!$TREC_EVAL -q $QRELS_FILE $PORTER_OUTPUT_FILE

num_ret        	1	3
num_rel        	1	5
num_rel_ret    	1	0
map            	1	0.0000
R-prec         	1	0.0000
bpref          	1	0.0000
recip_rank     	1	0.0000
ircl_prn.0.00  	1	0.0000
ircl_prn.0.10  	1	0.0000
ircl_prn.0.20  	1	0.0000
ircl_prn.0.30  	1	0.0000
ircl_prn.0.40  	1	0.0000
ircl_prn.0.50  	1	0.0000
ircl_prn.0.60  	1	0.0000
ircl_prn.0.70  	1	0.0000
ircl_prn.0.80  	1	0.0000
ircl_prn.0.90  	1	0.0000
ircl_prn.1.00  	1	0.0000
P5             	1	0.0000
P10            	1	0.0000
P15            	1	0.0000
P20            	1	0.0000
P30            	1	0.0000
P100           	1	0.0000
P200           	1	0.0000
P500           	1	0.0000
P1000          	1	0.0000
num_ret        	2	13
num_rel        	2	2
num_rel_ret    	2	1
map            	2	0.5000
R-prec         	2	0.5000
bpref          	2	0.5000
recip_rank     	2	1.0000
ircl_prn.0.00  	2	1.0000
ircl_prn.0.10  	2	1.0000
ircl_prn.0.20  	2	1.0000
ircl_prn.0.30  	2	1.0000
ircl_prn.0.40  	2	1.0000
ircl_prn.0.50  

In [26]:
!$TREC_EVAL -q $QRELS_FILE $SNOWBALL_OUTPUT_FILE

num_ret        	1	3
num_rel        	1	5
num_rel_ret    	1	0
map            	1	0.0000
R-prec         	1	0.0000
bpref          	1	0.0000
recip_rank     	1	0.0000
ircl_prn.0.00  	1	0.0000
ircl_prn.0.10  	1	0.0000
ircl_prn.0.20  	1	0.0000
ircl_prn.0.30  	1	0.0000
ircl_prn.0.40  	1	0.0000
ircl_prn.0.50  	1	0.0000
ircl_prn.0.60  	1	0.0000
ircl_prn.0.70  	1	0.0000
ircl_prn.0.80  	1	0.0000
ircl_prn.0.90  	1	0.0000
ircl_prn.1.00  	1	0.0000
P5             	1	0.0000
P10            	1	0.0000
P15            	1	0.0000
P20            	1	0.0000
P30            	1	0.0000
P100           	1	0.0000
P200           	1	0.0000
P500           	1	0.0000
P1000          	1	0.0000
num_ret        	2	13
num_rel        	2	2
num_rel_ret    	2	1
map            	2	0.5000
R-prec         	2	0.5000
bpref          	2	0.5000
recip_rank     	2	1.0000
ircl_prn.0.00  	2	1.0000
ircl_prn.0.10  	2	1.0000
ircl_prn.0.20  	2	1.0000
ircl_prn.0.30  	2	1.0000
ircl_prn.0.40  	2	1.0000
ircl_prn.0.50  

In [27]:
!$TREC_EVAL -q $QRELS_FILE $LEMMA_OUTPUT_FILE

num_ret        	1	3
num_rel        	1	5
num_rel_ret    	1	0
map            	1	0.0000
R-prec         	1	0.0000
bpref          	1	0.0000
recip_rank     	1	0.0000
ircl_prn.0.00  	1	0.0000
ircl_prn.0.10  	1	0.0000
ircl_prn.0.20  	1	0.0000
ircl_prn.0.30  	1	0.0000
ircl_prn.0.40  	1	0.0000
ircl_prn.0.50  	1	0.0000
ircl_prn.0.60  	1	0.0000
ircl_prn.0.70  	1	0.0000
ircl_prn.0.80  	1	0.0000
ircl_prn.0.90  	1	0.0000
ircl_prn.1.00  	1	0.0000
P5             	1	0.0000
P10            	1	0.0000
P15            	1	0.0000
P20            	1	0.0000
P30            	1	0.0000
P100           	1	0.0000
P200           	1	0.0000
P500           	1	0.0000
P1000          	1	0.0000
num_ret        	2	13
num_rel        	2	2
num_rel_ret    	2	1
map            	2	0.5000
R-prec         	2	0.5000
bpref          	2	0.5000
recip_rank     	2	1.0000
ircl_prn.0.00  	2	1.0000
ircl_prn.0.10  	2	1.0000
ircl_prn.0.20  	2	1.0000
ircl_prn.0.30  	2	1.0000
ircl_prn.0.40  	2	1.0000
ircl_prn.0.50  

c.) Overall, the search was improved from an original MAP score of 0.19971 to 0.3456, an improvement of 75.34%, using the Lancaster stemmer.  The Porter and Snowball stemmers both gave MAP scores of 0.3366, while the lemmatizer gave a MAP score of 0.3402.

The remainder of the questions are answered for the Lancaster stemmer.

d.) Some queries improved, while others got worse.  Topics 2, 9, 19, and 28 saw improvements from a 0 or non-existent MAP scores to a score between 0.0385 - 0.500.  Topics 4, 10, 14 also saw improvements in MAP score.

Meanwhile, topics 22 and 26 (Vetran's Benefits and Nuclear power plants respectivly) saw reductions in MAP score.

e.) Given the significant overall improvement in MAP score, the implemented ideas was a step in the right direction in term of overall improvement in the system.  Some topics, such as 19 (Cybercrime, internet fraud, and cyber fraud) now return resuls while the original system had no results.  However, there are still other improvements that can be done to the system to improve the results.

## Question 4 - Additional Iteration

a.) Below is the token analysis of the topics with MAP scores of 0.

In [28]:
# define a reader object on the index
myReader2 = lancasterIndex.reader()

In [29]:
# Topic 1 - mining gold silver coal
print([token.text for token in lancasterTokenizer("mining gold silver coal")])

['min', 'gold', 'silv', 'coal']


In [30]:
print("# docs with 'min'", myReader2.doc_frequency("file_content", "min"))
print("# docs with 'gold'", myReader2.doc_frequency("file_content", "gold"))
print("# docs with 'silv'", myReader2.doc_frequency("file_content", "silv"))
print("# docs with 'coal'", myReader2.doc_frequency("file_content", "coal"))

# docs with 'min' 338
# docs with 'gold' 111
# docs with 'silv' 63
# docs with 'coal' 62


Even though all terms are present, none are being found.  Looking at the source files being searched, a number of documents only mention 'Mining' by itself, without the other terms present.

In [31]:
# Topic 6 - physical therapists
print([token.text for token in lancasterTokenizer("physical therapists")])

['phys', 'therap']


In [32]:
print("# docs with 'phys'", myReader2.doc_frequency("file_content", "phys"))
print("# docs with 'therap'", myReader2.doc_frequency("file_content", "therap"))

# docs with 'phys' 448
# docs with 'therap' 17


In [33]:
# Topic 7 - cotton industry
print([token.text for token in lancasterTokenizer("cotton industry")])

['cotton', 'industry']


In [34]:
print("# docs with 'cotton'", myReader2.doc_frequency("file_content", "cotton"))
print("# docs with 'industry'", myReader2.doc_frequency("file_content", "industry"))

# docs with 'cotton' 35
# docs with 'industry' 395


In [35]:
# Topic 16 - Emergency and disaster preparedness assistance
print([token.text for token in lancasterTokenizer("Emergency and disaster preparedness assistance")])

['emerg', 'disast', 'prep', 'assist']


In [36]:
print("# docs with 'emerg'", myReader2.doc_frequency("file_content", "emerg"))
print("# docs with 'disast'", myReader2.doc_frequency("file_content", "disast"))
print("# docs with 'prep'", myReader2.doc_frequency("file_content", "prep"))
print("# docs with 'assist'", myReader2.doc_frequency("file_content", "assist"))

# docs with 'emerg' 358
# docs with 'disast' 153
# docs with 'prep' 368
# docs with 'assist' 768


As seen above, the improved tokenizer successfully ran on all documents, but some queries are still not returning any results.  Looking at some of the documents labeled as relevant in the gov.qrels document, it would appear that many of the documents only have one or two key terms used in the query.

To improve performance, giving a stronger weighting to words that are more unique, that appear in documents can help compensate for missing query terms in the results.  Adjusting the query to use 'OR' instead of 'AND' may help, as well as adjusting the document scoring system.

## Adding Scoring System to Search

In [37]:
from whoosh import scoring, qparser

In [38]:
queryParserList = []
bm25SearcherBList = []
bm25SearcherK1List = []

# Create parsers and searchers with varying hyper parameter settings
for i in range(0,4):
    # Varying factory weighting
    queryParserList.append(QueryParser("file_content", schema=lancasterIndex.schema, group=qparser.OrGroup.factory(i*0.25)))
    
    # Varying hyperparameters in BM25 scoring system
    bm25SearcherBList.append(lancasterIndex.searcher(weighting=scoring.BM25F(B=i*0.25, K1=1.2)))
    bm25SearcherK1List.append(lancasterIndex.searcher(weighting=scoring.BM25F(B=0.75, K1=1.2+i*0.1)))

# Varying scoring system
tfidfSearcher = lancasterIndex.searcher(weighting=scoring.TF_IDF())
bm25Searcher = lancasterIndex.searcher(weighting=scoring.BM25F())

b.) The query object was adjusted to join multi-term queries with an "OR" instead of an "AND" term.  An extra weighting was also added to documents that had multiple query terms with the factory() function.

There was also an attempt to try tf-idf as a scoring method over BM25.  The B and K1 scaling factor parameters for the BM25 scoring method was also adjusted to see how it affected search results.

See answer c below for result of the changes.

### Joining Query Terms With 'OR' Instead of 'AND'

In [39]:
# Test Effect of 'OR' queries
orOutputFileList = []
for parserNum, parser in enumerate(queryParserList):
    # create an output file to which we'll write our results
    filepath = OUTPUT_FILE_PATH + "orRes" + str(parserNum)
    outputFile = open(filepath, "w")
    orOutputFileList.append(filepath)
    
    # for each evaluated topic:
    # build a query and record the results in the file in TREC_EVAL format
    for topic in topics:
        topic_id, topic_phrase = tuple(topic.split(" ", 1))
        topicQuery = parser.parse(topic_phrase)
        topicResults = bm25Searcher.search(topicQuery, limit=None)
        for (docnum, result) in enumerate(topicResults):
            score = topicResults.score(docnum)
            outputFile.write("%s Q0 %s %d %lf test\n" % (topic_id, os.path.basename(result["file_path"]), docnum, score))

    # close the topic and results file
    outputFile.close()

In [40]:
# factory = 0
orOutputFile = orOutputFileList[0]
!$TREC_EVAL -q $QRELS_FILE $orOutputFile

num_ret        	1	481
num_rel        	1	5
num_rel_ret    	1	5
map            	1	0.0618
R-prec         	1	0.0000
bpref          	1	0.0000
recip_rank     	1	0.0500
ircl_prn.0.00  	1	0.0968
ircl_prn.0.10  	1	0.0968
ircl_prn.0.20  	1	0.0968
ircl_prn.0.30  	1	0.0968
ircl_prn.0.40  	1	0.0968
ircl_prn.0.50  	1	0.0968
ircl_prn.0.60  	1	0.0968
ircl_prn.0.70  	1	0.0412
ircl_prn.0.80  	1	0.0412
ircl_prn.0.90  	1	0.0410
ircl_prn.1.00  	1	0.0410
P5             	1	0.0000
P10            	1	0.0000
P15            	1	0.0000
P20            	1	0.0500
P30            	1	0.0667
P100           	1	0.0400
P200           	1	0.0250
P500           	1	0.0100
P1000          	1	0.0050
num_ret        	2	59
num_rel        	2	2
num_rel_ret    	2	2
map            	2	0.5357
R-prec         	2	0.5000
bpref          	2	0.5000
recip_rank     	2	1.0000
ircl_prn.0.00  	2	1.0000
ircl_prn.0.10  	2	1.0000
ircl_prn.0.20  	2	1.0000
ircl_prn.0.30  	2	1.0000
ircl_prn.0.40  	2	1.0000
ircl_prn.0.50

In [41]:
# factory = 0.25
orOutputFile = orOutputFileList[1]
!$TREC_EVAL -q $QRELS_FILE $orOutputFile

num_ret        	1	481
num_rel        	1	5
num_rel_ret    	1	5
map            	1	0.0557
R-prec         	1	0.0000
bpref          	1	0.0000
recip_rank     	1	0.0417
ircl_prn.0.00  	1	0.0769
ircl_prn.0.10  	1	0.0769
ircl_prn.0.20  	1	0.0769
ircl_prn.0.30  	1	0.0769
ircl_prn.0.40  	1	0.0769
ircl_prn.0.50  	1	0.0769
ircl_prn.0.60  	1	0.0769
ircl_prn.0.70  	1	0.0439
ircl_prn.0.80  	1	0.0439
ircl_prn.0.90  	1	0.0439
ircl_prn.1.00  	1	0.0439
P5             	1	0.0000
P10            	1	0.0000
P15            	1	0.0000
P20            	1	0.0000
P30            	1	0.0667
P100           	1	0.0400
P200           	1	0.0250
P500           	1	0.0100
P1000          	1	0.0050
num_ret        	2	59
num_rel        	2	2
num_rel_ret    	2	2
map            	2	0.5357
R-prec         	2	0.5000
bpref          	2	0.5000
recip_rank     	2	1.0000
ircl_prn.0.00  	2	1.0000
ircl_prn.0.10  	2	1.0000
ircl_prn.0.20  	2	1.0000
ircl_prn.0.30  	2	1.0000
ircl_prn.0.40  	2	1.0000
ircl_prn.0.50

In [42]:
# factory = 0.50
orOutputFile = orOutputFileList[2]
!$TREC_EVAL -q $QRELS_FILE $orOutputFile

num_ret        	1	481
num_rel        	1	5
num_rel_ret    	1	5
map            	1	0.0557
R-prec         	1	0.0000
bpref          	1	0.0000
recip_rank     	1	0.0417
ircl_prn.0.00  	1	0.0769
ircl_prn.0.10  	1	0.0769
ircl_prn.0.20  	1	0.0769
ircl_prn.0.30  	1	0.0769
ircl_prn.0.40  	1	0.0769
ircl_prn.0.50  	1	0.0769
ircl_prn.0.60  	1	0.0769
ircl_prn.0.70  	1	0.0439
ircl_prn.0.80  	1	0.0439
ircl_prn.0.90  	1	0.0439
ircl_prn.1.00  	1	0.0439
P5             	1	0.0000
P10            	1	0.0000
P15            	1	0.0000
P20            	1	0.0000
P30            	1	0.0667
P100           	1	0.0400
P200           	1	0.0250
P500           	1	0.0100
P1000          	1	0.0050
num_ret        	2	59
num_rel        	2	2
num_rel_ret    	2	2
map            	2	0.5357
R-prec         	2	0.5000
bpref          	2	0.5000
recip_rank     	2	1.0000
ircl_prn.0.00  	2	1.0000
ircl_prn.0.10  	2	1.0000
ircl_prn.0.20  	2	1.0000
ircl_prn.0.30  	2	1.0000
ircl_prn.0.40  	2	1.0000
ircl_prn.0.50

In [43]:
# factory = 0.75
orOutputFile = orOutputFileList[2]
!$TREC_EVAL -q $QRELS_FILE $orOutputFile

num_ret        	1	481
num_rel        	1	5
num_rel_ret    	1	5
map            	1	0.0557
R-prec         	1	0.0000
bpref          	1	0.0000
recip_rank     	1	0.0417
ircl_prn.0.00  	1	0.0769
ircl_prn.0.10  	1	0.0769
ircl_prn.0.20  	1	0.0769
ircl_prn.0.30  	1	0.0769
ircl_prn.0.40  	1	0.0769
ircl_prn.0.50  	1	0.0769
ircl_prn.0.60  	1	0.0769
ircl_prn.0.70  	1	0.0439
ircl_prn.0.80  	1	0.0439
ircl_prn.0.90  	1	0.0439
ircl_prn.1.00  	1	0.0439
P5             	1	0.0000
P10            	1	0.0000
P15            	1	0.0000
P20            	1	0.0000
P30            	1	0.0667
P100           	1	0.0400
P200           	1	0.0250
P500           	1	0.0100
P1000          	1	0.0050
num_ret        	2	59
num_rel        	2	2
num_rel_ret    	2	2
map            	2	0.5357
R-prec         	2	0.5000
bpref          	2	0.5000
recip_rank     	2	1.0000
ircl_prn.0.00  	2	1.0000
ircl_prn.0.10  	2	1.0000
ircl_prn.0.20  	2	1.0000
ircl_prn.0.30  	2	1.0000
ircl_prn.0.40  	2	1.0000
ircl_prn.0.50

Overall, replacing the 'OR' join for multiple query terms instead of 'AND' has improved MAP scores.  After varying the factory parameters the best performing setting had factory parameter at 0, for a MAP score of 0.3797.

### Comparing TF-IDF to BM25

In [44]:
# Test scoring system
scoringSearcherList = [tfidfSearcher, bm25Searcher]
scoringOutputFileList = [OUTPUT_FILE_PATH + "/tfidfRes", OUTPUT_FILE_PATH + "/bm25Res"]

for systemNum, searcher in enumerate(scoringSearcherList):
    # create an output file to which we'll write our results
    outputFile = open(scoringOutputFileList[systemNum], "w")

    # for each evaluated topic:
    # build a query and record the results in the file in TREC_EVAL format
    for topic in topics:
        topic_id, topic_phrase = tuple(topic.split(" ", 1))
        topicQuery = queryParserList[0].parse(topic_phrase)
        topicResults = searcher.search(topicQuery, limit=None)
        for (docnum, result) in enumerate(topicResults):
            score = topicResults.score(docnum)
            outputFile.write("%s Q0 %s %d %lf test\n" % (topic_id, os.path.basename(result["file_path"]), docnum, score))

    # close the topic and results file
    outputFile.close()

In [45]:
tfidfOutputFile = scoringOutputFileList[0]
!$TREC_EVAL -q $QRELS_FILE $tfidfOutputFile

num_ret        	1	481
num_rel        	1	5
num_rel_ret    	1	5
map            	1	0.0491
R-prec         	1	0.0000
bpref          	1	0.0000
recip_rank     	1	0.0476
ircl_prn.0.00  	1	0.0870
ircl_prn.0.10  	1	0.0870
ircl_prn.0.20  	1	0.0870
ircl_prn.0.30  	1	0.0870
ircl_prn.0.40  	1	0.0870
ircl_prn.0.50  	1	0.0494
ircl_prn.0.60  	1	0.0494
ircl_prn.0.70  	1	0.0494
ircl_prn.0.80  	1	0.0494
ircl_prn.0.90  	1	0.0216
ircl_prn.1.00  	1	0.0216
P5             	1	0.0000
P10            	1	0.0000
P15            	1	0.0000
P20            	1	0.0000
P30            	1	0.0667
P100           	1	0.0400
P200           	1	0.0200
P500           	1	0.0100
P1000          	1	0.0050
num_ret        	2	59
num_rel        	2	2
num_rel_ret    	2	2
map            	2	0.1884
R-prec         	2	0.0000
bpref          	2	0.0000
recip_rank     	2	0.3333
ircl_prn.0.00  	2	0.3333
ircl_prn.0.10  	2	0.3333
ircl_prn.0.20  	2	0.3333
ircl_prn.0.30  	2	0.3333
ircl_prn.0.40  	2	0.3333
ircl_prn.0.50

In [46]:
bm25OutputFile = scoringOutputFileList[1]
!$TREC_EVAL -q $QRELS_FILE $bm25OutputFile

num_ret        	1	481
num_rel        	1	5
num_rel_ret    	1	5
map            	1	0.0618
R-prec         	1	0.0000
bpref          	1	0.0000
recip_rank     	1	0.0500
ircl_prn.0.00  	1	0.0968
ircl_prn.0.10  	1	0.0968
ircl_prn.0.20  	1	0.0968
ircl_prn.0.30  	1	0.0968
ircl_prn.0.40  	1	0.0968
ircl_prn.0.50  	1	0.0968
ircl_prn.0.60  	1	0.0968
ircl_prn.0.70  	1	0.0412
ircl_prn.0.80  	1	0.0412
ircl_prn.0.90  	1	0.0410
ircl_prn.1.00  	1	0.0410
P5             	1	0.0000
P10            	1	0.0000
P15            	1	0.0000
P20            	1	0.0500
P30            	1	0.0667
P100           	1	0.0400
P200           	1	0.0250
P500           	1	0.0100
P1000          	1	0.0050
num_ret        	2	59
num_rel        	2	2
num_rel_ret    	2	2
map            	2	0.5357
R-prec         	2	0.5000
bpref          	2	0.5000
recip_rank     	2	1.0000
ircl_prn.0.00  	2	1.0000
ircl_prn.0.10  	2	1.0000
ircl_prn.0.20  	2	1.0000
ircl_prn.0.30  	2	1.0000
ircl_prn.0.40  	2	1.0000
ircl_prn.0.50

Overall, using the Lancaster tokenizer from Part 3, and the "OR" parser from the previous section, the BM25 scoring system has a higher MAP score of 0.3797, compared to TF-IDF which scored 0.1302.  When using the the default "AND" parser, TF-IDF scored a MAP score of 0.1641.

### Varying B parameter in BM25

In [47]:
# Test Effect of B parameter in BM25
bm25BOutputFileList = []
for searcherNum, searcher in enumerate(bm25SearcherBList):
    # create an output file to which we'll write our results
    filepath = OUTPUT_FILE_PATH + "/bm25BRes" + str(searcherNum)
    outputFile = open(filepath, "w")
    bm25BOutputFileList.append(filepath)
    
    # for each evaluated topic:
    # build a query and record the results in the file in TREC_EVAL format
    for topic in topics:
        topic_id, topic_phrase = tuple(topic.split(" ", 1))
        topicQuery = queryParserList[0].parse(topic_phrase)
        topicResults = searcher.search(topicQuery, limit=None)
        for (docnum, result) in enumerate(topicResults):
            score = topicResults.score(docnum)
            outputFile.write("%s Q0 %s %d %lf test\n" % (topic_id, os.path.basename(result["file_path"]), docnum, score))

    # close the topic and results file
    outputFile.close()

In [48]:
# B = 0.0, K1 = 1.2
bm25BOutputFile = bm25BOutputFileList[0]
!$TREC_EVAL -q $QRELS_FILE $bm25BOutputFile

num_ret        	1	481
num_rel        	1	5
num_rel_ret    	1	5
map            	1	0.0412
R-prec         	1	0.0000
bpref          	1	0.0000
recip_rank     	1	0.0417
ircl_prn.0.00  	1	0.0652
ircl_prn.0.10  	1	0.0652
ircl_prn.0.20  	1	0.0652
ircl_prn.0.30  	1	0.0652
ircl_prn.0.40  	1	0.0652
ircl_prn.0.50  	1	0.0652
ircl_prn.0.60  	1	0.0652
ircl_prn.0.70  	1	0.0278
ircl_prn.0.80  	1	0.0278
ircl_prn.0.90  	1	0.0186
ircl_prn.1.00  	1	0.0186
P5             	1	0.0000
P10            	1	0.0000
P15            	1	0.0000
P20            	1	0.0000
P30            	1	0.0333
P100           	1	0.0300
P200           	1	0.0200
P500           	1	0.0100
P1000          	1	0.0050
num_ret        	2	59
num_rel        	2	2
num_rel_ret    	2	2
map            	2	0.5217
R-prec         	2	0.5000
bpref          	2	0.5000
recip_rank     	2	1.0000
ircl_prn.0.00  	2	1.0000
ircl_prn.0.10  	2	1.0000
ircl_prn.0.20  	2	1.0000
ircl_prn.0.30  	2	1.0000
ircl_prn.0.40  	2	1.0000
ircl_prn.0.50

In [49]:
# B = 0.25, K1 = 1.2
bm25BOutputFile = bm25BOutputFileList[1]
!$TREC_EVAL -q $QRELS_FILE $bm25BOutputFile

num_ret        	1	481
num_rel        	1	5
num_rel_ret    	1	5
map            	1	0.0551
R-prec         	1	0.0000
bpref          	1	0.0000
recip_rank     	1	0.0476
ircl_prn.0.00  	1	0.0938
ircl_prn.0.10  	1	0.0938
ircl_prn.0.20  	1	0.0938
ircl_prn.0.30  	1	0.0938
ircl_prn.0.40  	1	0.0938
ircl_prn.0.50  	1	0.0938
ircl_prn.0.60  	1	0.0938
ircl_prn.0.70  	1	0.0351
ircl_prn.0.80  	1	0.0351
ircl_prn.0.90  	1	0.0275
ircl_prn.1.00  	1	0.0275
P5             	1	0.0000
P10            	1	0.0000
P15            	1	0.0000
P20            	1	0.0000
P30            	1	0.0667
P100           	1	0.0300
P200           	1	0.0250
P500           	1	0.0100
P1000          	1	0.0050
num_ret        	2	59
num_rel        	2	2
num_rel_ret    	2	2
map            	2	0.5263
R-prec         	2	0.5000
bpref          	2	0.5000
recip_rank     	2	1.0000
ircl_prn.0.00  	2	1.0000
ircl_prn.0.10  	2	1.0000
ircl_prn.0.20  	2	1.0000
ircl_prn.0.30  	2	1.0000
ircl_prn.0.40  	2	1.0000
ircl_prn.0.50

In [50]:
# B = 0.50, K1 = 1.2
bm25BOutputFile = bm25BOutputFileList[2]
!$TREC_EVAL -q $QRELS_FILE $bm25BOutputFile

num_ret        	1	481
num_rel        	1	5
num_rel_ret    	1	5
map            	1	0.0602
R-prec         	1	0.0000
bpref          	1	0.0000
recip_rank     	1	0.0526
ircl_prn.0.00  	1	0.1034
ircl_prn.0.10  	1	0.1034
ircl_prn.0.20  	1	0.1034
ircl_prn.0.30  	1	0.1034
ircl_prn.0.40  	1	0.1034
ircl_prn.0.50  	1	0.1034
ircl_prn.0.60  	1	0.1034
ircl_prn.0.70  	1	0.0396
ircl_prn.0.80  	1	0.0396
ircl_prn.0.90  	1	0.0340
ircl_prn.1.00  	1	0.0340
P5             	1	0.0000
P10            	1	0.0000
P15            	1	0.0000
P20            	1	0.0500
P30            	1	0.1000
P100           	1	0.0300
P200           	1	0.0250
P500           	1	0.0100
P1000          	1	0.0050
num_ret        	2	59
num_rel        	2	2
num_rel_ret    	2	2
map            	2	0.5294
R-prec         	2	0.5000
bpref          	2	0.5000
recip_rank     	2	1.0000
ircl_prn.0.00  	2	1.0000
ircl_prn.0.10  	2	1.0000
ircl_prn.0.20  	2	1.0000
ircl_prn.0.30  	2	1.0000
ircl_prn.0.40  	2	1.0000
ircl_prn.0.50

In [51]:
# B = 0.75, K1 = 1.2
bm25BOutputFile = bm25BOutputFileList[3]
!$TREC_EVAL -q $QRELS_FILE $bm25BOutputFile

num_ret        	1	481
num_rel        	1	5
num_rel_ret    	1	5
map            	1	0.0618
R-prec         	1	0.0000
bpref          	1	0.0000
recip_rank     	1	0.0500
ircl_prn.0.00  	1	0.0968
ircl_prn.0.10  	1	0.0968
ircl_prn.0.20  	1	0.0968
ircl_prn.0.30  	1	0.0968
ircl_prn.0.40  	1	0.0968
ircl_prn.0.50  	1	0.0968
ircl_prn.0.60  	1	0.0968
ircl_prn.0.70  	1	0.0412
ircl_prn.0.80  	1	0.0412
ircl_prn.0.90  	1	0.0410
ircl_prn.1.00  	1	0.0410
P5             	1	0.0000
P10            	1	0.0000
P15            	1	0.0000
P20            	1	0.0500
P30            	1	0.0667
P100           	1	0.0400
P200           	1	0.0250
P500           	1	0.0100
P1000          	1	0.0050
num_ret        	2	59
num_rel        	2	2
num_rel_ret    	2	2
map            	2	0.5357
R-prec         	2	0.5000
bpref          	2	0.5000
recip_rank     	2	1.0000
ircl_prn.0.00  	2	1.0000
ircl_prn.0.10  	2	1.0000
ircl_prn.0.20  	2	1.0000
ircl_prn.0.30  	2	1.0000
ircl_prn.0.40  	2	1.0000
ircl_prn.0.50

Overall, using the Lancaster tokenizer from Part 3, and the "OR" parser from the previous section, the BM25 scoring system has a the highest MAP score of 0.3797 when B = 0.75, holding K1 constant at 1.2.

### Varying K1 parameter in BM25

In [52]:
# Test Effect of K1 parameter in BM25
bm25KOutputFileList = []
for searcherNum, searcher in enumerate(bm25SearcherK1List):
    # create an output file to which we'll write our results
    filepath = OUTPUT_FILE_PATH + "/bm25KRes" + str(searcherNum)
    outputFile = open(filepath, "w")
    bm25KOutputFileList.append(filepath)
    
    # for each evaluated topic:
    # build a query and record the results in the file in TREC_EVAL format
    for topic in topics:
        topic_id, topic_phrase = tuple(topic.split(" ", 1))
        topicQuery = queryParserList[0].parse(topic_phrase)
        topicResults = searcher.search(topicQuery, limit=None)
        for (docnum, result) in enumerate(topicResults):
            score = topicResults.score(docnum)
            outputFile.write("%s Q0 %s %d %lf test\n" % (topic_id, os.path.basename(result["file_path"]), docnum, score))

    # close the topic and results file
    outputFile.close()

In [53]:
# B = 0.75, K1 = 1.2
bm25KOutputFile = bm25KOutputFileList[0]
!$TREC_EVAL -q $QRELS_FILE $bm25KOutputFile

num_ret        	1	481
num_rel        	1	5
num_rel_ret    	1	5
map            	1	0.0618
R-prec         	1	0.0000
bpref          	1	0.0000
recip_rank     	1	0.0500
ircl_prn.0.00  	1	0.0968
ircl_prn.0.10  	1	0.0968
ircl_prn.0.20  	1	0.0968
ircl_prn.0.30  	1	0.0968
ircl_prn.0.40  	1	0.0968
ircl_prn.0.50  	1	0.0968
ircl_prn.0.60  	1	0.0968
ircl_prn.0.70  	1	0.0412
ircl_prn.0.80  	1	0.0412
ircl_prn.0.90  	1	0.0410
ircl_prn.1.00  	1	0.0410
P5             	1	0.0000
P10            	1	0.0000
P15            	1	0.0000
P20            	1	0.0500
P30            	1	0.0667
P100           	1	0.0400
P200           	1	0.0250
P500           	1	0.0100
P1000          	1	0.0050
num_ret        	2	59
num_rel        	2	2
num_rel_ret    	2	2
map            	2	0.5357
R-prec         	2	0.5000
bpref          	2	0.5000
recip_rank     	2	1.0000
ircl_prn.0.00  	2	1.0000
ircl_prn.0.10  	2	1.0000
ircl_prn.0.20  	2	1.0000
ircl_prn.0.30  	2	1.0000
ircl_prn.0.40  	2	1.0000
ircl_prn.0.50

In [54]:
# B = 0.75, K1 = 1.3
bm25KOutputFile = bm25KOutputFileList[1]
!$TREC_EVAL -q $QRELS_FILE $bm25KOutputFile

num_ret        	1	481
num_rel        	1	5
num_rel_ret    	1	5
map            	1	0.0625
R-prec         	1	0.0000
bpref          	1	0.0000
recip_rank     	1	0.0500
ircl_prn.0.00  	1	0.0968
ircl_prn.0.10  	1	0.0968
ircl_prn.0.20  	1	0.0968
ircl_prn.0.30  	1	0.0968
ircl_prn.0.40  	1	0.0968
ircl_prn.0.50  	1	0.0968
ircl_prn.0.60  	1	0.0968
ircl_prn.0.70  	1	0.0431
ircl_prn.0.80  	1	0.0431
ircl_prn.0.90  	1	0.0431
ircl_prn.1.00  	1	0.0431
P5             	1	0.0000
P10            	1	0.0000
P15            	1	0.0000
P20            	1	0.0500
P30            	1	0.0667
P100           	1	0.0400
P200           	1	0.0250
P500           	1	0.0100
P1000          	1	0.0050
num_ret        	2	59
num_rel        	2	2
num_rel_ret    	2	2
map            	2	0.5357
R-prec         	2	0.5000
bpref          	2	0.5000
recip_rank     	2	1.0000
ircl_prn.0.00  	2	1.0000
ircl_prn.0.10  	2	1.0000
ircl_prn.0.20  	2	1.0000
ircl_prn.0.30  	2	1.0000
ircl_prn.0.40  	2	1.0000
ircl_prn.0.50

In [55]:
# B = 0.75, K1 = 1.4
bm25KOutputFile = bm25KOutputFileList[2]
!$TREC_EVAL -q $QRELS_FILE $bm25KOutputFile

num_ret        	1	481
num_rel        	1	5
num_rel_ret    	1	5
map            	1	0.0631
R-prec         	1	0.0000
bpref          	1	0.0000
recip_rank     	1	0.0500
ircl_prn.0.00  	1	0.0968
ircl_prn.0.10  	1	0.0968
ircl_prn.0.20  	1	0.0968
ircl_prn.0.30  	1	0.0968
ircl_prn.0.40  	1	0.0968
ircl_prn.0.50  	1	0.0968
ircl_prn.0.60  	1	0.0968
ircl_prn.0.70  	1	0.0446
ircl_prn.0.80  	1	0.0446
ircl_prn.0.90  	1	0.0446
ircl_prn.1.00  	1	0.0446
P5             	1	0.0000
P10            	1	0.0000
P15            	1	0.0000
P20            	1	0.0500
P30            	1	0.0667
P100           	1	0.0400
P200           	1	0.0250
P500           	1	0.0100
P1000          	1	0.0050
num_ret        	2	59
num_rel        	2	2
num_rel_ret    	2	2
map            	2	0.5357
R-prec         	2	0.5000
bpref          	2	0.5000
recip_rank     	2	1.0000
ircl_prn.0.00  	2	1.0000
ircl_prn.0.10  	2	1.0000
ircl_prn.0.20  	2	1.0000
ircl_prn.0.30  	2	1.0000
ircl_prn.0.40  	2	1.0000
ircl_prn.0.50

In [56]:
# B = 0.75, K1 = 1.5
bm25KOutputFile = bm25KOutputFileList[3]
!$TREC_EVAL -q $QRELS_FILE $bm25KOutputFile

num_ret        	1	481
num_rel        	1	5
num_rel_ret    	1	5
map            	1	0.0651
R-prec         	1	0.0000
bpref          	1	0.0000
recip_rank     	1	0.0500
ircl_prn.0.00  	1	0.1034
ircl_prn.0.10  	1	0.1034
ircl_prn.0.20  	1	0.1034
ircl_prn.0.30  	1	0.1034
ircl_prn.0.40  	1	0.1034
ircl_prn.0.50  	1	0.1034
ircl_prn.0.60  	1	0.1034
ircl_prn.0.70  	1	0.0460
ircl_prn.0.80  	1	0.0460
ircl_prn.0.90  	1	0.0459
ircl_prn.1.00  	1	0.0459
P5             	1	0.0000
P10            	1	0.0000
P15            	1	0.0000
P20            	1	0.0500
P30            	1	0.1000
P100           	1	0.0400
P200           	1	0.0250
P500           	1	0.0100
P1000          	1	0.0050
num_ret        	2	59
num_rel        	2	2
num_rel_ret    	2	2
map            	2	0.5357
R-prec         	2	0.5000
bpref          	2	0.5000
recip_rank     	2	1.0000
ircl_prn.0.00  	2	1.0000
ircl_prn.0.10  	2	1.0000
ircl_prn.0.20  	2	1.0000
ircl_prn.0.30  	2	1.0000
ircl_prn.0.40  	2	1.0000
ircl_prn.0.50

Overall, using the Lancaster tokenizer from Part 3, and the "OR" parser from the previous section, the BM25 scoring system has a the highest MAP score of 0.3810 when holding B constant at 0.75, and K1 = 1.5.

c.) Overall, the search was improved from am MAP score of 0.3456 to 0.3810, an improvement of 10.24%.  This was achived using a Lancaster stemmer from the NLTK package, and 'OR' querry parser, and the BM25 scoring system where B = 0.75 and K1 = 1.5.  

d.) All queries were either improved, or maintained the same performance. Topics 1, 6, 7 and 16 saw improvements from a 0 MAP scores to a score between 0.0651 - 0.1778. Topics 2, 4, 9 and 26 also saw improvements in MAP score.

e.) Given the overall improvement in MAP score, the implemented ideas was an improvement in the system. All test queries now return resuls, and the system can better handle multi-term queries. However, there are other things that can be done to achieve a better result.

For one, the B and K1 parameters could be more optimally chosen.  In the tests above, only a small subset of combinations were evaluated.  If a larger subset of combinations on the solution plane were tested, a more optimal setting may be found.  Also, an 'OR' parser was used in repalcement of an 'AND'' parser for the query. There may be cases where the user wants to find documents containing two keywords.  Unfortunatly, this case is not handled, as all 'and' phrases are removed from the query while it is being tokenized - adjusting for this should help improve precision in the results.

Finally, having a cross-validation set of test queries would be good to have to ensure that the system isn't being overfitted to the test queries.