# Assignment 2: IR

## Preparations
* Put all your imports, and path constants in the next cells
* Make sure all your path constants are **relative to** ***DATA_DIR*** and **NOT hard-coded** in your code.

In [63]:
# imports
# Put all your imports here
from whoosh import index, writing
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import *
from whoosh.qparser import QueryParser
import os.path
from pathlib import Path
import tempfile
import subprocess

In [64]:
DATA_DIR = "government"
#
# Put other path constants here
#
DOCUMENTS_DIR = os.path.join(DATA_DIR, "documents")
TOPIC_FILE = os.path.join(DATA_DIR, "gov.topics")
QRELS_FILE = os.path.join(DATA_DIR, "gov.qrels")

# For windows:
TREC_EVAL = os.path.join("trec_eval", "trec_eval.exe")

## Question 1
Provide your text answers in the following two markdown cells

### Q1 (a): Provide answer to Q1 (a) here [markdown cell]

Mean Average Precision (MAP)

### Q1 (b): Provide answer to Q1 (b) here [markdown cell]

A government web site database is large and bound to house tons of documents, and a singular document or even a few will likely not cover all of the user’s information needs (ruling out recip_rank and P@5, P@10 etc.). A user researching a certain case or subject will likely want to read through all documents relevant to the query. 

Thus, the best measure should look not only at whether the system retrieves as many relevant documents as possible, but also the ranking of these documents, such that relevant documents are the best to come to the user’s attention --> this makes MAP the best measure for this purpose.

## Question 2

### Q2 (a): Write your code below

In [65]:
# Put your code for creating the index here (you can add more cells).
# Make sure you save the final index in the variable INDEX_Q2, your query parser in QP_Q2, and your searcher in SEARCHER_Q2

def createIndex(schema):
    # Generate a temporary directory for the index
    indexDir = tempfile.mkdtemp()

    # create and return the index
    return index.create_in(indexDir, schema)

def addFilesToIndex(indexObj, fileList):
    # open writer
    writer = writing.BufferedWriter(indexObj, period=None, limit=1000)

    try:
        # write each file to index
        for docNum, filePath in enumerate(fileList):
            with open(filePath, "r", encoding="utf-8") as f:
                fileContent = f.read()
                writer.add_document(file_path = filePath,
                                    file_content = fileContent)

                # print status every 1000 documents
                if (docNum+1 % 1000 == 0):
                    print("already indexed:", docNum+1)
        print("done indexing.")

    finally:
        # close the index
        writer.close()

# first, define a Schema for the index
mySchema = Schema(file_path = ID(stored=True),
                  file_content = TEXT(analyzer = RegexTokenizer()))

# now, create the index at the path INDEX_DIR based on the new schema
myIndex = createIndex(mySchema)
        
# Build a list of files to index
filesToIndex = [str(filePath) for filePath in Path(DOCUMENTS_DIR).glob("**/*") if filePath.is_file()]

addFilesToIndex(myIndex, filesToIndex)

done indexing.


In [66]:
# count files to index
print("number of files:", len(filesToIndex))

number of files: 4078


In [67]:
# define a query parser for the field "file_content" in the index
myQueryParser = QueryParser("file_content", schema=myIndex.schema)
mySearcher = myIndex.searcher()

sampleQuery = myQueryParser.parse("topics")
sampleQueryResults = mySearcher.search(sampleQuery, limit=None)

# inspect the result:
# for each document print the rank and the score
for (docnum, result) in enumerate(sampleQueryResults):
    score = sampleQueryResults.score(docnum)
    fileName = os.path.basename(result["file_path"])
    print(fileName, docnum, score)

G00-38-4007589 0 7.587337532494129
G00-15-1980890 1 7.303347484832998
G00-52-2991521 2 7.273467259978599
G00-28-3095840 3 6.990666413736417
G00-94-3097702 4 6.838291228538814
G00-70-0992046 5 6.8033986825647945
G00-67-2771108 6 6.658953296962102
G00-93-3702508 7 6.658953296962102
G00-87-2843876 8 6.625902343114017
G00-79-4144643 9 6.363019442848441
G00-95-2757938 10 6.3460209424648175
G00-03-2042174 11 6.2129536639850995
G00-91-1609512 12 6.202081251266743
G00-18-4042118 13 6.173469427536226
G00-49-2029636 14 6.166262720519868
G00-07-2819623 15 6.109819292317214
G00-22-2864155 16 6.109819292317214
G00-68-0320800 17 6.023497065426611
G00-05-2948873 18 6.020163336230321
G00-05-3703958 19 6.013288024988558
G00-07-3972872 20 5.9863769693119
G00-86-3214229 21 5.926443826929672
G00-39-2879677 22 5.887255583637417
G00-48-2946724 23 5.848520073541328
G00-64-1861217 24 5.82291676450907
G00-25-0870759 25 5.698544569625222
G00-02-4047734 26 5.6204754094266605
G00-10-2392581 27 5.6204754094266605


In [68]:
INDEX_Q2 = myIndex # Replace None with your index for Q2
QP_Q2 = myQueryParser # Replace None with your query parser for Q2
SEARCHER_Q2 = mySearcher # Replace None with your searcher for Q2

In [69]:
sampleQueryResults

<Top 181 Results for Term('file_content', 'topics') runtime=0.005630000000564905>

In [70]:
# print the topic file
with open(TOPIC_FILE, "r") as f:
    print(f.read())

1 mining gold silver coal
2 juvenile delinquency
4 wireless communications
6 physical therapists
7 cotton industry
9 genealogy searches
10 Physical Fitness
14 Agricultural biotechnology
16 Emergency and disaster preparedness assistance
18 Shipwrecks
19 Cybercrime, internet fraud, and cyber fraud
22 Veteran's Benefits
24 Air Bag Safety
26 Nuclear power plants
28 Early Childhood Education



In [71]:
# print the first 10 lines in the qrels file
with open(QRELS_FILE, "r") as f:
    qrels10 = f.readlines()[:10]
    print("".join(qrels10))

1 0 G00-00-0681214 0
1 0 G00-00-0945765 0
1 0 G00-00-1006224 1
1 0 G00-00-1591495 0
1 0 G00-00-2764912 0
1 0 G00-00-3253540 0
1 0 G00-00-3717374 0
1 0 G00-01-0270065 0
1 0 G00-01-0400712 0
1 0 G00-01-0682299 0



In [72]:
def trecEval(topicFile, qrelsFile, queryParser, searcher):
    # Load topic file - a list of topics(search phrases) used for evalutation
    with open(topicFile, "r") as tf:
        topics = tf.read().splitlines()

    # create an output file to which we'll write our results
    tempOutputFile = tempfile.mkstemp()[1]
    with open(tempOutputFile, "w") as outputTRECFile:
        # for each evaluated topic:
        # build a query and record the results in the file in TREC_EVAL format
        for topic in topics:
            topic_id, topic_phrase = tuple(topic.split(" ", 1))
            topicQuery = queryParser.parse(topic_phrase)
            topicResults = searcher.search(topicQuery, limit=None)
            for (docnum, result) in enumerate(topicResults):
                score = topicResults.score(docnum)
                outputTRECFile.write("%s Q0 %s %d %lf test\n" % (topic_id, os.path.basename(result["file_path"]), docnum, score))
    
    result = subprocess.run([TREC_EVAL, '-q', qrelsFile, tempOutputFile], stdout=subprocess.PIPE)
    print(result.stdout.decode())

In [73]:
trecEval(TOPIC_FILE, QRELS_FILE, myQueryParser, mySearcher) 

num_ret               	1	1
num_rel               	1	5
num_rel_ret           	1	0
map                   	1	0.0000
Rprec                 	1	0.0000
bpref                 	1	0.0000
recip_rank            	1	0.0000
iprec_at_recall_0.00  	1	0.0000
iprec_at_recall_0.10  	1	0.0000
iprec_at_recall_0.20  	1	0.0000
iprec_at_recall_0.30  	1	0.0000
iprec_at_recall_0.40  	1	0.0000
iprec_at_recall_0.50  	1	0.0000
iprec_at_recall_0.60  	1	0.0000
iprec_at_recall_0.70  	1	0.0000
iprec_at_recall_0.80  	1	0.0000
iprec_at_recall_0.90  	1	0.0000
iprec_at_recall_1.00  	1	0.0000
P_5                   	1	0.0000
P_10                  	1	0.0000
P_15                  	1	0.0000
P_20                  	1	0.0000
P_30                  	1	0.0000
P_100                 	1	0.0000
P_200                 	1	0.0000
P_500                 	1	0.0000
P_1000                	1	0.0000
num_ret               	10	16
num_rel               	10	1
num_rel_ret           	10	1
map                   	10	0.1667
Rprec                 	10	0.0000


### Q2 (b): Provide answer to Q2 (b) here [markdown cell]

MAP for all query results = 0.1971

### Q2 (c): Provide answer to Q2(c) here [markdown cell]

Performed well – Query ID 18 and 24 (score of 1.0), 
Performed poorly – Query ID 1, 16, 2, 28, 6, 7, 9 (score of 0)

## Question 3

In [74]:
sampleQuery = QP_Q2.parse("wireless communications")
sampleQueryResults = mySearcher.search(sampleQuery, limit=None)

print("Retrieved Documents:")

for (docnum, result) in enumerate(sampleQueryResults):
    score = sampleQueryResults.score(docnum)
    fileName = os.path.basename(result["file_path"])
    print(fileName, docnum, score)

Retrieved Documents:
G00-99-2247765 0 16.44915453547273
G00-85-1525415 1 13.364613279303013
G00-05-1218739 2 12.956313628711154
G00-09-0774298 3 11.781349226871903
G00-56-4151981 4 11.367247611926537
G00-21-2229498 5 10.743957712158082
G00-98-4068688 6 10.46486548752591
G00-47-2117970 7 10.213356414484583
G00-67-0152545 8 8.392871246133646
G00-06-1757034 9 6.4315561377014046
G00-78-2551063 10 3.955775319427501
G00-84-0274223 11 2.0684375138105002


In [75]:
with open(QRELS_FILE, "r") as f:
    qrelsAll = f.readlines()

print("Relevant Documents:")
for line in qrelsAll:
    if line.startswith('4') and line.endswith('1\n'):
        print(line)

print("Only Rank 8: G00-47-2117970 is actually relevant")

Relevant Documents:
4 0 G00-03-2855342 1

4 0 G00-36-1275993 1

4 0 G00-47-2117970 1

4 0 G00-65-0162935 1

Only Rank 8: G00-47-2117970 is actually relevant


### Q3 (a): Provide answer to Q3 (a) here [markdown cell]

Query 4:


False positive: G00-84-0274223 (Not relevant but retrieved)
    The word "wireless" appeared once, and "communications" appeared multiple times, but were infrequent especially relative to the length of the document.
    
    
False negative: G00-36-1275993 (Relevant but not retrieved)
    The words "wireless" and "communications" in this were often capitalized (as opposed to lowercase in the query). Also, the words telecommunications, which are similar in meaning to communications but not the same word, were also used very frequently. 

Thus, adapting queries/documents to case sensitivity/insensitivity, as well as stemming/lemmatization may help.

### Q3 (b): Write your code below

In [86]:
import nltk
from nltk.stem import *

# download required resources
nltk.download("wordnet")

lrStem = LancasterStemmer()

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\nicolwon\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [84]:
# Dont change this! Use it as-is in your code
# This filter will run for both the index and the query
from whoosh.analysis import Filter
class CustomFilter(Filter):
    is_morph = True
    def __init__(self, filterFunc, *args, **kwargs):
        self.customFilter = filterFunc
        self.args = args
        self.kwargs = kwargs
    def __eq__(self):
        return (other
                and self.__class__ is other.__class__)
    def __call__(self, tokens):
        for t in tokens:
            if t.mode == 'query': # if called by query parser
                t.text = self.customFilter(t.text, *self.args, **self.kwargs)
                yield t
            else: # == 'index' if called by indexer
                t.text = self.customFilter(t.text, *self.args, **self.kwargs)
                yield t

In [95]:
# Example1: Whoosh filter for NLTK's LancasterStemmer
myFilter1 = RegexTokenizer() | CustomFilter(LancasterStemmer().stem) | LowercaseFilter()

In [102]:
# Put your code for creating the index here (you can add more cells).
# Make sure you save the final index in the variable INDEX_Q3, your query parser in QP_Q3, and your searcher in SEARCHER_Q3

# define a Schema with the new analyzer
mySchema2 = Schema(file_path = ID(stored=True),
                   file_content = TEXT(analyzer = myFilter1))

# create the index based on the new schema
myIndex2 = createIndex(mySchema2)

addFilesToIndex(myIndex2, filesToIndex)

done indexing.


In [103]:
INDEX_Q3 = myIndex2 # Replace None with your index for Q3
QP_Q3 = QueryParser("file_content", schema=INDEX_Q3.schema) # Replace None with your query parser for Q3
SEARCHER_Q3 = INDEX_Q3.searcher() # Replace None with your searcher for Q3

In [104]:
trecEval(TOPIC_FILE, QRELS_FILE, QP_Q3, SEARCHER_Q3) 

num_ret               	1	3
num_rel               	1	5
num_rel_ret           	1	0
map                   	1	0.0000
Rprec                 	1	0.0000
bpref                 	1	0.0000
recip_rank            	1	0.0000
iprec_at_recall_0.00  	1	0.0000
iprec_at_recall_0.10  	1	0.0000
iprec_at_recall_0.20  	1	0.0000
iprec_at_recall_0.30  	1	0.0000
iprec_at_recall_0.40  	1	0.0000
iprec_at_recall_0.50  	1	0.0000
iprec_at_recall_0.60  	1	0.0000
iprec_at_recall_0.70  	1	0.0000
iprec_at_recall_0.80  	1	0.0000
iprec_at_recall_0.90  	1	0.0000
iprec_at_recall_1.00  	1	0.0000
P_5                   	1	0.0000
P_10                  	1	0.0000
P_15                  	1	0.0000
P_20                  	1	0.0000
P_30                  	1	0.0000
P_100                 	1	0.0000
P_200                 	1	0.0000
P_500                 	1	0.0000
P_1000                	1	0.0000
num_ret               	10	54
num_rel               	10	1
num_rel_ret           	10	1
map                   	10	0.2500
Rprec                 	10	0.0000


### Q3 (c): Provide answer to Q3 (c) here [markdown cell]

Modifications made: Stemming and Lower Casing

Overall improvements were seen with all queries - more documents were retrieved, and the number of retrieved relevant documents doubled. This improvement is seen across all measures in all queries, including MAP, RPrec, recip_rank, and precision at recall levels.

For query 4, the number of false negatives decreased since only one relevant document was not retrieved. MAP also increased from 0.03 to over 0.5, and RPrec increased from 0 to 0.5. However, the number of retrieved documents increased dramatically, from 12 to 40, and thus the number of false positives also increased dramatically.

### Q3 (d): Provide answer to Q3 (d) here [markdown cell]

Yes

### Q3 (e): Provide answer to Q3 (e) here [markdown cell]

Yes

### Q3 (f): Provide answer to Q3 (f) here [markdown cell]

Although my idea improved precision measures such as MAP and RPrec, the number of retrieved and irrelevant documents (false positives) skyrocketed. Thus, I don't think this was a particularly useful improvement.

## Question 4 (Graduate Students)

In [None]:
GRAD_STUDENT = False # change to True if you are a grad student

### Q4 (a): Provide answer to Q4 (a) here [markdown cell]

### Q4 (b): Write your code below

In [None]:
# Put your code for creating the index here (you can add more cells).
# Make sure you save the final index in the variable INDEX_Q4, your query parser in QP_Q4, and your searcher in SEARCHER_Q4

In [None]:
INDEX_Q4 = None # Replace None with your index for Q4
QP_Q4 = None # Replace None with your query parser for Q4
SEARCHER_Q4 = None # Replace None with your searcher for Q4

### Q4 (c): Provide answer to Q4 (a) here [markdown cell]

### Q4 (d): Provide answer to Q4 (a) here [markdown cell]

### Q4 (e): Provide answer to Q4 (a) here [markdown cell]

### Q4 (f): Provide answer to Q4 (a) here [markdown cell]

## Validation

In [None]:
# Run the following cells to make sure your code returns the correct value types

In [None]:
from whoosh.index import FileIndex
from whoosh.qparser import QueryParser
from whoosh.searching import Searcher
import os.path

### Q2 Validation

In [107]:
assert(isinstance(INDEX_Q2, FileIndex)), "Index Type"
assert(isinstance(QP_Q2, QueryParser)), "Query Parser Type"
assert(isinstance(SEARCHER_Q2, Searcher)), "Searcher Type"
print("Q2 Types Validated")

Q2 Types Validated


### Q3 Validation

In [106]:
assert(isinstance(INDEX_Q3, FileIndex)), "Index Type"
assert(isinstance(QP_Q3, QueryParser)), "Query Parser Type"
assert(isinstance(SEARCHER_Q3, Searcher)), "Searcher Type"
print("Q3 Types Validated")

Q3 Types Validated


### Q4 Validation (Graduate Students)

In [None]:
assert((not GRAD_STUDENT) or isinstance(INDEX_Q4, FileIndex)), "Index Type"
assert((not GRAD_STUDENT) or isinstance(QP_Q4, QueryParser)), "Query Parser Type"
assert((not GRAD_STUDENT) or isinstance(SEARCHER_Q4, Searcher)), "Searcher Type"
print("Q4 Types Validated")