<a href="https://colab.research.google.com/github/groveratul/Information_Retrieval/blob/master/Information_retrieval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Information retrieval using whoosh

In [0]:
!pip install whoosh
!pip install pytrec_eval
!pip install wget



In [0]:
import wget
wget.download("https://github.com/MIE451-1513-2019/course-datasets/raw/master/government.zip", "government.zip")

'government (1).zip'

In [0]:
!unzip government.zip

Archive:  government.zip
replace government/topics-with-full-descriptions.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

In [0]:
# imports

from whoosh import index, writing,scoring
from whoosh.scoring import Weighting
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import *
from whoosh.analysis import Filter
from whoosh.qparser import QueryParser
from whoosh import qparser
import os.path
from pathlib import Path
import tempfile
import subprocess
import pytrec_eval
import wget
import nltk
from nltk.stem import *

Using government document dataset here

In [0]:
DATA_DIR = "government"
DOCUMENTS_DIR = os.path.join(DATA_DIR, "documents")
TOPIC_FILE = os.path.join(DATA_DIR, "gov.topics")
QRELS_FILE = os.path.join(DATA_DIR, "gov.qrels")


We are calculating MAP here because MAP calculates average precision score and then calculate mean of those over multiple queries. Hence, it allows to quantify the relevance of the top returned results.




In [0]:
# we start with basic tokenizer
tokenizer = RegexTokenizer()

# we might want use stemming:
stmAnalyzer = RegexTokenizer() | StemFilter()

# We probably want to lower-case it
# so we add LowercaseFilter
stmLwrAnalyzer = RegexTokenizer() | LowercaseFilter() | StemFilter()

# we probably want to ignore words like "we", "are", "with" when we index files
# so we add StopFilter to filter stop words
stmLwrStpAnalyzer = RegexTokenizer() | LowercaseFilter() | StopFilter() | StemFilter()

# we also probably want to break phrases like "whoosh.analysis" into "whoosh" and "analysis"
# so we add IntraWordFilter
stmLwrStpIntraAnalyzer = RegexTokenizer() | LowercaseFilter() | IntraWordFilter() | StopFilter() | StemFilter()


In [0]:
class CustomFilter(Filter):
    is_morph = True
    def __init__(self, filterFunc, *args, **kwargs):
        self.customFilter = filterFunc
        self.args = args
        self.kwargs = kwargs
    def __eq__(self):
        return (other
                and self.__class__ is other.__class__)
    def __call__(self, tokens):
        for t in tokens:
            if t.mode == 'query': # if called by query parser
                t.text = self.customFilter(t.text, *self.args, **self.kwargs)
                yield t
            else: # == 'index' if called by indexer
                t.text = self.customFilter(t.text, *self.args, **self.kwargs)
                yield t

In [0]:
nltk.download("wordnet")
lrStem = LancasterStemmer()
sbStem = SnowballStemmer("english")
wnLemm = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [0]:
def pyTrecEval(topicFile, qrelsFile, queryParser, searcher):
    # Load topic file - a list of topics(search phrases) used for evalutation
    with open(topicFile, "r") as tf:
        topics = tf.read().splitlines()

    # create an output file to which we'll write our results
    tempOutputFile = tempfile.mkstemp()[1]
    with open(tempOutputFile, "w") as outputTRECFile:
        # for each evaluated topic:
        # build a query and record the results in the file in TREC_EVAL format
        for topic in topics:
            topic_id, topic_phrase = tuple(topic.split(" ", 1))
            #print(topic_id, topic_phrase)
            topicQuery = queryParser.parse(topic_phrase)
            topicResults = searcher.search(topicQuery, limit=None)
            for (docnum, result) in enumerate(topicResults):
                score = topicResults.score(docnum)
                #print("%s Q0 %s %d %lf test\n" % (topic_id, os.path.basename(result["file_path"]), docnum, score))
                outputTRECFile.write("%s Q0 %s %d %lf test\n" % (topic_id, os.path.basename(result["file_path"]), docnum, score))
    with open(qrelsFile, 'r') as f_qrel:
        qrel = pytrec_eval.parse_qrel(f_qrel)

    with open(tempOutputFile, 'r') as f_run:
        run = pytrec_eval.parse_run(f_run)

    evaluator = pytrec_eval.RelevanceEvaluator(
        qrel, pytrec_eval.supported_measures)

    results = evaluator.evaluate(run)
    def print_line(measure, scope, value):
        print('{:25s}{:8s}{:.4f}'.format(measure, scope, value))

    for query_id, query_measures in results.items():
        for measure, value in query_measures.items():
            if measure == "runid":
              continue
            print_line(measure, query_id, value)
    for measure in query_measures.keys():
        if measure == "runid":
              continue
        print_line(
            measure,
            'all',
            pytrec_eval.compute_aggregated_measure(
                measure,
                [query_measures[measure]
                 for query_measures in results.values()]))

In [0]:
def printRelName(topicFile, qrelsFile, queryParser, searcher, id):
  with open(topicFile, "r") as tf:
        topics = tf.read().splitlines()
       
  for topic in topics:
        topic_id, topic_phrase = tuple(topic.split(" ", 1))
        if topic_id == id:
          print("---------------------------Topic_id and Topic_phrase----------------------------------")
          print(topic_id, topic_phrase)
          topicQuery = queryParser.parse(topic_phrase)
          topicResults = searcher.search(topicQuery, limit=None)
          print("---------------------------Return documents----------------------------------")
          for (docnum, result) in enumerate(topicResults):
              score = topicResults.score(docnum)
              print("%s Q0 %s %d %lf test" % (topic_id, os.path.basename(result["file_path"]), docnum, score))
          print("---------------------------Relevant documents----------------------------------")
          with open(qrelsFile, 'r') as f_qrel:
            qrels = f_qrel.readlines()
            for i in qrels:
              qid, _, doc, rel = i.rstrip().split(" ")
              if qid == id and rel == "1":
                print(i.rstrip())

In [0]:
def createIndex(schema):
    # Generate a temporary directory for the index
    indexDir = tempfile.mkdtemp()

    # create and return the index
    return index.create_in(indexDir, schema)

In [0]:
def addFilesToIndex(indexObj, fileList):
    # open writer
    writer = writing.BufferedWriter(indexObj, period=None, limit=1000)

    try:
        # write each file to index
        for docNum, filePath in enumerate(fileList):
            with open(filePath, "r", encoding="utf-8") as f:
                fileContent = f.read()
                writer.add_document(file_path = filePath,
                                    file_content = fileContent)

                # print status every 1000 documents
                if (docNum+1 % 1000 == 0):
                    print("already indexed:", docNum+1)
        print("done indexing.")

    finally:
       # close the index
        writer.close()

In [0]:

mySchema = Schema(file_path = ID(stored=True),
                  file_content = TEXT(analyzer = RegexTokenizer()))

myIndex = createIndex(mySchema)

In [0]:
filesToIndex = [str(filePath) for filePath in Path(DOCUMENTS_DIR).glob("**/*") if filePath.is_file()]
addFilesToIndex(myIndex, filesToIndex)

done indexing.


In [0]:
myQueryParser = QueryParser("file_content", schema=myIndex.schema)
mySearcher = myIndex.searcher()

In [0]:
pyTrecEval(TOPIC_FILE, QRELS_FILE, myQueryParser, mySearcher) 

num_q                    1       1.0000
num_ret                  1       1.0000
num_rel                  1       5.0000
num_rel_ret              1       0.0000
map                      1       0.0000
gm_map                   1       -11.5129
Rprec                    1       0.0000
bpref                    1       0.0000
recip_rank               1       0.0000
iprec_at_recall_0.00     1       0.0000
iprec_at_recall_0.10     1       0.0000
iprec_at_recall_0.20     1       0.0000
iprec_at_recall_0.30     1       0.0000
iprec_at_recall_0.40     1       0.0000
iprec_at_recall_0.50     1       0.0000
iprec_at_recall_0.60     1       0.0000
iprec_at_recall_0.70     1       0.0000
iprec_at_recall_0.80     1       0.0000
iprec_at_recall_0.90     1       0.0000
iprec_at_recall_1.00     1       0.0000
P_5                      1       0.0000
P_10                     1       0.0000
P_15                     1       0.0000
P_20                     1       0.0000
P_30                     1       0.000

In [0]:
INDEX_Q2 = myIndex 
QP_Q2 = myQueryParser 
SEARCHER_Q2 = mySearcher 

There are some fallacies in our previous approach: 

For instance,


Query: Veteran's Benefits \\
False Positive: According to this query, G00-20-0665105 is ranked highest but it is not relevant to the query. By looking into the document , it was found that veteran appears many times while benefits did not appear at all.

False Negative: According to this query,G00-59-0641167 is ranked fifth while it is very relevant to this query as it contain exact information about Veteran's Benefits. By looking into the document, it was observed that as the Veteran's benefits keyword was mentioned very few times together,and hence it was ranked lower due to lower count.


Improvements: 


1.   Decreasing the weight of word with higher document frequency 
2.  Using stop filter, stemming and lowercase to improve it's performance on lowercase words too and similar words while deleting the commonly used words.















In [0]:
printRelName(TOPIC_FILE, QRELS_FILE, myQueryParser, mySearcher, "22")

---------------------------Topic_id and Topic_phrase----------------------------------
22 Veteran's Benefits
---------------------------Return documents----------------------------------
22 Q0 G00-20-0665105 0 17.409278 test
22 Q0 G00-26-1904362 1 17.290035 test
22 Q0 G00-91-3056625 2 16.675452 test
22 Q0 G00-05-2034933 3 16.591891 test
22 Q0 G00-08-2045138 4 16.518430 test
22 Q0 G00-59-0641167 5 16.229949 test
22 Q0 G00-02-0832578 6 15.989090 test
22 Q0 G00-12-3819195 7 15.621370 test
22 Q0 G00-62-3414482 8 15.314785 test
22 Q0 G00-07-1233793 9 15.311532 test
22 Q0 G00-84-1325129 10 14.974365 test
22 Q0 G00-16-1355187 11 14.973564 test
22 Q0 G00-96-2877140 12 14.900865 test
22 Q0 G00-81-2370175 13 13.582083 test
22 Q0 G00-27-0576256 14 13.293954 test
22 Q0 G00-89-0106321 15 7.640470 test
22 Q0 G00-21-0649032 16 6.675688 test
---------------------------Relevant documents----------------------------------
22 0 G00-08-2045138 1


In [0]:

myFilter2 = stmLwrStpIntraAnalyzer 

mySchema2 = Schema(file_path = ID(stored=True),
                  file_content = TEXT(analyzer = myFilter2))
myIndex2 = createIndex(mySchema2)
filesToIndex2 = [str(filePath) for filePath in Path(DOCUMENTS_DIR).glob("**/*") if filePath.is_file()]
addFilesToIndex(myIndex2, filesToIndex2)
myQueryParser2 = QueryParser("file_content", schema=myIndex2.schema)
mySearcher2 = myIndex2.searcher()
pyTrecEval(TOPIC_FILE, QRELS_FILE, myQueryParser2, mySearcher2) 

done indexing.
num_q                    1       1.0000
num_ret                  1       3.0000
num_rel                  1       5.0000
num_rel_ret              1       0.0000
map                      1       0.0000
gm_map                   1       -11.5129
Rprec                    1       0.0000
bpref                    1       0.0000
recip_rank               1       0.0000
iprec_at_recall_0.00     1       0.0000
iprec_at_recall_0.10     1       0.0000
iprec_at_recall_0.20     1       0.0000
iprec_at_recall_0.30     1       0.0000
iprec_at_recall_0.40     1       0.0000
iprec_at_recall_0.50     1       0.0000
iprec_at_recall_0.60     1       0.0000
iprec_at_recall_0.70     1       0.0000
iprec_at_recall_0.80     1       0.0000
iprec_at_recall_0.90     1       0.0000
iprec_at_recall_1.00     1       0.0000
P_5                      1       0.0000
P_10                     1       0.0000
P_15                     1       0.0000
P_20                     1       0.0000
P_30                   

In [0]:
printRelName(TOPIC_FILE, QRELS_FILE, myQueryParser2, mySearcher2, "22")

---------------------------Topic_id and Topic_phrase----------------------------------
22 Veteran's Benefits
---------------------------Return documents----------------------------------
22 Q0 G00-04-2203486 0 14.822046 test
22 Q0 G00-68-0000000 1 14.713557 test
22 Q0 G00-59-0641167 2 14.681278 test
22 Q0 G00-91-3056625 3 14.680589 test
22 Q0 G00-55-3685771 4 14.629713 test
22 Q0 G00-31-2417708 5 14.599217 test
22 Q0 G00-78-4117174 6 14.554000 test
22 Q0 G00-38-1736026 7 14.520404 test
22 Q0 G00-84-1325129 8 14.480537 test
22 Q0 G00-11-1809773 9 14.475054 test
22 Q0 G00-84-2308958 10 14.403039 test
22 Q0 G00-20-0665105 11 14.318010 test
22 Q0 G00-16-0165956 12 14.183932 test
22 Q0 G00-88-2083848 13 14.178139 test
22 Q0 G00-16-1355187 14 14.120383 test
22 Q0 G00-19-2784490 15 14.046526 test
22 Q0 G00-69-0982341 16 14.037327 test
22 Q0 G00-89-0106321 17 13.962517 test
22 Q0 G00-76-2498376 18 13.946721 test
22 Q0 G00-69-2544761 19 13.925570 test
22 Q0 G00-40-2173588 20 13.910198 test
22 Q

In [0]:
INDEX_Q3 = myIndex2 
QP_Q3 = myQueryParser2
SEARCHER_Q3 = mySearcher2


We replaced the RegexTokenizer with stmLwrStrpIntraAnalyzer which makes the search ignore stop words , makes it case insensitive and stems the words.

The MAP improved form 0.1971 to 0.3366 which is a considerable improvement.


The false positive from the earlier answer dropped its ranked from highest to 10th. The false negative meanwhile was able to increase its rank from 5th to 2nd.

In [0]:
# Put your code for creating the index here (you can add more cells).
# Make sure you save the final index in the variable INDEX_Q4, your query parser in QP_Q4, and your searcher in SEARCHER_Q4
myFilter3 = stmLwrStpIntraAnalyzer | CustomFilter(LancasterStemmer().stem) 

mySchema3 = Schema(file_path = ID(stored=True),
                  file_content = TEXT(analyzer = myFilter3))
myIndex3 = createIndex(mySchema3)
filesToIndex3 = [str(filePath) for filePath in Path(DOCUMENTS_DIR).glob("**/*") if filePath.is_file()]
addFilesToIndex(myIndex3, filesToIndex3)
myQueryParser3 = QueryParser("file_content", schema=myIndex3.schema,group=qparser.OrGroup)
mySearcher3 = myIndex3.searcher(weighting = scoring.BM25F(B=0.55, K1=1.1))
pyTrecEval(TOPIC_FILE, QRELS_FILE, myQueryParser3, mySearcher3) 

done indexing.
num_q                    1       1.0000
num_ret                  1       469.0000
num_rel                  1       5.0000
num_rel_ret              1       5.0000
map                      1       0.0603
gm_map                   1       -2.8083
Rprec                    1       0.0000
bpref                    1       0.0000
recip_rank               1       0.0526
iprec_at_recall_0.00     1       0.1000
iprec_at_recall_0.10     1       0.1000
iprec_at_recall_0.20     1       0.1000
iprec_at_recall_0.30     1       0.1000
iprec_at_recall_0.40     1       0.1000
iprec_at_recall_0.50     1       0.1000
iprec_at_recall_0.60     1       0.1000
iprec_at_recall_0.70     1       0.0412
iprec_at_recall_0.80     1       0.0412
iprec_at_recall_0.90     1       0.0362
iprec_at_recall_1.00     1       0.0362
P_5                      1       0.0000
P_10                     1       0.0000
P_15                     1       0.0000
P_20                     1       0.0500
P_30                  


 A clear list of all final modifications made.  \
K1 and b value was tuned in BM25 scoring function. Quoting from https://www.elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables ," If b is bigger, the effects of the length of the document compared to the average length are more amplified." & "k1 is a variable which helps determine term frequency saturation characteristics. That is, it limits how much a single query term can affect the score of a given document. It does this through approaching an asymptote", it helped in hypertuning. Moreover, we also used stmLwrStpIntraAnalyzer with a custom Lancaster Stemmer and also changed the query parser to or group.\
Why each modification was made – how did it help? \
As explained above we tuned k1 and b according to our documents while stmLwrStpIntraAnalyzer to deduct stop words and remove case senstivity. Lancaster Stemmer was used for stemming and query parser was changed from default and to or as some queries such as mining gold etc were giving poor results as all the words were not combined in a similar fashion. \
The  final  MAP  performance  that  these  modifications  attained.\
0.3965

In [0]:
INDEX_Q4 = myIndex3 
QP_Q4 = myQueryParser3 
SEARCHER_Q4 = mySearcher3 