# Infromation Reterieval
    By Oghosa Igbinakenzua
    MIE 451 | Decision Support Systems
    October 7, 2017

## Preparations
* Put all your imports, and path constants in the next cells
* Make sure ***MATERIALS_DIR*** points to the directory where you extracted the Zip file.
* Make sure all your paths are **relative to ** ***MATERIALS_DIR*** and **NOT hard-coded** in your code.

In [1]:
# imports
# Put all your imports here
from whoosh import index, writing
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import *
from whoosh.qparser import QueryParser
import os, os.path
import shutil

In [2]:
MATERIALS_DIR = r"C:\DSS_Fall2017_Assign2"
#
DOCUMENTS_DIR = os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2\government\documents")
INDEX_DIR = os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2\government\index1")
QUER_FILE = os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2\government\topics\gov.topics")
QRELS_FILE = os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2\government\qrels\gov.qrels")
OUTPUT_FILE = os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2\government\myres")
TREC_EVAL = os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2\trec_eval\trec_eval.exe")
INDEX_DIR2 = os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2\government\index2")
OUTPUT_FILE2 = os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2\government\myres2")

## Question 1: Appropriate TREC measures
trec_eval can report dozens of measures:  for example “p5” (precision, in the first five documents returned), “num_rel_ret” (the number of relevant documents retrieved over all queries), and “re-cip_rank” (the reciprocal rank of top relevant document:  e.g., 0.25 if the first relevant document is the fourth in the ranking).  You can get a list by running trec_eval − h or in trec_eval’s README file.

    (a)  Which of trec_eval’s measures might be appropriate for measuring search system performance for government web sites?  [List the measure]
    (b)  Why do you think this measure is appropriate?  [1 sentence]

### Q1 (a): 
P_5

### Q1 (b): 
It shows the relevance of the fisrt k (k=5 in this case) documents returned which is useful in this case beacuse the end user of the output is human and could be resonably be satisfied with a storng precision of the first 5 documents returend. 

## Question 2: Indexing & Querying
Index the government documents, run the queries (“topics”) through (vanilla) Whoosh as a baseline system, and run trec_eval to compare Whoosh’s results with human judgements.

    (a)  Save your index (after indexing all the documents) in the provided variable INDEX_Q2, your query  parser  in  the provided  variable QP_Q2,  and  your  searcher  in  the  provided  variable SEARCHER_Q2.
    (b)  How well did the baseline Whoosh system do on your chosen measure?  [Provide the number.]
    (c)  Are there any particular topics where it did very well, or very badly?  [If so, list a few topic IDs for each]
    
    (Note: trec_eval − q will report measures for each query/topic separately as well as the averages. This will help you pinpoint good or bad cases.)

### Q2 (a): Write your code below

In [3]:
# Put your code for creating the index here (you can add more cells).
# Make sure you save the final index in the variable INDEX_Q2, your query parser in QP_Q2, and your searcher in SEARCHER_Q2

######   BUILGING THE INDEX   ######
# first, define a Schema for the index
mySchema = Schema(file_path = ID(stored=True),
                  file_content = TEXT(analyzer = RegexTokenizer()))
# tokenizer needs to be in the structure of the Whoosh library, 
# regex is the simplest and pretty much splits words at the spaces
# if you have more structured data, you might want more columns in the schema (title, contents)

# if index exists - remove it, allows you to re-run the code without causing problems
# deletes the index if it already exists, so a new one can be created when reruning the code
if os.path.isdir(INDEX_DIR):
    shutil.rmtree(INDEX_DIR)

# create the directory for the index
os.makedirs(INDEX_DIR)

# create index
myIndex = index.create_in(INDEX_DIR, mySchema)

# first we build a list of all the full paths of the files in DOCUMENTS_DIR
filesToIndex = []
for root, dirs, files in os.walk(DOCUMENTS_DIR):
    filePaths = [os.path.join(root, fileName) for fileName in files if not fileName.startswith('.')]
    filesToIndex.extend(filePaths)
    
# count files to index
print("number of files:", len(filesToIndex))

number of files: 4078


In [4]:
# open writer, helps us write to the index
# buffer accumulates data, then writes it all at the same time, works better when there's a lot of files
myWriter = writing.BufferedWriter(myIndex, period=20, limit=1000)

try:
    # write each file to index
    for docNum, filePath in enumerate(filesToIndex):
        with open(filePath, "r",  encoding="utf-8") as f:
            fileContent = f.read()
            myWriter.add_document(file_path = filePath,
                                  file_content = fileContent)
            
            if (docNum % 1000 == 0):
                print("already indexed:", docNum+1)
    print("done indexing.")

finally:
    # save the index
    # if there is an exception in the try, close the writer to prevent locks
    myWriter.close()

already indexed: 1
already indexed: 1001
already indexed: 2001
already indexed: 3001
already indexed: 4001
done indexing.


In [5]:
# define a query parser for the field "file_content" in the index
myQueryParser = QueryParser("file_content", schema=myIndex.schema)
mySearcher = myIndex.searcher()

In [6]:
# Load topic file - a list of topics(search phrases) used for evalutation
topicsFile = open(QUER_FILE,"r")
topics = topicsFile.read().splitlines()

# create an output file to which we'll write our results
outputTRECFile = open(OUTPUT_FILE, "w")

# for each evaluated topic:
# build a query and record the results in the file in TREC_EVAL format
for topic in topics:
    topic_id, topic_phrase = tuple(topic.split(" ", 1))
    topicQuery = myQueryParser.parse(topic_phrase)
    topicResults = mySearcher.search(topicQuery, limit=None)
    for (docnum, result) in enumerate(topicResults):
        score = topicResults.score(docnum)
        outputTRECFile.write("%s Q0 %s %d %lf test\n" % (topic_id, os.path.basename(result["file_path"]), docnum, score))

# close the topic and results file
outputTRECFile.close()
topicsFile.close()

In [7]:
!$TREC_EVAL -q $QRELS_FILE $OUTPUT_FILE

num_ret               	1	1
num_rel               	1	5
num_rel_ret           	1	0
map                   	1	0.0000
Rprec                 	1	0.0000
bpref                 	1	0.0000
recip_rank            	1	0.0000
iprec_at_recall_0.00  	1	0.0000
iprec_at_recall_0.10  	1	0.0000
iprec_at_recall_0.20  	1	0.0000
iprec_at_recall_0.30  	1	0.0000
iprec_at_recall_0.40  	1	0.0000
iprec_at_recall_0.50  	1	0.0000
iprec_at_recall_0.60  	1	0.0000
iprec_at_recall_0.70  	1	0.0000
iprec_at_recall_0.80  	1	0.0000
iprec_at_recall_0.90  	1	0.0000
iprec_at_recall_1.00  	1	0.0000
P_5                   	1	0.0000
P_10                  	1	0.0000
P_15                  	1	0.0000
P_20                  	1	0.0000
P_30                  	1	0.0000
P_100                 	1	0.0000
P_200                 	1	0.0000
P_500                 	1	0.0000
P_1000                	1	0.0000
num_ret               	10	16
num_rel               	10	1
num_rel_ret           	10	1
map                   	10	0.1667
Rprec                 	10	0.0000


In [8]:
INDEX_Q2 = myIndex # Replace None with your index for Q2
QP_Q2 = myQueryParser # Replace None with your query parser for Q2
SEARCHER_Q2 = mySearcher # Replace None with your searcher for Q2

In [9]:
# print the topic file, queries used to evaluate the searcher
!cat $QUER_FILE

1 mining gold silver coal
2 juvenile delinquency
4 wireless communications
6 physical therapists
7 cotton industry
9 genealogy searches
10 Physical Fitness
14 Agricultural biotechnology
16 Emergency and disaster preparedness assistance
18 Shipwrecks
19 Cybercrime, internet fraud, and cyber fraud
22 Veteran's Benefits
24 Air Bag Safety
26 Nuclear power plants
28 Early Childhood Education


In [10]:
# print the first 5000 lines in the qrels file
!head -n 5000 $QRELS_FILE

1 0 G00-00-0681214 0
1 0 G00-00-0945765 0
1 0 G00-00-1006224 1
1 0 G00-00-1591495 0
1 0 G00-00-2764912 0
1 0 G00-00-3253540 0
1 0 G00-00-3717374 0
1 0 G00-01-0270065 0
1 0 G00-01-0400712 0
1 0 G00-01-0682299 0
1 0 G00-01-2154945 0
1 0 G00-01-2689026 0
1 0 G00-01-2898660 0
1 0 G00-02-0146077 0
1 0 G00-02-0351712 0
1 0 G00-02-0510219 0
1 0 G00-02-0555602 0
1 0 G00-02-0901987 1
1 0 G00-02-1239993 0
1 0 G00-02-3981961 0
1 0 G00-02-4057099 0
1 0 G00-03-0366425 0
1 0 G00-03-0697220 0
1 0 G00-03-0931579 0
1 0 G00-03-1585062 0
1 0 G00-03-1898526 1
1 0 G00-04-0296440 0
1 0 G00-04-0767639 0
1 0 G00-04-0971232 0
1 0 G00-04-1046533 0
1 0 G00-04-1562575 0
1 0 G00-04-1864045 0
1 0 G00-04-3558844 0
1 0 G00-04-4166204 0
1 0 G00-05-0619345 0
1 0 G00-05-0623935 0
1 0 G00-05-1315394 0
1 0 G00-05-2231767 0
1 0 G00-05-2357004 0
1 0 G00-05-2948873 0
1 0 G00-05-2988698 0
1 0 G00-05-3051665 0
1 0 G00-05-3539237 0
1 0 G00-06-0872838 0
1 0 G00-06-2562505 0
1 0 G00-06-3965004 0
1 0 G00-06-4038523 0
1 0 G00-07-11

7 0 G00-05-3051665 0
7 0 G00-05-3437619 0
7 0 G00-05-3814025 0
7 0 G00-05-3832940 0
7 0 G00-06-0014141 0
7 0 G00-06-0690672 0
7 0 G00-06-1472959 0
7 0 G00-06-1727652 0
7 0 G00-06-1913581 0
7 0 G00-06-2562505 0
7 0 G00-06-2661322 0
7 0 G00-06-2965032 0
7 0 G00-06-3054024 0
7 0 G00-06-3652121 0
7 0 G00-07-0261374 0
7 0 G00-07-0393523 0
7 0 G00-07-1561076 0
7 0 G00-07-1993864 0
7 0 G00-07-2561555 0
7 0 G00-07-2654723 0
7 0 G00-07-2687847 0
7 0 G00-07-3318478 0
7 0 G00-07-3659195 0
7 0 G00-07-3853050 0
7 0 G00-07-3972872 0
7 0 G00-07-4009621 1
7 0 G00-07-4080218 0
7 0 G00-08-0085414 0
7 0 G00-08-0317811 0
7 0 G00-08-1048959 0
7 0 G00-08-1084598 0
7 0 G00-08-1594396 0
7 0 G00-08-1609604 0
7 0 G00-08-1667883 0
7 0 G00-08-2258958 0
7 0 G00-08-2472523 0
7 0 G00-08-2666680 0
7 0 G00-08-3234501 0
7 0 G00-08-3809731 0
7 0 G00-08-3960707 0
7 0 G00-09-0245064 0
7 0 G00-09-1193469 0
7 0 G00-09-1560577 0
7 0 G00-09-3195749 0
7 0 G00-09-3199095 0
7 0 G00-09-3375933 0
7 0 G00-10-0281044 0
7 0 G00-10-08

19 0 G00-24-3625720 0
19 0 G00-26-2502814 0
19 0 G00-27-1492903 0
19 0 G00-27-3165010 0
19 0 G00-27-4066921 0
19 0 G00-28-2516057 0
19 0 G00-29-1316899 0
19 0 G00-29-4124951 0
19 0 G00-30-3863788 0
19 0 G00-31-0699333 0
19 0 G00-31-1069354 0
19 0 G00-34-2047694 0
19 0 G00-38-2378615 0
19 0 G00-38-2908882 0
19 0 G00-42-1042518 0
19 0 G00-43-3414854 0
19 0 G00-44-2135933 0
19 0 G00-45-0568572 0
19 0 G00-46-3492739 0
19 0 G00-49-3068476 0
19 0 G00-50-0181286 0
19 0 G00-50-0979736 0
19 0 G00-52-1354748 0
19 0 G00-54-0590753 0
19 0 G00-54-2958371 0
19 0 G00-54-3242132 0
19 0 G00-54-3996340 0
19 0 G00-55-0330602 0
19 0 G00-56-1706812 0
19 0 G00-58-3937064 0
19 0 G00-60-2428516 0
19 0 G00-63-0465539 0
19 0 G00-66-3193359 0
19 0 G00-67-0564291 0
19 0 G00-67-3576955 0
19 0 G00-68-2200781 0
19 0 G00-69-1939272 0
19 0 G00-70-3118212 0
19 0 G00-72-1267721 0
19 0 G00-73-0028862 0
19 0 G00-75-1809770 0
19 0 G00-75-2759811 0
19 0 G00-75-3743073 0
19 0 G00-80-0569048 0
19 0 G00-82-2065688 0
19 0 G00-8

24 0 G00-10-0136170 0
24 0 G00-10-0606850 0
24 0 G00-10-0664217 0
24 0 G00-10-0844304 0
24 0 G00-10-1001660 0
24 0 G00-10-1731941 0
24 0 G00-10-2477398 0
24 0 G00-10-3149786 0
24 0 G00-10-3331481 0
24 0 G00-10-3730888 0
24 0 G00-10-3849661 0
24 0 G00-10-3892683 0
24 0 G00-11-0450519 0
24 0 G00-11-0909095 0
24 0 G00-13-3394460 0
24 0 G00-14-1850196 0
24 0 G00-15-1586300 0
24 0 G00-15-4189524 0
24 0 G00-17-3027297 0
24 0 G00-18-2415462 0
24 0 G00-19-2401614 0
24 0 G00-21-2495856 0
24 0 G00-21-2773039 0
24 0 G00-21-3464887 0
24 0 G00-21-3669563 0
24 0 G00-21-4103271 0
24 0 G00-22-1192355 0
24 0 G00-23-1800567 0
24 0 G00-23-3119665 0
24 0 G00-25-3064175 0
24 0 G00-27-0369252 0
24 0 G00-30-1962408 0
24 0 G00-31-0699333 0
24 0 G00-34-0984356 0
24 0 G00-35-3406418 1
24 0 G00-38-2869018 0
24 0 G00-39-0996057 0
24 0 G00-43-0410361 0
24 0 G00-43-1317779 0
24 0 G00-43-2871497 0
24 0 G00-43-3227387 0
24 0 G00-44-0168977 0
24 0 G00-45-0332043 0
24 0 G00-45-2538832 0
24 0 G00-46-0840102 0
24 0 G00-4

### Q2 (b): 
p_5 = 0.0714

### Q2 (c): 

Good topics

Topic ID | P_5
---------|---------
26 | 0.2
24 | 0.2
22 | 0.2
18 | 0.2
14 | 0.2

Poor topics

Topic ID | P_5
---------|---------
1 | 0
10 | 0
16 | 0
2 | 0
28 | 0
4 | 0
6 | 0
7 | 0
9 | 0


## Question 3 : Improving Performance
Look at where the baseline Whoosh system did well, or badly.

    (a)  What do you think would improve Whoosh’s performance on this test collection, and why?
    *For the system you aim to improve, you need to (1) understand what documents were highly ranked, (2) what documents should have been highly ranked, and (3) explain false positives (irrelevant documents ranked highly) and false negatives (relevant documents not  ranked  highly)  in  order  to  directly  inform  your  suggested  improvements.  Hence, please  find  one  query  and  explain  one  false  positive  and  one  false  negative  case  and explain each error and how this motivates your suggested modification. (Please note:  it is highly unlikely for two students to choose the same query and same two false positive and false negative examples...  similarity in responses will be reported to the department for investigation as a possible plagiarism case.)
    (b)  Based on your analysis, make any changes you think can improve your baseline.  Run your modified version of Whoosh, and look again at the evaluation measure you chose.  Save your new  index  in  the  provided  variable INDEX_Q3,  your  query  parser  in  the  provided  variable QP_Q3 , and your searcher in the provided variable SEARCHER_Q3.
    (c)  What modifications did you make and what were the improvements?  explain whether there were  overall  improvements  (over  some/all  queries)  in  performance  and  also  whether  either the false negative or false positive cases from part (a) improved.  [1-3 sentences, any single improvement over the baseline is sufficient for full credit, but nonetheless, you are encouraged to explore]
    (d)  Did your changes improve things overall?  [yes/no]
    (e)  Did some queries get better while others got worse?  [yes/no] 
    (f)  What do you think this means for your idea:  was it good?  Why or why not?  [1-3 sentences]

### Q3 (a): 
I picked **Query 28 "Early Childhood Education"**. Of the 22 documents retrived, the top 3 documents that were: highly ranked were, 
* **G00-75-2371200** 0 24.560041 test
* **G00-93-3702508** 1 21.224782 test
* **G00-99-2279811** 2 20.544861 test

However, 0 of the 2 relevant documents were retrieved. Accorciding to the judgment file, the two relevant documents were **G00-54-2576117** and **G00-02-0541868**, none of which were retrived. This means both documents are false negatives and all 22 documents retrived is a false positives. 


**_False Positive:_**
The document G00-75-2371200 is false postive becasue it contains two instances of the phrase "Early Childhood Education" along with one instance of the phrase "Early Childhood Educators" two other instances of the phrase "Early Childhood ". Also it contains 5 instances of the token "Early", 4 instances of the token "Childhood" and 6 instances of the token "Education"

**_False Negative:_**
The document G00-54-2576117 is a false negative becasue it does not contain any instance of the tokens, "Early" and "Childhood". (It contains "early" and "childhood")

**_Suggested Modification:_**
I intend to apply the lowercase function to the tokenizer to solve this issue. My expectation is to for the relevant documents (at least G00-54-2576117) to be retrieved. This is will be an improvement to the perfomance.

### Q3 (b): Write your code below

In [11]:
# we also probably want to break phrases like "whoosh.analysis" into "whoosh" and "analysis"
# so we add IntraWordFilter
stmLwrStpIntraAnalyzer = RegexTokenizer() | LowercaseFilter()

# Put your code for creating the index here (you can add more cells).
# Make sure you save the final index in the variable INDEX_Q3, your query parser in QP_Q3, and your searcher in SEARCHER_Q3

mySchema2 = Schema(file_path = ID(stored=True),
                   file_content = TEXT(analyzer = stmLwrStpIntraAnalyzer))

# if index exists - remove it
if os.path.isdir(INDEX_DIR2):
    shutil.rmtree(INDEX_DIR2)

# create the directory for the index
os.makedirs(INDEX_DIR2)

# create index or open it if already exists
myIndex2 = index.create_in(INDEX_DIR2, mySchema2)

# open writer
myWriter2 = writing.BufferedWriter(myIndex2, period=20, limit=1000)

try:
    # write each file to index
    for docNum, filePath in enumerate(filesToIndex):
        with open(filePath, "r",  encoding="utf-8") as f:
            fileContent = f.read()
            myWriter2.add_document(file_path = filePath,
                                  file_content = fileContent)
            
            if (docNum % 1000 == 0):
                print("already indexed:", docNum+1)
    print("done indexing.")

finally:
    # save the index
    myWriter2.close()
    

already indexed: 1
already indexed: 1001
already indexed: 2001
already indexed: 3001
already indexed: 4001
done indexing.


In [12]:
# define a query parser for the field "file_content" in the index
myQueryParser2 = QueryParser("file_content", schema=myIndex2.schema)
mySearcher2 = myIndex2.searcher()

# Load topic file - a list of topics(search phrases) used for evalutation
topicsFile = open(QUER_FILE,"r")
topics = topicsFile.read().splitlines()

# create an output file to which we'll write our results
outputTRECFile2 = open(OUTPUT_FILE2, "w")

# for each evaluated topic:
# build a query and record the results in the file in TREC_EVAL format
for topic in topics:
    topic_id, topic_phrase = tuple(topic.split(" ", 1))
    topicQuery = myQueryParser2.parse(topic_phrase)
    topicResults = mySearcher2.search(topicQuery, limit=None)
    for (docnum, result) in enumerate(topicResults):
        score = topicResults.score(docnum)
        outputTRECFile2.write("%s Q0 %s %d %lf test\n" % (topic_id, os.path.basename(result["file_path"]), docnum, score))

# close the topic and results file
outputTRECFile2.close()
topicsFile.close()

In [13]:
!$TREC_EVAL -q $QRELS_FILE $OUTPUT_FILE2

num_ret               	1	3
num_rel               	1	5
num_rel_ret           	1	0
map                   	1	0.0000
Rprec                 	1	0.0000
bpref                 	1	0.0000
recip_rank            	1	0.0000
iprec_at_recall_0.00  	1	0.0000
iprec_at_recall_0.10  	1	0.0000
iprec_at_recall_0.20  	1	0.0000
iprec_at_recall_0.30  	1	0.0000
iprec_at_recall_0.40  	1	0.0000
iprec_at_recall_0.50  	1	0.0000
iprec_at_recall_0.60  	1	0.0000
iprec_at_recall_0.70  	1	0.0000
iprec_at_recall_0.80  	1	0.0000
iprec_at_recall_0.90  	1	0.0000
iprec_at_recall_1.00  	1	0.0000
P_5                   	1	0.0000
P_10                  	1	0.0000
P_15                  	1	0.0000
P_20                  	1	0.0000
P_30                  	1	0.0000
P_100                 	1	0.0000
P_200                 	1	0.0000
P_500                 	1	0.0000
P_1000                	1	0.0000
num_ret               	10	24
num_rel               	10	1
num_rel_ret           	10	1
map                   	10	0.3333
Rprec                 	10	0.0000


In [14]:
INDEX_Q3 = myIndex2 # Replace None with your index for Q3
QP_Q3 = myQueryParser2 # Replace None with your query parser for Q3
SEARCHER_Q3 = mySearcher2 # Replace None with your searcher for Q3

### Q3 (c): 
**_Modification made:_** Included a lowercase filter to the tokenizer.

**_Query 28 Improvements:_** 21 of the 26 metrics showed improvements for query 28. More specifially, 14 more documents were retrieved (from 22 to 36). Precision @ 20, 30, ..., 100 increased from 0. For example p_20 increased by 5% (from 0.0 to 0.05). All relevant documents were now retrived (from 0/2 to 2/2) which mean both false negatives became true positives.

**_Overall Improvements:_** Overall, there were improvements to perfomance. For example, p_5 increased by 3.35% (from .0714 to 0.1067) and the % of relevant documents retrived increased by 13% (from 7/33 to 12/35).  

### Q3 (d): 
Yes. Overall improvements are stated above.

### Q3 (e): 
No. The change only made positve improvements to the query results.

### Q3 (f): 
I think this means my idea was good but not great. This is beacause while it did show positive imporviemtns to the query of intrest (query 28) and the overall result, the improvements couuld posisbly have been better. For example further refiniments such as steming could have increased the precion at 5, 10 or 15. 

## Question 4 (Graduate Students)

In [None]:
GRAD_STUDENT = False # change to True if you are a grad student

### Q4 (a): Provide answer to Q4 (a) here [markdown cell]

### Q4 (b): Write your code below

In [None]:
# Put your code for creating the index here (you can add more cells).
# Make sure you save the final index in the variable INDEX_Q4, your query parser in QP_Q4, and your searcher in SEARCHER_Q4

In [None]:
INDEX_Q4 = None # Replace None with your index for Q4
QP_Q4 = None # Replace None with your query parser for Q4
SEARCHER_Q4 = None # Replace None with your searcher for Q4

### Q4 (c): Provide answer to Q4 (a) here [markdown cell]

### Q4 (d): Provide answer to Q4 (a) here [markdown cell]

### Q4 (e): Provide answer to Q4 (a) here [markdown cell]

### Q4 (f): Provide answer to Q4 (a) here [markdown cell]

## Validation

In [None]:
# Run the following cells to make sure your code returns the correct value types

In [15]:
from whoosh.index import FileIndex
from whoosh.qparser import QueryParser
from whoosh.searching import Searcher
import os.path

### Path Validation

In [19]:
assert "MATERIALS_DIR" in globals(), "variable MATERIALS_DIR does not exists"
assert(os.path.isdir(os.path.join(MATERIALS_DIR))), "MATERIALS_DIR folder does not exists"
assert(os.path.isdir(os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2"))), "invalid folder structure"
assert(os.path.isdir(os.path.join(MATERIALS_DIR, r"DSS_Fall2017_Assign2\government\documents"))), "invalid folder structure"
print("Paths validated")

Paths validated


### Q2 Validation

In [20]:
assert(isinstance(INDEX_Q2, FileIndex)), "Index Type"
assert(isinstance(QP_Q2, QueryParser)), "Query Parser Type"
assert(isinstance(SEARCHER_Q2, Searcher)), "Searcher Type"
print("Q2 Types Validated")

Q2 Types Validated


### Q3 Validation

In [21]:
assert(isinstance(INDEX_Q3, FileIndex)), "Index Type"
assert(isinstance(QP_Q3, QueryParser)), "Query Parser Type"
assert(isinstance(SEARCHER_Q3, Searcher)), "Searcher Type"
print("Q3 Types Validated")

Q3 Types Validated


### Q4 Validation (Graduate Students)

In [None]:
assert((not GRAD_STUDENT) or isinstance(INDEX_Q4, FileIndex)), "Index Type"
assert((not GRAD_STUDENT) or isinstance(QP_Q4, QueryParser)), "Query Parser Type"
assert((not GRAD_STUDENT) or isinstance(SEARCHER_Q4, Searcher)), "Searcher Type"
print("Q4 Types Validated")

## Refernces:
Code from the IR_Lab.ipyb file provided was used a primary source for this assignment.