# DSCI 563 Lab Assignment 4: Information Retrieval (Cheat sheet)

## Assignment Objectives

In this assignment you will

- Set up a Whoosh index using newsgroup posts 
- Build some basic information retrieval systems and compare them to Whoosh output


## What does this have to do with what we are doing in class?

In this lab you will be using Whoosh, just like we saw in class. You'll then build an alternative search engine that does the stuff that Whoosh is doing for you.

Some things to keep in mind:

* By default, the Whoosh parsers and analyzers will combine all search terms with "AND".  If you want to have an OR or a NOT, you will need to include them (all uppercase) as well as appropriate parentheses, in the query.
* Debug thoroughly!  Make sure that when you run a query through a parser, it's producing what you would expect!

In [39]:
#provided code
import sys
!{sys.executable} -m pip install whoosh

Defaulting to user installation because normal site-packages is not writeable
[33mDEPRECATION: omegaconf 2.0.6 has a non-standard dependency specifier PyYAML>=5.1.*. pip 24.0 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of omegaconf or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


Run the code below to access relevant modules (you can add to this as needed)

In [40]:
#provided code
import numpy as np
from math import log
from scipy.spatial.distance import pdist,squareform
from collections import defaultdict,Counter
from sklearn.datasets import fetch_20newsgroups
from nltk.stem import PorterStemmer
from nltk import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction import DictVectorizer
from whoosh.qparser import *
from whoosh.fields import Schema, TEXT, KEYWORD, NUMERIC
from whoosh.analysis import RegexTokenizer, LowercaseFilter, StopFilter, StemFilter ### The analyzers tokenize and stems the data
from whoosh import index ### We'll create an index

## Tidy Submission

rubric={mechanics:1}

To get the marks for tidy submission:

- Submit the assignment by filling in this jupyter notebook with your answers embedded
- Be sure to follow the [general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions)
- This is a *paired* lab.  Please pair up with another student - I recommend working with someone you haven't submitted a lab with, yet.  Always working with the same partner means that you come to rely on their skills, and don't get exposed to any other skill sets.  Everyone in the cohort has similar goals, but different strengths and weaknesses - there are no "weak" partners.
- Create a repo and grant access to all instructors.  You only need to submit a single lab for your team.



### Exercise 1: Whoosh index

In this lab, we will be using a classic IR corpus, the "20 newsgroups" dataset which consists of over 11k posts from 20 newsgroups, [accessible through sklearn](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html). The provided code below extracts the information from the original plaintext format into a list of dictionaries `newsgroup_info_dicts`, where the dictionary for each post includes not just the `Text` body, but also the `Newsgroup`, the original poster (`From`), their organization (`Organization`) and the `Subject` of the post.  Before moving on to the later parts of the lab, take a look at what the post looks like - you'll be indexing specific information in the post, so it is important that you understand how the data looks.

In [41]:
#provided code
def get_post_info_dict(full_text):
    '''given a newsgroup post in typical format, extracts the header as a 
    dict, including the body of the text in the "text" field'''
    info_dict = {}
    header_boundry = full_text.find("\n\n")  ### Header is separated by 2 newlines
    for line in full_text[:header_boundry].split("\n"): ### Split the header by lines
        first_colon = line.find(": ") ### Everything before the colon is a topic (ie, author, subject, etc.); everything after is more information
        info_dict[line[:first_colon]] = line[first_colon +2:]
    info_dict["Text"] = full_text[header_boundry + 2:].strip()
    return info_dict

newsgroups = fetch_20newsgroups(remove=('footers', 'quotes'))
newsgroup_info_dicts = []
print("LENGTH: ", len(newsgroups.data))
for i in range(len(newsgroups.data)):
    info_dict =  get_post_info_dict(newsgroups.data[i])
    info_dict["Newsgroup"] = newsgroups.target_names[newsgroups.target[i]]
    if 20 < len(info_dict["Text"]) < 50000: # get rid of very big or very small texts
        newsgroup_info_dicts.append(info_dict)
newsgroup_info_dicts[0]

LENGTH:  11314


{'From': "lerxst@wam.umd.edu (where's my thing)",
 'Subject': 'WHAT car is this!?',
 'Nntp-Posting-Host': 'rac3.wam.umd.edu',
 'Organization': 'University of Maryland, College Park',
 'Lines': '15',
 'Text': 'I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.',
 'Newsgroup': 'rec.autos'}

* `text` for the main text, 
* `newsgroup` for the name of the newsgraoup,
* `subject` for the title of the post, 
* `author` for the author, and 
* `organization` which contains tha name of the poster's organization. 

#### Exercise 1.1 
rubric={accuracy:2}

First, you're going to define two different analyzers:

You should not tokenize the author or organization fields, so you can store them in the `Keyword` data type.
1. An analyzer for the text and subject fields (`text_analyzer`) which tokenizes, lowercases, removes stopwords, and stems, in that order (if you put the stemmer first, it will block some stopword removal!  Think a bit about why this happens.).
2. An analyzer for the newgroup fields (`group_analyzer`) which tokenizes on ".". This will tokenize newsgroups such as "comp.os.ms-windows.misc" into "comp", "os", "ms-windows", "misc". No filters needed for this analyzer.  Hint: The regex tokenizer is looking for a pattern to match in each of its words - not the pattern to split on.  How will you write a regex that captures everything *except* the period?

You will then create a schema that contains the analyzers for the text, subject, author, and organization fields.

You will not be tokenizing the author or organization fields, so you can use the KEYWORD analyzer (This is very similar to how we stored the "genre" type in class - by specifying KEYWORD, you're telling the index not to do any special parsing).  

I recommend that you create a `QueryParser` for the text, group, author, and organization, and test that they are parsing correctly. 
For example, `TextParser=QueryParser("text", schema=schema)` where `schema` is `Schema(text=Text(analyzer= ...))`

"We are testing the parser" in the text should return (text:test AND text:parser), while "comp.os.ms-windows.misc" in the group should return (group:comp AND group:os AND group:ms-windows AND group:misc).  The author and organization should only split by word.

If you're curious, test queries that have "OR", and "NOT", as well.

In [42]:
text_analyzer = 
group_analyzer = 
schema = Schema(text=TEXT(analyzer=text_analyzer, stored=True), ... )

In [43]:
TextParser = 
GroupParser = 
AuthorParser = 
OrganizationParser = 
SubjectParser = 

print(TextParser.parse("We are testing the parser"))
print(GroupParser.parse("comp.os.ms-windows.misc"))
print(AuthorParser.parse("Charles Dickens"))
print(OrganizationParser.parse("Microsoft"))
print(SubjectParser.parse("FloPPY Disk"))

(text:test AND text:parser)
(group:comp AND group:os AND group:ms-windows AND group:misc)
(author:Charles AND author:Dickens)
organization:Microsoft
(subject:floppi AND subject:disk)


#### Exercise 1.2 
rubric={accuracy:2}

Now, create the index for your posts and iterate over the `newsgroup_info_dicts` to add each post to your index. You should give each of the posts an`id` equal to its index in `newsgroup_info_dicts`.  The index creates an index in the specified directory.  Make sure to call writer.commit() after adding the documents to the index!  If you don't, and you lock the index, you can usually reset things by restartng your kernel.  Keep in mind that every time you call create_in, it will overwrite your old index. The newsgroup_info_dicts from the last part has ~11K entries, so you might want to create a status-tracker.

In [44]:
import os.path
from whoosh.index import create_in


# mkdir, then create_in; 

# open_dir

# create a writer object to add documents to the index


In [45]:
for id, info_dict in enumerate(newsgroup_info_dicts):
    if id%1000 == 0: 
        print(id)

    # get from the diction for each 
        
    # then, add_document

0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000


In [46]:
# finally, you should `commit`


#### Exercise 1.3 
rubric={accuracy:1}

Test your index by printing out all the authors of posts in one of the "comp" newsgroups (there are more than one) whose subject contains the word *floppy*. There should be 10 of them, and one of them is named *Stig*.

In [47]:
parser = QueryParser("text",schema=schema)

with ix.searcher() as searcher:
    # TextParse: "comp" newsgroups & the word *floppy*
    
    # search, where the results object acts like a list of the matched documents.
    results = 
    print (results)
    
    for r in results:
        # print author; 

# print length; 

<Top 10 Results for And([Term('group', 'comp'), Term('subject', 'floppi')]) runtime=0.00213558399991598>
limagen@hpwala.wal.hp.com
jdresser@altair.tymnet.com (Jay Dresser)
towwang@statler.engin.umich.edu (Tow Wang Hui)
venaas@flipper.pvv.unit.no (Stig Venaas)
balog@eniac.seas.upenn.edu (Eric J Balog)
tcking@uswnvg.com (Tim King)
jtrascap@nyx.cs.du.edu (Jim Trascapoulos)
bagels@gotham.East.Sun.COM (Alex Beigelman - NYC SE)
flyboy@spf.trw.com (Jeff Wright)
cctr132@csc.canterbury.ac.nz (Nick FitzGerald, PC Software Consultant, CSC, UoC, NZ)
10


### Exercise 2: Boolean IR

In this exercise, you will create your own boolean IR system and make sure it works the same way as Elasticsearch.

#### 2.1
rubric={accuracy:2}

First, create an inverted index (`inverted_index`) for the 20 newsgroup corpus using a Python (default)dict. As you did above, use the index in newsgroup_info_dicts. In order to make the preprocessing consistent with Elasticsearch, you will need to use the `text_analyzser.simulate(text)` method of the text analyzer you built in 1.1, and deal with the somewhat messy output.

In [48]:
# count = 0
# # your code here
# def get_tokens(text, analyzer):
#     '''applies the Whoosh analyzer on the Text field of the info_dict
#     and returns a list of tokens'''
#     ...
#     tokens = []
   
inverted_index = defaultdict(set)

 
# iterate newsgroup_info_dicts
#   iterate your tokens (`get_tokens`)
#        then, add document to inverted_index[token]; 

In [49]:
print(inverted_index["dog"])
print(len(inverted_index["dog"]))

{3969, 1026, 9091, 6660, 10372, 904, 1802, 3339, 4492, 10122, 5906, 7955, 7572, 9086, 2838, 2456, 2201, 10779, 9628, 1053, 4773, 8613, 5674, 4526, 7471, 7726, 3889, 8758, 10039, 10166, 2877, 10814, 6719, 8255, 9284, 5319, 9543, 4429, 6093, 4815, 9294, 2513, 5973, 5975, 5848, 6615, 1626, 218, 7644, 94, 3835, 9823, 8673, 6754, 8546, 484, 101, 9952, 5738, 9580, 4079, 112, 4720, 6768, 758, 6518, 10486, 2811, 9084, 126}
70


In [50]:
assert "the" not in inverted_index
assert len(inverted_index["dog"]) == 70 
assert 94 in inverted_index["dog"]
print("Success!")

Success!


#### 2.2
rubric={accuracy:3,efficiency:1}

Now you're going to create a boolean IR engine. You should complete the recursive `get_docs` function which takes your inverted index and a search "expression" and returns a set of all document ids which satisfy the expression. An expression is either a string (a single word), or a 2-uple where the first element is a boolean operator ("or","and",or "not") and the second is a either another a tuple of expressions (if the operator is "or" or "and") or a single expression if the operator is "not" (HINT: for "not", a set consisting of all the document ids will be useful, and is provided below). For example, the following call to `get_docs` for **_documents that contain the word *'hit'* and not any of the words *'base'*, *'ball'*, or *'run'*_**:

`get_docs(inverted_index,("and",("hit",("not", ("or", ("base","ball", "run")))))))`

Means that you want documents that contain the word *hit* and not any of the words *base*, *ball*, or *run*.  

The base case which gets the relevant documents for a single word using your inverted index is provided for you. You'll want to use set operators extensively here.

In [51]:
all_docs = set(range(len(newsgroup_info_dicts)))

def get_docs(inverted_index,expression):
    '''given an inverted index which provides a mapping from words to documents 
    in a collection, evaluates expression according to boolean logic and 
    returns a list of documents for which the expression holds
    '''
    if isinstance(expression, str):
        return inverted_index.get(expression, set())
    else:
        operator_type,operands = expression
        # your code here
        if operator_type == "not":
            # results = all_doc - get_doc's result
            return all_docs - ...
            
        elif operator_type == "and":
            # results = ...
            return results

        elif operator_type == "or":
            # results = ...
            return results
        # your code here    



#### 2.3
rubric={accuracy:1,quality:1}

Here, you will create tests which assert that the output of your boolean IR system is the same as Elasticsearch. You should have at least three tests:

1. Documents that contain the word *hit*
2. Documents that contain the words *hit*, *home*, and *run*
3. Documents that contain the word *hit* and not any of the words *base*, *ball*, or *run*. 

You should write a function that takes a Whoosh search and returns a set of document ids, the same output as `get_docs` (so it is easy to compare). Remember that you'll need to increase the default number of returned hits for Whoosh.

In [52]:
def get_hit_ids(s):
    '''get a set of ids returned from a ElasticSearch search'''
    ids = set()

    # ids is a set for your result file ids
    return ids


In [53]:
BoolIR = get_docs(inverted_index, ("dog"))

In [54]:
BoolIR

{94,
 101,
 112,
 126,
 218,
 484,
 758,
 904,
 1026,
 1053,
 1626,
 1802,
 2201,
 2456,
 2513,
 2811,
 2838,
 2877,
 3339,
 3835,
 3889,
 3969,
 4079,
 4429,
 4492,
 4526,
 4720,
 4773,
 4815,
 5319,
 5674,
 5738,
 5848,
 5906,
 5973,
 5975,
 6093,
 6518,
 6615,
 6660,
 6719,
 6754,
 6768,
 7471,
 7572,
 7644,
 7726,
 7955,
 8255,
 8546,
 8613,
 8673,
 8758,
 9084,
 9086,
 9091,
 9284,
 9294,
 9543,
 9580,
 9628,
 9823,
 9952,
 10039,
 10122,
 10166,
 10372,
 10486,
 10779,
 10814}

In [55]:
WhooshIR = get_hit_ids(ix.searcher(), "dog", None)

In [56]:
WhooshIR

{94,
 101,
 112,
 126,
 218,
 484,
 758,
 904,
 1026,
 1053,
 1626,
 1802,
 2201,
 2456,
 2513,
 2811,
 2838,
 2877,
 3339,
 3835,
 3889,
 3969,
 4079,
 4429,
 4492,
 4526,
 4720,
 4773,
 4815,
 5319,
 5674,
 5738,
 5848,
 5906,
 5973,
 5975,
 6093,
 6518,
 6615,
 6660,
 6719,
 6754,
 6768,
 7471,
 7572,
 7644,
 7726,
 7955,
 8255,
 8546,
 8613,
 8673,
 8758,
 9084,
 9086,
 9091,
 9284,
 9294,
 9543,
 9580,
 9628,
 9823,
 9952,
 10039,
 10122,
 10166,
 10372,
 10486,
 10779,
 10814}

In [57]:
BoolIR == WhooshIR

True

In [58]:
#tests here
#your code here
assert get_docs(inverted_index,("hit")) ==  get_hit_ids(ix.searcher(), "hit", None)
print("Success!")
# your code here

Success!


### Exercise 3: Document relevance with Okapi BM25

In this exercise, you are again going to mimic the output of Elasticsearch by implementing [Okapi BM25 document relevance](https://en.wikipedia.org/wiki/Okapi_BM25), which a special version of tf-idf (not the same used in sklearn!), as well as a simplified version of the vector-space model.

#### 3.1
rubric={accuracy:2,quality:1}

First you are going to calculate the *idf* part of the equation, as well as the average length of the texts (after preprocessing) which will be used in 3.2. First iterate over the corpus (`newsgroup_info_dicts`) and, after again using `simulate` for the Elasticsearch text analyzer, calculate an initial document frequency for each term, as well as the average length. Then, iterate over your df dict and create an corresponding idf dict. For Okapi BM25, idf is calculated as follows:

$$\text{IDF}(q_i) = \ln (\frac{N - n(q_i) + 0.5}{n(q_i) + 0.5}+1)$$

where $q_i$ is the term (word type), $n(q_i)$ its document frequency, and $N$ the total number of documents. 
For quality, it's a good idea to have a function which calculates this idf. 
**Therefore, you implement `calculate_idf(n, N)` where `N` is `N` (total number of documents), and `n` is `n(q_i)` (document frequency).** 
You can use the Elasticsearch `explain` method shown in lecture to get some test cases.

In [59]:
def calculate_idf(n, N):
    '''calculates Okapi BM25 idf based on the df n and the total number of
    documents N'''
    # your code here
    idf = log( ... )
    return idf 

print(calculate_idf(2, 2))  ### Should be 0.1823215567939546

0.1823215567939546


- `total_length` = all tokens in `newsgroup_info_dicts` (note that you should use `text_analyzer` as before: `text_analyzser.simulate(text)`)
- `df`, document frequency of the token (hint: use `Counter()`), which will be represented as `n`
- `N` = length of `newsgroup_info_dicts`
- `avg_length = total_length/N`
- `idf_dict` by `calculate_idf(n, N)`

In [60]:
# your code here

dfs = Counter()                         # by text_analyzer.simulate(info_dict["Text"])["tokens"]`
total_length = 0                        # total_length of dfs
N = len(newsgroup_info_dicts)

# iterate newsgroup_info_dicts to get_tokens and update dfs (and total_length); 

avg_length = total_length/N

idf_dict = {}

# then, iterate dfs for idf using `calculate_idf(n, N)`;
# N is given
# n is an item in dfs

#### 3.2
rubric={accuracy:2,quality:1}

Now for the *tf* part, which will need to be calculated for each document. Iterate through the corpus again, and, for each document, first build a tf dictionary (a count of terms in the document), and then use that and the document frequencies you calculated in 3.1 to assign a tf-idf value for each term in each document. You are not using the raw term frequency, but a special term frequency calculated as follows:

$$\text{BM25-tf}(D,Q) = \frac{f(q_i, D)}{f(q_i, D) + k_1 \cdot (1 - b + b \cdot \frac{|D|}{\text{avgdl}})}$$

Where $q_i$ is the current term, f(q_i, D) is its frequency in the current document (the term frequency you have calculated), k_1 is the term saturation parameter with a (default) value of 1.2, b is the the length normalization parameter with a (default) value of 0.75 (you can just use these values directly, you don't need to make them parameters here), $|D|$ is the token length of the current document, and $avgdl$ is the average document length you calculated in 3.1. 
Again, you should have a separate function which calculates this special BM25-tf, again with some test cases which should ideally come from Elasticsearch `explain`. 
**Therefore, you implement `BM25_tf(tf, doc_length, avg_length)` where `tf` is $f(q_i, D)$, `doc_length` is $|D|$, and `avg_length` is avgdl**
Then you should multiply it by the idf for that term you calculated in **3.1** to get a tf-idf score for that term in that document. When you have created a dictionary of all tf-idf scores in a document, append that dictionary to a list.

In [61]:
# k_1 = 1.2
# b = 0.75

# def BM25_tf(tf, doc_length, avg_length):
#     # your code here
#     tf_value = tf / (tf + ... )
#     return tf_value

BM25_tf(2,9,9) ### Should be 0.625

0.625

for each item in `newsgroup_info_dicts`:
- `tfidf_dict = BM25_tf * idf_dict` for each term (type)
- use `text_analyzser.simulate(text)` to get all tokens
- `doc_length` for the length of tokens
- `tf_dict`  for all terms
- finally, `tfidfed_corpus.append(tfidf_dict)` 

In [62]:
# your code here
tfidfed_corpus = []

for info_dict in newsgroup_info_dicts:
    tfidf_dict = {}
    ...
    doc_length = ...
    for term,value in tf_dict.items():
        tfidf_dict[term] = BM25_tf(value, doc_length, avg_length)*idf_dict[term]            # avg_length (from 3.1)
    
    tfidfed_corpus.append(tfidf_dict)

In [63]:
tfidfed_corpus[0]

{'wa': 1.0479337681490617,
 'wonder': 1.8735039967194669,
 'anyone': 1.5475372870680149,
 'out': 0.8636541627604866,
 'there': 0.7142514227210075,
 'could': 1.1068030003948344,
 'enlighten': 3.484161348259486,
 'me': 0.8433127347025031,
 'car': 2.666514410203664,
 'saw': 2.1538572662151334,
 'other': 0.8976704330308405,
 'dai': 1.4743743759915375,
 'door': 3.1764122676129545,
 'sport': 2.6546920772093427,
 'look': 1.4531881704497596,
 'late': 2.304552087215622,
 '60': 2.3159553446871413,
 'earli': 2.1692814584620717,
 '70': 2.555079292665207,
 'call': 1.3033830350270366,
 'bricklin': 4.652778513628567,
 'were': 1.0990566413761538,
 'realli': 1.3659512811809624,
 'small': 1.8748433735110683,
 'addit': 2.128338364879106,
 'front': 2.230475356162961,
 'bumper': 3.529460747022825,
 'separ': 2.3798240685565886,
 'rest': 2.0380158654396667,
 'bodi': 2.1262618477543906,
 'all': 0.7306445111637885,
 'know': 0.8876413177849213,
 'tellm': 5.14189911389988,
 'model': 2.0506048315224845,
 'name': 

#### 3.3
rubric={accuracy:2,quality:1}

Now you are going to built an relevance ranking search engine. Create a function `BM25IR` which takes a raw English query as input. You should again be using `simulate` to 
- convert the query to tokens compatible with your corpus. Then, 
- iterate over your the corpus (`tfidfed_corpus`) and, 
- **for each term in the query, sum the corresponding tf-idf values** from the document (if a term does not appear, its tf-idf defaults to 0) to get a relevance score for each document. 
Then, 
- rank all the documents in your corpus and **return an ordered list of the top 10 ids**, highest ranked first.  

In [64]:
# def BM25IR(query, analyzer, tfidfed_corpus):
#     # your code here

def BM25IR(query, analyzer, tfidfed_corpus):
    rankings = []
    query_terms = get_tokens(query, analyzer)
    # iterate tfidfed_corpus .. 

    return # top 10 ids; 

#### 3.4
rubric={accuracy:1,quality:1}

Create at least 3 example queries that are relevant to topic matters in the corpus and have at least 3 interesting (not stop) words. Print out the top document for each query as determined by your system, and show that your output is *almost* the same as the results of Whoosh using your index from Exercise 1 by comparing the two using P@10 (i.e., consider documents in the top 10 ranked results of Whoosh to be the "relevant" documents for this calculation); if you have done the rest of this problem correctly, you should get 1 or 0.9 for nearly all queries.

Be careful!  Your search engine should be using the text_analyzer you created above, but the Whoosh engine should take the result of the TextParser.  It's a fine distinction, but the Parser produces a generator, while the analyzer produces a list.

FYI, the reason your results may not be exactly the same as Whoosh is because Whoosh stores only approximations for some values. That is, your implementation of Okapi BM25 is in fact more accurate! You will note that it is also quite fast (and could be made even faster with the inverted index from Exercise 2), the main efficiency issue here is that you're using quite a lot of memory with all those Python dicts. Generally, this approach would not scale well to millions of documents.

In [65]:
def get_top10_hits(index, analyzer, query):
    '''get a list of ids returned from a Whoosh search, in order of relevance'''

    query = # how to make the query? 
    print("QUERY: ", TextParser.parse(query))
    results = 
    ids = set()
    # iterate results to add at ids; 

    return ids

def check_query(query,analyzer,tfidfed_corpus):
    '''print out the top document in the tfidifed_corpus given the query,
    and calculate P@10 compared with Whoosh output'''
    output = BM25IR(...)
    print("OUTPUT: ", output)
    
    print("-----------------------------")
    print("Whoosh P@10:")
    # print the ratio of (get_top10_hits10 INTERSECTION your `output`) / 10

check_query("home runs and extra innings", text_analyzer, tfidfed_corpus)

OUTPUT:  [8458, 3290, 9289, 10586, 5580, 6188, 655, 9948, 10566, 5217]
-----------------------------
Whoosh P@10:
QUERY:  (text:home OR text:run OR text:extra OR text:inn)
0.9


### Exercise 4: Language models for IR (Optional)
rubric={accuracy:1,efficiency:1}