In [1]:
!pip install python-terrier



In [2]:
# Check that python-terrier is installed (the right one and not pyterrier)
!pip show python-terrier

Name: python-terrier
Version: 0.10.0
Summary: Terrier IR platform Python API
Home-page: https://github.com/terrier-org/pyterrier
Author: Craig Macdonald
Author-email: craigm@dcs.gla.ac.uk
License: 
Location: c:\users\enric\anaconda3\envs\new-environment\lib\site-packages
Requires: chest, deprecated, dill, ir-datasets, ir-measures, jinja2, joblib, matchpy, more-itertools, nptyping, numpy, pandas, pyjnius, pytrec-eval-terrier, requests, scikit-learn, scipy, statsmodels, tqdm, wget
Required-by: 


In [3]:
# imports
import pyterrier as pt
if not pt.started():
    pt.init()
import pandas as pd
import json
import os

  from .autonotebook import tqdm as notebook_tqdm
PyTerrier 0.10.0 has loaded Terrier 5.8 (built by craigm on 2023-11-01 18:05) and terrier-helper 0.0.8



As we have all the data stored in a .jsonl file containing a list of objects, we need to read each line and save it into an array. 

In [5]:
json_objects = []
with open('../data/data.jsonl', 'r') as jsonl_file:
    lines = jsonl_file.readlines()

    # Use list comprehension to load each line as a JSON object
    json_objects = [json.loads(line) for line in lines]
print(json_objects[0])

{'name': 'AleSmith Speedway Stout - Barrel-Aged Kopi Luwak', 'description': 'Barrel Aged Kopi Luwak Speedway Stout is our award winning year round Speedway Stout with Kopi Luwak Coffee that has been matured in premium bourbon barrels. Considered one of the world’s most expensive and rare coffee varieties, the beans are specifically selected for their ripeness and quality by the civet, a small jungle marsupial that feeds of the fruit of the coffee cherry. Good to the last dropping!', 'image_url': 'https://res.cloudinary.com/ratebeer/image/upload/w_400,c_limit,d_Default_Beer_qqrv7k.png,f_auto/beer_202442', 'price': None, 'style': None, 'critic_score': {'max': 100, 'actual': 100.00000002692344}, 'brewer': {'name': 'AleSmith Brewing Company', 'city': 'San Diego', 'country': {'code': 'US', 'name': 'United States'}, 'state': {'name': 'California'}}, 'alcohol_bv': 12.0, 'tasting_notes': None, 'closure': None, 'packaging': None}


Following, we need to transform all the beers data into a single string containing everything.

The reason to do this is that we will need to submit text for each beer to create the related index.

In [6]:
documents = []
i = 1
for value in json_objects:
    text = ''
    for field_name, field_value in value.items():
        text+=f"{field_value} "
    text = text[:-1] # Cut last empty space from string
    documents.append({'docno': f"d{i}", 'text': text})
    i = i + 1

df = pd.DataFrame(documents)

print(df.head())

  docno                                               text
0    d1  AleSmith Speedway Stout - Barrel-Aged Kopi Luw...
1    d2  AleSmith Speedway Stout - Barrel-Aged Vietname...
2    d3  AleSmith Speedway Stout - Bourbon Barrel Aged ...
3    d4  Bell's Expedition Stout - Bourbon Barrel-Aged ...
4    d5  Cigar City Caffè Americano Double Stout (Rum B...


### Index creation

Now we can proceed to create an index (and its indexer)

In [7]:
current_directory = os.getcwd()
indexer = pt.DFIndexer(f"{current_directory}\index_3docs", overwrite=True)
index_ref = indexer.index(df["text"], df["docno"])
index_ref.toString()

'c:\\Users\\enric\\OneDrive\\Desktop\\Stuff\\Information Retrieval\\Project\\beer-search-engine\\backend\\index_3docs/data.properties'

In [8]:
index = pt.IndexFactory.of(index_ref)

Let's check for the statistics of the index

In [9]:
print(index.getCollectionStatistics().toString())

Number of documents: 22123
Number of terms: 30253
Number of postings: 714126
Number of fields: 0
Number of tokens: 848329
Field names: []
Positions:   false



We see that we have as planned, roughly 11 k documents and more than 30 k terms.

In [10]:
for kv in index.getLexicon():
  print("%s  -> %s " % (kv.getKey(), kv.getValue().toString()  ))

0  -> term49 Nt=4286 TF=4563 maxTF=7 @{0 0 0} 
00  -> term27909 Nt=1 TF=1 maxTF=1 @{0 1961 3} 
000  -> term360 Nt=31 TF=39 maxTF=3 @{0 1964 7} 
000anniversari  -> term12640 Nt=1 TF=1 maxTF=1 @{0 2026 3} 
000th  -> term1769 Nt=6 TF=7 maxTF=2 @{0 2029 3} 
001  -> term1162 Nt=7 TF=7 maxTF=1 @{0 2043 2} 
0015  -> term22573 Nt=1 TF=1 maxTF=1 @{0 2060 4} 
0019b  -> term16712 Nt=1 TF=1 maxTF=1 @{0 2063 6} 
002  -> term939 Nt=7 TF=7 maxTF=1 @{0 2067 0} 
0024  -> term23693 Nt=1 TF=1 maxTF=1 @{0 2084 0} 
0028  -> term25954 Nt=1 TF=1 maxTF=1 @{0 2087 2} 
003  -> term683 Nt=3 TF=3 maxTF=1 @{0 2090 6} 
0038  -> term27250 Nt=1 TF=1 maxTF=1 @{0 2096 6} 
004  -> term9327 Nt=4 TF=4 maxTF=1 @{0 2100 2} 
0042b  -> term14577 Nt=1 TF=1 maxTF=1 @{0 2111 4} 
005  -> term10435 Nt=2 TF=2 maxTF=1 @{0 2114 4} 
006  -> term5697 Nt=7 TF=7 maxTF=1 @{0 2120 4} 
007  -> term15855 Nt=1 TF=1 maxTF=1 @{0 2138 2} 
008  -> term16713 Nt=1 TF=1 maxTF=1 @{0 2141 2} 
009  -> term17607 Nt=2 TF=2 maxTF=1 @{0 2144 4} 
01  -> ter

### Batch Retrieve
Let's now try to compute BatchRetrieve with Tf and see what the results are over the same query

In [None]:
br = pt.BatchRetrieve(index, wmodel="Tf")
br.search("IPA beer 8% alcohol")

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,2915,d2916,0,22.0,IPA beer 8% alcohol
1,1,4233,d4234,1,20.0,IPA beer 8% alcohol
2,1,9455,d9456,2,19.0,IPA beer 8% alcohol
3,1,1712,d1713,3,18.0,IPA beer 8% alcohol
4,1,1286,d1287,4,16.0,IPA beer 8% alcohol
...,...,...,...,...,...,...
995,1,402,d403,995,4.0,IPA beer 8% alcohol
996,1,411,d412,996,4.0,IPA beer 8% alcohol
997,1,417,d418,997,4.0,IPA beer 8% alcohol
998,1,438,d439,998,4.0,IPA beer 8% alcohol


In [None]:
br = pt.BatchRetrieve(index, wmodel="TF_IDF")
br.search("IPA beer 8% alcohol")

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,8518,d8519,0,6.917960,IPA beer 8% alcohol
1,1,6782,d6783,1,6.413824,IPA beer 8% alcohol
2,1,8716,d8717,2,6.204563,IPA beer 8% alcohol
3,1,3225,d3226,3,6.026825,IPA beer 8% alcohol
4,1,3428,d3429,4,5.835148,IPA beer 8% alcohol
...,...,...,...,...,...,...
995,1,3349,d3350,995,2.729804,IPA beer 8% alcohol
996,1,4292,d4293,996,2.729804,IPA beer 8% alcohol
997,1,5091,d5092,997,2.729804,IPA beer 8% alcohol
998,1,5204,d5205,998,2.729804,IPA beer 8% alcohol


In [12]:
br = pt.BatchRetrieve(index, wmodel="BM25")
results = br.search("IPA beer 8% alcohol")

In [27]:
for i in results.iterrows():
    print(i[1])

qid                        1
docid                   8518
docno                  d8519
rank                       0
score              10.105519
query    IPA beer 8% alcohol
Name: 0, dtype: object
qid                        1
docid                   6782
docno                  d6783
rank                       1
score               9.875814
query    IPA beer 8% alcohol
Name: 1, dtype: object
qid                        1
docid                  10398
docno                 d10399
rank                       2
score               9.536089
query    IPA beer 8% alcohol
Name: 2, dtype: object
qid                        1
docid                   8716
docno                  d8717
rank                       3
score               9.315281
query    IPA beer 8% alcohol
Name: 3, dtype: object
qid                        1
docid                   3225
docno                  d3226
rank                       4
score               8.748891
query    IPA beer 8% alcohol
Name: 4, dtype: object
qid            

In [29]:
# for i, v in results.items():
#     print(i,v)

def concatenate_fields(row):
    return ' '.join(str(field) for i in range(row))
results['Concatenated'] = results.apply(concatenate_fields, axis=1)
print(results['Concatenated'])


0      1 8518 d8519 0 10.105518657666739 IPA beer 8% ...
1      1 6782 d6783 1 9.875814450349683 IPA beer 8% a...
2      1 10398 d10399 2 9.53608945539947 IPA beer 8% ...
3      1 8716 d8717 3 9.315280789802106 IPA beer 8% a...
4      1 3225 d3226 4 8.748890706750503 IPA beer 8% a...
                             ...                        
995    1 4292 d4293 995 3.248008667558798 IPA beer 8%...
996    1 5091 d5092 996 3.248008667558798 IPA beer 8%...
997    1 5204 d5205 997 3.248008667558798 IPA beer 8%...
998    1 5414 d5415 998 3.248008667558798 IPA beer 8%...
999    1 5949 d5950 999 3.248008667558798 IPA beer 8%...
Name: Concatenated, Length: 1000, dtype: object


As we see, we get very different results, as the two models differ a lot in the implementation of the indexing.

We know that the *TF* model is the simplest simply considers the frequency of the required terms in the document, so we can assume we can reach an higher precision as it has no discrimination in place between terms that occur very often in the language and ones that don't.

As *TF_IDF* does that, we know it's a more complete model. Even though, we know that we can do better as we don't consider the document length, and of course longer documents are more likely to contain the words we're looking for.

For this reason, we know that the *BM25* model is the most complete one (between the ones we saw) as it takes into account all the above mentioned problems. Thus, we consider the latest to be the most precise BatchRetrieve.