# inforet 2022 2

In [1]:
# no time to lose:
# !wget https://gerdes.fr/saclay/informationRetrieval/our_msmarco.zip
# !unzip our_msmarco.zip
# this will be big: 1.2gb!
# you will get three files

In [2]:
# this turns on the autotimer, so that every cell has a timing information below
try:
    %load_ext autotime
except:
    !pip install ipython-autotime
    %load_ext autotime
# stop using:
# %unload_ext autotime

time: 172 µs (started: 2023-03-23 20:25:55 +01:00)


In [3]:
# !pip install dask
import re
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from collections import Counter
from tqdm.notebook import tqdm
from sklearn.feature_extraction.text import CountVectorizer
import dask.dataframe as dd

time: 3.74 s (started: 2023-03-23 20:25:55 +01:00)


## our dataset

- "TREC stands for the Text Retrieval Conference. Started in 1992 it is a series of workshops that focus on supporting research within the information retrieval community. It provides the infrastructure necessary for large-scale evaluation of text retrieval methodologies. Every year these workshops are organized, which are centered around a set of tracks. These tracks encourage new researches in the area of information retrieval."
- TREC 2019 Deep Learning Track https://microsoft.github.io/msmarco/TREC-Deep-Learning-2019
- data from MS-Marco https://microsoft.github.io/msmarco/
- The dataset contains  367k queries and a corpus of 3.2 million documents. 
___
- if you want to reproduce my selection or get a bigger set, uncomment and execute


In [4]:
#!wget https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-docs.tsv.gz
#!wget https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-doctrain-queries.tsv.gz
#!wget https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-doctrain-top100.gz
	
#!gzip -d msmarco-docs.tsv.gz
#!gzip -d msmarco-doctrain-queries.tsv.gz
#!gzip -d msmarco-doctrain-top100.gz


time: 265 µs (started: 2023-03-23 20:25:58 +01:00)


- we have three datasets:
    
    1. the queries: msmarco-doctrain-queries.tsv
    2. the gold: msmarco-doctrain-top100.tsv is a table containing query_id, doc_id and rank
    3. the actual documents: msmarco-docs.tsv 21GB of text! doc_id, url, title, text

In [5]:
all_queries=pd.read_table('./msmarco/doctrain-queries.tsv',header=None)
all_queries.columns=['qid','query']
print('Shape=>',all_queries.shape)
all_queries.head()

Shape=> (367013, 2)


Unnamed: 0,qid,query
0,1185869,)what was the immediate impact of the success ...
1,1185868,_________ justice is designed to repair the ha...
2,1183785,elegxo meaning
3,645590,what does physical medicine do
4,186154,feeding rice cereal how many times per day


time: 247 ms (started: 2023-03-23 20:25:58 +01:00)


#### reducing the dataset
- here we take 1000 queries. 
- if this is too big for your computer, use this code to build smaller version, starting with the already reduced 1000 query set that we've downloaded before


In [6]:
our_queries=all_queries.sample(n=1000,random_state=42).reset_index(drop=True)
print('Shape=>',our_queries.shape)
our_queries.head()

Shape=> (1000, 2)


Unnamed: 0,qid,query
0,687888,what is a jpe
1,480210,price for asphalt driveway
2,591004,what causes pressure skin bruising
3,260536,how long drive from flagstaff to grand canyon
4,39422,average number of bowel movements per day for ...


time: 10.9 ms (started: 2023-03-23 20:25:59 +01:00)


In [7]:
our_queries.to_csv('./our_msmarco/our.msmarco.queries.tsv',sep='\t')

time: 4.89 ms (started: 2023-03-23 20:25:59 +01:00)


#### the gold file
- 36m lines!

In [8]:
gold_top100=pd.read_table('./msmarco/doctrain-top100',delimiter=' ',header=None)
gold_top100.columns=['qid','Q0','docid','rank','score','runstring']
print('Shape=>',gold_top100.shape)
display(gold_top100.head())

# Reducing gold_top100 for training
our_gold_top100=gold_top100[gold_top100['qid'].isin(our_queries['qid'].unique())].reset_index(drop=True)
print('Shape=>',our_gold_top100.shape)
our_gold_top100.head()

Shape=> (36701116, 6)


Unnamed: 0,qid,Q0,docid,rank,score,runstring
0,1185869,Q0,D59221,1,-4.80433,IndriQueryLikelihood
1,1185869,Q0,D59220,2,-4.92127,IndriQueryLikelihood
2,1185869,Q0,D2192591,3,-5.05215,IndriQueryLikelihood
3,1185869,Q0,D2777518,4,-5.05486,IndriQueryLikelihood
4,1185869,Q0,D2371978,5,-5.07048,IndriQueryLikelihood


Shape=> (100000, 6)


Unnamed: 0,qid,Q0,docid,rank,score,runstring
0,310290,Q0,D579750,1,-5.11498,IndriQueryLikelihood
1,310290,Q0,D579754,2,-5.57703,IndriQueryLikelihood
2,310290,Q0,D2380815,3,-5.84852,IndriQueryLikelihood
3,310290,Q0,D822566,4,-5.95002,IndriQueryLikelihood
4,310290,Q0,D2249695,5,-6.08326,IndriQueryLikelihood


time: 32.4 s (started: 2023-03-23 20:25:59 +01:00)


In [9]:
our_gold_top100.to_csv('./our_msmarco/our.msmarco.gold.tsv',sep='\t')

time: 263 ms (started: 2023-03-23 20:26:31 +01:00)


#### the data file

- it's so big that it's smarter to use dask: https://docs.dask.org/en/stable/

In [10]:
df=dd.read_table('./msmarco/msmarco-docs.tsv',blocksize=100e6,header=None) #  partitions of 100MB
df.columns=['docid','url','title','body']
df.head()

Unnamed: 0,docid,url,title,body
0,D1555982,https://answers.yahoo.com/question/index?qid=2...,The hot glowing surfaces of stars emit energy ...,Science & Mathematics Physics The hot glowing ...
1,D301595,http://childparenting.about.com/od/physicalemo...,Developmental Milestones and Your 8-Year-Old C...,School-Age Kids Growth & Development Developme...
2,D1359209,http://visihow.com/Check_for_Lice_Nits,Check for Lice Nits,Check for Lice Nits Edited by Mian Sheilette O...
3,D2147834,http://www.nytimes.com/2010/01/05/business/glo...,Dubai Opens a Tower to Beat All,Global Business Dubai Opens a Tower to Beat Al...
4,D1568809,http://www.realtor.com/realestateandhomes-sear...,"Coulterville, CA Real Estate & Homes for Sale","Coulterville, CA Real Estate & Homes for Sale4..."


time: 1.3 s (started: 2023-03-23 20:26:31 +01:00)


In [11]:
# can't get the number of rows quickly :s
# very slow:
# len(df.index)

# faster:
!wc -l ./msmarco/msmarco-docs.tsv

 3213835 ./msmarco/msmarco-docs.tsv
time: 14.7 s (started: 2023-03-23 20:26:33 +01:00)


- big dataset with 3m rows!
- we want the top 100 for our queries
- this takes some time!

In [12]:
def create_corpus(result):
  unique_docid=result['docid'].unique()
  condition=df['docid'].isin(unique_docid)
  corpus=df[condition].reset_index(drop=True)
  corpus=corpus.drop(columns='url')
  print('Number of Rows=>',len(corpus))
  return corpus

our_docs=create_corpus(our_gold_top100)
our_docs.head()

Number of Rows=> 92565


Unnamed: 0,docid,title,body
0,D2981241,What do you call a group of lions?,Lions Vocabulary of the English Language Word ...
1,D687756,.,"The A Priori Argument ( also, Rationalization;..."
2,D913099,Everything You Need To Learn How To Cook Veget...,Home > How To Cook Vegetables Everything You N...
3,D328017,"What is the difference between latitude, longi...",Longitude Latitude Geographic Coordinate Syste...
4,D1636347,When was the pulley invented?,Answers.com ® Wiki Answers ® Categories Techno...


time: 3min (started: 2023-03-23 20:26:47 +01:00)


In [13]:
our_docs.to_csv('./our_msmarco/our.msmarco.docs.tsv',sep='\t', single_file=True)

['/Users/zhehuang/M1AI-UPSaclay/T4/Information Retrieval/Class2/our_msmarco/our.msmarco.docs.tsv']

time: 3min 11s (started: 2023-03-23 20:29:49 +01:00)


- this is still a big file: 92k documents

# reading in our smaller files
here we use the

- !wget https://gerdes.fr/saclay/informationRetrieval/our_msmarco.zip
- !unzip our_msmarco.zip

In [14]:
queries = pd.read_csv('our.msmarco.queries.tsv',sep='\t',usecols=[1,2])
queries

Unnamed: 0,qid,query
0,687888,what is a jpe
1,480210,price for asphalt driveway
2,591004,what causes pressure skin bruising
3,260536,how long drive from flagstaff to grand canyon
4,39422,average number of bowel movements per day for ...
...,...,...
995,89597,cell voltage mv meaning
996,1167043,what an ip address
997,737304,what is daily max citizens atm
998,156934,do i need a florida commercial driver license


time: 12.9 ms (started: 2023-03-23 20:33:01 +01:00)


In [15]:
gold = pd.read_csv('our.msmarco.gold.tsv',sep='\t',usecols=[1,3,4,5])
gold

Unnamed: 0,qid,docid,rank,score
0,310290,D579750,1,-5.11498
1,310290,D579754,2,-5.57703
2,310290,D2380815,3,-5.84852
3,310290,D822566,4,-5.95002
4,310290,D2249695,5,-6.08326
...,...,...,...,...
99995,257942,D253854,96,-6.32693
99996,257942,D3056621,97,-6.32837
99997,257942,D1323491,98,-6.32871
99998,257942,D2722485,99,-6.33100


time: 70 ms (started: 2023-03-23 20:33:01 +01:00)


In [16]:
docs = pd.read_csv('our.msmarco.docs.tsv',sep='\t',usecols=[1,2,3])
docs

Unnamed: 0,docid,title,body
0,D2981241,What do you call a group of lions?,Lions Vocabulary of the English Language Word ...
1,D687756,.,"The A Priori Argument ( also, Rationalization;..."
2,D913099,Everything You Need To Learn How To Cook Veget...,Home > How To Cook Vegetables Everything You N...
3,D328017,"What is the difference between latitude, longi...",Longitude Latitude Geographic Coordinate Syste...
4,D1636347,When was the pulley invented?,Answers.com ® Wiki Answers ® Categories Techno...
...,...,...,...
92560,D3379210,Top 39 Doctor insights on: Can An Iud Cause Ha...,Top 39 Doctor insights on: Can An Iud Cause Ha...
92561,D3068739,How to get back your DirecTV cancellation fees,How to get back your Direc TV cancellation fee...
92562,D1590402,Certification FAQs,Fingerprinting 1. Where can I get fingerprinte...
92563,D2175490,Greenhouse gas emissions by Canadian economic ...,"Access PDF (682 KB)In 2015, Canada's total gre..."


time: 12.1 s (started: 2023-03-23 20:33:01 +01:00)


In [17]:
# Creating Training Set of Queries
training_queries=queries.iloc[:500]
print('Shape=>',training_queries.shape)
display(training_queries.head())
# Creating Testing Set of Queries
testing_queries=queries.iloc[500:]
print('Shape=>',testing_queries.shape)
testing_queries.head()

Shape=> (500, 2)


Unnamed: 0,qid,query
0,687888,what is a jpe
1,480210,price for asphalt driveway
2,591004,what causes pressure skin bruising
3,260536,how long drive from flagstaff to grand canyon
4,39422,average number of bowel movements per day for ...


Shape=> (500, 2)


Unnamed: 0,qid,query
500,116364,decree verb definition
501,638813,what does hemiballistic mean
502,401631,is advil considered aspirin?
503,1050265,who sang the song midnight train to georgia
504,632336,what does aq mean in chemistry


time: 6.84 ms (started: 2023-03-23 20:33:13 +01:00)


## exploring the data

### 🚧 todo: check whether there are NAN and take care of them

In [18]:
print(np.sum(queries['query'].isna()), "queries are NAN")
print(np.sum(gold['score'].isna()), "golds are NAN")
print(np.sum(docs['body'].isna()), "docs are NAN")

0 queries are NAN
0 golds are NAN
3 docs are NAN
time: 18.4 ms (started: 2023-03-23 20:33:13 +01:00)


In [136]:
docs.fillna('NAN', inplace=True)

time: 101 ms (started: 2023-03-24 00:32:36 +01:00)


### let's have a look at some random query:

In [137]:
queries.loc[111]

qid                                      251898
query    how long does getting a doctorate take
Name: 111, dtype: object

time: 5.76 ms (started: 2023-03-24 00:32:42 +01:00)


In [138]:
gold[gold.qid==251898]

Unnamed: 0,qid,docid,rank,score
36200,251898,D2865964,1,-4.74293
36201,251898,D3557816,2,-4.90695
36202,251898,D2723985,3,-4.95911
36203,251898,D1951655,4,-4.97272
36204,251898,D1709749,5,-5.02176
...,...,...,...,...
36295,251898,D2531901,96,-5.56896
36296,251898,D2956542,97,-5.57138
36297,251898,D301873,98,-5.57262
36298,251898,D2952336,99,-5.57504


time: 24 ms (started: 2023-03-24 00:32:42 +01:00)


### 🚧 todo: let's look at the top-ranked document for that query
- title
- body

In [139]:
# todo: .values[0] can help
gold[gold.qid==251898].values[0][1]

'D2865964'

time: 2.78 ms (started: 2023-03-24 00:32:43 +01:00)


In [140]:
docs[docs.docid == gold[gold.qid==251898].values[0][1]]

Unnamed: 0,docid,title,body,cleaned
56982,D2865964,How long does it take to get a post-doctoral d...,Answers.com ® Wiki Answers ® Categories Jobs &...,NAN


time: 25.1 ms (started: 2023-03-24 00:32:44 +01:00)


In [141]:
# title
docs[docs.docid == gold[gold.qid==251898].values[0][1]].values[0][1]

'How long does it take to get a post-doctoral degree?'

time: 13.4 ms (started: 2023-03-24 00:32:44 +01:00)


In [142]:
# body
docs[docs.docid == gold[gold.qid==251898].values[0][1]].values[0][2]

"Answers.com ® Wiki Answers ® Categories Jobs & Education Education College Degrees Graduate Degrees How long does it take to get a post-doctoral degree? Flag How long does it take to get a post-doctoral degree? Answer by Joe Ragusa Confidence votes 98.6KI am not aware of any such thing as a postdoctoral degree. You may be referring to postdoctoral research which may be required for obtaining a tenure-track faculty position, especially at research oriented institutions. Some have suggested that postdoctoral appointments - that were traditionally optional - have become mandatory as demand for tenure-track positions in academia has drastically increased over previous decades (Wikipedia). The length of time in this case depends on the type, content, scope, and depth of the research.2 people found this useful Was this answer useful? Yes Somewhat No Tyler Durden9988 1,326 Contributions How long does it take to get a doctoral degree? It totally depends on the actual degree and school but you

time: 8.56 ms (started: 2023-03-24 00:32:44 +01:00)


### 🚧 todo: let's look at the second document
- let's make a functioin to make that easier

In [143]:
def titleAndBody(qid,nr):
    display(docs[docs.docid == gold[gold.qid==qid].values[nr][1]].values[0][1])
    display(docs[docs.docid == gold[gold.qid==qid].values[nr][1]].values[0][2])
titleAndBody(251898,1)

'How long does it take to get a degree in psychology?'

"Answers.com ® Wiki Answers ® Categories Jobs & Education Education College Degrees Bachelors Degrees How long does it take to get a degree in psychology? Flag How long does it take to get a degree in psychology? Answer by Colette Fisher Stone Confidence votes 16To complete a bachelors degree, it can take anywhere from 124 -128 credits. This could take approximately four years of study, provided the student takes the program as prescribed by the institution. Viper1 I just finished my Bachelor's in Psychology. It actually takes anywhere from 180-195 credits, and does take four years if you are taking 2 classes a semester. If you take on a heavier class load, it is possible to graduate in 3 years.15 people found this useful Was this answer useful? Yes Somewhat No Joe Ragusa How long does it take to earn a bachelor's degree in forensic psychology? It usually takes about 4 years to receive a Bachelor's Degree Answer I might also add that it depends on entrance testing of basic skills, cred

time: 14.3 ms (started: 2023-03-24 00:32:45 +01:00)


#### let's look at the 100th document

In [144]:
titleAndBody(251898,99)

'.'

'Q How long to receive va dic benefits? I receive a widows benefit from railroad retirement can i collect ssame time Topic: Asked by: Lino In Education & Reference > Financial Aid > Receive>A Top Solutions Janice sibug of 1631 el camino real #7 tustin ca 92780 was born to an illegal alien name l ... read more If direct deposit keep checking your bank account. I received my award letter (explaning b ... read more Add your answer Post to Facebook Post to Twitter Subscribe me Suggested Solutions (10) What\'s this? Anonymous0 2 Janice sibug of 1631 el camino real #7 tustin ca 92780 was born to an illegal alien name ligaya fabian who jumpshipped her flght at lax from germany to canada.source: How much do anchor babies receive in benefits? Was this answer helpful? Yes | No Comment Reply Reportpgfua Level 2 (Sophomore)4 Answers"If direct deposit keep checking your bank account..."5 0 If direct deposit keep checking your bank account. I received my award letter (explaning benefits) after I rec

time: 15.1 ms (started: 2023-03-24 00:32:45 +01:00)


### 🚧 todo: try this with a different queries to get a feel of the quality of the gold

# doing our first baseline retrieval function

- todo: 
    - build and fit a binary CountVectorizer on the **titles**
    - play with and understand build_analyzer, build_tokenizer, and transform
    - transform our query 111
        - understand what happens with yet unseen words in the transform process
    - find the docs with the most words in common
    - write an evaluation function computing the top 10 precision p@10
    - apply to our 500 queries


In [145]:
vectorizer = CountVectorizer(binary=True)
# understand the options: 
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
X = vectorizer.fit_transform(docs.title.tolist())
print('we got',len(vectorizer.get_feature_names_out()),'features, for example',vectorizer.get_feature_names_out()[33333:33339])

we got 39559 features, for example ['sputtering' 'sputters' 'sputum' 'spy' 'spyware' 'sq']
time: 651 ms (started: 2023-03-24 00:32:46 +01:00)


In [146]:
queries.loc[111].query

'how long does getting a doctorate take'

time: 1.8 ms (started: 2023-03-24 00:32:46 +01:00)


In [147]:
vectorizer.build_analyzer()(queries.loc[111].query)

['how', 'long', 'does', 'getting', 'doctorate', 'take']

time: 1.88 ms (started: 2023-03-24 00:32:46 +01:00)


In [148]:
vectorizer.build_tokenizer()(queries.loc[111].query)

['how', 'long', 'does', 'getting', 'doctorate', 'take']

time: 1.91 ms (started: 2023-03-24 00:32:46 +01:00)


In [149]:
qv = vectorizer.transform([queries.loc[111].query])
qv

<1x39559 sparse matrix of type '<class 'numpy.int64'>'
	with 6 stored elements in Compressed Sparse Row format>

time: 2.45 ms (started: 2023-03-24 00:32:46 +01:00)


### 🚧 todo:
- understand what happens with yet unseen words in the transform process


- think of the shape of X, what are the rows, what are the columns?
- how to select the titles that have the words of our query?
       - think of matrix multiplication and transposition

<font color=green>
When encountering new words during the transform process, CountVectorizer will simply ignore them and not include them in the resulting matrix. In other words, the count for any unseen word in a given document will be zero.
<font>

In [150]:
xqv = qv.toarray()
xqv

array([[0, 0, 0, ..., 0, 0, 0]])

time: 1.87 ms (started: 2023-03-24 00:32:47 +01:00)


In [151]:
xqv.shape

(1, 39559)

time: 1.32 ms (started: 2023-03-24 00:32:47 +01:00)


In [152]:
X.shape

(92565, 39559)

time: 1.19 ms (started: 2023-03-24 00:32:47 +01:00)


<font color=green>
the rows correspond to the input title and the columns correspond to the unique words in the corpus.
<font>

In [153]:
# Perform matrix multiplication between the matrix X and the transposed query matrix xqv.
result = X @ xqv.transpose()
print(result.shape)
print(result)

# Select the titles that have at least one of the words in the query
selected_titles = [title for i, title in enumerate(docs.title.tolist()) if result[i] > 0]
# print(selected_titles)

(92565, 1)
[[0]
 [0]
 [1]
 ...
 [0]
 [0]
 [0]]
time: 131 ms (started: 2023-03-24 00:32:48 +01:00)


### 🚧 todo: 
  - look at argmax and max, 
  - check the numpy 
      - flatnonzero function to find the best match
      - the .A and the .flat functions
  - show the best matching doc

In [154]:
# Find the index of the document with the most common words
best_match_idx = np.argmax(result)
print('The document with the most common words is of index', best_match_idx)

# Get the number of common words of the best matching document
best_match_number = result[best_match_idx][0]
print('The number of common words with the query is', best_match_number)

The document with the most common words is of index 11957
The number of common words with the query is 6
time: 1.57 ms (started: 2023-03-24 00:32:48 +01:00)


In [155]:
# Get the corresponding document
# best_match_doc = docs.loc[best_match_idx]['body']
docs.loc[best_match_idx].body

'Education & Reference Higher Education (University +)After getting my bachelors how long does it take to get a doctorate? In my last year of my bachelors degree in psychology and I want to get a doctorate. But I have no idea how long it would take. And is it better to get a masters a doctorate? Follow 6 answers Answers Relevance Rating Newest Oldest Best Answer: It generally takes 2 years to complete a master\'s and after that I would say at least 3-4 to complete a Ph. D. although many doctorates take much longer than that: sometimes up to 6-8 years. Unless you\'re trying to get into a unique medical field, such as virology or epidemiology etc., or to work/teach at the college level, then a master\'s degree will suffice for the most part. In fact, some colleges employ professors who hold a Masters degree and don\'t have a doctorate. also fyi, for every 100 Ph. D.\'s rewarded to a person, there are fewer than 10 collegiate job openings :). Hope that helped :)dvd_clapp · 10 years ago0 1

time: 1.82 ms (started: 2023-03-24 00:32:48 +01:00)


### 🚧 todo: use argpartition to get the 10 best answers

In [156]:
pred10i = np.argpartition(result.ravel(), -10)[-10:]
pred10i

array([16630, 91275, 44359, 71292,  1477, 43190, 66313,  3686, 83787,
       11957])

time: 1.98 ms (started: 2023-03-24 00:32:48 +01:00)


In [157]:
docs.loc[pred10i]

Unnamed: 0,docid,title,body,cleaned
16630,D3070882,How long does it take a tattoo to heal?,Answers.com ® Wiki Answers ® Categories Health...,NAN
91275,D3090762,How Long Does It Take to Get a Doctorate in En...,Home » College How Long Does It Take to Get a ...,NAN
44359,D638869,How long does it take for the femur bone to heal,Home > Surgery >How long does it take for the ...,NAN
71292,D471734,How long does it take to earn an marketing deg...,Expert answer by Joe Ragusa Confidence votes 9...,NAN
1477,D2813027,How long does Seroquel stay in your system if ...,Answers.com ® Wiki Answers ® Categories Health...,NAN
43190,D700799,Holidays By Country Â»Holidays By Religion Â»O...,Home > Important Days > Daylight Savings > Day...,NAN
66313,D2956542,How long does it take to finish a doctorate?,Education & Reference Higher Education (Univer...,NAN
3686,D2111066,After getting a flu shot how long does it take...,Celticballa568 19 Contributions After getting ...,NAN
83787,D1951655,How long does it take to get a Doctorate of Nu...,Answers.com ® Wiki Answers ® Categories Jobs &...,NAN
11957,D2408424,After getting my bachelors how long does it ta...,Education & Reference Higher Education (Univer...,NAN


time: 13.7 ms (started: 2023-03-24 00:32:49 +01:00)


In [158]:
docs.loc[pred10i].docid

16630    D3070882
91275    D3090762
44359     D638869
71292     D471734
1477     D2813027
43190     D700799
66313    D2956542
3686     D2111066
83787    D1951655
11957    D2408424
Name: docid, dtype: object

time: 2.58 ms (started: 2023-03-24 00:32:49 +01:00)


In [159]:
gold[gold.qid==251898].docid

36200    D2865964
36201    D3557816
36202    D2723985
36203    D1951655
36204    D1709749
           ...   
36295    D2531901
36296    D2956542
36297     D301873
36298    D2952336
36299    D1805809
Name: docid, Length: 100, dtype: object

time: 3.43 ms (started: 2023-03-24 00:32:49 +01:00)


### 🚧 todo:
- find the relevant documents that are in our top 10
- use intersect1d
- compute the precision p@10

In [160]:
relevant_docs = gold[gold.qid==251898].docid.tolist()
# print(relevant_docs)
top10 = docs.loc[pred10i].docid.tolist()
# print(top10)
intersection = np.intersect1d(top10, relevant_docs)
intersection

array(['D1951655', 'D2408424', 'D2956542', 'D3090762', 'D471734'],
      dtype='<U8')

time: 3.51 ms (started: 2023-03-24 00:32:49 +01:00)


In [161]:
precision = len(intersection)/10
precision

0.5

time: 2.8 ms (started: 2023-03-24 00:32:50 +01:00)


In [162]:
queries[queries.qid==251898]['query'].tolist()

['how long does getting a doctorate take']

time: 2.14 ms (started: 2023-03-24 00:32:50 +01:00)


In [163]:
# 🚧 todo: build a function p@10 that gives the precision at 10
def pAt10(qid):

    query_vector = vectorizer.transform(queries[queries.qid==qid]['query'].tolist()).toarray()
    result = X @ query_vector.transpose()

    pred10i = np.argpartition(result.ravel(), -10)[-10:]
    top10 = docs.loc[pred10i].docid.tolist()
    relevant_docs = gold[gold.qid==qid].docid.tolist()
    intersection = np.intersect1d(top10, relevant_docs)
    precision = len(intersection)/10
    return precision

pAt10(251898)


0.5

time: 9.77 ms (started: 2023-03-24 00:32:50 +01:00)


### 🚧 todo:
- take our 500 training queries qid
- apply our function
- compute the average

In [164]:
training_queries.qid

0      687888
1      480210
2      591004
3      260536
4       39422
        ...  
495    133970
496     79788
497    791583
498    732078
499    197098
Name: qid, Length: 500, dtype: int64

time: 2.33 ms (started: 2023-03-24 00:32:51 +01:00)


In [165]:
training_queries.qid.apply(pAt10)

0      0.0
1      0.5
2      0.3
3      0.5
4      0.4
      ... 
495    0.6
496    0.1
497    0.0
498    0.1
499    0.3
Name: qid, Length: 500, dtype: float64

time: 2.87 s (started: 2023-03-24 00:32:51 +01:00)


In [166]:
training_queries.qid.apply(pAt10).mean()

0.32

time: 2.82 s (started: 2023-03-24 00:32:54 +01:00)


- that looks like a baseline we can beat :)
- what's the query we are doing best in?
    - max?

In [167]:
training_queries.qid.apply(pAt10).max()

1.0

time: 2.78 s (started: 2023-03-24 00:32:57 +01:00)


- oh, we have just been lucky before...

## 🚧 todo:

- redo the vectorization and evaluation on the whole text, not only the titles
- try the non-binary CountVectorizer
- go for tf-idf
    - play with at least two options and re-evaluate
- find other improvements. these may include:
    - cleaning the text
    - heuristically combining title and body matches
    - looking at bigrams
    - looking at terms (by means of a clean multi-word term list from wikipedia, see notebook 1)
    - by removing stopwords (look at nltk or spacy to do that)
    - trying an implementation of bm25
  
- do a grid search with a few promising parameters
    - maybe get inspired by GridSearchCV and pipelines in https://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html#sphx-glr-auto-examples-model-selection-grid-search-text-feature-extraction-py
        - you can also check the weel-written section "Pipelines" in this book: https://www.oreilly.com/library/view/applied-text-analysis/9781491963036/ch04.html
    - make a nice visualization of the results
    
- interpret the complete results in 3 to 5 sentences.
    - what strategy would do best if we switch our evaluation to p@100?

- give some ideas for improving the results





### Redo the vectorization and evaluation on the whole text

In [168]:
# vectorization
vectorizer = CountVectorizer(binary=True)
X = vectorizer.fit_transform(docs.body.tolist())

# evaluation
training_queries.qid.apply(pAt10).mean()

0.48

time: 2min 26s (started: 2023-03-24 00:33:00 +01:00)


### Try the non-binary CountVectorizer

In [169]:
# vectorization
vectorizer = CountVectorizer(binary=False)
X = vectorizer.fit_transform(docs.body.tolist())

# evaluation
training_queries.qid.apply(pAt10).mean()

0.046200000000000005

time: 2min 32s (started: 2023-03-24 00:35:26 +01:00)


### Play with at least two options and re-evaluate on tf-idf

In [170]:
from sklearn.feature_extraction.text import TfidfVectorizer

# vectorization
vectorizer = TfidfVectorizer(norm='l1')
X = vectorizer.fit_transform(docs.body.tolist())

# evaluation
training_queries.qid.apply(pAt10).mean()

0.7766000000000001

time: 2min 41s (started: 2023-03-24 00:37:58 +01:00)


In [171]:
# vectorization
vectorizer = TfidfVectorizer(norm='l2')
X = vectorizer.fit_transform(docs.body.tolist())

# evaluation
training_queries.qid.apply(pAt10).mean()

0.8972

time: 2min 36s (started: 2023-03-24 00:40:40 +01:00)


In [172]:
# vectorization
vectorizer = TfidfVectorizer(sublinear_tf=True)
X = vectorizer.fit_transform(docs.body.tolist())

# evaluation
training_queries.qid.apply(pAt10).mean()

0.935

time: 2min 41s (started: 2023-03-24 00:43:16 +01:00)


In [173]:
# vectorization
vectorizer = TfidfVectorizer(sublinear_tf=False)
X = vectorizer.fit_transform(docs.body.tolist())

# evaluation
training_queries.qid.apply(pAt10).mean()

0.8972

time: 2min 37s (started: 2023-03-24 00:45:58 +01:00)


### Other improvements

#### Clean text

In [192]:
import string

def clean_text(text):
    # remove words with numbers inside
    text = re.sub('\w*\d+\w*', '', text)
    # replace new lines by space
    text = text.replace('\n', ' ')
    # remove urls
    text = re.sub(r'http\S+', '', text)
    # only keep ascii words
    text = text.encode("ascii", errors="ignore").decode()
    return text

# Cleaning corpus using RegEx
docs['cleaned'] = docs['body'].apply(lambda x: clean_text(x))

time: 3min 12s (started: 2023-03-24 01:01:13 +01:00)


In [193]:
docs

Unnamed: 0,docid,title,body,cleaned
0,D2981241,What do you call a group of lions?,Lions Vocabulary of the English Language Word ...,Lions Vocabulary of the English Language Word ...
1,D687756,.,"The A Priori Argument ( also, Rationalization;...","The A Priori Argument ( also, Rationalization;..."
2,D913099,Everything You Need To Learn How To Cook Veget...,Home > How To Cook Vegetables Everything You N...,Home > How To Cook Vegetables Everything You N...
3,D328017,"What is the difference between latitude, longi...",Longitude Latitude Geographic Coordinate Syste...,Longitude Latitude Geographic Coordinate Syste...
4,D1636347,When was the pulley invented?,Answers.com ® Wiki Answers ® Categories Techno...,Answers.com Wiki Answers Categories Technolo...
...,...,...,...,...
92560,D3379210,Top 39 Doctor insights on: Can An Iud Cause Ha...,Top 39 Doctor insights on: Can An Iud Cause Ha...,Top Doctor insights on: Can An Iud Cause Hair...
92561,D3068739,How to get back your DirecTV cancellation fees,How to get back your Direc TV cancellation fee...,How to get back your Direc TV cancellation fee...
92562,D1590402,Certification FAQs,Fingerprinting 1. Where can I get fingerprinte...,Fingerprinting . Where can I get fingerprinted...
92563,D2175490,Greenhouse gas emissions by Canadian economic ...,"Access PDF (682 KB)In 2015, Canada's total gre...","Access PDF ( KB)In , Canada's total greenhouse..."


time: 13.8 ms (started: 2023-03-24 01:05:22 +01:00)


In [194]:
# vectorization
vectorizer = TfidfVectorizer(norm='l2')
X = vectorizer.fit_transform(docs.cleaned.tolist())

# evaluation
training_queries.qid.apply(pAt10).mean()

0.8932

time: 2min 14s (started: 2023-03-24 01:05:38 +01:00)


#### Remove Stopwords

In [197]:
# Remove Stopwords
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

def remove_Stopwords(text):
    text = ' '.join(word for word in text.split() if word not in stop_words)
    return text

docs['removed'] = docs['cleaned'].apply(lambda x: clean_text(x))

time: 2min 7s (started: 2023-03-24 01:17:43 +01:00)


In [198]:
# vectorization
vectorizer = TfidfVectorizer(norm='l2')
X = vectorizer.fit_transform(docs.removed.tolist())

# evaluation
training_queries.qid.apply(pAt10).mean()

0.8932

time: 2min 13s (started: 2023-03-24 01:19:50 +01:00)


#### looking at bigrams

In [199]:
#### too much time to finish running ####

# # vectorization
# vectorizer = TfidfVectorizer(norm='l2', stop_words=stop_words, ngram_range=(2,2))
# X = vectorizer.fit_transform(docs.cleaned.tolist())

# # evaluation
# training_queries.qid.apply(pAt10).mean()

time: 399 µs (started: 2023-03-24 01:22:04 +01:00)


### Do a grid search with a few promising parameters

In [210]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import make_scorer

def score(y_true, y_pred):
    # here, y_true and y_pred are not used since we only have one evaluation metric
    return training_queries.qid.apply(pAt10).mean()

scorer = make_scorer(score)

# Define the pipeline
pipeline = Pipeline([
    ('vect', TfidfVectorizer())
])

# Define the parameter grid
param_grid = {
    'vect__stop_words': [None, 'english'],
    'vect__sublinear_tf': [True, False]
}

# Create the grid search object
vectorizer = TfidfVectorizer()
grid_search = GridSearchCV(pipeline, param_grid, scoring=scorer, cv=5)

# Fit the grid search object on the data
grid_search.fit(docs.cleaned.tolist())

# Print the best parameters and best score
print("Best Parameters: ", grid_search.best_params_)
print("Best Score: ", grid_search.best_score_)


Traceback (most recent call last):
  File "/Users/zhehuang/opt/anaconda3/envs/HoNLP/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 765, in _score
    scores = scorer(estimator, X_test)
TypeError: _BaseScorer.__call__() missing 1 required positional argument: 'y_true'



KeyboardInterrupt: 

time: 16min 30s (started: 2023-03-24 01:49:13 +01:00)


In [211]:
# # Create a dataframe of the grid search results
results_df = pd.DataFrame(grid_search.cv_results_)

# Pivot the dataframe to create a heatmap
pivot_df = results_df.pivot(index='vect__stop_words', columns='vect__sublinear_tf', values='average_accuracy')

# Create the heatmap
sns.heatmap(pivot_df, annot=True, fmt=".4g")

AttributeError: 'GridSearchCV' object has no attribute 'cv_results_'

time: 10.8 ms (started: 2023-03-24 02:05:55 +01:00)


### interpret the complete results in 3 to 5 sentences.
    - what strategy would do best if we switch our evaluation to p@100?

The grid search results indicate that the best performing strategy for this dataset is to use the TfidfVectorizer with lé norm, English stop words, sublinear term frequency. (as same as attempts before)

However, if we switch our evaluation metric to p@100, we may need to consider different hyperparameter settings since the most important factor for p@10, which is the number of relevant documents retrieved in the top 10, may not be as crucial. It may be worth experimenting with larger values of ngram_range and possibly adjusting the hyperparameters above to optimize for p@100.

### Give some ideas for improving the results

1. Use a more sophisticated vectorizer, such as a Word2Vec or GloVe embedding.
2. Experiment with different preprocessing techniques, such as lemmatization or stemming, to see if they improve the quality of the features.
3. Experiment with different hyperparameters for the vectorizer to see if we can find a combination that works better.