# About
1. This notebook load data from https://microsoft.github.io/msmarco/TREC-Deep-Learning.html for Information Retrival passage ranking system. It produces a small Train, Evaluation and Test sets and clean the text:

- The Train set contains 1000 queries and each query's top 10 passage and worst 10 passage
- The Evaluation set contains 1000 queries and each query's top 10 passage and worst 10 passage
- The Test set contains 1000 queries and each query's top 10 passage and worst 10 passage

2. It outputs 6 .csv files:

- Train set query info (query - and top 10 passages's ID, worst 10 passages's ID) (20000 rows)
- Evaluation set query info (query - and top 10 passages, worst 10 passages's ID)(20000 rows)
- Test set query info (query - and top 10 passages's ID, worst 10 passages's ID)(20000 rows)

- Train set passages (~20000 rows)
- Evaluation set passages (~20000 rows)
- Test set passages (~20000 rows)

3. This notebook takes around 3 hours to run and only need to run once



# Load libraries

In [6]:
import pandas as pd
import numpy as np
#import tarfile
import json
import re
import spacy
from tqdm.notebook import tqdm
tqdm.pandas()

# Download data (only need to run once)

data source: 

https://microsoft.github.io/msmarco/TREC-Deep-Learning.html

reference code: 

https://www.analyticsvidhya.com/blog/2020/08/information-retrieval-using-word2vec-based-vector-space-model/

https://github.com/ljxowen/TREC-Information-Retrieval/blob/main/Passages%20Ranking/TREC%20Project(passage%20Ranking).ipynb



https://github.com/snovaisg/Trec-DeepLearning-2020/blob/master/download_and_unzip_data


## Download queries data

In [2]:
#!wget -O data/passv2_train_queries.tsv --header "X-Ms-Version: 2019-12-12" https://msmarco.z22.web.core.windows.net/msmarcoranking/passv2_train_queries.tsv

--2024-03-03 09:33:08--  https://msmarco.z22.web.core.windows.net/msmarcoranking/passv2_train_queries.tsv
Resolving msmarco.z22.web.core.windows.net (msmarco.z22.web.core.windows.net)... 20.150.34.1
Connecting to msmarco.z22.web.core.windows.net (msmarco.z22.web.core.windows.net)|20.150.34.1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11608838 (11M) [application/octet-stream]
Saving to: ‘data/passv2_train_queries.tsv’


2024-03-03 09:33:18 (1.16 MB/s) - ‘data/passv2_train_queries.tsv’ saved [11608838/11608838]



## Download passage corpus data (used 1 hour 50 minutes)

In [10]:
#!wget -O data/msmarco_v2_passage.tar --header "X-Ms-Version: 2019-12-12" https://msmarco.z22.web.core.windows.net/msmarcoranking/msmarco_v2_passage.tar

--2024-03-03 10:03:40--  https://msmarco.z22.web.core.windows.net/msmarcoranking/msmarco_v2_passage.tar
Resolving msmarco.z22.web.core.windows.net (msmarco.z22.web.core.windows.net)... 20.150.34.1
Connecting to msmarco.z22.web.core.windows.net (msmarco.z22.web.core.windows.net)|20.150.34.1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21768192000 (20G) [application/x-tar]
Saving to: ‘data/msmarco_v2_passage.tar’


2024-03-03 11:54:09 (3.13 MB/s) - ‘data/msmarco_v2_passage.tar’ saved [21768192000/21768192000]



In [24]:
# # extract the tar file
# # open file 
# file = tarfile.open("data/msmarco_v2_passage.tar") 

# # extracting file 
# file.extractall("./data/") 

In [32]:
# decompress all .gz files in this floder, it takes 21minutes 22 seconds
#!gzip -d -r ./data/msmarco_v2_passage/

## [do not use] Download qrels data

In [6]:
#!wget -O data/passv2_train_qrels.tsv --header "X-Ms-Version: 2019-12-12" https://msmarco.z22.web.core.windows.net/msmarcoranking/passv2_train_qrels.tsv

--2024-03-03 09:50:21--  https://msmarco.z22.web.core.windows.net/msmarcoranking/passv2_train_qrels.tsv
Resolving msmarco.z22.web.core.windows.net (msmarco.z22.web.core.windows.net)... 20.150.34.1
Connecting to msmarco.z22.web.core.windows.net (msmarco.z22.web.core.windows.net)|20.150.34.1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11620946 (11M) [application/octet-stream]
Saving to: ‘data/passv2_train_qrels.tsv’


2024-03-03 09:50:25 (2.61 MB/s) - ‘data/passv2_train_qrels.tsv’ saved [11620946/11620946]



## Download passage top 100 data

In [11]:
#!wget -O data/passv2_train_top100.txt.gz --header "X-Ms-Version: 2019-12-12" https://msmarco.z22.web.core.windows.net/msmarcoranking/passv2_train_top100.txt.gz

--2024-03-03 11:56:18--  https://msmarco.z22.web.core.windows.net/msmarcoranking/passv2_train_top100.txt.gz
Resolving msmarco.z22.web.core.windows.net (msmarco.z22.web.core.windows.net)... 20.150.34.1
Connecting to msmarco.z22.web.core.windows.net (msmarco.z22.web.core.windows.net)|20.150.34.1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 340634991 (325M) [application/x-gzip]
Saving to: ‘data/passv2_train_top100.txt.gz’


2024-03-03 11:58:07 (3.00 MB/s) - ‘data/passv2_train_top100.txt.gz’ saved [340634991/340634991]



In [None]:
## decompress the gz file
#!gzip -d ./data/passv2_train_top100.txt.gz

# Load data

## Load queries data

In [8]:
train_queries_df = pd.read_csv('./data/passv2_train_queries.tsv'
                               , delimiter = "\t" 
                               , header=None
                               , names = ['query_id','query'])

In [9]:
print(train_queries_df.shape)
display(train_queries_df.head())

(277144, 2)


Unnamed: 0,query_id,query
0,121352,define extreme
1,510633,tattoo fixers how much does it cost
2,674172,what is a bank transit number
3,570009,what are the four major groups of elements
4,54528,blood clots in urine after menopause


## [do not use] Load qrels data

In [7]:
# train_qrels_df = pd.read_csv('./data/passv2_train_qrels.tsv'
#                              , names = ['0','passage','1']
#                              , header = None,delimiter = "\t")

## Load each query's top 100 passages' info

In [10]:
train_top100_df = pd.read_csv('./data/passv2_train_top100.txt'
                              , delimiter = " "
                              , names = ['query_id','used','passage_id','rank','score','username'])
                              

# Get Train, Test sets by random sampling

## Queries - randomly choose 2000 as Train and Test queries

In [11]:
train_queries_df_sample3000 = train_queries_df.sample(n=3000,random_state=42).reset_index(drop=True)


print(train_queries_df_sample3000.shape)
display(train_queries_df_sample3000.head())

(3000, 2)


Unnamed: 0,query_id,query
0,916247,what us state bears the slogan the land enchan...
1,203324,him functions to the paper health record
2,123916,define merit-based pay
3,54169,bitcoin price increasing
4,766010,what is linguistic chauvinism


## Queries - randomly choose 1000 samples as Train set, 1000 samples as Validation set and 1000 samples Test set

In [12]:
query_train_set = train_queries_df_sample3000.sample(n=1000,random_state=42).reset_index(drop=True)


print(query_train_set.shape)
display(query_train_set.head())

(1000, 2)


Unnamed: 0,query_id,query
0,568182,what are the characteristics of wool fibres
1,36836,average grad school loan rates
2,476199,pneumonia contagious period
3,193532,gastroparesis symptoms and treatments
4,1060211,why are beets the super food for the liver


In [13]:
query_val_and_test_set = train_queries_df_sample3000[~train_queries_df_sample3000["query_id"].isin(query_train_set["query_id"])]

print(query_val_and_test_set.shape)


(2000, 2)


In [14]:
query_val_set = query_val_and_test_set.sample(n=1000,random_state=42).reset_index(drop=True)


print(query_val_set.shape)
display(query_val_set.head())

(1000, 2)


Unnamed: 0,query_id,query
0,791126,what is respiration controlled by
1,508396,symptoms of peptic ulcer disease
2,152571,diseases that cause hyperpigmentation
3,50371,benefits of betaine hcl
4,28705,at what internal temp will baled hay mold


In [15]:
query_test_set = query_val_and_test_set[~query_val_and_test_set["query_id"].isin(query_val_set["query_id"])]

print(query_test_set.shape)
display(query_test_set.head())

(1000, 2)


Unnamed: 0,query_id,query
1,203324,him functions to the paper health record
2,123916,define merit-based pay
4,766010,what is linguistic chauvinism
5,497475,side effects for fluticasone furoate
6,634331,what does chemoautotroph mean


In [16]:
# check make sure no test data is in train or val data
[id for id in query_test_set["query_id"].to_list() if id in query_train_set["query_id"].to_list() or id in query_val_set["query_id"].to_list()]

[]

In [17]:
# check make sure no val data is in train data
[id for id in query_val_set["query_id"].to_list() if id in query_train_set["query_id"].to_list()]

[]

## Passages info - For Train, Validation and Test sets, get each query's top 100 passages' info

In [18]:
train_top100_df.head()

Unnamed: 0,query_id,used,passage_id,rank,score,username
0,5,Q0,msmarco_passage_49_25899182,1,12.1278,Anserini
1,5,Q0,msmarco_passage_06_781809452,2,11.9428,Anserini
2,5,Q0,msmarco_passage_09_146319807,3,11.7703,Anserini
3,5,Q0,msmarco_passage_18_567713921,4,11.5883,Anserini
4,5,Q0,msmarco_passage_30_434058059,5,11.588299,Anserini


In [19]:
# train
query_train_set_with_top100_passage_info = query_train_set.merge(train_top100_df
                                                          , how = "left"
                                                          , on = "query_id")

print(query_train_set_with_top100_passage_info.shape)
display(query_train_set_with_top100_passage_info.head())

(100000, 7)


Unnamed: 0,query_id,query,used,passage_id,rank,score,username
0,568182,what are the characteristics of wool fibres,Q0,msmarco_passage_62_547459701,1,16.102699,Anserini
1,568182,what are the characteristics of wool fibres,Q0,msmarco_passage_21_588232716,2,14.7847,Anserini
2,568182,what are the characteristics of wool fibres,Q0,msmarco_passage_11_97323294,3,13.8658,Anserini
3,568182,what are the characteristics of wool fibres,Q0,msmarco_passage_68_54887603,4,13.865799,Anserini
4,568182,what are the characteristics of wool fibres,Q0,msmarco_passage_62_547466749,5,13.7936,Anserini


In [20]:
# val
query_val_set_with_top100_passage_info = query_val_set.merge(train_top100_df
                                                          , how = "left"
                                                          , on = "query_id")

print(query_val_set_with_top100_passage_info.shape)
display(query_val_set_with_top100_passage_info.head())

(100000, 7)


Unnamed: 0,query_id,query,used,passage_id,rank,score,username
0,791126,what is respiration controlled by,Q0,msmarco_passage_48_489095876,1,9.5709,Anserini
1,791126,what is respiration controlled by,Q0,msmarco_passage_30_653845189,2,9.5522,Anserini
2,791126,what is respiration controlled by,Q0,msmarco_passage_04_540229875,3,9.4101,Anserini
3,791126,what is respiration controlled by,Q0,msmarco_passage_27_425453773,4,9.1409,Anserini
4,791126,what is respiration controlled by,Q0,msmarco_passage_59_586798337,5,8.8821,Anserini


In [21]:
# test
query_test_set_with_top100_passage_info = query_test_set.merge(train_top100_df
                                                          , how = "left"
                                                          , on = "query_id")

print(query_test_set_with_top100_passage_info.shape)
display(query_test_set_with_top100_passage_info.head())

(100000, 7)


Unnamed: 0,query_id,query,used,passage_id,rank,score,username
0,203324,him functions to the paper health record,Q0,msmarco_passage_49_115778700,1,13.2671,Anserini
1,203324,him functions to the paper health record,Q0,msmarco_passage_26_450017756,2,11.8537,Anserini
2,203324,him functions to the paper health record,Q0,msmarco_passage_19_202396335,3,11.2738,Anserini
3,203324,him functions to the paper health record,Q0,msmarco_passage_28_623516004,4,11.0361,Anserini
4,203324,him functions to the paper health record,Q0,msmarco_passage_02_47181174,5,10.9476,Anserini


## Queries - For Train, Validation and Test set, keep 20 passages for each query

## keep the top 10 passage (mark as rel = 1) and the worst 10 passage ( mark as rel = 0) 



In [22]:
rel=list(range(1,11))
nonrel=list(range(91,101))

query_train_set_with_top100_passage_info['rel'] = query_train_set_with_top100_passage_info['rank'].apply(
                                                    lambda x: 1 if x in rel else ( 0 if x in nonrel else np.nan))

query_val_set_with_top100_passage_info['rel'] = query_val_set_with_top100_passage_info['rank'].apply(
                                                    lambda x: 1 if x in rel else ( 0 if x in nonrel else np.nan))

query_test_set_with_top100_passage_info['rel'] = query_test_set_with_top100_passage_info['rank'].apply(
                                                    lambda x: 1 if x in rel else ( 0 if x in nonrel else np.nan))

In [23]:
query_train_set_with_top100_passage_info[query_train_set_with_top100_passage_info['rel']==1].head()

Unnamed: 0,query_id,query,used,passage_id,rank,score,username,rel
0,568182,what are the characteristics of wool fibres,Q0,msmarco_passage_62_547459701,1,16.102699,Anserini,1.0
1,568182,what are the characteristics of wool fibres,Q0,msmarco_passage_21_588232716,2,14.7847,Anserini,1.0
2,568182,what are the characteristics of wool fibres,Q0,msmarco_passage_11_97323294,3,13.8658,Anserini,1.0
3,568182,what are the characteristics of wool fibres,Q0,msmarco_passage_68_54887603,4,13.865799,Anserini,1.0
4,568182,what are the characteristics of wool fibres,Q0,msmarco_passage_62_547466749,5,13.7936,Anserini,1.0


In [24]:
query_train_set_with_top100_passage_info[query_train_set_with_top100_passage_info['rel']==0].head()

Unnamed: 0,query_id,query,used,passage_id,rank,score,username,rel
90,568182,what are the characteristics of wool fibres,Q0,msmarco_passage_60_105759450,91,11.130899,Anserini,0.0
91,568182,what are the characteristics of wool fibres,Q0,msmarco_passage_10_713131218,92,11.1237,Anserini,0.0
92,568182,what are the characteristics of wool fibres,Q0,msmarco_passage_19_856407904,93,11.1162,Anserini,0.0
93,568182,what are the characteristics of wool fibres,Q0,msmarco_passage_21_607516236,94,11.116199,Anserini,0.0
94,568182,what are the characteristics of wool fibres,Q0,msmarco_passage_45_566215262,95,11.0814,Anserini,0.0


In [25]:
query_train_set_with_top100_passage_info[query_train_set_with_top100_passage_info['rel'].isnull()].head()

Unnamed: 0,query_id,query,used,passage_id,rank,score,username,rel
10,568182,what are the characteristics of wool fibres,Q0,msmarco_passage_04_190143648,11,12.8934,Anserini,
11,568182,what are the characteristics of wool fibres,Q0,msmarco_passage_50_795936449,12,12.8562,Anserini,
12,568182,what are the characteristics of wool fibres,Q0,msmarco_passage_06_345598979,13,12.7799,Anserini,
13,568182,what are the characteristics of wool fibres,Q0,msmarco_passage_48_498233726,14,12.5037,Anserini,
14,568182,what are the characteristics of wool fibres,Q0,msmarco_passage_14_377280660,15,12.4557,Anserini,


In [26]:
query_train_set_with_top100_passage_info.loc[~query_train_set_with_top100_passage_info['rel'].isnull()]

Unnamed: 0,query_id,query,used,passage_id,rank,score,username,rel
0,568182,what are the characteristics of wool fibres,Q0,msmarco_passage_62_547459701,1,16.102699,Anserini,1.0
1,568182,what are the characteristics of wool fibres,Q0,msmarco_passage_21_588232716,2,14.784700,Anserini,1.0
2,568182,what are the characteristics of wool fibres,Q0,msmarco_passage_11_97323294,3,13.865800,Anserini,1.0
3,568182,what are the characteristics of wool fibres,Q0,msmarco_passage_68_54887603,4,13.865799,Anserini,1.0
4,568182,what are the characteristics of wool fibres,Q0,msmarco_passage_62_547466749,5,13.793600,Anserini,1.0
...,...,...,...,...,...,...,...,...
99995,949330,when is a sore no longer contagious,Q0,msmarco_passage_43_8885256,96,11.128300,Anserini,0.0
99996,949330,when is a sore no longer contagious,Q0,msmarco_passage_08_424943290,97,11.116700,Anserini,0.0
99997,949330,when is a sore no longer contagious,Q0,msmarco_passage_55_288217143,98,11.116699,Anserini,0.0
99998,949330,when is a sore no longer contagious,Q0,msmarco_passage_42_57931706,99,11.114800,Anserini,0.0


In [27]:

query_train_set_with_passage_info = query_train_set_with_top100_passage_info.loc[~query_train_set_with_top100_passage_info['rel'].isnull()]
query_train_set_with_passage_info['rel'] = query_train_set_with_passage_info['rel'].astype(int)

query_val_set_with_passage_info = query_val_set_with_top100_passage_info.loc[~query_val_set_with_top100_passage_info['rel'].isnull()]
query_val_set_with_passage_info['rel'] = query_val_set_with_passage_info['rel'].astype(int)

query_test_set_with_passage_info = query_test_set_with_top100_passage_info.loc[~query_test_set_with_top100_passage_info['rel'].isnull()]
query_test_set_with_passage_info['rel'] = query_test_set_with_passage_info['rel'].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  query_train_set_with_passage_info['rel'] = query_train_set_with_passage_info['rel'].astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  query_val_set_with_passage_info['rel'] = query_val_set_with_passage_info['rel'].astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  query_test_set_wi

In [28]:
print(query_train_set_with_passage_info.shape)
print(query_val_set_with_passage_info.shape)
print(query_test_set_with_passage_info.shape)
display(query_train_set_with_passage_info.head())
display(query_val_set_with_passage_info.head())
display(query_test_set_with_passage_info.head())

(20000, 8)
(20000, 8)
(20000, 8)


Unnamed: 0,query_id,query,used,passage_id,rank,score,username,rel
0,568182,what are the characteristics of wool fibres,Q0,msmarco_passage_62_547459701,1,16.102699,Anserini,1
1,568182,what are the characteristics of wool fibres,Q0,msmarco_passage_21_588232716,2,14.7847,Anserini,1
2,568182,what are the characteristics of wool fibres,Q0,msmarco_passage_11_97323294,3,13.8658,Anserini,1
3,568182,what are the characteristics of wool fibres,Q0,msmarco_passage_68_54887603,4,13.865799,Anserini,1
4,568182,what are the characteristics of wool fibres,Q0,msmarco_passage_62_547466749,5,13.7936,Anserini,1


Unnamed: 0,query_id,query,used,passage_id,rank,score,username,rel
0,791126,what is respiration controlled by,Q0,msmarco_passage_48_489095876,1,9.5709,Anserini,1
1,791126,what is respiration controlled by,Q0,msmarco_passage_30_653845189,2,9.5522,Anserini,1
2,791126,what is respiration controlled by,Q0,msmarco_passage_04_540229875,3,9.4101,Anserini,1
3,791126,what is respiration controlled by,Q0,msmarco_passage_27_425453773,4,9.1409,Anserini,1
4,791126,what is respiration controlled by,Q0,msmarco_passage_59_586798337,5,8.8821,Anserini,1


Unnamed: 0,query_id,query,used,passage_id,rank,score,username,rel
0,203324,him functions to the paper health record,Q0,msmarco_passage_49_115778700,1,13.2671,Anserini,1
1,203324,him functions to the paper health record,Q0,msmarco_passage_26_450017756,2,11.8537,Anserini,1
2,203324,him functions to the paper health record,Q0,msmarco_passage_19_202396335,3,11.2738,Anserini,1
3,203324,him functions to the paper health record,Q0,msmarco_passage_28_623516004,4,11.0361,Anserini,1
4,203324,him functions to the paper health record,Q0,msmarco_passage_02_47181174,5,10.9476,Anserini,1


## Passage content - for Train, Validation and Test

In [29]:
# define a function that can get passage content from corpus
def get_passage(passage_id):
    (string1, string2, bundlenum, position) = passage_id.split('_')
    assert string1 == 'msmarco' and string2 == 'passage'

    with open(f'./data/msmarco_v2_passage/msmarco_passage_{bundlenum}', 'rt', encoding='utf8') as in_fh:
        in_fh.seek(int(position))
        json_string = in_fh.readline()
        passage = json.loads(json_string)
        assert passage['pid'] == passage_id
        return passage

In [30]:
# test to extract passage corpus data - try one example with passage_id (pid) 
get_passage("msmarco_passage_05_840839268")

{'pid': 'msmarco_passage_05_840839268',
 'passage': 'New Mexico State Symbols. State Nickname: The Land of Enchantment. State Slogan: Land of Enchantment; also on its license plate. State Motto: Crescit eundo (It grows as it goes) State flower: Yucca flower. State bird: Roadrunner aka Greater Roadrunner.',
 'spans': '(610,634),(635,674),(675,735),(736,784),(785,811),(812,857)',
 'docid': 'msmarco_doc_05_1547437048'}

In [31]:
# put train, test's passage_id into lists
train_passage_id = query_train_set_with_passage_info["passage_id"].to_list()
val_passage_id = query_val_set_with_passage_info["passage_id"].to_list()
test_passage_id = query_test_set_with_passage_info["passage_id"].to_list()


In [32]:
print(len(train_passage_id))
print(len(val_passage_id))
print(len(test_passage_id))
train_passage_id[:5]

20000
20000
20000


['msmarco_passage_62_547459701',
 'msmarco_passage_21_588232716',
 'msmarco_passage_11_97323294',
 'msmarco_passage_68_54887603',
 'msmarco_passage_62_547466749']

In [33]:
# define a function that for each passage_id in the list_passage_id, get its passage content, save into a dict
def extract_passage_content_using_pid(list_passage_id):
    dict_passage_id_content = dict()

    for passage_id in list_passage_id:
        passage_dict = get_passage(passage_id)
        #print(passage_dict)
        #print(passage_dict["passage"])
        dict_passage_id_content[passage_id] = passage_dict["passage"]
    print(f'Found {len(dict_passage_id_content)} passage number.')
    dict(list(dict_passage_id_content.items())[0:5]) 
    return dict_passage_id_content

In [34]:
dict_train_passage_id_content = extract_passage_content_using_pid(train_passage_id)
dict_val_passage_id_content = extract_passage_content_using_pid(val_passage_id)
dict_test_passage_id_content = extract_passage_content_using_pid(test_passage_id)

Found 19920 passage number.
Found 19933 passage number.
Found 19854 passage number.


In [35]:
# print 5 examples
for key in list(dict_train_passage_id_content.keys())[:5]:
    print(key)
    print(dict_train_passage_id_content[key])
    print("\n")

msmarco_passage_62_547459701
Table of Contents. Growth. Harvesting. Grading of Wool Fibers. Properties of Wool Fibers. Application of Wool Fibers. Characteristics of Wool Fibers and Products. Summary of Characteristics of Wool Fibers. Of the major apparel fibres, wool is the most reusable and recyclable fibre on the planet.


msmarco_passage_21_588232716
A micron ( micrometre) is the measurement used to express the diameter of wool fibre. Fine wool fibers have a low micron value. Fibre diameter is the most important characteristic of wool in determining its value.


msmarco_passage_11_97323294
Objective measurements include diameter (micron), length, strength, position of break, vegetable matter and colour. AWEX-ID covers subjective characteristics. Diameter. Mean fibre diameter is a measurement in micrometres (microns) of the average diameter of wool fibres in a sale lot. Fibre diameter is responsible for 70-80 per cent of the greasy wool price over the long term.


msmarco_passage_68

In [36]:
# change dict to df
train_passage_id_content = pd.DataFrame(dict_train_passage_id_content.items()
                                         ,columns = ["passage_id", "passage"])

val_passage_id_content = pd.DataFrame(dict_val_passage_id_content.items()
                                         ,columns = ["passage_id", "passage"])

test_passage_id_content = pd.DataFrame(dict_test_passage_id_content.items()
                                         ,columns = ["passage_id", "passage"])

In [37]:
print(train_passage_id_content.shape)
print(val_passage_id_content.shape)
print(test_passage_id_content.shape)
display(train_passage_id_content.head())
display(val_passage_id_content.head())
display(test_passage_id_content.head())

(19920, 2)
(19933, 2)
(19854, 2)


Unnamed: 0,passage_id,passage
0,msmarco_passage_62_547459701,Table of Contents. Growth. Harvesting. Grading...
1,msmarco_passage_21_588232716,A micron ( micrometre) is the measurement used...
2,msmarco_passage_11_97323294,Objective measurements include diameter (micro...
3,msmarco_passage_68_54887603,Objective measurements include diameter (micro...
4,msmarco_passage_62_547466749,Summary of Characteristics of Wool Fibers. Woo...


Unnamed: 0,passage_id,passage
0,msmarco_passage_48_489095876,Explain what respirable crystalline silica is ...
1,msmarco_passage_30_653845189,soft and elastic. which part of the brain cont...
2,msmarco_passage_04_540229875,What Part of the Brain Controls Breathing. Let...
3,msmarco_passage_27_425453773,Learning Objectives. Describe the neural mecha...
4,msmarco_passage_59_586798337,Key Takeaways. Respirators are a last resort a...


Unnamed: 0,passage_id,passage
0,msmarco_passage_49_115778700,"This information can be either paper-based, a ..."
1,msmarco_passage_26_450017756,Hybrid Health Record. Electronic Health Record...
2,msmarco_passage_19_202396335,Health information management ( HIM) is inform...
3,msmarco_passage_28_623516004,Documentation for Health Records addresses iss...
4,msmarco_passage_02_47181174,The influence of this growing shift toward tec...


# Text cleaning - Only needed for Word2Vec, not needed for BERT

## Clean queries for Train, Validation and Test

In [38]:
# reference: https://www.analyticsvidhya.com/blog/2020/08/information-retrieval-using-word2vec-based-vector-space-model/

# Dictionary of english Contractions
contractions_dict = { "ain't": "are not","'s":" is","aren't": "are not","can't": "can not","can't've": "cannot have",
"'cause": "because","could've": "could have","couldn't": "could not","couldn't've": "could not have",
"didn't": "did not","doesn't": "does not","don't": "do not","hadn't": "had not","hadn't've": "had not have",
"hasn't": "has not","haven't": "have not","he'd": "he would","he'd've": "he would have","he'll": "he will",
"he'll've": "he will have","how'd": "how did","how'd'y": "how do you","how'll": "how will","i'd": "i would",
"i'd've": "i would have","i'll": "i will","i'll've": "i will have","i'm": "i am","i've": "i have",
"isn't": "is not","it'd": "it would","it'd've": "it would have","it'll": "it will","it'll've": "it will have",
"let's": "let us","ma'am": "madam","mayn't": "may not","might've": "might have","mightn't": "might not",
"mightn't've": "might not have","must've": "must have","mustn't": "must not","mustn't've": "must not have",
"needn't": "need not","needn't've": "need not have","o'clock": "of the clock","oughtn't": "ought not",
"oughtn't've": "ought not have","shan't": "shall not","sha'n't": "shall not",
"shan't've": "shall not have","she'd": "she would","she'd've": "she would have","she'll": "she will",
"she'll've": "she will have","should've": "should have","shouldn't": "should not",
"shouldn't've": "should not have","so've": "so have","that'd": "that would","that'd've": "that would have",
"there'd": "there would","there'd've": "there would have",
"they'd": "they would","they'd've": "they would have","they'll": "they will","they'll've": "they will have",
"they're": "they are","they've": "they have","to've": "to have","wasn't": "was not","we'd": "we would",
"we'd've": "we would have","we'll": "we will","we'll've": "we will have","we're": "we are","we've": "we have",
"weren't": "were not","what'll": "what will","what'll've": "what will have","what're": "what are",
"what've": "what have","when've": "when have","where'd": "where did",
"where've": "where have","who'll": "who will","who'll've": "who will have","who've": "who have",
"why've": "why have","will've": "will have","won't": "will not","won't've": "will not have",
"would've": "would have","wouldn't": "would not","wouldn't've": "would not have","y'all": "you all",
"y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",
"you'd": "you would","you'd've": "you would have","you'll": "you will","you'll've": "you will have",
"you're": "you are","you've": "you have"}

# Regular expression for finding contractions
contractions_re=re.compile('(%s)' % '|'.join(contractions_dict.keys()))

# Function for expanding contractions
def expand_contractions(text,contractions_dict=contractions_dict):
    def replace(match):
        return contractions_dict[match.group(0)]
    return contractions_re.sub(replace, text)

In [39]:
# Function for Cleaning Text
# remove the words with digits, replace newline characters with space, remove URLs,
# and replace everything that isn’t English alphabets with space.

def clean_text(text):
    text=re.sub('\w*\d\w*','', text)
    text=re.sub('\n',' ',text)
    text=re.sub(r"http\S+", "", text)
    text=re.sub('[^a-z]',' ',text)
    return text

In [40]:
def clean_query_df(query_df, new_col, orig_col):
    # Lowercasing the text
    query_df[new_col] = query_df[orig_col].apply(lambda x:x.lower())

    # Expanding contractions
    query_df[new_col]=query_df[new_col].apply(lambda x:expand_contractions(x))


    # Cleaning queries using RegEx
    query_df[new_col]= query_df[new_col].apply(lambda x: clean_text(x))


    # Removing extra spaces
    query_df[new_col] = query_df[new_col].apply(lambda x: re.sub(' +',' ',x))

    return query_df


In [41]:
query_train_set_with_passage_info.head()

Unnamed: 0,query_id,query,used,passage_id,rank,score,username,rel
0,568182,what are the characteristics of wool fibres,Q0,msmarco_passage_62_547459701,1,16.102699,Anserini,1
1,568182,what are the characteristics of wool fibres,Q0,msmarco_passage_21_588232716,2,14.7847,Anserini,1
2,568182,what are the characteristics of wool fibres,Q0,msmarco_passage_11_97323294,3,13.8658,Anserini,1
3,568182,what are the characteristics of wool fibres,Q0,msmarco_passage_68_54887603,4,13.865799,Anserini,1
4,568182,what are the characteristics of wool fibres,Q0,msmarco_passage_62_547466749,5,13.7936,Anserini,1


In [42]:
query_test_set_with_passage_info.head()

Unnamed: 0,query_id,query,used,passage_id,rank,score,username,rel
0,203324,him functions to the paper health record,Q0,msmarco_passage_49_115778700,1,13.2671,Anserini,1
1,203324,him functions to the paper health record,Q0,msmarco_passage_26_450017756,2,11.8537,Anserini,1
2,203324,him functions to the paper health record,Q0,msmarco_passage_19_202396335,3,11.2738,Anserini,1
3,203324,him functions to the paper health record,Q0,msmarco_passage_28_623516004,4,11.0361,Anserini,1
4,203324,him functions to the paper health record,Q0,msmarco_passage_02_47181174,5,10.9476,Anserini,1


In [58]:
query_train_set_with_passage_info_cleaned = clean_query_df(query_train_set_with_passage_info
                                                    , "query_cleaned"
                                                    , "query")

query_val_set_with_passage_info_cleaned = clean_query_df(query_val_set_with_passage_info
                                                    , "query_cleaned"
                                                    , "query")

query_test_set_with_passage_info_cleaned = clean_query_df(query_test_set_with_passage_info
                                                    , "query_cleaned"
                                                    , "query")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  query_df[new_col] = query_df[orig_col].apply(lambda x:x.lower())
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  query_df[new_col]=query_df[new_col].apply(lambda x:expand_contractions(x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  query_df[new_col]= query_df[new_col].apply(lambda x: clean_text(x

In [59]:
query_train_set_with_passage_info_cleaned.head()

Unnamed: 0,query_id,query,used,passage_id,rank,score,username,rel,query_cleaned
0,568182,what are the characteristics of wool fibres,Q0,msmarco_passage_62_547459701,1,16.102699,Anserini,1,what are the characteristics of wool fibres
1,568182,what are the characteristics of wool fibres,Q0,msmarco_passage_21_588232716,2,14.7847,Anserini,1,what are the characteristics of wool fibres
2,568182,what are the characteristics of wool fibres,Q0,msmarco_passage_11_97323294,3,13.8658,Anserini,1,what are the characteristics of wool fibres
3,568182,what are the characteristics of wool fibres,Q0,msmarco_passage_68_54887603,4,13.865799,Anserini,1,what are the characteristics of wool fibres
4,568182,what are the characteristics of wool fibres,Q0,msmarco_passage_62_547466749,5,13.7936,Anserini,1,what are the characteristics of wool fibres


In [60]:
query_val_set_with_passage_info_cleaned.head()

Unnamed: 0,query_id,query,used,passage_id,rank,score,username,rel,query_cleaned
0,791126,what is respiration controlled by,Q0,msmarco_passage_48_489095876,1,9.5709,Anserini,1,what is respiration controlled by
1,791126,what is respiration controlled by,Q0,msmarco_passage_30_653845189,2,9.5522,Anserini,1,what is respiration controlled by
2,791126,what is respiration controlled by,Q0,msmarco_passage_04_540229875,3,9.4101,Anserini,1,what is respiration controlled by
3,791126,what is respiration controlled by,Q0,msmarco_passage_27_425453773,4,9.1409,Anserini,1,what is respiration controlled by
4,791126,what is respiration controlled by,Q0,msmarco_passage_59_586798337,5,8.8821,Anserini,1,what is respiration controlled by


In [45]:
query_test_set_with_passage_info_cleaned.head()

Unnamed: 0,query_id,query,used,passage_id,rank,score,username,rel,query_cleaned
0,203324,him functions to the paper health record,Q0,msmarco_passage_49_115778700,1,13.2671,Anserini,1,him functions to the paper health record
1,203324,him functions to the paper health record,Q0,msmarco_passage_26_450017756,2,11.8537,Anserini,1,him functions to the paper health record
2,203324,him functions to the paper health record,Q0,msmarco_passage_19_202396335,3,11.2738,Anserini,1,him functions to the paper health record
3,203324,him functions to the paper health record,Q0,msmarco_passage_28_623516004,4,11.0361,Anserini,1,him functions to the paper health record
4,203324,him functions to the paper health record,Q0,msmarco_passage_02_47181174,5,10.9476,Anserini,1,him functions to the paper health record


## Clean passages for Train, Validation and Test

In [46]:
def clean_corpus(corpus_df, new_col, orig_col):

    # Lowercasing the text
    corpus_df[new_col] = corpus_df[orig_col].apply(lambda x:x.lower())


    # Expanding Contractions
    corpus_df[new_col] = corpus_df[new_col].apply(lambda x:expand_contractions(x))


    # Cleaning corpus using RegEx
    corpus_df[new_col] = corpus_df[new_col].apply(lambda x: clean_text(x))


    # Removing extra spaces
    corpus_df[new_col] = corpus_df[new_col].apply(lambda x: re.sub(' +',' ',x))


    # Stopwords removal & Lemmatizing tokens using SpaCy

    nlp = spacy.load('en_core_web_sm',disable=['ner','parser'])
    nlp.max_length=5000000

    # Removing Stopwords and Lemmatizing words
    corpus_df[new_col + '_lemmatized'] = corpus_df[new_col].progress_apply(lambda x: ' '.join([token.lemma_ for token in list(nlp(x)) if (token.is_stop==False)]))
    return corpus_df

In [47]:
display(train_passage_id_content.head())

Unnamed: 0,passage_id,passage
0,msmarco_passage_62_547459701,Table of Contents. Growth. Harvesting. Grading...
1,msmarco_passage_21_588232716,A micron ( micrometre) is the measurement used...
2,msmarco_passage_11_97323294,Objective measurements include diameter (micro...
3,msmarco_passage_68_54887603,Objective measurements include diameter (micro...
4,msmarco_passage_62_547466749,Summary of Characteristics of Wool Fibers. Woo...


In [48]:
train_passage_id_content_cleaned = clean_corpus(train_passage_id_content
             , "passage_cleaned", "passage")

  0%|          | 0/19920 [00:00<?, ?it/s]

In [49]:
val_passage_id_content_cleaned = clean_corpus(val_passage_id_content
             , "passage_cleaned", "passage")

  0%|          | 0/19933 [00:00<?, ?it/s]

In [50]:

test_passage_id_content_cleaned = clean_corpus(test_passage_id_content
             , "passage_cleaned", "passage")


  0%|          | 0/19854 [00:00<?, ?it/s]

In [51]:
train_passage_id_content_cleaned.head()

Unnamed: 0,passage_id,passage,passage_cleaned,passage_cleaned_lemmatized
0,msmarco_passage_62_547459701,Table of Contents. Growth. Harvesting. Grading...,table of contents growth harvesting grading of...,table content growth harvesting grade wool fib...
1,msmarco_passage_21_588232716,A micron ( micrometre) is the measurement used...,a micron micrometre is the measurement used to...,micron micrometre measurement express diameter...
2,msmarco_passage_11_97323294,Objective measurements include diameter (micro...,objective measurements include diameter micron...,objective measurement include diameter micron ...
3,msmarco_passage_68_54887603,Objective measurements include diameter (micro...,objective measurements include diameter micron...,objective measurement include diameter micron ...
4,msmarco_passage_62_547466749,Summary of Characteristics of Wool Fibers. Woo...,summary of characteristics of wool fibers wool...,summary characteristic wool fiber wool protein...


In [52]:
val_passage_id_content_cleaned.head()

Unnamed: 0,passage_id,passage,passage_cleaned,passage_cleaned_lemmatized
0,msmarco_passage_48_489095876,Explain what respirable crystalline silica is ...,explain what respirable crystalline silica is ...,explain respirable crystalline silica health h...
1,msmarco_passage_30_653845189,soft and elastic. which part of the brain cont...,soft and elastic which part of the brain contr...,soft elastic brain control involuntary action ...
2,msmarco_passage_04_540229875,What Part of the Brain Controls Breathing. Let...,what part of the brain controls breathing let ...,brain control breathing let s know breathe res...
3,msmarco_passage_27_425453773,Learning Objectives. Describe the neural mecha...,learning objectives describe the neural mechan...,learn objective describe neural mechanism resp...
4,msmarco_passage_59_586798337,Key Takeaways. Respirators are a last resort a...,key takeaways respirators are a last resort an...,key takeaway respirator resort control method ...


In [53]:
test_passage_id_content_cleaned.head()

Unnamed: 0,passage_id,passage,passage_cleaned,passage_cleaned_lemmatized
0,msmarco_passage_49_115778700,"This information can be either paper-based, a ...",this information can be either paper based a c...,information paper base combination paper digit...
1,msmarco_passage_26_450017756,Hybrid Health Record. Electronic Health Record...,hybrid health record electronic health records...,hybrid health record electronic health record ...
2,msmarco_passage_19_202396335,Health information management ( HIM) is inform...,health information management him is informati...,health information management information mana...
3,msmarco_passage_28_623516004,Documentation for Health Records addresses iss...,documentation for health records addresses iss...,documentation health record address issue rela...
4,msmarco_passage_02_47181174,The influence of this growing shift toward tec...,the influence of this growing shift toward tec...,influence grow shift technology feel industry ...


In [54]:
train_passage_id_content_cleaned[train_passage_id_content_cleaned["passage_cleaned_lemmatized"].isnull()]

Unnamed: 0,passage_id,passage,passage_cleaned,passage_cleaned_lemmatized


In [55]:
val_passage_id_content_cleaned[val_passage_id_content_cleaned["passage_cleaned_lemmatized"].isnull()]

Unnamed: 0,passage_id,passage,passage_cleaned,passage_cleaned_lemmatized


In [56]:
test_passage_id_content_cleaned[test_passage_id_content_cleaned["passage_cleaned_lemmatized"].isnull()]

Unnamed: 0,passage_id,passage,passage_cleaned,passage_cleaned_lemmatized


# Save processed data

In [61]:
# query
query_train_set_with_passage_info.to_csv("./output/query_train_set_with_passage_info.csv"
                                       , index = False)
query_val_set_with_passage_info.to_csv("./output/query_val_set_with_passage_info.csv"
                                       , index = False)
query_test_set_with_passage_info.to_csv("./output/query_test_set_with_passage_info.csv"
                                       , index = False)

# passage
train_passage_id_content_cleaned.to_csv("./output/train_passage_id_content_cleaned.csv"
                                       , index = False)
val_passage_id_content_cleaned.to_csv("./output/val_passage_id_content_cleaned.csv"
                                       , index = False)
test_passage_id_content_cleaned.to_csv("./output/test_passage_id_content_cleaned.csv"
                                       , index = False)