# About
1. This notebook load data from https://microsoft.github.io/msmarco/TREC-Deep-Learning.html for solve a passage ranking problem, and produces a small random sample of Train and Test sets and clean the text:

- The Train set contains 1000 queries and each query's top 10 passage

- The Train set contains 1000 queries and each query's top 10 passage

2. It outputs 4 .csv files:

- Train set query info (query - and top 10 passages)

- Test set query info (query - and top 10 passages)

- Train set passage

- Test set passage

3. This notebook takes around 3 hours to run and only need to run once

# Load libraries

In [106]:
import pandas as pd
#import numpy as np
#import tarfile
import json
import re
import spacy
from tqdm.notebook import tqdm
tqdm.pandas()

# Download data (only need to run once)

data source: 
https://microsoft.github.io/msmarco/TREC-Deep-Learning.html

reference code: 
https://github.com/snovaisg/Trec-DeepLearning-2020/blob/master/download_and_unzip_data


## Download queries data

In [2]:
#!wget -O data/passv2_train_queries.tsv --header "X-Ms-Version: 2019-12-12" https://msmarco.z22.web.core.windows.net/msmarcoranking/passv2_train_queries.tsv

--2024-03-03 09:33:08--  https://msmarco.z22.web.core.windows.net/msmarcoranking/passv2_train_queries.tsv
Resolving msmarco.z22.web.core.windows.net (msmarco.z22.web.core.windows.net)... 20.150.34.1
Connecting to msmarco.z22.web.core.windows.net (msmarco.z22.web.core.windows.net)|20.150.34.1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11608838 (11M) [application/octet-stream]
Saving to: ‘data/passv2_train_queries.tsv’


2024-03-03 09:33:18 (1.16 MB/s) - ‘data/passv2_train_queries.tsv’ saved [11608838/11608838]



## Download passage corpus data (used 1 hour 50 minutes)

In [10]:
#!wget -O data/msmarco_v2_passage.tar --header "X-Ms-Version: 2019-12-12" https://msmarco.z22.web.core.windows.net/msmarcoranking/msmarco_v2_passage.tar

--2024-03-03 10:03:40--  https://msmarco.z22.web.core.windows.net/msmarcoranking/msmarco_v2_passage.tar
Resolving msmarco.z22.web.core.windows.net (msmarco.z22.web.core.windows.net)... 20.150.34.1
Connecting to msmarco.z22.web.core.windows.net (msmarco.z22.web.core.windows.net)|20.150.34.1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21768192000 (20G) [application/x-tar]
Saving to: ‘data/msmarco_v2_passage.tar’


2024-03-03 11:54:09 (3.13 MB/s) - ‘data/msmarco_v2_passage.tar’ saved [21768192000/21768192000]



In [24]:
# # extract the tar file
# # open file 
# file = tarfile.open("data/msmarco_v2_passage.tar") 

# # extracting file 
# file.extractall("./data/") 

In [32]:
# decompress all .gz files in this floder, it takes 21minutes 22 seconds
#!gzip -d -r ./data/msmarco_v2_passage/

## [do not use] Download qrels data

In [6]:
#!wget -O data/passv2_train_qrels.tsv --header "X-Ms-Version: 2019-12-12" https://msmarco.z22.web.core.windows.net/msmarcoranking/passv2_train_qrels.tsv

--2024-03-03 09:50:21--  https://msmarco.z22.web.core.windows.net/msmarcoranking/passv2_train_qrels.tsv
Resolving msmarco.z22.web.core.windows.net (msmarco.z22.web.core.windows.net)... 20.150.34.1
Connecting to msmarco.z22.web.core.windows.net (msmarco.z22.web.core.windows.net)|20.150.34.1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11620946 (11M) [application/octet-stream]
Saving to: ‘data/passv2_train_qrels.tsv’


2024-03-03 09:50:25 (2.61 MB/s) - ‘data/passv2_train_qrels.tsv’ saved [11620946/11620946]



## Download passage top 100 data

In [11]:
#!wget -O data/passv2_train_top100.txt.gz --header "X-Ms-Version: 2019-12-12" https://msmarco.z22.web.core.windows.net/msmarcoranking/passv2_train_top100.txt.gz

--2024-03-03 11:56:18--  https://msmarco.z22.web.core.windows.net/msmarcoranking/passv2_train_top100.txt.gz
Resolving msmarco.z22.web.core.windows.net (msmarco.z22.web.core.windows.net)... 20.150.34.1
Connecting to msmarco.z22.web.core.windows.net (msmarco.z22.web.core.windows.net)|20.150.34.1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 340634991 (325M) [application/x-gzip]
Saving to: ‘data/passv2_train_top100.txt.gz’


2024-03-03 11:58:07 (3.00 MB/s) - ‘data/passv2_train_top100.txt.gz’ saved [340634991/340634991]



In [None]:
## decompress the gz file
#!gzip -d ./data/passv2_train_top100.txt.gz

# Load data

## Load queries data

In [37]:
train_queries_df = pd.read_csv('./data/passv2_train_queries.tsv'
                               , delimiter = "\t" 
                               , header=None
                               , names = ['query_id','query'])

In [38]:
print(train_queries_df.shape)
display(train_queries_df.head())

(277144, 2)


Unnamed: 0,query_id,query
0,121352,define extreme
1,510633,tattoo fixers how much does it cost
2,674172,what is a bank transit number
3,570009,what are the four major groups of elements
4,54528,blood clots in urine after menopause


## [do not use] Load qrels data

In [7]:
train_qrels_df = pd.read_csv('./data/passv2_train_qrels.tsv'
                             , names = ['0','passage','1']
                             , header = None,delimiter = "\t")

In [8]:
print(train_qrels_df.shape)
display(train_qrels_df.head())

(284212, 3)


Unnamed: 0,0,file,1
1185869,0,msmarco_passage_08_840101254,1
1183785,0,msmarco_passage_01_444503625,1
695572,0,msmarco_passage_20_461843390,1
852919,0,msmarco_passage_00_837399976,1
637313,0,msmarco_passage_08_12770678,1


## Load each query's top 100 passages' info

In [73]:
train_top100_df = pd.read_csv('./data/passv2_train_top100.txt'
                              , delimiter = " "
                              , names = ['query_id','used','passage_id','rank','score','username'])
                              

In [74]:
print(train_top100_df.shape)
display(train_top100_df.head())

(27713673, 6)


Unnamed: 0,query_id,used,passage_id,rank,score,username
0,5,Q0,msmarco_passage_49_25899182,1,12.1278,Anserini
1,5,Q0,msmarco_passage_06_781809452,2,11.9428,Anserini
2,5,Q0,msmarco_passage_09_146319807,3,11.7703,Anserini
3,5,Q0,msmarco_passage_18_567713921,4,11.5883,Anserini
4,5,Q0,msmarco_passage_30_434058059,5,11.588299,Anserini


# Get Train, Test sets by random sampling

## Queries - randomly choose 2000 as Train and Test queries

In [41]:
train_queries_df_sample2000 = train_queries_df.sample(n=2000,random_state=42).reset_index(drop=True)


print(train_queries_df_sample2000.shape)
display(train_queries_df_sample2000.head())

(2000, 2)


Unnamed: 0,query_id,query
0,916247,what us state bears the slogan the land enchan...
1,203324,him functions to the paper health record
2,123916,define merit-based pay
3,54169,bitcoin price increasing
4,766010,what is linguistic chauvinism


## Queries - randomly choose 1000 samples as Train set and 1000 samples Test set

In [42]:
query_train_set = train_queries_df_sample2000.sample(n=1000,random_state=42).reset_index(drop=True)


print(query_train_set.shape)
display(query_train_set.head())

(1000, 2)


Unnamed: 0,query_id,query
0,560129,what are hues
1,712421,what is an aspirator
2,417213,is manic depression genetic
3,225992,how does diesel smell
4,623400,what do huntsman spiders eat


In [43]:
query_test_set = train_queries_df_sample2000[~train_queries_df_sample2000["query_id"].isin(query_train_set["query_id"])]

print(query_test_set.shape)
display(query_test_set.head())

(1000, 2)


Unnamed: 0,query_id,query
0,916247,what us state bears the slogan the land enchan...
1,203324,him functions to the paper health record
3,54169,bitcoin price increasing
4,766010,what is linguistic chauvinism
5,497475,side effects for fluticasone furoate


In [44]:
# check make sure no test data is in train data
[id for id in query_test_set["query_id"].to_list() if id in query_train_set["query_id"].to_list()]

[]

## Passages info - For Train and Test sets, get each query's top 100 passages' info

In [75]:
train_top100_df.head()

Unnamed: 0,query_id,used,passage_id,rank,score,username
0,5,Q0,msmarco_passage_49_25899182,1,12.1278,Anserini
1,5,Q0,msmarco_passage_06_781809452,2,11.9428,Anserini
2,5,Q0,msmarco_passage_09_146319807,3,11.7703,Anserini
3,5,Q0,msmarco_passage_18_567713921,4,11.5883,Anserini
4,5,Q0,msmarco_passage_30_434058059,5,11.588299,Anserini


In [76]:
query_train_set.head()

Unnamed: 0,query_id,query
0,560129,what are hues
1,712421,what is an aspirator
2,417213,is manic depression genetic
3,225992,how does diesel smell
4,623400,what do huntsman spiders eat


In [77]:
# train
query_train_set_with_top100_passage_info = query_train_set.merge(train_top100_df
                                                          , how = "left"
                                                          , on = "query_id")

print(query_train_set_with_top100_passage_info.shape)
display(query_train_set_with_top100_passage_info.head())

(100000, 7)


Unnamed: 0,query_id,query,used,passage_id,rank,score,username
0,560129,what are hues,Q0,msmarco_passage_03_496902198,1,8.4601,Anserini
1,560129,what are hues,Q0,msmarco_passage_35_561149885,2,8.460099,Anserini
2,560129,what are hues,Q0,msmarco_passage_05_224676265,3,8.1767,Anserini
3,560129,what are hues,Q0,msmarco_passage_04_168335684,4,8.1125,Anserini
4,560129,what are hues,Q0,msmarco_passage_02_769341954,5,8.0855,Anserini


In [78]:
# test
query_test_set_with_top100_passage_info = query_test_set.merge(train_top100_df
                                                          , how = "left"
                                                          , on = "query_id")

print(query_test_set_with_top100_passage_info.shape)
display(query_test_set_with_top100_passage_info.head())

(100000, 7)


Unnamed: 0,query_id,query,used,passage_id,rank,score,username
0,916247,what us state bears the slogan the land enchan...,Q0,msmarco_passage_05_840839268,1,16.004101,Anserini
1,916247,what us state bears the slogan the land enchan...,Q0,msmarco_passage_06_203354916,2,15.7155,Anserini
2,916247,what us state bears the slogan the land enchan...,Q0,msmarco_passage_45_489369159,3,15.715499,Anserini
3,916247,what us state bears the slogan the land enchan...,Q0,msmarco_passage_50_676325639,4,14.9837,Anserini
4,916247,what us state bears the slogan the land enchan...,Q0,msmarco_passage_21_464076261,5,14.3472,Anserini


## Filter to keep top 10 passage for Train and Test sets

In [79]:
query_train_set_with_passage_info = query_train_set_with_top100_passage_info[
                                            query_train_set_with_top100_passage_info["rank"] <= 10]
query_test_set_with_passage_info = query_test_set_with_top100_passage_info[
                                            query_test_set_with_top100_passage_info["rank"] <= 10]

In [80]:
print(query_train_set_with_passage_info.shape)
display(query_train_set_with_passage_info.head())

(10000, 7)


Unnamed: 0,query_id,query,used,passage_id,rank,score,username
0,560129,what are hues,Q0,msmarco_passage_03_496902198,1,8.4601,Anserini
1,560129,what are hues,Q0,msmarco_passage_35_561149885,2,8.460099,Anserini
2,560129,what are hues,Q0,msmarco_passage_05_224676265,3,8.1767,Anserini
3,560129,what are hues,Q0,msmarco_passage_04_168335684,4,8.1125,Anserini
4,560129,what are hues,Q0,msmarco_passage_02_769341954,5,8.0855,Anserini


In [81]:
print(query_test_set_with_passage_info.shape)
display(query_test_set_with_passage_info.head())

(10000, 7)


Unnamed: 0,query_id,query,used,passage_id,rank,score,username
0,916247,what us state bears the slogan the land enchan...,Q0,msmarco_passage_05_840839268,1,16.004101,Anserini
1,916247,what us state bears the slogan the land enchan...,Q0,msmarco_passage_06_203354916,2,15.7155,Anserini
2,916247,what us state bears the slogan the land enchan...,Q0,msmarco_passage_45_489369159,3,15.715499,Anserini
3,916247,what us state bears the slogan the land enchan...,Q0,msmarco_passage_50_676325639,4,14.9837,Anserini
4,916247,what us state bears the slogan the land enchan...,Q0,msmarco_passage_21_464076261,5,14.3472,Anserini


## Passage content - for Train and Test

In [82]:
# define a function that can get passage content from corpus
def get_passage(passage_id):
    (string1, string2, bundlenum, position) = passage_id.split('_')
    assert string1 == 'msmarco' and string2 == 'passage'

    with open(f'./data/msmarco_v2_passage/msmarco_passage_{bundlenum}', 'rt', encoding='utf8') as in_fh:
        in_fh.seek(int(position))
        json_string = in_fh.readline()
        passage = json.loads(json_string)
        assert passage['pid'] == passage_id
        return passage

In [83]:
# test to extract passage corpus data - try one example with passage_id (pid) 
get_passage("msmarco_passage_05_840839268")

{'pid': 'msmarco_passage_05_840839268',
 'passage': 'New Mexico State Symbols. State Nickname: The Land of Enchantment. State Slogan: Land of Enchantment; also on its license plate. State Motto: Crescit eundo (It grows as it goes) State flower: Yucca flower. State bird: Roadrunner aka Greater Roadrunner.',
 'spans': '(610,634),(635,674),(675,735),(736,784),(785,811),(812,857)',
 'docid': 'msmarco_doc_05_1547437048'}

In [88]:
# put train, test's passage_id into lists
train_passage_id = query_train_set_with_passage_info["passage_id"].to_list()
test_passage_id = query_test_set_with_passage_info["passage_id"].to_list()


In [89]:
print(len(train_passage_id))
print(len(test_passage_id))
train_passage_id

10000
10000


['msmarco_passage_03_496902198',
 'msmarco_passage_35_561149885',
 'msmarco_passage_05_224676265',
 'msmarco_passage_04_168335684',
 'msmarco_passage_02_769341954',
 'msmarco_passage_08_307460138',
 'msmarco_passage_53_265468378',
 'msmarco_passage_08_307476102',
 'msmarco_passage_54_724480825',
 'msmarco_passage_54_724489016',
 'msmarco_passage_24_198341005',
 'msmarco_passage_10_623634187',
 'msmarco_passage_66_618053314',
 'msmarco_passage_22_436715579',
 'msmarco_passage_53_621908725',
 'msmarco_passage_12_566610075',
 'msmarco_passage_35_131764855',
 'msmarco_passage_32_423035030',
 'msmarco_passage_41_517810876',
 'msmarco_passage_39_430043247',
 'msmarco_passage_03_851565752',
 'msmarco_passage_37_240503412',
 'msmarco_passage_57_679994116',
 'msmarco_passage_08_84524184',
 'msmarco_passage_53_588473485',
 'msmarco_passage_30_279571167',
 'msmarco_passage_27_496592111',
 'msmarco_passage_36_629107319',
 'msmarco_passage_02_813440272',
 'msmarco_passage_02_813505588',
 'msmarco_p

In [91]:
# define a function that for each passage_id in the list_passage_id, get its passage content, save into a dict
def extract_passage_content_using_pid(list_passage_id):
    dict_passage_id_content = dict()

    for passage_id in list_passage_id:
        passage_dict = get_passage(passage_id)
        #print(passage_dict)
        #print(passage_dict["passage"])
        dict_passage_id_content[passage_id] = passage_dict["passage"]
    print(f'Found {len(dict_passage_id_content)} passage number.')
    dict(list(dict_passage_id_content.items())[0:5]) 
    return dict_passage_id_content

In [92]:
dict_train_passage_id_content = extract_passage_content_using_pid(train_passage_id)
dict_test_passage_id_content = extract_passage_content_using_pid(test_passage_id)

Found 9981 passage number.
Found 9969 passage number.


In [93]:
dict_train_passage_id_content

{'msmarco_passage_03_496902198': 'Let’s dig a little deeper into each. Hues are colors and what hue we see is dependent on the wavelength of light being reflected or produced. I doubt I need to tell you what a color is and since color and hue are synonymous you should know what a hue is as well.',
 'msmarco_passage_35_561149885': 'Let’s dig a little deeper into each. Hues are colors and what hue we see is dependent on the wavelength of light being reflected or produced. I doubt I need to tell you what a color is and since color and hue are synonymous you should know what a hue is as well.',
 'msmarco_passage_05_224676265': 'Hue: This is what we usually mean when we ask "what color is   that?". The property of color that we are actually asking about is "hue". For example, when we talk about colors that are red, yellow, green, and blue, we are   talking about hue. Different hues are caused by different wavelengths of light.',
 'msmarco_passage_04_168335684': "hue = color or a shade of co

In [125]:
# change dict to df
train_passage_id_content = pd.DataFrame(dict_train_passage_id_content.items()
                                         ,columns = ["passage_id", "passage"])
test_passage_id_content = pd.DataFrame(dict_test_passage_id_content.items()
                                         ,columns = ["passage_id", "passage"])

In [129]:
print(train_passage_id_content.shape)
print(test_passage_id_content.shape)
display(train_passage_id_content.head())
display(test_passage_id_content.head())

(9981, 2)
(9969, 2)


Unnamed: 0,passage_id,passage
0,msmarco_passage_03_496902198,Let’s dig a little deeper into each. Hues are ...
1,msmarco_passage_35_561149885,Let’s dig a little deeper into each. Hues are ...
2,msmarco_passage_05_224676265,Hue: This is what we usually mean when we ask ...
3,msmarco_passage_04_168335684,hue = color or a shade of color\nexample sente...
4,msmarco_passage_02_769341954,Hue: Hue is what we normally think of as color...


Unnamed: 0,passage_id,passage
0,msmarco_passage_05_840839268,New Mexico State Symbols. State Nickname: The ...
1,msmarco_passage_06_203354916,Land of Enchantment. Before Land of Enchantmen...
2,msmarco_passage_45_489369159,Land of Enchantment. Before Land of Enchantmen...
3,msmarco_passage_50_676325639,New Mexico State Slogans. Whereas the New Mexi...
4,msmarco_passage_21_464076261,"1 to approximately 99-000. First use of the ""L..."


# Text cleaning

## Clean queries for Train and Test

In [98]:
# reference: https://www.analyticsvidhya.com/blog/2020/08/information-retrieval-using-word2vec-based-vector-space-model/

# Dictionary of english Contractions
contractions_dict = { "ain't": "are not","'s":" is","aren't": "are not","can't": "can not","can't've": "cannot have",
"'cause": "because","could've": "could have","couldn't": "could not","couldn't've": "could not have",
"didn't": "did not","doesn't": "does not","don't": "do not","hadn't": "had not","hadn't've": "had not have",
"hasn't": "has not","haven't": "have not","he'd": "he would","he'd've": "he would have","he'll": "he will",
"he'll've": "he will have","how'd": "how did","how'd'y": "how do you","how'll": "how will","i'd": "i would",
"i'd've": "i would have","i'll": "i will","i'll've": "i will have","i'm": "i am","i've": "i have",
"isn't": "is not","it'd": "it would","it'd've": "it would have","it'll": "it will","it'll've": "it will have",
"let's": "let us","ma'am": "madam","mayn't": "may not","might've": "might have","mightn't": "might not",
"mightn't've": "might not have","must've": "must have","mustn't": "must not","mustn't've": "must not have",
"needn't": "need not","needn't've": "need not have","o'clock": "of the clock","oughtn't": "ought not",
"oughtn't've": "ought not have","shan't": "shall not","sha'n't": "shall not",
"shan't've": "shall not have","she'd": "she would","she'd've": "she would have","she'll": "she will",
"she'll've": "she will have","should've": "should have","shouldn't": "should not",
"shouldn't've": "should not have","so've": "so have","that'd": "that would","that'd've": "that would have",
"there'd": "there would","there'd've": "there would have",
"they'd": "they would","they'd've": "they would have","they'll": "they will","they'll've": "they will have",
"they're": "they are","they've": "they have","to've": "to have","wasn't": "was not","we'd": "we would",
"we'd've": "we would have","we'll": "we will","we'll've": "we will have","we're": "we are","we've": "we have",
"weren't": "were not","what'll": "what will","what'll've": "what will have","what're": "what are",
"what've": "what have","when've": "when have","where'd": "where did",
"where've": "where have","who'll": "who will","who'll've": "who will have","who've": "who have",
"why've": "why have","will've": "will have","won't": "will not","won't've": "will not have",
"would've": "would have","wouldn't": "would not","wouldn't've": "would not have","y'all": "you all",
"y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",
"you'd": "you would","you'd've": "you would have","you'll": "you will","you'll've": "you will have",
"you're": "you are","you've": "you have"}

# Regular expression for finding contractions
contractions_re=re.compile('(%s)' % '|'.join(contractions_dict.keys()))

# Function for expanding contractions
def expand_contractions(text,contractions_dict=contractions_dict):
    def replace(match):
        return contractions_dict[match.group(0)]
    return contractions_re.sub(replace, text)

In [94]:
# Function for Cleaning Text
# remove the words with digits, replace newline characters with space, remove URLs,
# and replace everything that isn’t English alphabets with space.

def clean_text(text):
    text=re.sub('\w*\d\w*','', text)
    text=re.sub('\n',' ',text)
    text=re.sub(r"http\S+", "", text)
    text=re.sub('[^a-z]',' ',text)
    return text

In [99]:
def clean_query_df(query_df, new_col, orig_col):
    # Lowercasing the text
    query_df[new_col] = query_df[orig_col].apply(lambda x:x.lower())

    # Expanding contractions
    query_df[new_col]=query_df[new_col].apply(lambda x:expand_contractions(x))


    # Cleaning queries using RegEx
    query_df[new_col]= query_df[new_col].apply(lambda x: clean_text(x))


    # Removing extra spaces
    query_df[new_col] = query_df[new_col].apply(lambda x: re.sub(' +',' ',x))

    return query_df


In [100]:
query_train_set_with_passage_info.head()

Unnamed: 0,query_id,query,used,passage_id,rank,score,username
0,560129,what are hues,Q0,msmarco_passage_03_496902198,1,8.4601,Anserini
1,560129,what are hues,Q0,msmarco_passage_35_561149885,2,8.460099,Anserini
2,560129,what are hues,Q0,msmarco_passage_05_224676265,3,8.1767,Anserini
3,560129,what are hues,Q0,msmarco_passage_04_168335684,4,8.1125,Anserini
4,560129,what are hues,Q0,msmarco_passage_02_769341954,5,8.0855,Anserini


In [104]:
query_test_set_with_passage_info.head()

Unnamed: 0,query_id,query,used,passage_id,rank,score,username,query_cleaned
0,916247,what us state bears the slogan the land enchan...,Q0,msmarco_passage_05_840839268,1,16.004101,Anserini,what us state bears the slogan the land enchan...
1,916247,what us state bears the slogan the land enchan...,Q0,msmarco_passage_06_203354916,2,15.7155,Anserini,what us state bears the slogan the land enchan...
2,916247,what us state bears the slogan the land enchan...,Q0,msmarco_passage_45_489369159,3,15.715499,Anserini,what us state bears the slogan the land enchan...
3,916247,what us state bears the slogan the land enchan...,Q0,msmarco_passage_50_676325639,4,14.9837,Anserini,what us state bears the slogan the land enchan...
4,916247,what us state bears the slogan the land enchan...,Q0,msmarco_passage_21_464076261,5,14.3472,Anserini,what us state bears the slogan the land enchan...


In [101]:
query_train_set_with_passage_info_cleaned = clean_query_df(query_train_set_with_passage_info
                                                    , "query_cleaned"
                                                    , "query")

query_test_set_with_passage_info_cleaned = clean_query_df(query_test_set_with_passage_info
                                                    , "query_cleaned"
                                                    , "query")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  query_df[new_col] = query_df[orig_col].apply(lambda x:x.lower())
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  query_df[new_col]=query_df[new_col].apply(lambda x:expand_contractions(x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  query_df[new_col]= query_df[new_col].apply(lambda x: clean_text(x

In [102]:
query_train_set_with_passage_info_cleaned.head()

Unnamed: 0,query_id,query,used,passage_id,rank,score,username,query_cleaned
0,560129,what are hues,Q0,msmarco_passage_03_496902198,1,8.4601,Anserini,what are hues
1,560129,what are hues,Q0,msmarco_passage_35_561149885,2,8.460099,Anserini,what are hues
2,560129,what are hues,Q0,msmarco_passage_05_224676265,3,8.1767,Anserini,what are hues
3,560129,what are hues,Q0,msmarco_passage_04_168335684,4,8.1125,Anserini,what are hues
4,560129,what are hues,Q0,msmarco_passage_02_769341954,5,8.0855,Anserini,what are hues


In [103]:
query_test_set_with_passage_info_cleaned.head()

Unnamed: 0,query_id,query,used,passage_id,rank,score,username,query_cleaned
0,916247,what us state bears the slogan the land enchan...,Q0,msmarco_passage_05_840839268,1,16.004101,Anserini,what us state bears the slogan the land enchan...
1,916247,what us state bears the slogan the land enchan...,Q0,msmarco_passage_06_203354916,2,15.7155,Anserini,what us state bears the slogan the land enchan...
2,916247,what us state bears the slogan the land enchan...,Q0,msmarco_passage_45_489369159,3,15.715499,Anserini,what us state bears the slogan the land enchan...
3,916247,what us state bears the slogan the land enchan...,Q0,msmarco_passage_50_676325639,4,14.9837,Anserini,what us state bears the slogan the land enchan...
4,916247,what us state bears the slogan the land enchan...,Q0,msmarco_passage_21_464076261,5,14.3472,Anserini,what us state bears the slogan the land enchan...


## Clean passages for Train and Test

In [107]:
def clean_corpus(corpus_df, new_col, orig_col):

    # Lowercasing the text
    corpus_df[new_col] = corpus_df[orig_col].apply(lambda x:x.lower())


    # Expanding Contractions
    corpus_df[new_col] = corpus_df[new_col].apply(lambda x:expand_contractions(x))


    # Cleaning corpus using RegEx
    corpus_df[new_col] = corpus_df[new_col].apply(lambda x: clean_text(x))


    # Removing extra spaces
    corpus_df[new_col] = corpus_df[new_col].apply(lambda x: re.sub(' +',' ',x))


    # Stopwords removal & Lemmatizing tokens using SpaCy

    nlp = spacy.load('en_core_web_sm',disable=['ner','parser'])
    nlp.max_length=5000000

    # Removing Stopwords and Lemmatizing words
    corpus_df[new_col + '_lemmatized'] = corpus_df[new_col].progress_apply(lambda x: ' '.join([token.lemma_ for token in list(nlp(x)) if (token.is_stop==False)]))
    return corpus_df

In [130]:
display(train_passage_id_content.head())

Unnamed: 0,passage_id,passage
0,msmarco_passage_03_496902198,Let’s dig a little deeper into each. Hues are ...
1,msmarco_passage_35_561149885,Let’s dig a little deeper into each. Hues are ...
2,msmarco_passage_05_224676265,Hue: This is what we usually mean when we ask ...
3,msmarco_passage_04_168335684,hue = color or a shade of color\nexample sente...
4,msmarco_passage_02_769341954,Hue: Hue is what we normally think of as color...


In [135]:
train_passage_id_content_cleaned = clean_corpus(train_passage_id_content
             , "passage_cleaned", "passage")

  0%|          | 0/9981 [00:00<?, ?it/s]

In [136]:

test_passage_id_content_cleaned = clean_corpus(test_passage_id_content
             , "passage_cleaned", "passage")


  0%|          | 0/9969 [00:00<?, ?it/s]

In [137]:
train_passage_id_content_cleaned.head()

Unnamed: 0,passage_id,passage,passage_cleaned,passage_cleaned_lemmatized
0,msmarco_passage_03_496902198,Let’s dig a little deeper into each. Hues are ...,let s dig a little deeper into each hues are c...,let s dig little deeply hue color hue dependen...
1,msmarco_passage_35_561149885,Let’s dig a little deeper into each. Hues are ...,let s dig a little deeper into each hues are c...,let s dig little deeply hue color hue dependen...
2,msmarco_passage_05_224676265,Hue: This is what we usually mean when we ask ...,hue this is what we usually mean when we ask w...,hue usually mean ask color property color actu...
3,msmarco_passage_04_168335684,hue = color or a shade of color\nexample sente...,hue color or a shade of color example sentence...,hue color shade color example sentence baby sk...
4,msmarco_passage_02_769341954,Hue: Hue is what we normally think of as color...,hue hue is what we normally think of as color ...,hue hue normally think color technically hue d...


In [138]:
test_passage_id_content_cleaned.head()

Unnamed: 0,passage_id,passage,passage_cleaned,passage_cleaned_lemmatized
0,msmarco_passage_05_840839268,New Mexico State Symbols. State Nickname: The ...,new mexico state symbols state nickname the la...,new mexico state symbols state nickname land e...
1,msmarco_passage_06_203354916,Land of Enchantment. Before Land of Enchantmen...,land of enchantment before land of enchantment...,land enchantment land enchantment state slogan...
2,msmarco_passage_45_489369159,Land of Enchantment. Before Land of Enchantmen...,land of enchantment before land of enchantment...,land enchantment land enchantment state slogan...
3,msmarco_passage_50_676325639,New Mexico State Slogans. Whereas the New Mexi...,new mexico state slogans whereas the new mexic...,new mexico state slogan new mexico state motto...
4,msmarco_passage_21_464076261,"1 to approximately 99-000. First use of the ""L...",to approximately first use of the land of enc...,approximately use land enchantment slogan em...


# Save processed data

In [141]:
# query
query_train_set_with_passage_info.to_csv("./output/query_train_set_with_passage_info.csv"
                                       , index = False)
query_test_set_with_passage_info.to_csv("./output/query_test_set_with_passage_info.csv"
                                       , index = False)

# passage
train_passage_id_content_cleaned.to_csv("./output/train_passage_id_content_cleaned.csv"
                                       , index = False)
test_passage_id_content_cleaned.to_csv("./output/test_passage_id_content_cleaned.csv"
                                       , index = False)