# Data Preprocessing

In this notebook we preprocess and prepate all the datasets we are going to need for the next steps.

In particular, we want to trim all the whitespaces both the questions and contexts for the full dataset.

Additionally, we want to save all unique documents (each with a unique Id) in a separate file to use later for retrieval.

Finally, we want to generate a test set that we will use to benchmark the different retrieval algorithms. As we saw in our data analysis, there are 5 questions that are duplicated (but have a different context), and there are some contexts that are used by different questions. This may be important for the sampling mechanism to prevent sampling bias.

By the end of the notebook, we should have generated three files:
- `query_context_pairs.csv` (dataset after applying text preprocessing)
- `documents.csv` (dataset with unique documents)
- `train_query_context_pairs.csv`  (dataset for training/val)
- `test_query_context_pairs.csv` (dataset for testing)

In [1]:
import hashlib
import pandas as pd

from pathlib import Path
from sklearn.model_selection import train_test_split

In [2]:
DOCUMENTS_FILENAME = "documents.csv"
DATASET_PATH = "query_context_pairs.csv"
TRAIN_DATASET_PATH = "train_query_context_pairs.csv"
TEST_DATASET_PATH = "test_query_context_pairs.csv"

## Load the dataset

In [3]:
data_path = Path.cwd().resolve().absolute().parent / "data"
dataset_path = data_path / "ds_nlp_challenge.csv"
dataset = pd.read_csv(dataset_path, index_col=0)
print("Length of the dataset: ", len(dataset))

dataset = dataset.drop_duplicates()
print("Length of the dataset [after removing duplicates]: ", len(dataset))

dataset["question"] = dataset["question"].str.strip()
dataset["context"] = dataset["context"].str.strip()

dataset.head()

Length of the dataset:  20000
Length of the dataset [after removing duplicates]:  19988


Unnamed: 0,question,context
0,Do European Leagues sell their television righ...,The Premier League sells its television rights...
1,"What does the Catholic church considered ""mixe...",Between the third and fourth sessions the pope...
2,What are some of the practices Gautama underwe...,Gautama first went to study with famous religi...
3,How many band members wrote Queen's One Vision?,"The band, now revitalised by the response to L..."
4,When did the federation have to be implemented...,"After Nasser died in November 1970, his succes..."


## Generate document and question Ids

In [4]:
def generate_document_id(content):
    return hashlib.md5(content.encode('utf-8')).hexdigest()

dataset["question_id"] = dataset["question"].apply(lambda text: generate_document_id(text))
dataset["context_id"] = dataset["context"].apply(lambda text: generate_document_id(text))
dataset.head()

Unnamed: 0,question,context,question_id,context_id
0,Do European Leagues sell their television righ...,The Premier League sells its television rights...,c3d337ab68dfd285f559ebc0daf65125,98d6e3c8d58561cff931f63fb4e64c1c
1,"What does the Catholic church considered ""mixe...",Between the third and fourth sessions the pope...,b03fc4a34dda7d1cfb4e9640aec30d39,4bcb9c7951bfad7dc475dc0a8364b86d
2,What are some of the practices Gautama underwe...,Gautama first went to study with famous religi...,12397967175462937d614d633fb6a8b0,70e382792af20b3772cce2520b45da5e
3,How many band members wrote Queen's One Vision?,"The band, now revitalised by the response to L...",77307b73836b34d59721596af121cd2f,5e450e68f649328e03bace871b873fee
4,When did the federation have to be implemented...,"After Nasser died in November 1970, his succes...",1a6850f01a96afabb91c24cdb3edcdee,0409ff54cef43157e6e7c88803e8590a


## Save processed dataset

In [5]:
new_dataset_path = data_path / DATASET_PATH
dataset.to_csv(new_dataset_path, index=False)

## Save dataset with unique documents

In [6]:
documents = (
    dataset
    .loc[:, ["context_id", "context"]]
    .drop_duplicates()
    .rename(columns={"context_id": "document_id", "context": "document_content"})
)
print("Total number of unique documents: ", len(documents))
documents.head()

Total number of unique documents:  12761


Unnamed: 0,document_id,document_content
0,98d6e3c8d58561cff931f63fb4e64c1c,The Premier League sells its television rights...
1,4bcb9c7951bfad7dc475dc0a8364b86d,Between the third and fourth sessions the pope...
2,70e382792af20b3772cce2520b45da5e,Gautama first went to study with famous religi...
3,5e450e68f649328e03bace871b873fee,"The band, now revitalised by the response to L..."
4,0409ff54cef43157e6e7c88803e8590a,"After Nasser died in November 1970, his succes..."


In [7]:
documents_path = data_path / DOCUMENTS_FILENAME
documents.to_csv(documents_path, index=False)

## Split into training and test set

In [8]:
unique_context_ids = dataset.context_id.unique()
unique_context_ids.shape

(12761,)

In [9]:
train_ids, test_ids = train_test_split(unique_context_ids, random_state=0, test_size=0.2)
print(train_ids.shape, test_ids.shape)

train_data = dataset[dataset.context_id.isin(train_ids)]
test_data = dataset[dataset.context_id.isin(test_ids)]

print(train_data.shape, test_data.shape)

(10208,) (2553,)
(16035, 4) (3953, 4)


In [10]:
# Intersection between contexts of train and test set should be 0!
len(set(test_data.context_id.unique().tolist()).intersection(set(train_data.context_id.unique().tolist())))

0

In [11]:
train_data.head()

Unnamed: 0,question,context,question_id,context_id
0,Do European Leagues sell their television righ...,The Premier League sells its television rights...,c3d337ab68dfd285f559ebc0daf65125,98d6e3c8d58561cff931f63fb4e64c1c
1,"What does the Catholic church considered ""mixe...",Between the third and fourth sessions the pope...,b03fc4a34dda7d1cfb4e9640aec30d39,4bcb9c7951bfad7dc475dc0a8364b86d
2,What are some of the practices Gautama underwe...,Gautama first went to study with famous religi...,12397967175462937d614d633fb6a8b0,70e382792af20b3772cce2520b45da5e
3,How many band members wrote Queen's One Vision?,"The band, now revitalised by the response to L...",77307b73836b34d59721596af121cd2f,5e450e68f649328e03bace871b873fee
4,When did the federation have to be implemented...,"After Nasser died in November 1970, his succes...",1a6850f01a96afabb91c24cdb3edcdee,0409ff54cef43157e6e7c88803e8590a


In [12]:
test_data.head()

Unnamed: 0,question,context,question_id,context_id
9,Which Japanese carrier survived the first wave...,With the Japanese CAP out of position and the ...,00d9f4e0b8a0c4654bc202f7be8cb243,05bb76ec5fcf0a6b793b5bf3607fdbb9
14,What are anarchists not against?,Anarchists are against the State but are not a...,9ec4aab7fcf6a247539c9f1e3094f37b,15f04e0814d60718a465c034077269dd
16,How did troops react to the missile?,"On 1 May, The Sun claimed to have 'sponsored' ...",d1d6e5c2409d498f00c690e14669e5d0,23d120965eaf2a679c23d7d88eb3fe1a
18,Who primarily occupies the complexes surroundi...,Ann Arbor's residential neighborhoods contain ...,6e99378749ee466ce63713c931d3f740,eb70588ed7264bcd249d6705e1b90547
33,Does God have a gender?,"In monotheism and henotheism, God is conceived...",e6908fed63cb1c828f422793452a5806,1d644948160ff5feb5f6bfae66d8525e


## Save training and test sets

In [13]:
train_set_path = data_path / TRAIN_DATASET_PATH
train_data.to_csv(train_set_path, index=False)

In [14]:
test_set_path = data_path / TEST_DATASET_PATH
test_data.to_csv(test_set_path, index=False)