# Introduction

This notebook is dedicated to preparing and organizing the RAG-12000 dataset for downstream tasks such as training, hyperparameter tuning, and evaluation.

We begin by loading and cleaning the dataset, removing null entries and duplicates, and filtering meta-questions. Then we split the dataset into three main parts:

- **Fine-tuning Set** (70%)
- **Hyperparameter Tuning Set** (15%)
- **Test Set** (15%)

#### 1. Dataset Loading and cheking for missing entries/duplicates

In [1]:
!git clone https://huggingface.co/datasets/neural-bridge/rag-dataset-12000

Cloning into 'rag-dataset-12000'...
remote: Enumerating objects: 50, done.[K
remote: Total 50 (delta 0), reused 0 (delta 0), pack-reused 50 (from 1)[K
Unpacking objects: 100% (50/50), 15.55 KiB | 884.00 KiB/s, done.


In [2]:
import pandas as pd

test_data=pd.read_parquet("/kaggle/working/rag-dataset-12000/data/test-00000-of-00001-af2a9f454ad1b8a3.parquet")
train_data=pd.read_parquet("/kaggle/working/rag-dataset-12000/data/train-00000-of-00001-9df3a936e1f63191.parquet")
all_data=pd.concat([train_data,test_data])

In [3]:
#none?
print(f" Context has {all_data['context'].isna().sum()}  none values\n Question has {all_data['question'].isna().sum()} none values\n Answer has {all_data['answer'].isna().sum()} none values")

 Context has 0  none values
 Question has 3 none values
 Answer has 3 none values


In [4]:
cleaned_data=all_data.dropna()

In [5]:
#duplicates
print(f" Context has {len(cleaned_data['context'].values)-len(cleaned_data['context'].unique())}  duplicates \n Question has {len(cleaned_data['question'].values)-len(cleaned_data['question'].unique())} duplicates\n Answer has {len(cleaned_data['answer'].values)-len(cleaned_data['answer'].unique())} duplicates")

 Context has 0  duplicates 
 Question has 14 duplicates
 Answer has 0 duplicates


In [6]:
question_count=cleaned_data.groupby("question").size()
question_count[question_count>1]

question
What are some of the features that set the Nectar mattress apart from other foam mattresses?    2
What are some of the job roles mentioned in the context?                                        2
What are the ingredients needed for the recipe mentioned in the context?                        2
What is the context about?                                                                      6
What is the context discussing about?                                                           2
What is the date mentioned in the context?                                                      2
What is the main topic discussed in the context?                                                4
What is the price range of the items listed in the context?                                     2
dtype: int64

In [7]:
#search for potential questions related to context
context_queries=cleaned_data[cleaned_data["question"].str.contains('context', case=False)]["question"]
context_queries.shape
context_queries.head(100).values

array(['What is the lunch that Kim and Tony plan to have on Monday according to the context?',
       'What are the five love and relationship podcasts mentioned in the context?',
       "What does the word 'quantum' mean as per the context?",
       "What are the 5 ways to make your Mum's day special according to the context?",
       'What is a "foster failure" in the context of animal rescue?',
       'What are the four phenomenological constants described by Schön that could form a matrix for reflective practice within interdisciplinary research contexts?',
       'Who is the most followed person on Instagram according to the context?',
       'What is the main concern of the individual in the context about their job title?',
       'What is the two-step secret to making money on Ebay according to the context?',
       'What are some of the functions of barstools counter stools on sale ivory swivel mentioned in the context?',
       'What does the word "Shamo" refer to in the conte

#### Around 8% of the dataset was removed because it contained meta-questions mentioning "context", which are not representative of natural user inputs.


In [8]:
final_data=cleaned_data[~cleaned_data["question"].str.contains("context",case=False)]

In [10]:
final_data_no_duplicates=final_data.drop_duplicates("question")
final_data_no_duplicates

Unnamed: 0,context,question,answer
0,Caption: Tasmanian berry grower Nic Hansen sho...,What is the Berry Export Summary 2028 and what...,The Berry Export Summary 2028 is a dedicated e...
1,RWSN Collaborations\nSouthern Africa Self-supp...,What are some of the benefits reported from ha...,Benefits reported from having access to Self-s...
2,All Android applications categories\nDescripti...,What are the unique features of the Coolands f...,The unique features of the Coolands for Twitte...
3,"How unequal is India? The question is simple, ...",What is the main difference between the Nation...,The main difference between the NSS and the IH...
4,Gunnar Nelson took his time on the feet agains...,How did Gunnar Nelson win the fight against Za...,Gunnar Nelson won the fight against Zak Cummin...
...,...,...,...
2395,"Fuzzy's Ultra Premium Vodka\nThe Myth, The Man...",What are some of the achievements of Fuzzy Zoe...,Fuzzy Zoeller is known for his golfing success...
2396,Swedish Grand Prix rider Malin Nilsson got mar...,Who did Malin Nilsson marry on 2 June 2018?,"Malin Nilsson got married to her partner, Germ..."
2397,The Cracchiolo Law Library of the James E. Rog...,What is the Fellowship in Law Librarianship of...,The Fellowship in Law Librarianship is a progr...
2398,2nd physical eMAG store opens in Mammut\nOnlin...,Where has the second physical eMAG store been ...,The second physical eMAG store has been opened...


#### General data cleaning

In [11]:
import re
from bs4 import BeautifulSoup

def clean_text(text):
    if not isinstance(text, str):
        return ""
    text = text.lower()
    text = BeautifulSoup(text, "html.parser").get_text()
    text = re.sub(r"\s+", " ", text).strip()
    return text
final_data_no_duplicates["context"]=final_data_no_duplicates["context"].apply(lambda x : clean_text(x))


Assuming this really is an XML document, what you're doing might work, but you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the Python package 'lxml' installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.




  text = BeautifulSoup(text, "html.parser").get_text()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_data_no_duplicates["context"]=final_data_no_duplicates["context"].apply(lambda x : clean_text(x))


#### Data splitting & save

In [13]:
from sklearn.model_selection import train_test_split
finetuning_data, tmp_data = train_test_split(final_data_no_duplicates, test_size=0.3, random_state=42)
hyperparameter_tuning_data, test_data=train_test_split(tmp_data, test_size=0.5, random_state=42)

finetuning_data.to_parquet("/kaggle/working/finetuning_data.parquet")
hyperparameter_tuning_data.to_parquet("/kaggle/working/hyperparameter_tuning_data.parquet")
test_data.to_parquet("/kaggle/working/test_data.parquet")