<a href="https://colab.research.google.com/github/mr-cri-spy/LLM-Playground/blob/main/Task_Focused_Training_Aim_for_Better_Learning_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task-Focused Training: Aim for Better Learning - Dataset 📊

## Learning Objectives 🎯
- Install and configure necessary libraries to manage datasets.
- Understand how to load and process datasets for specific tasks in machine learning.
- Convert datasets to a format suitable for training machine learning models.
- Prepare and store datasets efficiently for machine learning applications.

## Library Installation 🛠️
Install the `datasets` library and its dependencies. This step ensures that all necessary tools are available for data handling and preprocessing, which are essential for task-focused training.

In [None]:
!pip install datasets==2.19.1

Collecting datasets==2.19.1
  Downloading datasets-2.19.1-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets==2.19.1)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets==2.19.1)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets==2.19.1)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.3.1,>=2023.1.0 (from fsspec[http]<=2024.3.1,>=2023.1.0->datasets==2.19.1)
  Downloading fsspec-2024.3.1-py3-none-any.whl.metadata (6.8 kB)
Downloading datasets-2.19.1-py3-none-any.whl (542 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading 

## Loading Dataset 📚
Load a specific dataset using the Hugging Face `datasets` library. This step involves fetching the dataset from a public repository and examining its structure to ensure it fits the training task.

In [None]:
from datasets import load_dataset


In [None]:
data = load_dataset("rajpurkar/squad")

df = data['validation'].to_pandas()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/7.62k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

## Data Processing and Transformation 🔧
Transform the dataset into a more usable format by extracting necessary fields and converting it into a DataFrame. This process is crucial for tailoring the data to the specific needs of the training model.

In [None]:
df['output'] = df['answers'].map(lambda x: x['text'][0])
df = df.drop(columns=['answers'])

## Data Storage and Preparation 🗃️
Prepare the processed data for training by saving it in a Parquet file. This format is optimized for large-scale data storage and access, making it ideal for machine learning workflows.

In [None]:
import os
os.makedirs('data', exist_ok=True)

df.to_parquet("data/squad_for_llms.parquet")
