# Descripcion
En este notebook se busca crear el dataset en el formato adecuado. Para ello se debe crear un archivo jsonl, con la siguiente estructura para cada dato:

{"instruction": "...", "input": "...", "output": "..."}

Después se genera un archivo de training_val y de testing, para entrenar y para hacer prueba de test

In [None]:
import pandas as pd

In [4]:
!pip install datasets transformers[sentencepiece]
!pip install jsonlines

[0m

In [5]:

from datasets import load_dataset

raw_datasets = load_dataset("Abirate/english_quotes")
raw_datasets.cache_files
raw_datasets.save_to_disk("Abirate_english_quotes")

Saving the dataset (0/1 shards):   0%|          | 0/2508 [00:00<?, ? examples/s]

In [6]:
from datasets import load_from_disk

arrow_datasets_reloaded = load_from_disk("Abirate_english_quotes")
arrow_datasets_reloaded

DatasetDict({
    train: Dataset({
        features: ['quote', 'author', 'tags'],
        num_rows: 2508
    })
})

In [7]:
for split, dataset in raw_datasets.items():
    dataset.to_csv(f"my-dataset-{split}.csv", index=None)

Creating CSV from Arrow format:   0%|          | 0/3 [00:00<?, ?ba/s]

In [11]:
# load dataset in pandas dataframe

df = pd.read_csv("my-dataset-train.csv")
df["output"] = df.apply(lambda row: row["author"] + ": " + row["tags"], axis = 1)
df["instruction"] = df["quote"]
df



Unnamed: 0,quote,author,tags,output,instruction
0,“Be yourself; everyone else is already taken.”,Oscar Wilde,['be-yourself' 'gilbert-perreira' 'honesty' 'i...,Oscar Wilde: ['be-yourself' 'gilbert-perreira'...,“Be yourself; everyone else is already taken.”
1,"“I'm selfish, impatient and a little insecure....",Marilyn Monroe,['best' 'life' 'love' 'mistakes' 'out-of-contr...,Marilyn Monroe: ['best' 'life' 'love' 'mistake...,"“I'm selfish, impatient and a little insecure...."
2,“Two things are infinite: the universe and hum...,Albert Einstein,['human-nature' 'humor' 'infinity' 'philosophy...,Albert Einstein: ['human-nature' 'humor' 'infi...,“Two things are infinite: the universe and hum...
3,"“So many books, so little time.”",Frank Zappa,['books' 'humor'],Frank Zappa: ['books' 'humor'],"“So many books, so little time.”"
4,“A room without books is like a body without a...,Marcus Tullius Cicero,['books' 'simile' 'soul'],Marcus Tullius Cicero: ['books' 'simile' 'soul'],“A room without books is like a body without a...
...,...,...,...,...,...
2503,“Morality is simply the attitude we adopt towa...,"Oscar Wilde,",['morality' 'philosophy'],"Oscar Wilde,: ['morality' 'philosophy']",“Morality is simply the attitude we adopt towa...
2504,“Don't aim at success. The more you aim at it ...,"Viktor E. Frankl,",['happiness' 'success'],"Viktor E. Frankl,: ['happiness' 'success']",“Don't aim at success. The more you aim at it ...
2505,"“In life, finding a voice is speaking and livi...",John Grisham,['inspirational-life'],John Grisham: ['inspirational-life'],"“In life, finding a voice is speaking and livi..."
2506,"“Winter is the time for comfort, for good food...",Edith Sitwell,['comfort' 'home' 'winter'],Edith Sitwell: ['comfort' 'home' 'winter'],"“Winter is the time for comfort, for good food..."


In [12]:

def get_json_per_row_chat_format(row):
    json_i = {"instruction":row["instruction"],"input":"" , "output":row["output"]}
    
    return json_i

df["json_i"] = df.apply(lambda row: get_json_per_row_chat_format(row), axis = 1)
df.head()





Unnamed: 0,quote,author,tags,output,instruction,json_i
0,“Be yourself; everyone else is already taken.”,Oscar Wilde,['be-yourself' 'gilbert-perreira' 'honesty' 'i...,Oscar Wilde: ['be-yourself' 'gilbert-perreira'...,“Be yourself; everyone else is already taken.”,{'instruction': '“Be yourself; everyone else i...
1,"“I'm selfish, impatient and a little insecure....",Marilyn Monroe,['best' 'life' 'love' 'mistakes' 'out-of-contr...,Marilyn Monroe: ['best' 'life' 'love' 'mistake...,"“I'm selfish, impatient and a little insecure....","{'instruction': '“I'm selfish, impatient and a..."
2,“Two things are infinite: the universe and hum...,Albert Einstein,['human-nature' 'humor' 'infinity' 'philosophy...,Albert Einstein: ['human-nature' 'humor' 'infi...,“Two things are infinite: the universe and hum...,{'instruction': '“Two things are infinite: the...
3,"“So many books, so little time.”",Frank Zappa,['books' 'humor'],Frank Zappa: ['books' 'humor'],"“So many books, so little time.”","{'instruction': '“So many books, so little tim..."
4,“A room without books is like a body without a...,Marcus Tullius Cicero,['books' 'simile' 'soul'],Marcus Tullius Cicero: ['books' 'simile' 'soul'],“A room without books is like a body without a...,{'instruction': '“A room without books is like...


In [13]:
import pandas as pd

df = df.sample(frac=1, random_state=42).reset_index(drop=True)

# Specify the ratio for splitting (e.g., 80-20)
train_ratio = 0.8
test_ratio = 1 - train_ratio

# Calculate the split index
split_index = int(len(df) * train_ratio)

# Split the DataFrame
df_train = df.iloc[:split_index]
df_test = df.iloc[split_index:]

print(df_train.shape)
print(df_test.shape)


(2006, 6)
(502, 6)


In [19]:
df_train.to_csv("data/df_train.csv")
df_test.to_csv("data/df_test.csv")


In [17]:
import jsonlines
import json

df_json = df_train
data = list(df_json["json_i"].values)
print(len(data))

file_path = 'data/train.jsonl'

# Write the list of dictionaries to the JSONL file
with jsonlines.open(file_path, mode='w') as jsonl_file:
    jsonl_file.write_all(data)

print(f'The data has been saved to {file_path}')

2006
The data has been saved to data/train.jsonl


In [20]:
import jsonlines
import json

df_json = df_test
data = list(df_json["json_i"].values)
print(len(data))

file_path = 'data/df_test.jsonl'

# Write the list of dictionaries to the JSONL file
with jsonlines.open(file_path, mode='w') as jsonl_file:
    jsonl_file.write_all(data)

print(f'The data has been saved to {file_path}')


502
The data has been saved to data/df_test.jsonl
