# Create the Training Data

The goal is to create sample data consisting of common English phrases paired with their corresponding translations in Morse Code. This data will later be used as training and testing material when fine-tuning the LLM.  
This notebook downloads a [dataset](https://huggingface.co/datasets/skeskinen/books3_basic_sentenses_paraphrased) from Hugging Face.

In [16]:
from datasets import load_dataset
ds = load_dataset("skeskinen/books3_basic_sentenses_paraphrased", split="train")

This dataset contains about 670K rows, for this purpose we are only interested in the `paraphrase` column

In [17]:
ds.to_pandas()

Unnamed: 0,text,book,pos,smog_index,paraphrase
0,This title was originally cataloged by the Lib...,Are You My Mother_ - P.D. Eastman,0.014706,4.8,The title was cataloged by the Library of Cong...
1,The egg jumped.,Are You My Mother_ - P.D. Eastman,0.113971,4.8,The egg flew into the air.
2,So away she went.,Are You My Mother_ - P.D. Eastman,0.158088,4.8,She went away.
3,"""Where is my mother?""",Are You My Mother_ - P.D. Eastman,0.191176,4.8,Where is my mother?
4,He looked for her.,Are You My Mother_ - P.D. Eastman,0.202206,4.8,He was looking for her.
...,...,...,...,...,...
670199,He seemed to be in his early twenties.,A Gathering of Old Men,0.280271,5.6,He was in his twenties.
670200,"He was about five eight, and weighed round a h...",A Gathering of Old Men,0.280271,5.6,He was around 100 and forty pounds.
670201,Even from this distance you could see he was s...,A Gathering of Old Men,0.280271,5.6,He was scared even from this distance.
670202,"He was unarmed, and he reached back into the c...",A Gathering of Old Men,0.280271,5.6,He reached into the car for a gun and was not ...


Create a function to remove diacritics from characters, since most diacritic marks are not represented in Morse Code.

In [18]:
import unicodedata

def normalize_text(text):
    # Normalize to NFKD form which separates characters from their diacritics
    normalized = unicodedata.normalize('NFKD', text)
    # Filter out combining characters (accents, etc.)
    ascii_text = ''.join([c for c in normalized if not unicodedata.combining(c)])
    return ascii_text

Encode each `paraphrase` text into Morse Code. The result should include both the original text and its corresponding Morse Code translation.

In [19]:
from encode import encode_to_morse

def apply_encoding(batch):
    lines = []
    morse = []

    for text in batch["paraphrase"]:
        lines.append(text)
        morse.append(encode_to_morse(normalize_text(text), skip_unknown=True))

    return {
        "line": lines,
        "morse": morse
    }


In [20]:
new_ds = ds.map(apply_encoding, batched=True, remove_columns=["paraphrase", "text", "book", "pos", "smog_index"], num_proc=8)
new_ds.to_pandas()

Unnamed: 0,line,morse
0,The title was cataloged by the Library of Cong...,- .... . / - .. - .-.. . / .-- .- ... / -.-. ....
1,The egg flew into the air.,- .... . / . --. --. / ..-. .-.. . .-- / .. -....
2,She went away.,... .... . / .-- . -. - / .- .-- .- -.-- .-.-.-
3,Where is my mother?,.-- .... . .-. . / .. ... / -- -.-- / -- --- -...
4,He was looking for her.,.... . / .-- .- ... / .-.. --- --- -.- .. -. -...
...,...,...
670199,He was in his twenties.,.... . / .-- .- ... / .. -. / .... .. ... / - ...
670200,He was around 100 and forty pounds.,.... . / .-- .- ... / .- .-. --- ..- -. -.. / ...
670201,He was scared even from this distance.,.... . / .-- .- ... / ... -.-. .- .-. . -.. / ...
670202,He reached into the car for a gun and was not ...,.... . / .-. . .- -.-. .... . -.. / .. -. - --...


Remove any duplicate entries to avoid training the LLM on repeated data. While duplicates wouldn't cause harm, there is no need to repeat values at this stage of data preparation.

In [27]:
import pandas as pd
from datasets import Dataset


df = new_ds.to_pandas()

# Drop duplicates (across all columns, or specify subset)
df = df.drop_duplicates()  
# Convert back to Hugging Face Dataset
deduped_dataset = Dataset.from_pandas(df, preserve_index=False)

Print the number of rows in the original and deduplicated datasets

In [32]:
print(f"Original dataset size    : {len(new_ds)}")
print(f"Deduplicated dataset size: {len(deduped_dataset)}")

Original dataset size    : 670204
Deduplicated dataset size: 531976


## Upload the prepared dataset to your repository on Hugging Face

Make sure you have logged into Hugging Face using their CLI tool if you haven't done so before.

```bash
hugging-face login
```

You can now upload your dataset to Hugging Face, make sure to update the repository name to your own.

In [35]:
deduped_dataset.push_to_hub("philipfourie/books3_basic_sentenses_paraphrased-Morse", private=False)

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/532 [00:00<?, ?ba/s]

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/datasets/philipfourie/books3_basic_sentenses_paraphrased-Morse/commit/b7b7719e52db13b28540a48e6d8b80a299c8c01f', commit_message='Upload dataset', commit_description='', oid='b7b7719e52db13b28540a48e6d8b80a299c8c01f', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/philipfourie/books3_basic_sentenses_paraphrased-Morse', endpoint='https://huggingface.co', repo_type='dataset', repo_id='philipfourie/books3_basic_sentenses_paraphrased-Morse'), pr_revision=None, pr_num=None)

This completes the preparation of your training data that can be used for both training and validation.
These files are available in [parquet](https://huggingface.co/datasets/philipfourie/books3_basic_sentenses_paraphrased-Morse/tree/main/data) format in your Hugging Face Datasets repository.

https://huggingface.co/datasets/philipfourie/books3_basic_sentenses_paraphrased-Morse