## Datasets

### Load from Hugging Face

In [1]:
from datasets import load_dataset

dataset = load_dataset("fka/awesome-chatgpt-prompts")
dataset

README.md:   0%|          | 0.00/339 [00:00<?, ?B/s]

prompts.csv: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/203 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['act', 'prompt'],
        num_rows: 203
    })
})

> The above dataset only has a `train` dataset. Let's look at another one that has `train`, `validation`, and `test` datasets.

In [3]:
dataset = load_dataset("knkarthick/samsum")
dataset

README.md: 0.00B [00:00, ?B/s]

train.csv: 0.00B [00:00, ?B/s]

validation.csv: 0.00B [00:00, ?B/s]

test.csv: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/14732 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/818 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/819 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 14732
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 818
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 819
    })
})

### Preprocessing Methods

In [6]:
# Reload the original dataset
dataset = load_dataset("fka/awesome-chatgpt-prompts")
dataset["train"][0]

{'act': 'An Ethereum Developer',
 'prompt': 'Imagine you are an experienced Ethereum developer tasked with creating a smart contract for a blockchain messenger. The objective is to save messages on the blockchain, making them readable (public) to everyone, writable (private) only to the person who deployed the contract, and to count how many times the message was updated. Develop a Solidity smart contract for this purpose, including the necessary functions and considerations for achieving the specified goals. Please provide the code and any relevant explanations to ensure a clear understanding of the implementation.'}

In [7]:
dataset = dataset["train"].shuffle(seed=43).select(range(100))
dataset

Dataset({
    features: ['act', 'prompt'],
    num_rows: 100
})

In [8]:
# Create test dataset
dataset = dataset.train_test_split(train_size=0.8, seed=49)
dataset

DatasetDict({
    train: Dataset({
        features: ['act', 'prompt'],
        num_rows: 80
    })
    test: Dataset({
        features: ['act', 'prompt'],
        num_rows: 20
    })
})

**Let's make our own dataset** from the `reuters21578/*.sgm` files. This was downloaded from https://archive.ics.uci.edu/ml/machine-learning-databases/reuters21578-mld/reuters21578.tar.gz

In [32]:
## Get the title and body of all the articles
import glob
from bs4 import BeautifulSoup

dir_path = "./reuters21578/"
files = os.path.join(dir_path, "*.sgm")
articles = []

for filepath in glob.glob(files):
    with open(filepath, "r", encoding="latin-1") as file:
        soup = BeautifulSoup(file, "html.parser")

    for r in soup.find_all("reuters"):
        title = r.title.string if r.title else ""
        body = r.body.string if r.body else ""

        ## Clean up the results
        if title == "" and body == "":
            continue
        
        articles.append({
            "title": title,
            "body": body
        })

print(f"Articles: {len(articles):,}")
articles[0]

Articles: 20,841


{'title': 'FAIRFAX SAYS HIGHER TAX HITS FIRST HALF EARNINGS',
 'body': 'Media group John Fairfax Ltd <FFXA.S>\nsaid that its flat first half net profit partly reflected the\nimpact of changes in the Australian tax system.\n    Fairfax earlier reported net earnings edged up 2.3 pct to\n25.94 mln dlrs in the 26 weeks ended December 28 from 25.35 mln\na year earlier although pre-tax profit rose 9.1 pct to 48.30\nmln from 44.29 mln.\n    Net would have risen 10.1 pct but for the increase in\ncompany tax to 49 pct from 46 and the imposition of the tax on\nfringe benefits, paid by employers and not the recipients, the\ncompany said in a statement.\n    Fairfax also pointed to the cyclical downturn in revenue\ngrowth in the television industry as another reason for the\nflat first half earnings.\n    It said it considered the result satisfactory in view of\nthese factors.\n    Fairfax said its flagship dailies, The Sydney Morning\nHerald and the Melbourne Age, boosted advertising volume, as\n

In [40]:
## Now let's make our own dataset from these articles
import json

TRAIN_PCT = 0.8
VALID_PCT = 0.1

TRAIN_NUM = int(len(articles) * TRAIN_PCT)
VALID_NUM = int(len(articles) * (TRAIN_PCT + VALID_PCT))

# Split the data
train_articles = articles[:TRAIN_NUM]
print(f"Training dataset: {len(train_articles):,}")

valid_articles = articles[TRAIN_NUM:VALID_NUM]
print(f"Validation dataset: {len(valid_articles):,}")

test_articles = articles[VALID_NUM:]
print(f"Test dataset: {len(test_articles):,}")

def save_as_jsonl(data, filename):
    with open(filename, "w") as file:
        for article in data:
            file.write(json.dumps(article) + "\n")
    print(f"Wrote {filename}")

save_as_jsonl(train_articles, "train.jsonl")
save_as_jsonl(valid_articles, "valid.jsonl")
save_as_jsonl(test_articles, "test.jsonl")

Training dataset: 16,672
Validation dataset: 2,084
Test dataset: 2,085
Wrote train.jsonl
Wrote valid.jsonl
Wrote test.jsonl


In [41]:
## Load them as a dataset
data_files = {
    "train": "train.jsonl",
    "validation": "valid.jsonl",
    "test": "test.jsonl"
}
dataset = load_dataset("json", data_files=data_files)
dataset

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['title', 'body'],
        num_rows: 16672
    })
    validation: Dataset({
        features: ['title', 'body'],
        num_rows: 2084
    })
    test: Dataset({
        features: ['title', 'body'],
        num_rows: 2085
    })
})

In [45]:
## Login to Hugging Face
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [46]:
## Upload dataset to Hugging Face
dataset.push_to_hub("reuters-articles")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/17 [00:00<?, ?ba/s]

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

                                        : 100%|##########| 7.85MB / 7.85MB            

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/3 [00:00<?, ?ba/s]

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

                                        : 100%|##########| 1.05MB / 1.05MB            

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/3 [00:00<?, ?ba/s]

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

                                        : 100%|##########| 1.01MB / 1.01MB            

README.md:   0%|          | 0.00/527 [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/datasets/shayharding/reuters-articles/commit/5cf5cd354c19888e5adf77475053eeecfd0d99c7', commit_message='Upload dataset', commit_description='', oid='5cf5cd354c19888e5adf77475053eeecfd0d99c7', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/shayharding/reuters-articles', endpoint='https://huggingface.co', repo_type='dataset', repo_id='shayharding/reuters-articles'), pr_revision=None, pr_num=None)