## Step 1. Understanding Builder class

In [19]:
import datasets

In [20]:
dl_manager = datasets.DownloadManager()

Example: rotten_tomatoes reviews

In [21]:
_DOWNLOAD_URL = "https://storage.googleapis.com/seldon-datasets/sentence_polarity_v1/rt-polaritydata.tar.gz"

In [23]:
archive = dl_manager.download(_DOWNLOAD_URL)

The downloaded file is stored in the hidden `.cache` directory. You can look up the exact location of the file as below.

In [27]:
archive

'/Users/gradcheckout/.cache/huggingface/datasets/downloads/8d5816082536d6d235d1e6d1e53cc9173be5de48a03ec38b43e51789052a6c34.6d7967df3967b0b06b2972453d333c6ff314dde9b2683be1cafcd0d4f86066aa'

When you manually download the file and unzip the file, you may see the structure of how the datafiles are organized. 

![image.png](attachment:image.png)

![image.png](attachment:image.png)

Within the parent folder, the first file you see is `rt-polaritydata.README.1.0.txt`, and there also is a folder named `rt-polaritydata`. Within the `rt-polaritydata` folder, you may see two files `rt-polarity.neg` and `rt-polarity.pos`. All we need are the files, and using `dl_manager.iter_archive(archive)` we have a generator that iterate over the files available within the tar file. 

In [26]:
for path, f in dl_manager.iter_archive(archive):
    print(path, f)

rt-polaritydata.README.1.0.txt <ExFileObject name=''>
rt-polaritydata/rt-polarity.neg <ExFileObject name=''>
rt-polaritydata/rt-polarity.pos <ExFileObject name=''>


## Step 2. Write up *`_split_generator`* function

The function `_split_generator` returns a list of `datasets.SplitGenerator` objects, which requires the following parameters: <br>

![image.png](attachment:image.png)

Make usre that keys in `gen_kwargs` are included in the `_generate_examples` method function that will be introduced in the later section.

In [31]:
def _split_generators(dl_manager):
    """Downloads Rotten Tomatoes sentences."""
    archive = dl_manager.download(_DOWNLOAD_URL)
    return [
        datasets.SplitGenerator(
            name=datasets.Split.TRAIN,
            gen_kwargs={"split_key": "train", "files": dl_manager.iter_archive(archive)},
        ),
        datasets.SplitGenerator(
            name=datasets.Split.VALIDATION,
            gen_kwargs={"split_key": "validation", "files": dl_manager.iter_archive(archive)},
        ),
        datasets.SplitGenerator(
            name=datasets.Split.TEST,
            gen_kwargs={"split_key": "test", "files": dl_manager.iter_archive(archive)},
        ),
    ]


In [32]:
_split_generators(dl_manager)

[SplitGenerator(name='train', gen_kwargs={'split_key': 'train', 'files': <datasets.download.download_manager.ArchiveIterable object at 0x7f9fe1b01130>}, split_info=SplitInfo(name='train', num_bytes=0, num_examples=0, shard_lengths=None, dataset_name=None)),
 SplitGenerator(name='validation', gen_kwargs={'split_key': 'validation', 'files': <datasets.download.download_manager.ArchiveIterable object at 0x7fa0032ae6d0>}, split_info=SplitInfo(name='validation', num_bytes=0, num_examples=0, shard_lengths=None, dataset_name=None)),
 SplitGenerator(name='test', gen_kwargs={'split_key': 'test', 'files': <datasets.download.download_manager.ArchiveIterable object at 0x7fa0032ae910>}, split_info=SplitInfo(name='test', num_bytes=0, num_examples=0, shard_lengths=None, dataset_name=None))]

## Step 3. `_generate_examples`

### Step 3-0. `_get_examples_from_split` : generate examples for each split depending on their split_key

Since we wish to split the dataset into train, validation and test datasets, this requires us to have a function that return a different set of texts depending on the `split_key` arguement in the function. 

In [33]:
def _get_examples_from_split(self, split_key, files):
    pass

First thing to check is the property of the datafiles. 

In [50]:
data_dir = "rt-polaritydata/"
pos_samples, neg_samples = None, None
for path, f in dl_manager.iter_archive(archive):
    if path == data_dir + "rt-polarity.pos":
        pos_samples = [line.decode('latin_1').strip() for line in f]
    elif path == data_dir + "rt-polarity.neg":
        neg_samples = [line.decode('latin_1').strip() for line in f]
    if pos_samples is not None and neg_samples is not None:
        break
        

With all the positive and negative samples, we will then split them into train, validate, and test datasets

In [60]:
# 80/10/10 split
assert len(pos_samples) == len(neg_samples), "length of pos_samples and neg_samples are not equal" # make sure the length of both samples are the same
i1 = int(len(pos_samples) * 0.8 + 0.5) # 4265.3 -> 4265
i2 = int(len(pos_samples) * 0.9 + 0.5) # 4798.4 -> 4798
train_samples = pos_samples[:i1] + neg_samples[:i1]
train_labels = (["pos"] * i1) + (["neg"] * i1)
assert len(train_samples) == len(train_labels)
validation_samples = pos_samples[i1:i2] + neg_samples[i1:i2]
validation_labels = (["pos"] * (i2 - i1)) + (["neg"] * (i2 - i1))
assert len(validation_samples) == len(validation_labels)
test_samples = pos_samples[i2:] + neg_samples[i2:]
test_labels = (["pos"] * (len(pos_samples) - i2)) + (["neg"] * (len(pos_samples) - i2))
assert len(test_samples) == len(test_labels)

After creating each, you may return one of them corresponding to the `split_key` argument. <br>
Now, let's organize the `_get_examples_from_split` function.

In [81]:
def _get_examples_from_split(split_key, files):
    # part I
    data_dir = "rt-polaritydata/"
    pos_samples, neg_samples = None, None
    for path, f in dl_manager.iter_archive(archive):
        if path == data_dir + "rt-polarity.pos":
            pos_samples = [line.decode('latin_1').strip() for line in f]
        elif path == data_dir + "rt-polarity.neg":
            neg_samples = [line.decode('latin_1').strip() for line in f]
        if pos_samples is not None and neg_samples is not None:
            break
    # part II
    assert len(pos_samples) == len(neg_samples), "length of pos_samples and neg_samples are not equal" # make sure the length of both samples are the same
    i1 = int(len(pos_samples) * 0.8 + 0.5) # 4265.3 -> 4265
    i2 = int(len(pos_samples) * 0.9 + 0.5) # 4798.4 -> 4798
    train_samples = pos_samples[:i1] + neg_samples[:i1]
    train_labels = (["pos"] * i1) + (["neg"] * i1)
    assert len(train_samples) == len(train_labels)
    validation_samples = pos_samples[i1:i2] + neg_samples[i1:i2]
    validation_labels = (["pos"] * (i2 - i1)) + (["neg"] * (i2 - i1))
    assert len(validation_samples) == len(validation_labels)
    test_samples = pos_samples[i2:] + neg_samples[i2:]
    test_labels = (["pos"] * (len(pos_samples) - i2)) + (["neg"] * (len(pos_samples) - i2))
    assert len(test_samples) == len(test_labels)
    # part III
    if split_key == "train":
        return (train_samples, train_labels)
    elif split_key == "validation":
        return (validation_samples, validation_labels)
    elif split_key == "test":
        return (test_samples, test_labels)
    else:
        raise ValueError(f"Invalid split key {split_key}")

Using the `_get_examples_from_split` function, `_generate_examples` function just need to **yield** the data

In [85]:
def _generate_examples(split_key, files):
    """Yields examples for a given split of MR."""
    split_text, split_labels = _get_examples_from_split(split_key, files)
    for idx, text, label in zip(range(len(split_text)), split_text, split_labels):
        data_key = split_key + "_" + str(idx)
        feature_dict = {"text": text, "label": label}
        yield data_key, feature_dict


In [86]:
generator = _generate_examples(**_split_generators(dl_manager)[0].gen_kwargs)

In [87]:
next(generator)

('train_0',
 {'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
  'label': 'pos'})

## Step 4. Organize the custom data loading script 

Go check the python script file for the rotten tomatos : https://huggingface.co/datasets/rotten_tomatoes/blob/main/rotten_tomatoes.py

In [1]:
import datasets
from datasets import DownloadManager
import csv
dl_manager = DownloadManager()

In [2]:
import numpy as np
import pandas as pd
import re

In [3]:
ds = datasets.load_dataset('sample_data')

No config specified, defaulting to: sample_data/sample_data
Found cached dataset sample_data (/Users/gradcheckout/.cache/huggingface/datasets/sample_data/sample_data/1.0.0/900aa8a1e70892a2dfc2b6fb4d01e27052c9d75965051202dd2e68522a3646af)


  0%|          | 0/1 [00:00<?, ?it/s]

In [4]:
ds['train']['second']

[0.3389665484428406,
 0.2607637047767639,
 0.2737963795661926,
 0.42167502641677856,
 0.6151910424232483,
 0.2477778196334839,
 0.8248767852783203,
 0.03461909294128418,
 0.8353712558746338,
 0.7601965665817261]

In [8]:
def _split_generators(dl_manager):
    downloaded_files = dl_manager.download(url_or_urls={'dataset':'sample_data/sample.csv'})    
    return [datasets.SplitGenerator(  # pylint:disable=g-complex-comprehension
        name=datasets.Split.TRAIN, gen_kwargs={"split": "train", "data_file": downloaded_files["dataset"]}
        )]


In [9]:
dl_manager.download('sample_data/sample.csv')

'/Users/gradcheckout/Lee/tutorial/sample_data/sample.csv'

In [10]:
sgl = _split_generators(dl_manager)

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

In [11]:
sg = sgl[0]

In [18]:
sg.gen_kwargs['data_file']

'/Users/gradcheckout/Lee/tutorial/sample_data/sample.csv'

In [9]:
def _generate_examples(data_file, split):
    with open(data_file) as f:
        reader = csv.DictReader(f, quoting = csv.QUOTE_NONE)
        data_features = {'first':'first',
                        'second':'second'}
        for n, row in enumerate(reader):
            example = {feat: float(row[col]) for feat, col in data_features.items()}
            example["idx"] = n
            yield example["idx"], example


In [15]:
with open(sg.gen_kwargs['data_file']) as f:
    csv_reader = csv.reader(f)
    for row in csv_reader:
        print(row)

['0.6455371975898743', '0.3389665484428406']
['0.6397613286972046', '0.2607637047767639']
['0.33012986183166504', '0.2737963795661926']
['0.8163353800773621', '0.42167502641677856']
['0.6774297952651978', '0.6151910424232483']
['0.6102346777915955', '0.2477778196334839']
['0.8693925142288208', '0.8248767852783203']
['0.9056412577629089', '0.03461909294128418']
['0.9850180149078369', '0.8353712558746338']
['0.9222685098648071', '0.7601965665817261']


In [43]:
ex = _generate_examples(**sg.gen_kwargs)

In [44]:
next(ex)

(0, {'first': 0.6455371975898743, 'second': 0.3389665484428406, 'idx': 0})

![image.png](attachment:image.png)

In [6]:
from datasets import DownloadManager

dl_manager = DownloadManager()

In [7]:
_DOWNLOAD_URL = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

In [8]:
# downloaded_files = dl_manager.download(url_or_urls={'dataset':'sample_data/sample.csv'})    
archive = dl_manager.download(_DOWNLOAD_URL)


Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

In [10]:
sg = datasets.SplitGenerator(
    name=datasets.Split.TRAIN, gen_kwargs={"files": dl_manager.iter_archive(archive), "split": "train"}
)


In [13]:
label_mapping = {"pos": 1, "neg": 0}

In [28]:
for path, file in sg.gen_kwargs['files']:
    print(file.read().decode('utf-8'))

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



The title suggests this was supposed to be in the same spirit as The Naked Gun, but the only similarity is Leslie Nielson. Drastically unfunny, even when it is technically bad. It drags like a rainy Boxing day. The sets are straight out of a 60's TV Sci-fi soap, the songs excruciatingly dull (and badly sung) and the the only suspense is how they got the money to make this low-point of Nielson's career.<br /><br />Go and see a good film, like Plan 9, or Robot Monster! If you want to see a good film in the same genre (and budget) try Dark Star. Its beach ball monster shames the rubber beast in this film right back to Toyo studios.
I have seen many movies. More than a lot of people i like to think. And with this vast knowledge of cinema, I try to appreciate every movie i see, and even if it is "bad", and try to find some redeemable quality in it. Upon saying this, I believe, and this is quite a statement if you knew me, that the motion picture "Spaceship" is the worst film ever made in th

In [8]:
downloaded_files['dataset']

'/Users/gradcheckout/Lee/tutorial/sample_data/sample.csv'

In [25]:
def _generate_examples(data_file):
    with open(data_file) as f: #, encoding="utf8"
        reader = csv.DictReader(f, quoting = csv.QUOTE_NONE)
        data_features = {'first':'first',
                        'second':'second'}
        for n, row in enumerate(reader):
            example = {feat: float(row[col]) for feat, col in data_features.items()}
            example["idx"] = n
            yield example["idx"], example

In [29]:
d = _generate_examples(data_file)
next(d)

(0, {'first': 0.6455371975898743, 'second': 0.3389665484428406, 'idx': 0})