### Knowing dataset
There are two types of dataset objects a regular and then an iterable.

A `dataset` provides fast random access to the rows and memory mappings so that oading even large datasets only use a small amound of device memory but if a dataset is really big that wont fit on the disk or memory `iterableDataset` allow us to access and use the dataset without waiting for it to download completely.

In [3]:
# !pip install datasets

#### dataset

In [4]:
# Dataset:

from datasets import load_dataset

dataset = load_dataset("rotten_tomatoes", split="train")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.46k [00:00<?, ?B/s]

train.parquet:   0%|          | 0.00/699k [00:00<?, ?B/s]

validation.parquet:   0%|          | 0.00/90.0k [00:00<?, ?B/s]

test.parquet:   0%|          | 0.00/92.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]

In [5]:
# indexing:

dataset[0]

{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
 'label': 1}

In [6]:
dataset[-1]["text"]

'things really get weird , though not particularly scary : the movie is all portent and no content .'

In [7]:
dataset["label"][:10]

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

#### indexing

In [8]:
# indexing by column vs row

import time

start_time = time.time()
text = dataset[0]["text"]
end_time = time.time()

print(f"Row first and then Column: {end_time - start_time}")

Row first and then Column: 0.0010323524475097656


In [9]:
import time

start_time = time.time()
text = dataset["text"][0]
end_time = time.time()

print(f"Column first and then Row: {end_time - start_time}")

Column first and then Row: 0.01571941375732422


#### slicing

In [10]:
dataset[:3]

{'text': ['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
  'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .',
  'effective but too-tepid biopic'],
 'label': [1, 1, 1]}

In [14]:
dataset[:10:2] # first 10 row with stride of 2

{'text': ['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
  'effective but too-tepid biopic',
  "emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one .",
  'offers that rare combination of entertainment and education .',
  "steers turns in a snappy screenplay that curls at the edges ; it's so clever you want to hate it . but he somehow pulls it off ."],
 'label': [1, 1, 1, 1, 1]}

#### iterable dataset

In [15]:
from datasets import load_dataset

iterable_dataset = load_dataset(
    "food101", split="train", streaming=True
)

count = 0

for example in iterable_dataset:
  count += 1
  if count > 10:
    break
  print(example)
  print(f"Dataset: {count}\n")

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

{'image': <PIL.Image.Image image mode=RGB size=384x512 at 0x799E8D0A3D60>, 'label': 6}
Dataset: 1

{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x512 at 0x799E8D0A2D40>, 'label': 6}
Dataset: 2

{'image': <PIL.Image.Image image mode=RGB size=512x383 at 0x799E8D0A2E60>, 'label': 6}
Dataset: 3

{'image': <PIL.Image.Image image mode=RGB size=512x348 at 0x799E8D0A2D70>, 'label': 6}
Dataset: 4

{'image': <PIL.Image.Image image mode=RGB size=512x512 at 0x799E8D0A3E80>, 'label': 6}
Dataset: 5

{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x512 at 0x799E8D0A2E60>, 'label': 6}
Dataset: 6

{'image': <PIL.Image.Image image mode=RGB size=512x512 at 0x799E8D0A23E0>, 'label': 6}
Dataset: 7

{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x512 at 0x799E8D0A2FE0>, 'label': 6}
Dataset: 8

{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x512 at 0x799E8D0A2020>, 'label': 6}
Dataset: 9

{'image': <PIL.Image.Image image mode

In [16]:
# iterable from existing dataset
from datasets import load_dataset

dataset = load_dataset("rotten_tomatoes", split="train")
iterable_dataset = dataset.to_iterable_dataset()

In [17]:
iterable_dataset

IterableDataset({
    features: ['text', 'label'],
    num_shards: 1
})

In [19]:
next(iter(iterable_dataset))

{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
 'label': 1}

In [20]:
list(iterable_dataset.take(10))

[{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
  'label': 1},
 {'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .',
  'label': 1},
 {'text': 'effective but too-tepid biopic', 'label': 1},
 {'text': 'if you sometimes like to go to the movies to have fun , wasabi is a good place to start .',
  'label': 1},
 {'text': "emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one .",
  'label': 1},
 {'text': 'the film provides some great insight into the neurotic mindset of all comics -- even those who have reached the absolute top of the game .',
  'label': 1},
 {'text': 'offers that rare combination of entertainm