## Requirements

In [2]:
from datasets import Dataset
import pandas as pd
from pathlib import Path

## Create dataset

Create a list of all file names to store in the dataset.  Each of those files contains a text fragment.

In [19]:
filenames = sorted([str(path) for path in Path('tmp_seq/').glob('*.txt')])

The dataset can now be constructed from these file names.

In [20]:
dataset = Dataset.from_text(filenames)

Downloading data: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:00<00:00, 101778.79files/s]
Generating train split: 500 examples [00:00, 4086.43 examples/s]


The dataset consists of as many examples as there were files, and the content of the files is stored in the `text` column.

In [24]:
len(dataset)

500

In [21]:
dataset[0]['text']

'ATGACCATGAAACAATTAAACGGTTTCAAACCGGGGCATACGAAAGCGGCTTTTAACTGTACGGCTGTCGACAGTATTGTTCGGCTACCTCGCGCTTGAGCCAAAAGTGCATTCACGCTGGGCTTGGCATTTTTTGTAAAGCCGTTGCGGCTCTGGCGCTCCGTTCCGGACTGAGTACACTACTTTCTCAATCTACTAATATGGCACCTACTTCGCGCGGCGATCAGACAGTCCTATTTGCGATCAGAACTACGCCTCAGGCACCGTGGCAAAGGTCACTGGCTCGGCTTACTATCGTGCTACGAGAAGATGGCCACTACCAGGAGAGAGTCATCCTAAATTAGACGCTTCGCGGGGCTATGTCACGGGCTAAAGACTTCGCCACTCCTGTACCCAATTTTCTAGCAGGATTTAAACCGCCCGTATGCGACCTAAAAGAGATGACTTGCGGCCCGAGCACAAGGTATACTCACAATCCCATTCCCACCCTCCGCTCCAGCTCGTACAAGATGGCTAGATGGA'

Verify that the content of the example is that of the corresponding file.

In [23]:
!cat {filenames[0]}

ATGACCATGAAACAATTAAACGGTTTCAAACCGGGGCATACGAAAGCGGCTTTTAACTGTACGGCTGTCGACAGTATTGTTCGGCTACCTCGCGCTTGAGCCAAAAGTGCATTCACGCTGGGCTTGGCATTTTTTGTAAAGCCGTTGCGGCTCTGGCGCTCCGTTCCGGACTGAGTACACTACTTTCTCAATCTACTAATATGGCACCTACTTCGCGCGGCGATCAGACAGTCCTATTTGCGATCAGAACTACGCCTCAGGCACCGTGGCAAAGGTCACTGGCTCGGCTTACTATCGTGCTACGAGAAGATGGCCACTACCAGGAGAGAGTCATCCTAAATTAGACGCTTCGCGGGGCTATGTCACGGGCTAAAGACTTCGCCACTCCTGTACCCAATTTTCTAGCAGGATTTAAACCGCCCGTATGCGACCTAAAAGAGATGACTTGCGGCCCGAGCACAAGGTATACTCACAATCCCATTCCCACCCTCCGCTCCAGCTCGTACAAGATGGCTAGATGGA


Examples can have multiple features, accessible as a dictionary.

In [25]:
dataset.features

{'text': Value(dtype='string', id=None)}

## Adding features

It is straightforward to add features to the data.  The features are stored in a CSV file.

In [27]:
labels = pd.read_csv('tmp_seq_labels.csv')

In [35]:
class_0 = labels[['filename', 'class_0']].sort_values('filename')['class_0'].to_list()

In [37]:
class_1 = labels[['filename', 'class_1']].sort_values('filename')['class_1'].to_list()

In [38]:
dataset = dataset.add_column('class_0', class_0).add_column('class_1', class_1)

In [39]:
dataset.features

{'text': Value(dtype='string', id=None),
 'class_0': Value(dtype='int64', id=None),
 'class_1': Value(dtype='string', id=None)}

An example now contains both the data and the features.

In [40]:
dataset[0]

{'text': 'ATGACCATGAAACAATTAAACGGTTTCAAACCGGGGCATACGAAAGCGGCTTTTAACTGTACGGCTGTCGACAGTATTGTTCGGCTACCTCGCGCTTGAGCCAAAAGTGCATTCACGCTGGGCTTGGCATTTTTTGTAAAGCCGTTGCGGCTCTGGCGCTCCGTTCCGGACTGAGTACACTACTTTCTCAATCTACTAATATGGCACCTACTTCGCGCGGCGATCAGACAGTCCTATTTGCGATCAGAACTACGCCTCAGGCACCGTGGCAAAGGTCACTGGCTCGGCTTACTATCGTGCTACGAGAAGATGGCCACTACCAGGAGAGAGTCATCCTAAATTAGACGCTTCGCGGGGCTATGTCACGGGCTAAAGACTTCGCCACTCCTGTACCCAATTTTCTAGCAGGATTTAAACCGCCCGTATGCGACCTAAAAGAGATGACTTGCGGCCCGAGCACAAGGTATACTCACAATCCCATTCCCACCCTCCGCTCCAGCTCGTACAAGATGGCTAGATGGA',
 'class_0': 0,
 'class_1': 'H'}

Verifying that the features are correctly assigned.

In [41]:
labels.sort_values('filename').head()

Unnamed: 0,filename,class_0,class_1
305,tmp_seq/seq000001.txt,0,H
333,tmp_seq/seq000002.txt,1,J
412,tmp_seq/seq000003.txt,0,H
59,tmp_seq/seq000004.txt,0,C
481,tmp_seq/seq000005.txt,0,C


A dataset can be saved to Arrow format which is a directory with an Arrow file containing the data and a number of files containing metadata.

In [42]:
dataset.save_to_disk('tmp_seq.arrow')

Saving the dataset (1/1 shards): 100%|██████████████████████████████████████████████████████████████████████████| 500/500 [00:00<00:00, 32461.64 examples/s]


## Loading Arrow files

The data can easily be reloaded using the `load_from_disk` function.

In [2]:
d = Dataset.load_from_disk('tmp_seq.arrow', )

Verify that the examples contain the expected data and metadata.

In [3]:
d.features

{'text': Value(dtype='string', id=None),
 'class_0': Value(dtype='int64', id=None),
 'class_1': Value(dtype='string', id=None)}

In [4]:
len(d)

500

In [5]:
d[0]

{'text': 'ATGACCATGAAACAATTAAACGGTTTCAAACCGGGGCATACGAAAGCGGCTTTTAACTGTACGGCTGTCGACAGTATTGTTCGGCTACCTCGCGCTTGAGCCAAAAGTGCATTCACGCTGGGCTTGGCATTTTTTGTAAAGCCGTTGCGGCTCTGGCGCTCCGTTCCGGACTGAGTACACTACTTTCTCAATCTACTAATATGGCACCTACTTCGCGCGGCGATCAGACAGTCCTATTTGCGATCAGAACTACGCCTCAGGCACCGTGGCAAAGGTCACTGGCTCGGCTTACTATCGTGCTACGAGAAGATGGCCACTACCAGGAGAGAGTCATCCTAAATTAGACGCTTCGCGGGGCTATGTCACGGGCTAAAGACTTCGCCACTCCTGTACCCAATTTTCTAGCAGGATTTAAACCGCCCGTATGCGACCTAAAAGAGATGACTTGCGGCCCGAGCACAAGGTATACTCACAATCCCATTCCCACCCTCCGCTCCAGCTCGTACAAGATGGCTAGATGGA',
 'class_0': 0,
 'class_1': 'H'}

In [6]:
d[-1]

{'text': 'CGCCTGTTACAGGGTGACTTCACCGCAATTTTAGTGCGACTCCCGACGCTCCCACGAGCCAAAAACCGGGTAGCGAAGTGATTAGTAAACGTGTACCGATTGTCAGCAGCATAATGGGTTTAGTTGATCTGAGGTTGGTAGCCATGCGATCGGTCACAACTAGCCTTATAGTGGAATCGGATGCGAGCAGAACGGAAGGAGATGGTGATCCGCCCGAGGCCGCCAACATAAATTTACTACAATACCTTGATTAGACATTGCGTCACGGCTGCCCGGCATTGGACTGAACGTCGGTACGATCTATTTCAAATTACCTCAGCCGATGGAGTAGGCTCGGCATCCCAACGCAAGCAATCGATCGACCACTCGAGTGCAGTTAGAGCACCCAGTTCGCGAGGCCTTTGATCACACCTTGTTATATAGCATCTCAATGTATGTGCCTTCCTCGCGGTGAGATTTCGGACAATAACGCTTGTTGAGTTTTACATAGGACCCGTGGTCCTTGATAACGTTCGTGGGACGTCGACCCTAGGTAAGTCTAAGAAACATTGGACAATGATGTCCTAATGATTACACACTTTTGATCAGAGTGGTCTTAGGCCGTATGCGATGCAGAGCTACAGGTCCTATCTAGAGGGCCACCTTGCCCCCAGCTGCTCTCCTTGCTGTCCACAAAATCTTCGCTTTACGGATTATGCAGGAGCTCATCTGCATCAGAGCAGTGTGACTCATTTCATCGAGGGTCCAAGGTCTTTTCCATGGGGTAAGGTGATCCGCACGGTAAAACGCAATAATAATTGATTGGGTAACGCTTGCAAACGCATTTGCCCGCGGTGACTACGTATT',
 'class_0': 1,
 'class_1': 'F'}