### Machine Learning Datasets with HuggingFace  🤗 Datasets
+ Datasets and evaluation metrics for natural language processing
+ Compatible with NumPy, Pandas, PyTorch and TensorFlow
+ 🤗 Datasets is a lightweight and extensible library to easily share and access datasets and evaluation metrics for Natural Language Processing (NLP).
+ 🤗 Datasets currently provides access to ~1,000 datasets and ~30 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics. You can browse the full set of datasets with the live 🤗 Datasets viewer.


#### Features
+ Built-in interoperability with Numpy, Pandas, PyTorch and Tensorflow 2

+ Lightweight and fast with a transparent and pythonic API
+ Strive on large datasets: 🤗 Datasets naturally frees the user from RAM memory limitation, all datasets are memory-mapped on drive by default.
+ Smart caching: never wait for your data to process several times

#### Installation
+ pip install datasets

In [10]:
# Load Pkgs
import datasets as  ds

In [11]:
# Method/Attrib
dir(ds)

['Array2D',
 'Array3D',
 'Array4D',
 'Array5D',
 'ArrowBasedBuilder',
 'ArrowReader',
 'ArrowWriter',
 'BeamBasedBuilder',
 'BuilderConfig',
 'ClassLabel',
 'Dataset',
 'DatasetBuilder',
 'DatasetDict',
 'DatasetInfo',
 'DownloadConfig',
 'DownloadManager',
 'Features',
 'GenerateMode',
 'GeneratorBasedBuilder',
 'IterableDataset',
 'IterableDatasetDict',
 'KeyHasher',
 'Metric',
 'MetricInfo',
 'MockDownloadManager',
 'NamedSplit',
 'NamedSplitAll',
 'NonMutableDict',
 'ReadInstruction',
 'SCRIPTS_VERSION',
 'Sequence',
 'Split',
 'SplitBase',
 'SplitDict',
 'SplitGenerator',
 'SplitInfo',
 'SubSplitInfo',
 'Translation',
 'TranslationVariableLanguages',
 'Value',
 'Version',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 'arrow_dataset',
 'arrow_reader',
 'arrow_writer',
 'async_tqdm',
 'builder',
 'cached_path',
 'classproperty',
 'combine',
 'concatenate_datasets',
 'config',
 'copyfunc

In [12]:
# List of Dataset
print(ds.list_datasets())

['acronym_identification', 'ade_corpus_v2', 'adversarial_qa', 'aeslc', 'afrikaans_ner_corpus', 'ag_news', 'ai2_arc', 'air_dialogue', 'ajgt_twitter_ar', 'allegro_reviews', 'allocine', 'alt', 'amazon_polarity', 'amazon_reviews_multi', 'amazon_us_reviews', 'ambig_qa', 'amttl', 'anli', 'app_reviews', 'aqua_rat', 'aquamuse', 'ar_cov19', 'ar_res_reviews', 'ar_sarcasm', 'arabic_billion_words', 'arabic_pos_dialect', 'arabic_speech_corpus', 'arcd', 'arsentd_lev', 'art', 'arxiv_dataset', 'ascent_kb', 'aslg_pc12', 'asnq', 'asset', 'assin', 'assin2', 'atomic', 'autshumato', 'babi_qa', 'banking77', 'bbaw_egyptian', 'bbc_hindi_nli', 'bc2gm_corpus', 'beans', 'best2009', 'bianet', 'bible_para', 'big_patent', 'billsum', 'bing_coronavirus_query_set', 'biomrc', 'blended_skill_talk', 'blimp', 'blog_authorship_corpus', 'bn_hate_speech', 'bookcorpus', 'bookcorpusopen', 'boolq', 'bprec', 'break_data', 'brwac', 'bsd_ja_en', 'bswac', 'c3', 'c4', 'cail2018', 'caner', 'capes', 'catalonia_independence', 'cats_vs_

In [13]:
# Number of Datasets
len(ds.list_datasets())

1308

In [14]:
ds.load_dataset??

[0;31mSignature:[0m
[0mds[0m[0;34m.[0m[0mload_dataset[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mpath[0m[0;34m:[0m [0mstr[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mname[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mstr[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdata_dir[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mstr[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdata_files[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mDict[0m[0;34m,[0m [0mList[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msplit[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mdatasets[0m[0;34m.[0m[0msplits[0m[0;34m.[0m[0mSplit[0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcache_dir[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mstr[0m[0;34m][0m [0;34m=[0m [0;32mNone

In [15]:
# Load A Dataset That is available
cola_dataset = ds.load_dataset('glue','cola',split='train')

Reusing dataset glue (/home/rooot/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


In [17]:
# Load A Dataset That is not yet downloaded
qqp_dataset = ds.load_dataset('glue','qqp',split='train')

Downloading and preparing dataset glue/qqp (download: 39.76 MiB, generated: 106.55 MiB, post-processed: Unknown size, total: 146.32 MiB) to /home/rooot/.cache/huggingface/datasets/glue/qqp/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=41696084.0), HTML(value='')))




HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…

HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…

HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…

Dataset glue downloaded and prepared to /home/rooot/.cache/huggingface/datasets/glue/qqp/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


In [18]:
# Check where it was stored as cache
qqp_dataset.cache_files

[{'filename': '/home/rooot/.cache/huggingface/datasets/glue/qqp/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/glue-train.arrow'}]

In [19]:
qqp_dataset

Dataset({
    features: ['question1', 'question2', 'label', 'idx'],
    num_rows: 363846
})

In [20]:
# Get Feature Info
qqp_dataset.features

{'question1': Value(dtype='string', id=None),
 'question2': Value(dtype='string', id=None),
 'label': ClassLabel(num_classes=2, names=['not_duplicate', 'duplicate'], names_file=None, id=None),
 'idx': Value(dtype='int32', id=None)}

In [21]:
# Get Number of rows
qqp_dataset.num_rows

363846

In [22]:
# Get number of columns
qqp_dataset.num_columns

4

In [23]:
# Get the shape
qqp_dataset.shape

(363846, 4)

In [24]:
# Type 
type(qqp_dataset)

datasets.arrow_dataset.Dataset

In [25]:
# Get Meta Data Info Dataset
qqp_dataset.info

DatasetInfo(description='GLUE, the General Language Understanding Evaluation benchmark\n(https://gluebenchmark.com/) is a collection of resources for training,\nevaluating, and analyzing natural language understanding systems.\n\n', citation='@online{WinNT,\n  author = {Iyer, Shankar and Dandekar, Nikhil and Csernai, Kornel},\n  title = {First Quora Dataset Release: Question Pairs},\n  year = {2017},\n  url = {https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs},\n  urldate = {2019-04-03}\n}\n@inproceedings{wang2019glue,\n  title={{GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding},\n  author={Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R.},\n  note={In the Proceedings of ICLR.},\n  year={2019}\n}\n', homepage='https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs', license='', features={'question1': Value(dtype='string', id=None), 'question2': Value(dtype='string', 

In [26]:
# Get Description of Dataset
qqp_dataset.description

'GLUE, the General Language Understanding Evaluation benchmark\n(https://gluebenchmark.com/) is a collection of resources for training,\nevaluating, and analyzing natural language understanding systems.\n\n'

In [27]:
# Citation
qqp_dataset.citation

'@online{WinNT,\n  author = {Iyer, Shankar and Dandekar, Nikhil and Csernai, Kornel},\n  title = {First Quora Dataset Release: Question Pairs},\n  year = {2017},\n  url = {https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs},\n  urldate = {2019-04-03}\n}\n@inproceedings{wang2019glue,\n  title={{GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding},\n  author={Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R.},\n  note={In the Proceedings of ICLR.},\n  year={2019}\n}\n'

In [28]:
# Licence
qqp_dataset.license

''

In [29]:
# Where the original dataset is from
qqp_dataset.homepage

'https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs'

In [43]:
# get size of dataset
qqp_dataset.size_in_bytes

153422425

In [30]:
#### View Datatypes
qqp_dataset.data

MemoryMappedTable
question1: string
question2: string
label: int64
idx: int32

In [31]:
# Column Names
qqp_dataset.column_names

['question1', 'question2', 'label', 'idx']

In [35]:
qqp_dataset.features['question1']

Value(dtype='string', id=None)

In [36]:
# View The Actual Dataset
# Single row
qqp_dataset[0]

{'question1': 'How is the life of a math student? Could you describe your own experiences?',
 'question2': 'Which level of prepration is enough for the exam jlpt5?',
 'label': 0,
 'idx': 0}

In [None]:
# pandas pd.iloc[0]

In [37]:
### Multiple Row selection
qqp_dataset[0:10]

{'question1': ['How is the life of a math student? Could you describe your own experiences?',
  'How do I control my horny emotions?',
  'What causes stool color to change to yellow?',
  'What can one do after MBBS?',
  'Where can I find a power outlet for my laptop at Melbourne Airport?',
  "How not to feel guilty since I am Muslim and I'm conscious we won't have sex together?",
  'How is air traffic controlled?',
  'What is the best self help book you have read? Why? How did it change your life?',
  "Can I enter University of Melbourne if I couldn't achieve the guaranteed marks in Trinity College Foundation?",
  'Do you need a passport to go to Jamaica from the United States?'],
 'question2': ['Which level of prepration is enough for the exam jlpt5?',
  'How do you control your horniness?',
  'What can cause stool to come out as little balls?',
  'What do i do after my MBBS ?',
  'Would a second airport in Sydney, Australia be needed if a high-speed rail link was created between Melb

In [39]:
### Multiple Row Selection using .select
qqp_dataset.select([1,3,5])['question1']

['How do I control my horny emotions?',
 'What can one do after MBBS?',
 "How not to feel guilty since I am Muslim and I'm conscious we won't have sex together?"]

In [None]:
# Pandas
# df.iloc[[1,3,5]]['question1']

In [41]:
# View it as a dataframe
df = qqp_dataset.to_pandas()

In [42]:
df.head()

Unnamed: 0,question1,question2,label,idx
0,How is the life of a math student? Could you d...,Which level of prepration is enough for the ex...,0,0
1,How do I control my horny emotions?,How do you control your horniness?,1,1
2,What causes stool color to change to yellow?,What can cause stool to come out as little balls?,0,2
3,What can one do after MBBS?,What do i do after my MBBS ?,1,3
4,Where can I find a power outlet for my laptop ...,"Would a second airport in Sydney, Australia be...",0,4


In [44]:
### Unique: Filtering
start_with_h = qqp_dataset.filter(lambda x: x['question1'].startswith('H'))

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=364.0), HTML(value='')))




In [46]:
start_with_h['question1']

['How is the life of a math student? Could you describe your own experiences?',
 'How do I control my horny emotions?',
 "How not to feel guilty since I am Muslim and I'm conscious we won't have sex together?",
 'How is air traffic controlled?',
 'How is being gay or lesbian less moral than divorce?',
 'How do you thank a Disneyland cast member?',
 'How do I lose weight fast?',
 'How can I make me believe that everything is going good in life and get satisfaction when nothing is going right?',
 'How does an IQ test work and what is determined from an IQ test?',
 'HTTP sites are not working while HTTPS sites are working in Google Chrome? What are some solutions?',
 'How do people join ISIS?',
 'Has Ancient Sumer been scientifically tested?',
 'How do I draw bending moment and shear force diagram for beams?',
 'How should I make myself brave?',
 'How do obtain telegram groups link?',
 'How would you spend the last 10 days of your life?',
 'How many Parrotheads are there in USA and Canada

In [47]:
### Thanks For Watching
### Jesus Saves @JCharisTech
### Jesse E.Agbe(JCharis)