# How to load datasets from Hugging Face Datasets

- toc: true 
- badges: true
- categories: [Hugging Face, Data Processing]
- permalink: /how-to-load-datasets-from-hugging-face-datasets/


<br><br>
The [Hugging Face Datasets](https://github.com/huggingface/datasets) makes thousands of datasets available that can be found on the [Hub](https://huggingface.co/datasets). Check if there's any dataset you would like to try out!
<br><br>
In this tutorial, we will load the [agnews](https://huggingface.co/datasets/ag_news#data-fields) dataset, a collection of more than 1 million news articles on four categories: world, sports, business, sci/tech.


## 1. Install the datasets package
- see the [installation guide](https://huggingface.co/docs/datasets/installation) for more information

In [1]:
#collapse-output
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.3.2-py3-none-any.whl (362 kB)
[K     |████████████████████████████████| 362 kB 8.0 MB/s 
[?25hCollecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 49.1 MB/s 
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.5.0-py3-none-any.whl (140 kB)
[K     |████████████████████████████████| 140 kB 53.1 MB/s 
[?25hCollecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 5.0 MB/s 
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████

## 2. Loading the dataset
The agnews dataset has data fields of:
- text: a string feature.
- label: a classification label, with possible values including World (0), Sports (1), Business (2), Sci/Tech (3).

In [2]:
#collapse-output
from datasets import load_dataset
agnews = load_dataset('ag_news')
agnews

Downloading builder script:   0%|          | 0.00/1.83k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.28k [00:00<?, ?B/s]

Using custom data configuration default


Downloading and preparing dataset ag_news/default (download: 29.88 MiB, generated: 30.23 MiB, post-processed: Unknown size, total: 60.10 MiB) to /root/.cache/huggingface/datasets/ag_news/default/0.0.0/bc2bcb40336ace1a0374767fc29bb0296cdaf8a6da7298436239c54d79180548...


Downloading data:   0%|          | 0.00/11.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/751k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

Dataset ag_news downloaded and prepared to /root/.cache/huggingface/datasets/ag_news/default/0.0.0/bc2bcb40336ace1a0374767fc29bb0296cdaf8a6da7298436239c54d79180548. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})

We check the class lables by : 

In [4]:
agnews['train'].features

{'label': ClassLabel(num_classes=4, names=['World', 'Sports', 'Business', 'Sci/Tech'], id=None),
 'text': Value(dtype='string', id=None)}

## 3. (Optional) Convert a Dataset object to a Pandas DataFrame 
- When we're dealing with data, it's often more convenient to use a DataFrame.

In [3]:
# from Datasets to Pandas DataFrames
agnews.set_format(type="pandas")
train_df = agnews['train'][:]
train_df.head()

Unnamed: 0,text,label
0,Wall St. Bears Claw Back Into the Black (Reute...,2
1,Carlyle Looks Toward Commercial Aerospace (Reu...,2
2,Oil and Economy Cloud Stocks' Outlook (Reuters...,2
3,Iraq Halts Oil Exports from Main Southern Pipe...,2
4,"Oil prices soar to all-time record, posing new...",2


<br><br>
That's it! Now you have the dataset at your disposal.
Check out the official [doc](https://huggingface.co/docs/datasets/v1.2.1/loading_datasets.html) for more information! 