<a href="https://colab.research.google.com/github/jonbaer/googlecolab/blob/master/huggingface_datasets_converter_kaggle.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The Hugging Face Datasets Converter (Kaggle)

This notebook allows you to convert a Kaggle dataset to a Hugging Face dataset. 

Follow the 4 simple steps below to take an existing dataset on Kaggle and convert it to a Hugging Face dataset, which can then be loaded with the `datasets` library.

# Step 1 - Setup

Run the cell below to install required dependencies.

In [None]:
%%capture
! git clone https://github.com/nateraw/huggingface-datasets-converter.git
%cd /content/huggingface-datasets-converter
! pip install -r requirements.txt
! git config --global credential.helper store

# Step 2 - Authenticate with Kaggle

Navigate to https://www.kaggle.com. Then go to the [Account tab of your user profile](https://www.kaggle.com/me/account) and select Create API Token. This will trigger the download of `kaggle.json`, a file containing your API credentials.

Then run the cell below to upload kaggle.json to your Colab runtime.

⚠️ It should be named exactly `kaggle.json`.

In [None]:
from google.colab import files

uploaded = files.upload()

for name in uploaded.keys():
  print(f'User uploaded file "{name}" with length {len(uploaded[name])} bytes')

# Then move kaggle.json into the folder where the API expects to find t.
!mkdir -p ~/.kaggle/ && mv kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json

Alternatively, you can upload the `kaggle.json` file manually to your working directory (probably the "content" folder) or using your Google Drive account ([see this](https://colab.research.google.com/notebooks/io.ipynb) for examples). Then run: 

```
!mkdir -p ~/.kaggle/ && mv kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json
```

# Step 3 - Authenticate with Hugging Face 🤗

You'll need to authenticate with your Hugging Face account, so make sure to [sign up](https://huggingface.co/join) if you haven't already. 

Then, run the cell below and provide a token that has ***write access***



In [None]:
from huggingface_hub import notebook_login

notebook_login()

# Step 4 - Convert From Kaggle

Below, input the:

- Kaggle ID of the dataset you'd like to upload (ex. `kaggleuser/dataset-name`)
- Repo ID of the dataset repo you'd like to upload to (ex. `huggingface-user/dataset-name`).

💡 You can find the Kaggle ID in the dataset URL.

For example: if you want to convert [this dataset](https://www.kaggle.com/datasets/ruchi798/data-science-job-salaries), the ID is `ruchi798/data-science-job-salaries`.

> 📝 **Note** - The converter works best on csv and json datasets, so we suggest you use one of those :)

In [None]:
from huggingface_datasets_converter.convert import notebook_converter_kaggle

notebook_converter_kaggle()

# Loading Converted Datasets

Now lets see how to load datasets we've converted!

### CSV

For reference on loading CSV files using `datasets`, check [the docs](https://huggingface.co/docs/datasets/loading#csv).


#### Example 1

If your dataset contains just one CSV file and no additional json/zip/csv files, you can load it by providing the repo ID to `datasets.load_dataset`.

- [Example Kaggle Dataset](https://www.kaggle.com/datasets/neuromusic/avocado-prices)
- [Example Hugging Face Repo](https://huggingface.co/datasets/nateraw/avocado-prices)

In [None]:
from datasets import load_dataset

ds = load_dataset('nateraw/avocado-prices')
ds

In [None]:
# get an example
ds['train'][0]

#### Example 2

- [Example Kaggle Dataset](https://www.kaggle.com/datasets/unsdsn/world-happiness)
- [Example Hugging Face Repo](https://huggingface.co/datasets/nateraw/world-happiness)

This example has [multiple CSV files](https://huggingface.co/datasets/nateraw/world-happiness/tree/main) in it. In cases like these, we can refer directly to the file we need like this:

In [None]:
ds = load_dataset('nateraw/world-happiness', data_files='2019.csv')
ds['train'][0]

### JSON

- Example Kaggle Dataset
- Example Hugging Face Repo

With JSON, it's expected the files are in JSONL format. However, even JSONL formatted files sometimes can fail. So, here we'll show how to load directly with the built-in `json` Python package. 

For reference on loading JSON files using `datasets`, check [the docs](https://huggingface.co/docs/datasets/loading#json).

#### Example 1

- [Example Kaggle Dataset](https://www.kaggle.com/datasets/roamresearch/prescriptionbasedprediction)
- [Example Hugging Face Repo](https://huggingface.co/datasets/nateraw/prescriptionbasedprediction)

In [None]:
import json
from huggingface_hub import hf_hub_download
from pathlib import Path

json_file = hf_hub_download('nateraw/prescriptionbasedprediction', 'roam_prescription_based_prediction.jsonl', repo_type='dataset')

data = [json.loads(x) for x in Path(json_file).read_text().splitlines()]
len(data)

#### Example 2

- [Example Kaggle Dataset](https://kaggle.com/datasets/succinctlyai/midjourney-texttoimage)
- [Example Hugging Face Repo](https://huggingface.co/datasets/nateraw/midjourney-texttoimage/)

In [None]:
import json
from pathlib import Path

from huggingface_hub import hf_hub_download

json_file = hf_hub_download('nateraw/midjourney-texttoimage', 'general-01_2022_06_20.json', repo_type='dataset')
data = json.loads(Path(json_file).read_text())

print(f"Keys: {data.keys()}\n\nNumber of messages: {len(data['messages'])}")