<a href="https://colab.research.google.com/github/ilsilfverskiold/smaller-models-docs/blob/main/computer-vision/cook/image-classification/dataset/Image_dataset_push_huggingface.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Push Custom Image Dataset to HuggingFace**

---

This dataset has been stored in Google Drive, and has images seperated by folders which will be interpreted as your labels. So, if I have a dataset with traffic images, I will put the high traffic images into a folder called high-traffic, my low traffic images into a folder called low-traffic and so on. The folders themselves will become your labels when you push the dataset to the huggingface library.

It will check for files that are not images (this is optional) to make sure you don't stumble onto issues later when you're fine tuning with the dataset.

You'll need a huggingface account and a token. Find your token under Settings, and make sure it has read/write access.

In [None]:
!pip install datasets numpy huggingface_hub --upgrade

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Check for corrupt images in the folder (this is optional)

In [None]:
from PIL import Image
import os

dataset_path = '/content/drive/MyDrive/your-image-folder' #remember to change this to where you have your images located.

def verify_images(folder_path):
    for subdir, dirs, files in os.walk(folder_path):
        for file in files:
            filepath = os.path.join(subdir, file)
            try:
                with Image.open(filepath) as img:
                    img.verify()
            except (IOError, SyntaxError) as e:
                print(f'Corrupt image: {filepath} | Error: {e}')
                os.remove(filepath)

verify_images(dataset_path)

In [None]:
drive.mount("/content/drive", force_remount=True)

Now we load the dataset (that has been checked) and check the features of it. Now we'll see the labels that have been set based on our file structure.

In [None]:
from datasets import load_dataset

dataset = load_dataset('imagefolder', data_dir=dataset_path)

print(dataset)

In [None]:
print(dataset['train'].features)
print(dataset['train'].features['label'].names)

Split the dataset into a train and validation set. This is also optional.

In [None]:
from datasets import load_dataset, DatasetDict

train_dataset = dataset["train"]

split_datasets = train_dataset.train_test_split(test_size=0.1, seed=42, stratify_by_column='label')

train_dataset = split_datasets['train']
val_dataset = split_datasets['test']

dataset_dict = DatasetDict({
    'train': train_dataset,
    'validation': val_dataset
})

Login to huggingface to upload the dataset. Remember you'll need your token here.

In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) 
Token is valid (permission: write).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your term

In [None]:
repo_name = "ilsilfverskiold/traffic-camera-norway-images" # remember to change this
dataset_dict.push_to_hub(repo_name)