# Pushing Datasets to Hugging Face Hub

Learn how to share your datasets with the community by uploading them to Hugging Face.

**Why push to Hugging Face?**
- Easy sharing and collaboration
- Version control for datasets
- Free hosting
- Direct integration with training scripts


In [None]:
# !pip install datasets huggingface_hub


In [None]:
import os
from datasets import Dataset, DatasetDict
from huggingface_hub import login
from dotenv import load_dotenv

load_dotenv()
HF_TOKEN = os.getenv("HF_TOKEN")

## Step 1: Authenticate with Hugging Face

First, you need to log in. You'll need a Hugging Face account and an access token.

**Get your token**: https://huggingface.co/settings/tokens


In [None]:
# Login to Hugging Face
# This will prompt you to enter your token
login(HF_TOKEN)

# Alternative: Set token as environment variable
# export HF_TOKEN="your_token_here"

print("‚ö†Ô∏è  Make sure you're logged in before proceeding!")
print("Run: login() and enter your token when prompted")


## Step 2: Create a Sample Dataset

Let's create a simple instruction-following dataset as an example:


In [None]:
print("=" * 70)
print("Creating Sample Dataset")
print("=" * 70)
print()

# Sample data - instruction-response pairs
data = {
    "instruction": [
        "What is the capital of France?",
        "Write a haiku about coding",
        "Explain photosynthesis in simple terms",
        "What is 25 + 37?",
        "Name three programming languages"
    ],
    "input": [
        "",
        "",
        "",
        "",
        ""
    ],
    "output": [
        "The capital of France is Paris.",
        "Code flows like water\\nBugs hide in silent shadows\\nDebug brings the light",
        "Photosynthesis is how plants make food using sunlight, water, and carbon dioxide.",
        "25 + 37 = 62",
        "Three programming languages are Python, JavaScript, and Java."
    ]
}

# Create Dataset object
dataset = Dataset.from_dict(data)

print(f"Created dataset with {len(dataset)} examples")
print()
print("First example:")
print(dataset[0])


## Step 3: Create Train/Validation Split (Optional)

It's good practice to include train/val splits in your dataset:


In [None]:
print("=" * 70)
print("Creating Train/Validation Split")
print("=" * 70)
print()

# Split dataset (80% train, 20% validation)
split_dataset = dataset.train_test_split(test_size=0.2, seed=42)

# Create DatasetDict with proper naming
dataset_dict = DatasetDict({
    "train": split_dataset["train"],
    "validation": split_dataset["test"]
})

print(dataset_dict)
print()
print(f"Training examples: {len(dataset_dict['train'])}")
print(f"Validation examples: {len(dataset_dict['validation'])}")


## Step 4: Push to Hugging Face Hub

Now we'll push the dataset to Hugging Face. Choose a unique name for your dataset!


In [None]:
print("=" * 70)
print("Pushing Dataset to Hugging Face Hub")
print("=" * 70)
print()

username = "your-username"
dataset_name = "my-instruction-dataset"

# Your dataset name (change this to your username!)
# Format: "username/dataset-name"
dataset_name = f"{username}/{dataset_name}"

dataset_dict.push_to_hub(dataset_name, private=False)

print("‚úì After pushing, your dataset will be available at:")
print(f"  https://huggingface.co/datasets/{dataset_name}")


## Step 5: Alternative - Push from Local Files

If you have data in JSON/JSONL/CSV files, you can load and push them directly:


In [None]:
from datasets import load_dataset

print("=" * 70)
print("Loading from Local Files")
print("=" * 70)
print()

# Example: Load from JSONL file
dataset = load_dataset("json", data_files="./example_dataset.jsonl")


dataset.push_to_hub(f"{username}/jsonl-dataset")


## Step 6: Make It Private (Optional)

You can make your dataset private if you don't want it publicly visible:


## Step 7: Add a Dataset Card (README)

After pushing, you should add a README to document your dataset. Here's a template:


In [None]:
readme_template = """
---
license: mit
task_categories:
- text-generation
language:
- en
size_categories:
- n<1K
---

# Dataset Name

## Dataset Description

Brief description of your dataset.

## Dataset Structure

### Fields

- `instruction`: The task or question
- `input`: Optional context or input
- `output`: The expected response

### Data Splits

- Training: X examples
- Validation: Y examples

## Usage

```python
from datasets import load_dataset

dataset = load_dataset("your-username/dataset-name")
```

## Citation

If you use this dataset, please cite:
[Your citation here]
"""

print("=" * 70)
print("Dataset Card Template")
print("=" * 70)
print(readme_template)
print()
print("üìù Add this to your dataset's README.md on Hugging Face")
print("   Go to: https://huggingface.co/datasets/your-username/dataset-name")
print("   Click 'Create README.md' and paste the template")


## Step 8: Load Your Dataset Back

Once pushed, anyone (including you!) can load it easily:


In [None]:
print("=" * 70)
print("Loading Your Dataset")
print("=" * 70)
print()

# Load the dataset you just pushed
my_dataset = load_dataset(dataset_name)
print(my_dataset)



In [None]:
dataset_name

---

## Quick Reference: Complete Workflow

```python
# 1. Login
from huggingface_hub import login
login()

# 2. Create/Load dataset
from datasets import Dataset, DatasetDict
data = {"text": ["example1", "example2"]}
dataset = Dataset.from_dict(data)

# 3. Split (optional)
split = dataset.train_test_split(test_size=0.2)
dataset_dict = DatasetDict({
    "train": split["train"],
    "validation": split["test"]
})

# 4. Push
dataset_dict.push_to_hub("username/dataset-name", private=False)
```

---

## Key Takeaways

1. **Authentication**: Always login first with your HF token
2. **DatasetDict**: Use splits (train/validation) for better organization
3. **Naming**: Use format `username/dataset-name`
4. **Privacy**: Set `private=True` for private datasets
5. **Documentation**: Always add a README with dataset description
6. **Formats**: Can load from JSON, JSONL, CSV, Parquet, or Python dicts
7. **Sharing**: Once pushed, anyone can load with `load_dataset()`

**Best Practices**:
- Include train/validation splits
- Write clear field descriptions in README
- Add license information
- Include usage examples
- Specify language and task categories
