# Loading GadziJƒôzyk Dataset

This notebook demonstrates how to load the **GadziJƒôzyk** dataset using Amber's dataset abstractions.

## ‚ö†Ô∏è Dataset Information Needed

**Note:** The exact source and structure of the GadziJƒôzyk dataset could not be automatically determined. Please update this notebook with the correct:

1. **Dataset source:** HuggingFace repo ID, local path, CSV file, or other source
2. **Dataset structure:** Field names, splits, and data format
3. **Dataset type:** Text dataset, classification dataset, or other

Once you have this information, update the configuration section below and adjust the loading code accordingly.


## Setup and Imports


In [1]:
%load_ext autoreload
%autoreload 2

from pathlib import Path
from amber.adapters import TextDataset, ClassificationDataset
from amber.adapters.loading_strategy import LoadingStrategy
from amber.store.local_store import LocalStore

print("‚úÖ Imports completed")


  from .autonotebook import tqdm as notebook_tqdm


‚úÖ Imports completed


## Configuration

**‚ö†Ô∏è Update these values based on the actual dataset source and structure:**


In [2]:
# Dataset configuration
# Option 1: HuggingFace Hub (if available)
DATASET_REPO_ID = None  # e.g., "username/gadzijezyk" or "organization/gadzijezyk"
SPLIT = "train"  # Usually "train", "test", "validation", etc.

# Option 2: Local CSV file path
CSV_PATH = None  # e.g., Path("../data/gadzijezyk.csv")

# Option 3: Local JSON/JSONL file
JSON_PATH = None  # e.g., Path("../data/gadzijezyk.json")

# Option 4: Local directory (for .txt files)
LOCAL_DIR = None  # e.g., Path("../data/gadzijezyk")

# Loading strategy
LOADING_STRATEGY = LoadingStrategy.MEMORY  # Use STREAM for large datasets

# Field names (adjust based on actual dataset structure)
TEXT_FIELD = "text"  # Column name containing text content
CATEGORY_FIELD = "label"  # Column name containing category/label (if classification dataset)

# Storage configuration
STORE_DIR = Path("../store")  # Relative to examples directory
STORE_DIR.mkdir(parents=True, exist_ok=True)

# Optional: Limit number of samples for quick testing
LIMIT = None  # Set to a number (e.g., 100) to limit samples

print("‚ö†Ô∏è  Please update the configuration above with the correct dataset source!")


‚ö†Ô∏è  Please update the configuration above with the correct dataset source!


## Option 1: Load from CSV File

If the dataset is stored as a CSV file:


In [3]:
if CSV_PATH:
    # Create store instance
    store = LocalStore(STORE_DIR)
    
    print(f"üì• Loading from CSV: {CSV_PATH}...")
    
    # If it's a text-only dataset:
    # dataset = TextDataset.from_csv(
    #     source=CSV_PATH,
    #     store=store,
    #     loading_strategy=LOADING_STRATEGY,
    #     text_field=TEXT_FIELD,
    # )
    
    # Or, if it's a classification dataset:
    dataset = ClassificationDataset.from_csv(
        source=CSV_PATH,
        store=store,
        loading_strategy=LOADING_STRATEGY,
        text_field=TEXT_FIELD,
        category_field=CATEGORY_FIELD,
    )
    
    print(f"‚úÖ Dataset loaded successfully!")
    print(f"üìä Number of samples: {len(dataset)}")
    if hasattr(dataset, 'get_categories'):
        print(f"üè∑Ô∏è  Categories: {dataset.get_categories()}")
else:
    print("‚ö†Ô∏è  CSV_PATH not set. Skipping CSV load.")


‚ö†Ô∏è  CSV_PATH not set. Skipping CSV load.


## Option 2: Load from HuggingFace Hub

If the dataset is available on HuggingFace Hub:


In [4]:
if DATASET_REPO_ID:
    # Create store instance
    store = LocalStore(STORE_DIR)
    
    # Determine if it's a classification or text dataset
    # If you have labels/categories, use ClassificationDataset
    # Otherwise, use TextDataset
    
    # Example: Load as TextDataset (if no labels)
    print(f"üì• Loading {DATASET_REPO_ID} from HuggingFace...")
    # dataset = TextDataset.from_huggingface(
    #     repo_id=DATASET_REPO_ID,
    #     store=store,
    #     split=SPLIT,
    #     loading_strategy=LOADING_STRATEGY,
    #     text_field=TEXT_FIELD,
    #     limit=LIMIT,
    # )
    
    # Or, if it's a classification dataset:
    # dataset = ClassificationDataset.from_huggingface(
    #     repo_id=DATASET_REPO_ID,
    #     store=store,
    #     split=SPLIT,
    #     loading_strategy=LOADING_STRATEGY,
    #     text_field=TEXT_FIELD,
    #     category_field=CATEGORY_FIELD,
    #     limit=LIMIT,
    # )
    
    # print(f"‚úÖ Dataset loaded successfully!")
    # print(f"üìä Number of samples: {len(dataset)}")
    print("‚ö†Ô∏è  Uncomment and adjust the code above based on your dataset type.")
else:
    print("‚ö†Ô∏è  DATASET_REPO_ID not set. Skipping HuggingFace load.")


‚ö†Ô∏è  DATASET_REPO_ID not set. Skipping HuggingFace load.


## Option 3: Load from Local Directory

If the dataset is stored locally as a directory of text files:


In [5]:
if LOCAL_DIR:
    # Create store instance
    store = LocalStore(STORE_DIR)
    
    print(f"üì• Loading from local directory: {LOCAL_DIR}...")
    dataset = TextDataset.from_local(
        source=LOCAL_DIR,
        store=store,
        loading_strategy=LOADING_STRATEGY,
        text_field=TEXT_FIELD,
    )
    
    print(f"‚úÖ Dataset loaded successfully!")
    print(f"üìä Number of samples: {len(dataset)}")
else:
    print("‚ö†Ô∏è  LOCAL_DIR not set. Skipping local directory load.")


‚ö†Ô∏è  LOCAL_DIR not set. Skipping local directory load.


## Option 4: Load from JSON/JSONL File

If the dataset is stored as a JSON or JSONL file:


In [6]:
if JSON_PATH:
    # Create store instance
    store = LocalStore(STORE_DIR)
    
    print(f"üì• Loading from JSON: {JSON_PATH}...")
    
    # If it's a text-only dataset:
    # dataset = TextDataset.from_json(
    #     source=JSON_PATH,
    #     store=store,
    #     loading_strategy=LOADING_STRATEGY,
    #     text_field=TEXT_FIELD,
    # )
    
    # Or, if it's a classification dataset:
    # dataset = ClassificationDataset.from_json(
    #     source=JSON_PATH,
    #     store=store,
    #     loading_strategy=LOADING_STRATEGY,
    #     text_field=TEXT_FIELD,
    #     category_field=CATEGORY_FIELD,
    # )
    
    # print(f"‚úÖ Dataset loaded successfully!")
    # print(f"üìä Number of samples: {len(dataset)}")
    print("‚ö†Ô∏è  Uncomment and adjust the code above based on your dataset type.")
else:
    print("‚ö†Ô∏è  JSON_PATH not set. Skipping JSON load.")


‚ö†Ô∏è  JSON_PATH not set. Skipping JSON load.


## Inspect Dataset Structure (if loading from HuggingFace)

If you're loading from HuggingFace and unsure about the structure, inspect it first:


In [7]:
if DATASET_REPO_ID:
    from datasets import load_dataset
    
    print("üîç Inspecting dataset structure...")
    raw_dataset = load_dataset(
        DATASET_REPO_ID,
        SPLIT,
        streaming=False,
    )
    
    # Get first example
    if hasattr(raw_dataset, '__getitem__'):
        first_example = raw_dataset[0]
    else:
        first_example = next(iter(raw_dataset))
    
    print("\nüìã Dataset columns:")
    print(raw_dataset.column_names if hasattr(raw_dataset, 'column_names') else list(first_example.keys()))
    
    print("\nüìù First example:")
    for key, value in first_example.items():
        if isinstance(value, str) and len(value) > 200:
            print(f"  {key}: {value[:200]}...")
        else:
            print(f"  {key}: {value}")
else:
    print("‚ö†Ô∏è  DATASET_REPO_ID not set. Skipping inspection.")


‚ö†Ô∏è  DATASET_REPO_ID not set. Skipping inspection.


## Notes

- **Update Configuration:** Make sure to update the configuration section with the correct dataset source and field names
- **Dataset Type:** Choose the appropriate dataset class (`TextDataset` or `ClassificationDataset`) based on your data
- **Field Names:** Adjust `TEXT_FIELD` and `CATEGORY_FIELD` to match your dataset's actual column names
- **Caching:** The dataset is cached locally in the store directory for faster subsequent loads
- **Streaming:** Use `LoadingStrategy.STREAM` for very large datasets to avoid loading everything into memory
