# Loading BAN-PL Dataset

This notebook demonstrates how to load the **BAN-PL** dataset using Amber's dataset abstractions.

## About BAN-PL

BAN-PL is a Polish dataset containing harmful and offensive content from Wykop.pl. The anonymized subset consists of 24,000 examples evenly split between "harmful" and "neutral" classes.

**Source:** Available as CSV file from GitHub repository

**GitHub:** [NASK-PIB/BAN-PL](https://github.com/NASK-PIB/BAN-PL)

**Paper:** [BAN-PL: A Polish Dataset of Banned Harmful and Offensive Content](https://aclanthology.org/2024.lrec-main.190/)

## Dataset Structure

This is a classification dataset with:
- Text content (potentially harmful or neutral)
- Labels: "harmful" or "neutral"

**Note:** The dataset contains potentially offensive content. Use responsibly.

**Important:** BAN-PL is not available on HuggingFace Hub. You need to download the CSV file from the GitHub repository first.


## Setup and Imports


In [1]:
%load_ext autoreload
%autoreload 2

from pathlib import Path
from amber.adapters import ClassificationDataset
from amber.adapters.loading_strategy import LoadingStrategy
from amber.store.local_store import LocalStore

print("‚úÖ Imports completed")


  from .autonotebook import tqdm as notebook_tqdm


‚úÖ Imports completed


## Configuration


In [None]:
# Dataset configuration
# Option 1: Local CSV file path (download from GitHub first)
CSV_PATH = Path("../data/BAN-PL.csv")  # Update this path to your CSV file location

# Option 2: Download from GitHub (uncomment to use)
# CSV_URL = "https://raw.githubusercontent.com/NASK-PIB/BAN-PL/main/data/BAN-PL.csv"

LOADING_STRATEGY = LoadingStrategy.MEMORY  # Use STREAM for large datasets

# Field names (adjust if the dataset uses different column names)
# Common column names in BAN-PL CSV: "text", "label" or "harmful"
TEXT_FIELD = "text"  # Column name containing text content
CATEGORY_FIELD = "label"  # Column name containing category/label (may be "harmful" or "label")

# Storage configuration
STORE_DIR = Path("../store")  # Relative to examples directory
STORE_DIR.mkdir(parents=True, exist_ok=True)

# Optional: Limit number of samples for quick testing
LIMIT = None  # Set to a number (e.g., 100) to limit samples

print(f"üìä Dataset: BAN-PL (CSV)")
print(f"üìÅ CSV path: {CSV_PATH}")
print(f"üìÅ Store directory: {STORE_DIR}")
print(f"üîß Loading strategy: {LOADING_STRATEGY}")


üìä Dataset: NASK-PIB/BAN-PL
üìÅ Store directory: ../store
üîß Loading strategy: LoadingStrategy.MEMORY


## Download Dataset (if needed)

If you haven't downloaded the CSV file yet, you can download it from GitHub:


In [None]:
# Download CSV from GitHub if not already present
import urllib.request

if not CSV_PATH.exists():
    print(f"üì• Downloading BAN-PL dataset from GitHub...")
    CSV_PATH.parent.mkdir(parents=True, exist_ok=True)
    
    # Try to download from GitHub
    try:
        url = "https://raw.githubusercontent.com/NASK-PIB/BAN-PL/main/data/BAN-PL.csv"
        urllib.request.urlretrieve(url, CSV_PATH)
        print(f"‚úÖ Downloaded to: {CSV_PATH}")
    except Exception as e:
        print(f"‚ùå Error downloading: {e}")
        print(f"üí° Please download the CSV file manually from:")
        print(f"   https://github.com/NASK-PIB/BAN-PL")
        print(f"   And place it at: {CSV_PATH}")
else:
    print(f"‚úÖ CSV file found at: {CSV_PATH}")


üì• Loading NASK-PIB/BAN-PL...


DatasetNotFoundError: Dataset 'NASK-PIB/BAN-PL' doesn't exist on the Hub or cannot be accessed.

## Load Dataset from CSV


In [None]:
# Create store instance
store = LocalStore(STORE_DIR)

# Load dataset from CSV
print(f"üì• Loading BAN-PL from CSV: {CSV_PATH}...")

if not CSV_PATH.exists():
    raise FileNotFoundError(
        f"CSV file not found at {CSV_PATH}. "
        f"Please download it from https://github.com/NASK-PIB/BAN-PL "
        f"or update CSV_PATH in the configuration."
    )

# Load the dataset
dataset = ClassificationDataset.from_csv(
    source=CSV_PATH,
    store=store,
    loading_strategy=LOADING_STRATEGY,
    text_field=TEXT_FIELD,
    category_field=CATEGORY_FIELD,
)

# Apply limit if specified (before loading, we'd need to filter the CSV)
# For now, we'll load the full dataset and you can slice it later
# If you need to limit, you can use: dataset = dataset[:LIMIT] after loading

print(f"‚úÖ Dataset loaded successfully!")
print(f"üìä Number of samples: {len(dataset)}")
print(f"üè∑Ô∏è  Categories: {dataset.get_categories()}")

# Apply limit if specified (slice the dataset)
if LIMIT is not None and not dataset.is_streaming:
    print(f"üìâ Limiting to first {LIMIT} samples...")
    # Note: This creates a new dataset with limited samples
    from datasets import Dataset
    limited_texts = dataset.get_texts()[:LIMIT]
    limited_categories = dataset.get_categories_for_texts(limited_texts)
    limited_ds = Dataset.from_dict({
        TEXT_FIELD: limited_texts,
        CATEGORY_FIELD: limited_categories
    })
    dataset = ClassificationDataset(
        limited_ds,
        store=store,
        loading_strategy=LOADING_STRATEGY,
        text_field=TEXT_FIELD,
        category_field=CATEGORY_FIELD,
    )
    print(f"üìä Limited dataset size: {len(dataset)}")


## Alternative: Inspect CSV Structure First

If you're unsure about the column names, inspect the CSV file first:


In [None]:
# Inspect CSV structure (optional)
import pandas as pd

if CSV_PATH.exists():
    print("üîç Inspecting CSV structure...")
    df_sample = pd.read_csv(CSV_PATH, nrows=5)
    print("\nüìã Column names:")
    print(df_sample.columns.tolist())
    print("\nüìù First few rows:")
    print(df_sample.head())
    print(f"\nüí° Update TEXT_FIELD and CATEGORY_FIELD if needed based on the column names above.")
else:
    print("‚ö†Ô∏è  CSV file not found. Please download it first.")


## Explore Dataset


In [None]:
# Get a sample item
sample = dataset[0]
print("üìù Sample item:")
print(f"Text: {sample['text'][:200]}..." if len(sample['text']) > 200 else f"Text: {sample['text']}")
print(f"Category: {sample['category']}")


In [None]:
# Get multiple samples
samples = dataset[:5]
print(f"üì¶ Retrieved {len(samples)} samples:")
for i, item in enumerate(samples):
    print(f"\n{i+1}. Category: {item['category']}")
    print(f"   Text preview: {item['text'][:100]}...")


## Iterate Over Dataset


In [None]:
# Iterate over items one by one
print("üîÑ Iterating over first 3 items:")
for i, item in enumerate(dataset.iter_items()):
    if i >= 3:
        break
    print(f"\nItem {i+1}:")
    print(f"  Category: {item['category']}")
    print(f"  Text: {item['text'][:80]}...")


In [None]:
# Iterate in batches
print("üì¶ Iterating in batches of 10:")
batch_count = 0
for batch in dataset.iter_batches(batch_size=10):
    batch_count += 1
    print(f"\nBatch {batch_count} ({len(batch)} items):")
    categories = [item['category'] for item in batch]
    print(f"  Categories: {categories}")
    if batch_count >= 2:  # Show only first 2 batches
        break


## Dataset Statistics


In [None]:
# Get category distribution
categories = dataset.get_categories()
print(f"üè∑Ô∏è  Available categories: {categories}")

# Count items per category (for non-streaming datasets)
if not dataset.is_streaming:
    category_counts = {}
    for item in dataset.iter_items():
        cat = item['category']
        category_counts[cat] = category_counts.get(cat, 0) + 1
    
    print("\nüìä Category distribution:")
    for cat, count in sorted(category_counts.items()):
        print(f"  {cat}: {count}")


## Notes

- The dataset is cached locally in the store directory for faster subsequent loads
- Use `LoadingStrategy.STREAM` for very large datasets to avoid loading everything into memory
- Adjust `TEXT_FIELD` and `CATEGORY_FIELD` if the dataset uses different column names
- The dataset contains potentially harmful content - use responsibly and in accordance with ethical guidelines
