# Working with Zstandard-compressed Datasets from Hugging Face

This notebook demonstrates how to work with zstd-compressed files from Hugging Face datasets, focusing on the "agentlans/bluesky" dataset that uses Zstandard (zstd) compression.

## Working with zstd-compressed Hugging Face Datasets
This notebook demonstrates how to work with zstd-compressed datasets from Hugging Face, specifically for the "agentlans/bluesky" dataset.

The standard `datasets.load_dataset()` method doesn't work well with zstd-compressed files due to compatibility issues. Instead, we'll use a direct approach:
1. Download files using `huggingface_hub`
2. Manually decompress with the `zstandard` library
3. Parse the JSON data

Let's start by installing the required packages:

In [ ]:
import zstandard as zstd
import json
from huggingface_hub import hf_hub_download
import os

# Download a specific language file (smaller file for demonstration)
file_path = hf_hub_download(
    repo_id="agentlans/bluesky",
    filename="en.jsonl.zst",
    repo_type="dataset"
)

print(f"Downloaded file to: {file_path}")

# Decompress and read the file
with open(file_path, 'rb') as f:
    dctx = zstd.ZstdDecompressor()
    with dctx.stream_reader(f) as reader:
        # Read the entire file (since it's small)
        data = reader.read().decode('utf-8')
        
        # Parse JSON lines
        lines = data.strip().split('\n')
        
        # Show the first few samples
        for i, line in enumerate(lines[:5]):
            sample = json.loads(line)
            print(f"\nSample {i+1}:")
            print(f"Text: {sample['text']}")
            print(f"Language: {sample['language']}")

## Working with Small Files

First, let's look at how to work with smaller language-specific files:

In [ ]:
import zstandard as zstd
from huggingface_hub import hf_hub_download
import json
import io

# Download the large file
file_path = hf_hub_download(
    repo_id="agentlans/bluesky",
    filename="all.jsonl.zst",
    repo_type="dataset"
)

print(f"Downloaded large file to: {file_path}")

def process_chunks(file_path, chunk_size=1024*1024):
    """Process a large zstd-compressed JSONL file in chunks."""
    samples_processed = 0
    samples_to_show = 5
    
    with open(file_path, 'rb') as f:
        dctx = zstd.ZstdDecompressor()
        with dctx.stream_reader(f) as reader:
            text_stream = io.TextIOWrapper(reader, encoding='utf-8')
            
            # Process line by line
            buffer = ""
            for i, line in enumerate(text_stream):
                if samples_processed < samples_to_show:
                    try:
                        sample = json.loads(line)
                        print(f"\nSample {samples_processed+1} from large file:")
                        print(f"Text: {sample['text'][:100]}..." if len(sample['text']) > 100 else f"Text: {sample['text']}")
                        print(f"Language: {sample['language']}")
                        samples_processed += 1
                    except json.JSONDecodeError:
                        print(f"Error decoding JSON on line {i}")
                else:
                    break
            
            # Print summary
            print(f"\nProcessed {samples_processed} samples from the large file")
            print(f"Note: The full file contains many more samples")

# Process the large file
process_chunks(file_path)

## Creating a Dataset Iterator

For more advanced processing, we can create an iterator that yields samples one at a time:

In [ ]:
import zstandard as zstd
from huggingface_hub import hf_hub_download
import json
import io
from typing import Iterator, Dict, Any

def bluesky_dataset_iterator(repo_id: str, filename: str) -> Iterator[Dict[str, Any]]:
    """
    Create an iterator for a zstd-compressed JSONL dataset from Hugging Face.
    
    Args:
        repo_id: The Hugging Face dataset repository ID
        filename: The specific file to download from the repository
        
    Returns:
        An iterator that yields one sample at a time
    """
    # Download the file
    file_path = hf_hub_download(
        repo_id=repo_id,
        filename=filename,
        repo_type="dataset"
    )
    
    # Open and process the file
    f = open(file_path, 'rb')
    dctx = zstd.ZstdDecompressor()
    stream_reader = dctx.stream_reader(f)
    text_stream = io.TextIOWrapper(stream_reader, encoding='utf-8')
    
    # Yield samples one by one
    for line in text_stream:
        try:
            sample = json.loads(line)
            yield sample
        except json.JSONDecodeError:
            continue
        except Exception as e:
            print(f"Error processing line: {e}")
            continue
    
    # Clean up
    text_stream.close()
    stream_reader.close()
    f.close()

# Demonstrate the iterator
dataset = bluesky_dataset_iterator("agentlans/bluesky", "en.jsonl.zst")

# Print the first 5 samples
for i, sample in enumerate(dataset):
    if i < 5:
        print(f"\nSample {i+1} (using iterator):")
        print(f"Text: {sample['text'][:100]}..." if len(sample['text']) > 100 else f"Text: {sample['text']}")
        print(f"Language: {sample['language']}")
    else:
        break

## Comparison with Standard Dataset Loading

For comparison, here's what happens when we try the standard `datasets.load_dataset()` approach. This will likely fail due to the zstd compatibility issues:

In [ ]:
try:
    from datasets import load_dataset
    
    # Attempt to load the dataset using the standard approach
    print("Attempting to load dataset using datasets.load_dataset()...")
    dataset = load_dataset("agentlans/bluesky", split="train")
    
    # If it succeeds, show a sample
    print("Success! Showing first sample:")
    print(dataset[0])
    
except Exception as e:
    print(f"Error loading dataset with datasets.load_dataset(): {e}")
    print("\nThis is why we need the manual approach demonstrated in this notebook.")

## Summary

In this notebook, we've demonstrated how to work with zstd-compressed datasets from Hugging Face:

1. **Direct Approach for Small Files**:
   - Download specific files using `huggingface_hub.hf_hub_download()`
   - Decompress with zstandard library
   - Parse and process the data

2. **Streaming Approach for Large Files**:
   - Process large files in chunks to avoid memory issues
   - Use streaming IO with zstandard
   - Handle each line individually

3. **Dataset Iterator**:
   - Create a reusable iterator function
   - Process samples one at a time
   - Properly handle resources with cleanup

This approach works reliably even when the standard `datasets.load_dataset()` method fails due to compatibility issues with zstd compression.

In [ ]:
# Install required packages
!pip install zstandard huggingface_hub

In [1]:
# Install required packages
!pip install zstandard huggingface_hub

Defaulting to user installation because normal site-packages is not writeable




