# Kaggle dataset-to-JSONL export template (Image)
Use this notebook as a guided recipe for going from a Kaggle competition dataset to a tidy `.jsonl` asset ready for downstream ML tooling. It double-checks Python dependencies, validates your Kaggle API credentials, optionally searches for the dataset slug, downloads and extracts the archive, and finally base64-encodes every image with metadata so you can store the samples alongside labels in a single JSON Lines file under `data/`. Run the cells in order whenever you need to refresh the export or adapt it to a different Kaggle dataset.

The dataset(s) in this example are used for demonstration purposes only. Microsoft does not endorse them specifically.

## Workflow Overview
1. Check and install the Python dependencies we need (Kaggle API, Pillow, tqdm).
2. Configure working directories and reusable constants for this run.
3. Validate the Kaggle API credentials so authenticated downloads work.
4. Discover the dataset reference on Kaggle (or use a manual override, if supplied).
5. Download the compressed archive with all images.
6. Extract the images into a clean workspace.
7. Encode each image as base64, gather metadata, and write a consolidated JSONL file.
8. Inspect the JSONL output to confirm counts and schema.

## 1. Install / verify dependencies
The Kaggle CLI plus Pillow (for reading image metadata) and tqdm (for progress bars) are required.

In [None]:
import sys
import subprocess

required_packages = ["kaggle", "Pillow", "tqdm"]
print(f'Ensuring required packages are installed: {required_packages}')
subprocess.run([sys.executable, '-m', 'pip', 'install', *required_packages], check=True)

## 2. Configure paths and constants
Adjust `DATASET_REF_OVERRIDE` if you already know the exact Kaggle dataset slug (e.g., `owner/dataset-name`). Otherwise the next step will search for it using the provided text query.

In [None]:
from pathlib import Path

DATA_DIR = Path('data/blood_cell_cancer_detection')
DATA_DIR.mkdir(parents=True, exist_ok=True)
RAW_DIR = DATA_DIR / 'raw'
OUTPUT_JSONL = DATA_DIR / 'blood_cell_images.jsonl'

DATASET_SEARCH_TERM = 'Blood Cell images for Cancer detection'
DATASET_REF_OVERRIDE = None  # Set to the explicit Kaggle dataset ref if you already know it

print(f'Work directory: {DATA_DIR.resolve()}')
print(f'Output JSONL will be written to: {OUTPUT_JSONL.resolve()}')

## 3. Validate Kaggle credentials
Either place the `kaggle.json` file under `~/.kaggle/` or populate the `KAGGLE_USERNAME` and `KAGGLE_KEY` environment variables before running this cell.
The location of the kaggle.json file is normally `<your user directory\.kaggle\>` e.g. something like `C:\users\<your windows user name>\.kaggle\` or within your home directory on Linux / MacOS.

The content of kaggle.json file will be similar to this:
`{"username":"<your kaggle username>","key":"<your API key from kaggle>"}`

If you don't have an API key from kaggle:
- login to [kaggle](https://www.kaggle.com/) with your credentials (register if necessary)
- click on your user name (icon on the upper right corner, next to the search bar) and select "Settings" from the menu (https://www.kaggle.com/settings)
- scroll down to the API section and create a key

In [None]:
import json
import os

kaggle_dir = Path.home() / '.kaggle'
kaggle_dir.mkdir(parents=True, exist_ok=True)
kaggle_credentials_path = kaggle_dir / 'kaggle.json'

if kaggle_credentials_path.exists():
    with kaggle_credentials_path.open('r', encoding='utf-8') as cred_file:
        credentials = json.load(cred_file)
    os.environ.setdefault('KAGGLE_USERNAME', credentials.get('username', ''))
    os.environ.setdefault('KAGGLE_KEY', credentials.get('key', ''))
    kaggle_credentials_path.chmod(0o600)
    print(f'Loaded Kaggle credentials from {kaggle_credentials_path}')
elif not (os.environ.get('KAGGLE_USERNAME') and os.environ.get('KAGGLE_KEY')):
    raise RuntimeError('Kaggle credentials not found. Upload kaggle.json to ~/.kaggle or set KAGGLE_USERNAME/KAGGLE_KEY.')
else:
    print('Using Kaggle credentials from environment variables.')

## 4. Locate the Kaggle dataset
Authenticate with the Kaggle API, search by text, and lock in the dataset reference (`DATASET_REF`) the rest of the notebook will use.

In [None]:
from kaggle.api.kaggle_api_extended import KaggleApi

api = KaggleApi()
api.authenticate()

search_results = []
DATASET_REF = None

if DATASET_REF_OVERRIDE:
    DATASET_REF = DATASET_REF_OVERRIDE
    print(f"Using manually specified dataset ref: {DATASET_REF}")
else:
    search_results = api.dataset_list(search=DATASET_SEARCH_TERM)
    if not search_results:
        raise ValueError(f"No Kaggle datasets matched the search term: {DATASET_SEARCH_TERM}")
    print("Candidate datasets:")
    for ds in search_results:
        total_bytes = getattr(ds, "totalBytes", None)
        if total_bytes is None:
            total_bytes = getattr(ds, "total_bytes", None)
        size_mb = (total_bytes or 0) / (1024 ** 2)
        print(f"  - {ds.ref} | {ds.title} | approx {size_mb:.2f} MB")
    DATASET_REF = search_results[0].ref
    print(f"Defaulting to the first match: {DATASET_REF}")

if not DATASET_REF:
    raise ValueError("No dataset reference available; set DATASET_REF_OVERRIDE or adjust the search term.")

DATASET_SLUG = DATASET_REF.split('/')[-1]
ZIP_PATH = DATA_DIR / f"{DATASET_SLUG}.zip"
print(f"Archive will be stored at: {ZIP_PATH}")

## 5. Download the dataset archive
Files are saved under the working directory; rerunning this cell is safe because it will skip the download if the zip already exists.

In [None]:
if ZIP_PATH.exists():
    print(f'Skipping download because {ZIP_PATH.name} already exists.')
else:
    api.dataset_download_files(DATASET_REF, path=str(DATA_DIR), force=False, quiet=False, unzip=False)
    if not ZIP_PATH.exists():
        raise FileNotFoundError(f'Expected archive {ZIP_PATH} was not created; check Kaggle output above.')
    print(f'Download complete: {ZIP_PATH}')

## 6. Extract the raw images
Start fresh each run by clearing the extraction directory before unzipping.

In [None]:
import shutil
import zipfile

if RAW_DIR.exists():
    try:
        shutil.rmtree(RAW_DIR)
    except PermissionError as exc:
        print(f"Skipping RAW_DIR cleanup because it is locked ({exc}). Existing files may be reused.")
RAW_DIR.mkdir(parents=True, exist_ok=True)

with zipfile.ZipFile(ZIP_PATH, 'r') as archive:
    archive.extractall(RAW_DIR)

file_count = sum(1 for _ in RAW_DIR.rglob('*') if _.is_file())
print(f'Extracted {file_count} files into {RAW_DIR}')

## 7. Convert images to a JSONL corpus
Each record will include: a UUID, Kaggle dataset ref, relative path, filename, inferred label (parent directory name), image dimensions/mode/format, and a base64-encoded payload of the exact bytes.

In [None]:
import base64
import json
from uuid import uuid4
from PIL import Image
from tqdm import tqdm

valid_extensions = {'.jpg', '.jpeg', '.png', '.bmp', '.tif', '.tiff'}
format_to_mime = {
    'JPEG': 'image/jpeg',
    'JPG': 'image/jpeg',
    'PNG': 'image/png',
    'BMP': 'image/bmp',
    'TIFF': 'image/tiff',
    'TIF': 'image/tiff',
}
extension_to_mime = {
    '.jpg': 'image/jpeg',
    '.jpeg': 'image/jpeg',
    '.png': 'image/png',
    '.bmp': 'image/bmp',
    '.tif': 'image/tiff',
    '.tiff': 'image/tiff',
}

def infer_mime(image_format: str | None, suffix: str) -> str:
    if image_format:
        mime = format_to_mime.get(image_format.upper())
        if mime:
            return mime
    return extension_to_mime.get(suffix.lower(), 'application/octet-stream')

image_paths = sorted(p for p in RAW_DIR.rglob('*') if p.suffix.lower() in valid_extensions)

if not image_paths:
    raise RuntimeError(f'No images found under {RAW_DIR}. Please inspect the extraction output and update the parsing logic if needed.')

with OUTPUT_JSONL.open('w', encoding='utf-8') as writer:
    for image_path in tqdm(image_paths, desc='Encoding images'):
        relative_path = image_path.relative_to(RAW_DIR)
        label = image_path.parent.name
        try:
            with Image.open(image_path) as img:
                width, height = img.size
                color_mode = img.mode
                image_format = img.format
        except Exception as exc:
            width = height = None
            color_mode = image_format = None
            print(f'Warning: failed to read metadata for {image_path}: {exc}')
        with image_path.open('rb') as img_file:
            payload_b64 = base64.b64encode(img_file.read()).decode('ascii')
        mime_type = infer_mime(image_format, image_path.suffix)
        data_uri = f"data:{mime_type};base64,{payload_b64}"
        record = {
            'id': str(uuid4()),
            'dataset_ref': DATASET_REF,
            'relative_path': relative_path.as_posix(),
            'filename': image_path.name,
            'label': label,
            'width': width,
            'height': height,
            'color_mode': color_mode,
            'image_format': image_format,
            'label_source': 'parent_directory_name',
            'images': [data_uri],
        }
        writer.write(json.dumps(record, ensure_ascii=False) + '\n')

print(f'Wrote {len(image_paths)} image records to {OUTPUT_JSONL}')

## 8. Inspect the JSONL output
Confirm the number of records, label distribution, and preview a sample entry (with the base64 payload truncated for readability).

In [None]:
from collections import Counter

total_records = 0
label_counts = Counter()
sample_record = None

with OUTPUT_JSONL.open('r', encoding='utf-8') as reader:
    for line in reader:
        total_records += 1
        record = json.loads(line)
        label_counts[record['label']] += 1
        if sample_record is None:
            sample_record = record

if sample_record:
    sample_record = dict(sample_record)
    if sample_record.get('images'):
        sample_record['images'][0] = sample_record['images'][0][:80] + '...'

print(f'Total records: {total_records}')
print('Label distribution:')
for label, count in label_counts.most_common():
    print(f'  {label}: {count}')
print('\nSample record:')
print(sample_record)
print(f'JSONL location: {OUTPUT_JSONL.resolve()}')

âœ… All steps completed. You now have a single JSON Lines corpus with base64-embedded blood cell images plus metadata/labels derived from the folder structure. 
The jsonl file is in this location: `../data/blood_cell_cancer_detection`

DOI (Digital Object Identifier)
https://doi.org/10.34740/kaggle/dsv/10500753

@misc{sumith_singh_kothwal_2025,
	title={Blood Cell images for Cancer detection},
	url={https://www.kaggle.com/dsv/10500753},
	DOI={10.34740/KAGGLE/DSV/10500753},
	publisher={Kaggle},
	author={Sumith Singh Kothwal},
	year={2025}
}


Now we will create a small subset from this large dataset for testing purposes. We will pick every 237th (configurable) line from the jsonl file and add them to the 2nd jsonl file with _subset suffix. 
This will create: data\blood_cell_cancer_detection\blood_cell_images_subset.jsonl

In [None]:
from pathlib import Path

SUBSET_STRIDE = 237 # Adjust this value to change the sampling rate. There are 5000 lines in the full file. 237 gives ~21 lines.
subset_path = OUTPUT_JSONL.with_stem(OUTPUT_JSONL.stem + '_subset')
subset_path = subset_path.with_suffix('.jsonl')

selected = 0
total = 0

with OUTPUT_JSONL.open('r', encoding='utf-8') as source, subset_path.open('w', encoding='utf-8') as target:
    for idx, line in enumerate(source):
        total += 1
        if idx % SUBSET_STRIDE == 0:
            target.write(line)
            selected += 1

print(f'Saved every {SUBSET_STRIDE}th record to {subset_path}')
print(f'Total source records: {total}')
print(f'Total subset records: {selected}')