### Lance Converter Script for any Image Dataset

This script serves as a versatile tool for transforming any Image Dataset into the Lance format, enabling seamless integration and analysis. It provides a straightforward solution for converting diverse image datasets into a standardized format for enhanced compatibility and ease of use.

For effortless access to pre-formatted CINIC-10 and mini-ImageNet datasets in Lance format, you can refer to the following links:

CINIC-10 Dataset: https://www.kaggle.com/datasets/vipulmaheshwarii/cinic-10-lance-dataset

mini-ImageNet Dataset: https://www.kaggle.com/datasets/vipulmaheshwarii/mini-imagenet-lance-dataset

### Imports

In [1]:
import os
import pandas as pd
import pyarrow as pa
import lance
import time
from tqdm import tqdm

import warnings
warnings.simplefilter('ignore')

### Set the variable according to your Image dataset

Assign the path to your image dataset to the variable `image_dataset`. This dataset should contain your images organized into training, testing, and validation folders. These images will be used to convert them into Lance format.


In [2]:
image_dataset = "image_dataset"

### Processing the Images

The `process_images` function is the central component of this notebook, responsible for transforming images from the training, testing, and validation folders into Lance format. This format typically includes essential attributes such as `image`, `filename`, `category`, and `data_type`.

Specifically, `image` represents the actual image data, `filename` denotes the name of the file, `category` indicates the category to which the image belongs, and `data_type` specifies whether the image is from the training, testing, or validation set.

In [2]:
def process_images():
    # Get the current directory path
    current_dir = os.getcwd()
    images_folder = os.path.join(current_dir, image_dataset)
    print(images_folder)

    # Define schema for RecordBatch
    schema = pa.schema([('image', pa.binary()), 
                        ('filename', pa.string()), 
                        ('category', pa.string()), 
                        ('data_type', pa.string())])

    # Iterate over the data types (train, test, valid)
    for data_type in ['train', 'test', 'val']:
        data_type_folder = os.path.join(images_folder, data_type)
        
        # Iterate over the categories within each data type
        for category in os.listdir(data_type_folder):
            category_folder = os.path.join(data_type_folder, category)
            
            # Iterate over the images within each category
            for filename in tqdm(os.listdir(category_folder), desc=f"Processing {data_type} - {category}"):
                # Construct the full path to the image
                image_path = os.path.join(category_folder, filename)

                # Read and convert the image to a binary format
                with open(image_path, 'rb') as f:
                    binary_data = f.read()

                image_array = pa.array([binary_data], type=pa.binary())
                filename_array = pa.array([filename], type=pa.string())
                category_array = pa.array([category], type=pa.string())
                data_type_array = pa.array([data_type], type=pa.string())

                # Yield RecordBatch for each image
                yield pa.RecordBatch.from_arrays(
                    [image_array, filename_array, category_array, data_type_array],
                    schema=schema
                )

### Creating a Lance Dataset

This function, `write_to_lance`, is designed to convert a PyArrow Table into a Lance dataset. It begins by defining the schema for the Lance dataset, specifying fields such as `image`, `filename`, `category`, and `data_type` , make sure the schema is the same as the one defined in the `process_images` function.

Once the schema is established, the function determines the path for saving the Lance file, leveraging the current working directory and the provided `image_dataset` variable. It then initializes a RecordBatchReader using the defined schema and the data obtained from the `process_images` function.

In [None]:
# Function to write PyArrow Table to Lance dataset
def write_to_lance():
    # Create an empty RecordBatchIterator
    schema = pa.schema([
        pa.field("image", pa.binary()),
        pa.field("filename", pa.string()),
        pa.field("category", pa.string()),
        pa.field("data_type", pa.string())
    ])

    # Specify the path where you want to save the Lance file
    current_dir = os.getcwd()
    images_folder = os.path.join(current_dir, image_dataset)
    lance_file_path = os.path.join(images_folder, f"{image_dataset}.lance")

    reader = pa.RecordBatchReader.from_batches(schema, process_images())
    lance.write_dataset(
        reader,
        lance_file_path,
        schema,
    )