# 2 - Creating Lance Datasets

<a target="_blank" href="https://colab.research.google.com/github/eto-ai/lance/blob/main/python/notebooks/02_creating_lance_datasets.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

Lance's python API makes it easy for you to create Lance datasets from a variety of sources

⤜ pandas dataframes <br/>
⤜ pyarrow tables <br/>
⤜ parquet datasets <br/>
⤜ [coming soon] known open-source dataset formats <br/>

In [1]:
!pip install --quiet pylance
!rm -rf /tmp/toy.lance /tmp/oxford_pet*

In [2]:
import lance
from lance import LanceFileFormat
import pyarrow as pa

## Warmup: write a toy dataframe

In [3]:
import numpy as np
import pandas as pd

toy = pd.DataFrame({
    'a': np.random.randn(5),
    'b': pd.Categorical.from_codes(np.random.randint(0, 5, 5),
                                   ['cat', 'dog', 'person', 'car', 'duck']),
    'c': pd.date_range('2022-01-01', freq='D', periods=5)
})
toy

Unnamed: 0,a,b,c
0,-0.052295,cat,2022-01-01
1,-0.125469,duck,2022-01-02
2,0.339519,duck,2022-01-03
3,-0.594905,cat,2022-01-04
4,-1.890206,duck,2022-01-05


<div class="alert alert-info">
    Write an Arrow Table to Lance format in 1 line
    </div>

In [4]:
lance.write_dataset(toy, '/tmp/toy.lance')

read it back out

In [5]:
(lance.dataset('/tmp/toy.lance')
 .to_table()
 .to_pandas())

Unnamed: 0,a,b,c
0,-0.052295,cat,2022-01-01
1,-0.125469,duck,2022-01-02
2,0.339519,duck,2022-01-03
3,-0.594905,cat,2022-01-04
4,-1.890206,duck,2022-01-05


## Workout: create pets dataset

Let's create a custom dataset

We'll need some raw data. For convenience, let's use fastai to get the oxford_pet dataset

In [6]:
!pip install --quiet fastai

In [7]:
from fastai.vision.all import untar_data, URLs
path = untar_data(URLs.PETS)

### Now we'll create a dataset with the image and some labeling information

First we get the images

In [8]:
from pathlib import Path
images = path / 'images'

df = pd.DataFrame({'image': pd.array(images.ls(), dtype='image[uri]')})
df.head()

Unnamed: 0,image
0,/root/.fastai/data/oxford-iiit-pet/images/yorkshire_terrier_196.jpg
1,/root/.fastai/data/oxford-iiit-pet/images/saint_bernard_45.jpg
2,/root/.fastai/data/oxford-iiit-pet/images/pug_82.jpg
3,/root/.fastai/data/oxford-iiit-pet/images/american_bulldog_176.jpg
4,/root/.fastai/data/oxford-iiit-pet/images/havanese_125.jpg


Images are defined as extension types in pandas that know how to:
- display themselves
- convert to numpy / tensor
- convert to PIL

In [9]:
df.image.values[0]

In [10]:
print(type(df.image[0].to_pil()))
print(type(df.image[0].to_numpy()))

<class 'PIL.Image.Image'>
<class 'numpy.ndarray'>


### pandas integration makes data preparation really easy

parse the filename

In [11]:
df['filename'] = df.image.apply(lambda s: Path(s).name)

Now get the class name

In [12]:
df['class'] = df.filename.str.rsplit('_', n=1).str[0].astype('category')

Get the species name

In [13]:
is_cat = df['class'].str[0].str.isupper()
df['species'] = pd.Categorical.from_codes(codes=is_cat.astype(int), 
                                          categories=['dog', 'cat'])

Let's see what it looks like altogether

In [14]:
df.head()

Unnamed: 0,image,filename,class,species
0,/root/.fastai/data/oxford-iiit-pet/images/yorkshire_terrier_196.jpg,yorkshire_terrier_196.jpg,yorkshire_terrier,dog
1,/root/.fastai/data/oxford-iiit-pet/images/saint_bernard_45.jpg,saint_bernard_45.jpg,saint_bernard,dog
2,/root/.fastai/data/oxford-iiit-pet/images/pug_82.jpg,pug_82.jpg,pug,dog
3,/root/.fastai/data/oxford-iiit-pet/images/american_bulldog_176.jpg,american_bulldog_176.jpg,american_bulldog,dog
4,/root/.fastai/data/oxford-iiit-pet/images/havanese_125.jpg,havanese_125.jpg,havanese,dog


### Let's save our work

In [15]:
lance.write_dataset(df, '/tmp/oxford_pet.lance')

In [16]:
(lance.dataset('/tmp/oxford_pet.lance')
 .to_table()
 .to_pandas()
 .head())

Unnamed: 0,image,filename,class,species
0,/root/.fastai/data/oxford-iiit-pet/images/yorkshire_terrier_196.jpg,yorkshire_terrier_196.jpg,yorkshire_terrier,dog
1,/root/.fastai/data/oxford-iiit-pet/images/saint_bernard_45.jpg,saint_bernard_45.jpg,saint_bernard,dog
2,/root/.fastai/data/oxford-iiit-pet/images/pug_82.jpg,pug_82.jpg,pug,dog
3,/root/.fastai/data/oxford-iiit-pet/images/american_bulldog_176.jpg,american_bulldog_176.jpg,american_bulldog,dog
4,/root/.fastai/data/oxford-iiit-pet/images/havanese_125.jpg,havanese_125.jpg,havanese,dog


## Cooldown: converting from parquet

If you have existing dataset in parquet it's also very easy to
convert the data to Lance

Generate a parquet dataset

In [17]:
import pyarrow.dataset as ds
tbl = pa.Table.from_pandas(df)                 
ds.write_dataset(tbl, '/tmp/oxford_pet.parquet',
                 format='parquet')

In [18]:
parquet_dataset = ds.dataset('/tmp/oxford_pet.parquet')
parquet_dataset.to_table().to_pandas().head()

Unnamed: 0,image,filename,class,species
0,/root/.fastai/data/oxford-iiit-pet/images/yorkshire_terrier_196.jpg,yorkshire_terrier_196.jpg,yorkshire_terrier,dog
1,/root/.fastai/data/oxford-iiit-pet/images/saint_bernard_45.jpg,saint_bernard_45.jpg,saint_bernard,dog
2,/root/.fastai/data/oxford-iiit-pet/images/pug_82.jpg,pug_82.jpg,pug,dog
3,/root/.fastai/data/oxford-iiit-pet/images/american_bulldog_176.jpg,american_bulldog_176.jpg,american_bulldog,dog
4,/root/.fastai/data/oxford-iiit-pet/images/havanese_125.jpg,havanese_125.jpg,havanese,dog


<div class="alert alert-info">
    Converting parquet to lance is also just 1 python statement
    </div> 

In [19]:
lance.write_dataset(parquet_dataset, '/tmp/oxford_pet_from_parquet.lance')

Again we can read it back out and see that it's the same

In [20]:
lance_dataset = lance.dataset('/tmp/oxford_pet_from_parquet.lance')
lance_dataset.to_table().to_pandas().head()

Unnamed: 0,image,filename,class,species
0,/root/.fastai/data/oxford-iiit-pet/images/yorkshire_terrier_196.jpg,yorkshire_terrier_196.jpg,yorkshire_terrier,dog
1,/root/.fastai/data/oxford-iiit-pet/images/saint_bernard_45.jpg,saint_bernard_45.jpg,saint_bernard,dog
2,/root/.fastai/data/oxford-iiit-pet/images/pug_82.jpg,pug_82.jpg,pug,dog
3,/root/.fastai/data/oxford-iiit-pet/images/american_bulldog_176.jpg,american_bulldog_176.jpg,american_bulldog,dog
4,/root/.fastai/data/oxford-iiit-pet/images/havanese_125.jpg,havanese_125.jpg,havanese,dog


## Try your own data!

Nice work! In this tutorial, we've created Lance datasets from pandas, arrow, and parquet using only a few lines of code.

Now you should try creating a Lance dataset with your own image data!