# 2 - Creating Lance Datasets

Lance's python API makes it easy for you to create Lance datasets from a variety of sources

⤜ pandas dataframes <br/>
⤜ pyarrow tables <br/>
⤜ parquet datasets <br/>
⤜ [coming soon] known open-source dataset formats <br/>

In [1]:
import lance
from lance import LanceFileFormat
import pyarrow as pa

## Warmup: write a toy dataframe

In [2]:
import numpy as np
import pandas as pd

toy = pd.DataFrame({
    'a': np.random.randn(5),
    'b': pd.Categorical.from_codes(np.random.randint(0, 5, 5),
                                   ['cat', 'dog', 'person', 'car', 'duck']),
    'c': pd.date_range('2022-01-01', freq='D', periods=5)
})
toy

Unnamed: 0,a,b,c
0,-0.046544,duck,2022-01-01
1,0.352004,car,2022-01-02
2,0.314334,person,2022-01-03
3,1.395114,cat,2022-01-04
4,-0.605658,person,2022-01-05


<div class="alert alert-info">
    Write an Arrow Table to Lance format in 1 line
    </div>

In [3]:
tbl = pa.Table.from_pandas(toy)                
lance.write_dataset(tbl, '/tmp/toy.lance')

read it back out

In [4]:
(lance.dataset('/tmp/toy.lance')
 .to_table()
 .to_pandas())

Unnamed: 0,a,b,c
0,-0.046544,duck,2022-01-01
1,0.352004,car,2022-01-02
2,0.314334,person,2022-01-03
3,1.395114,cat,2022-01-04
4,-0.605658,person,2022-01-05


## Workout: create pets dataset

Let's create a custom dataset

We'll need some raw data. For convenience, let's use fastai to get the oxford_pet dataset

In [5]:
!pip install --quiet fastai

In [6]:
from fastai.vision.all import untar_data, URLs
path = untar_data(URLs.PETS)

### Now we'll create a dataset with the image and some labeling information

First we get the images

In [7]:
from pathlib import Path
images = path / 'images'

df = pd.DataFrame({'image': pd.array(images.ls(), dtype='image[uri]')})
df.head()

Unnamed: 0,image
0,/home/ubuntu/.fastai/data/oxford-iiit-pet/images/saint_bernard_50.jpg
1,/home/ubuntu/.fastai/data/oxford-iiit-pet/images/boxer_130.jpg
2,/home/ubuntu/.fastai/data/oxford-iiit-pet/images/Egyptian_Mau_215.jpg
3,/home/ubuntu/.fastai/data/oxford-iiit-pet/images/Maine_Coon_146.jpg
4,/home/ubuntu/.fastai/data/oxford-iiit-pet/images/boxer_55.jpg


Images are defined as extension types in pandas that know how to:
- display themselves
- convert to numpy / tensor
- convert to PIL

In [8]:
df.image.values[0]

In [9]:
print(type(df.image[0].to_pil()))
print(type(df.image[0].to_numpy()))

<class 'PIL.Image.Image'>
<class 'numpy.ndarray'>


### pandas integration makes data preparation really easy

parse the filename

In [10]:
df['filename'] = df.image.apply(lambda s: Path(s).name)

Now get the class name

In [11]:
df['class'] = df.filename.str.rsplit('_', n=1).str[0].astype('category')

Get the species name

In [12]:
is_cat = df['class'].str[0].str.isupper()
df['species'] = pd.Categorical.from_codes(codes=is_cat.astype(int), 
                                          categories=['dog', 'cat'])

Let's see what it looks like altogether

In [13]:
df.head()

Unnamed: 0,image,filename,class,species
0,/home/ubuntu/.fastai/data/oxford-iiit-pet/images/saint_bernard_50.jpg,saint_bernard_50.jpg,saint_bernard,dog
1,/home/ubuntu/.fastai/data/oxford-iiit-pet/images/boxer_130.jpg,boxer_130.jpg,boxer,dog
2,/home/ubuntu/.fastai/data/oxford-iiit-pet/images/Egyptian_Mau_215.jpg,Egyptian_Mau_215.jpg,Egyptian_Mau,cat
3,/home/ubuntu/.fastai/data/oxford-iiit-pet/images/Maine_Coon_146.jpg,Maine_Coon_146.jpg,Maine_Coon,cat
4,/home/ubuntu/.fastai/data/oxford-iiit-pet/images/boxer_55.jpg,boxer_55.jpg,boxer,dog


### Let's save our work

In [14]:
tbl = pa.Table.from_pandas(df)            
lance.write_dataset(tbl, '/tmp/oxford_pet.lance')

In [15]:
(lance.dataset('/tmp/oxford_pet.lance')
 .to_table()
 .to_pandas()
 .head())

Unnamed: 0,image,filename,class,species
0,/home/ubuntu/.fastai/data/oxford-iiit-pet/images/saint_bernard_50.jpg,saint_bernard_50.jpg,saint_bernard,dog
1,/home/ubuntu/.fastai/data/oxford-iiit-pet/images/boxer_130.jpg,boxer_130.jpg,boxer,dog
2,/home/ubuntu/.fastai/data/oxford-iiit-pet/images/Egyptian_Mau_215.jpg,Egyptian_Mau_215.jpg,Egyptian_Mau,cat
3,/home/ubuntu/.fastai/data/oxford-iiit-pet/images/Maine_Coon_146.jpg,Maine_Coon_146.jpg,Maine_Coon,cat
4,/home/ubuntu/.fastai/data/oxford-iiit-pet/images/boxer_55.jpg,boxer_55.jpg,boxer,dog


## Cooldown: converting from parquet

If you have existing dataset in parquet it's also very easy to
convert the data to Lance

Generate a parquet dataset

In [16]:
import pyarrow.dataset as ds
tbl = pa.Table.from_pandas(df)                 
ds.write_dataset(tbl, '/tmp/oxford_pet.parquet',
                 format='parquet')

In [17]:
parquet_dataset = ds.dataset('/tmp/oxford_pet.parquet')
parquet_dataset.to_table().to_pandas().head()

Unnamed: 0,image,filename,class,species
0,/home/ubuntu/.fastai/data/oxford-iiit-pet/images/saint_bernard_50.jpg,saint_bernard_50.jpg,saint_bernard,dog
1,/home/ubuntu/.fastai/data/oxford-iiit-pet/images/boxer_130.jpg,boxer_130.jpg,boxer,dog
2,/home/ubuntu/.fastai/data/oxford-iiit-pet/images/Egyptian_Mau_215.jpg,Egyptian_Mau_215.jpg,Egyptian_Mau,cat
3,/home/ubuntu/.fastai/data/oxford-iiit-pet/images/Maine_Coon_146.jpg,Maine_Coon_146.jpg,Maine_Coon,cat
4,/home/ubuntu/.fastai/data/oxford-iiit-pet/images/boxer_55.jpg,boxer_55.jpg,boxer,dog


<div class="alert alert-info">
    Converting parquet to lance is also just 1 python statement
    </div> 

In [18]:
lance.write_dataset(parquet_dataset, '/tmp/oxford_pet_from_parquet.lance')

Again we can read it back out and see that it's the same

In [19]:
lance_dataset = lance.dataset('/tmp/oxford_pet_from_parquet.lance')
lance_dataset.to_table().to_pandas().head()

Unnamed: 0,image,filename,class,species
0,/home/ubuntu/.fastai/data/oxford-iiit-pet/images/saint_bernard_50.jpg,saint_bernard_50.jpg,saint_bernard,dog
1,/home/ubuntu/.fastai/data/oxford-iiit-pet/images/boxer_130.jpg,boxer_130.jpg,boxer,dog
2,/home/ubuntu/.fastai/data/oxford-iiit-pet/images/Egyptian_Mau_215.jpg,Egyptian_Mau_215.jpg,Egyptian_Mau,cat
3,/home/ubuntu/.fastai/data/oxford-iiit-pet/images/Maine_Coon_146.jpg,Maine_Coon_146.jpg,Maine_Coon,cat
4,/home/ubuntu/.fastai/data/oxford-iiit-pet/images/boxer_55.jpg,boxer_55.jpg,boxer,dog


## Try your own data!

Nice work! In this tutorial, we've created Lance datasets from pandas, arrow, and parquet using only a few lines of code.

Now you should try creating a Lance dataset with your own image data!