# 2 - Creating Lance Datasets

Lance's python API makes it easy for you to create Lance datasets from a variety of sources

⤜ pandas dataframes <br/>
⤜ pyarrow tables <br/>
⤜ parquet datasets <br/>
⤜ [coming soon] known open-source dataset formats <br/>

In [1]:
import lance
from lance import LanceFileFormat
import pyarrow as pa
import pyarrow.dataset as ds

## Warmup: write a toy dataframe

In [2]:
import numpy as np
import pandas as pd

toy = pd.DataFrame({
    'a': np.random.randn(5),
    'b': pd.Categorical.from_codes(np.random.randint(0, 5, 5),
                                   ['cat', 'dog', 'person', 'car', 'duck']),
    'c': pd.date_range('2022-01-01', freq='D', periods=5)
})
toy

Unnamed: 0,a,b,c
0,0.273693,person,2022-01-01
1,-0.033034,cat,2022-01-02
2,2.064357,person,2022-01-03
3,-1.075681,person,2022-01-04
4,-0.078632,car,2022-01-05


<div class="alert alert-info">
    Use standard pyarrow API to write Lance in 1 line
    </div>

In [3]:
ds.write_dataset(pa.Table.from_pandas(toy), 
                 base_dir='/tmp/oxford_pet.lance', 
                 format=LanceFileFormat(),
                 existing_data_behavior='overwrite_or_ignore')

read it back out

In [4]:
(lance.dataset('/tmp/oxford_pet.lance')
 .to_table()
 .to_pandas())

Unnamed: 0,a,b,c
0,0.273693,person,2022-01-01
1,-0.033034,cat,2022-01-02
2,2.064357,person,2022-01-03
3,-1.075681,person,2022-01-04
4,-0.078632,car,2022-01-05


## Workout: create pets dataset

Let's create a custom dataset

We'll need some raw data. For convenience, let's use fastai to get the oxford_pet dataset

In [5]:
!pip install --quiet fastai

In [6]:
from fastai.vision.all import untar_data, URLs
path = untar_data(URLs.PETS)

### Now we'll create a dataset with the image and some labeling information

First we get the images

In [7]:
from pathlib import Path
images = path / 'images'

df = pd.DataFrame({'image': pd.array(images.ls(), dtype='image[uri]')})
df.head()

Unnamed: 0,image
0,/Users/changshe/.fastai/data/oxford-iiit-pet/images/Egyptian_Mau_167.jpg
1,/Users/changshe/.fastai/data/oxford-iiit-pet/images/pug_52.jpg
2,/Users/changshe/.fastai/data/oxford-iiit-pet/images/basset_hound_112.jpg
3,/Users/changshe/.fastai/data/oxford-iiit-pet/images/Siamese_193.jpg
4,/Users/changshe/.fastai/data/oxford-iiit-pet/images/shiba_inu_122.jpg


In [8]:
df.image.values[0]

Next parse the filename

In [9]:
df['filename'] = df.image.apply(lambda s: Path(s).name)

Now get the class name

In [10]:
df['class'] = df.filename.str.rsplit('_', n=1).str[0].astype('category')

Get the species name

In [11]:
is_cat = df['class'].str[0].str.isupper()
df['species'] = pd.Categorical.from_codes(codes=is_cat.astype(int), 
                                          categories=['dog', 'cat'])

Let's see what it looks like altogether

In [12]:
df.head()

Unnamed: 0,image,filename,class,species
0,/Users/changshe/.fastai/data/oxford-iiit-pet/images/Egyptian_Mau_167.jpg,Egyptian_Mau_167.jpg,Egyptian_Mau,cat
1,/Users/changshe/.fastai/data/oxford-iiit-pet/images/pug_52.jpg,pug_52.jpg,pug,dog
2,/Users/changshe/.fastai/data/oxford-iiit-pet/images/basset_hound_112.jpg,basset_hound_112.jpg,basset_hound,dog
3,/Users/changshe/.fastai/data/oxford-iiit-pet/images/Siamese_193.jpg,Siamese_193.jpg,Siamese,cat
4,/Users/changshe/.fastai/data/oxford-iiit-pet/images/shiba_inu_122.jpg,shiba_inu_122.jpg,shiba_inu,dog


### Let's save our work again

In [13]:
ds.write_dataset(pa.Table.from_pandas(df), 
                 '/tmp/oxford_pet.lance', 
                 format=LanceFileFormat(),
                 existing_data_behavior='overwrite_or_ignore')

In [14]:
(lance.dataset('/tmp/oxford_pet.lance')
 .to_table()
 .to_pandas()
 .head())

Unnamed: 0,image,filename,class,species
0,/Users/changshe/.fastai/data/oxford-iiit-pet/images/Egyptian_Mau_167.jpg,Egyptian_Mau_167.jpg,Egyptian_Mau,cat
1,/Users/changshe/.fastai/data/oxford-iiit-pet/images/pug_52.jpg,pug_52.jpg,pug,dog
2,/Users/changshe/.fastai/data/oxford-iiit-pet/images/basset_hound_112.jpg,basset_hound_112.jpg,basset_hound,dog
3,/Users/changshe/.fastai/data/oxford-iiit-pet/images/Siamese_193.jpg,Siamese_193.jpg,Siamese,cat
4,/Users/changshe/.fastai/data/oxford-iiit-pet/images/shiba_inu_122.jpg,shiba_inu_122.jpg,shiba_inu,dog


## Cooldown: converting from parquet

If you have existing dataset in parquet it's also very easy to
convert the data to Lance

Generate a parquet dataset

In [15]:
ds.write_dataset(pa.Table.from_pandas(df), 
                 '/tmp/oxford_pet.parquet', 
                 format=ds.ParquetFileFormat(),
                 existing_data_behavior='overwrite_or_ignore')

In [16]:
parquet_dataset = ds.dataset('/tmp/oxford_pet.parquet')
parquet_dataset.to_table().to_pandas().head()

Unnamed: 0,image,filename,class,species
0,/Users/changshe/.fastai/data/oxford-iiit-pet/images/Egyptian_Mau_167.jpg,Egyptian_Mau_167.jpg,Egyptian_Mau,cat
1,/Users/changshe/.fastai/data/oxford-iiit-pet/images/pug_52.jpg,pug_52.jpg,pug,dog
2,/Users/changshe/.fastai/data/oxford-iiit-pet/images/basset_hound_112.jpg,basset_hound_112.jpg,basset_hound,dog
3,/Users/changshe/.fastai/data/oxford-iiit-pet/images/Siamese_193.jpg,Siamese_193.jpg,Siamese,cat
4,/Users/changshe/.fastai/data/oxford-iiit-pet/images/shiba_inu_122.jpg,shiba_inu_122.jpg,shiba_inu,dog


<div class="alert alert-info">
    Converting parquet to lance is also just 1 python statement
    </div> 

In [17]:
ds.write_dataset(parquet_dataset, 
                 '/tmp/oxford_pet.lance', 
                 format=LanceFileFormat(),
                 existing_data_behavior='overwrite_or_ignore')

Again we can read it back out and see that it's the same

In [18]:
lance_dataset = lance.dataset('/tmp/oxford_pet.lance')
lance_dataset.to_table().to_pandas().head()

Unnamed: 0,image,filename,class,species
0,/Users/changshe/.fastai/data/oxford-iiit-pet/images/Egyptian_Mau_167.jpg,Egyptian_Mau_167.jpg,Egyptian_Mau,cat
1,/Users/changshe/.fastai/data/oxford-iiit-pet/images/pug_52.jpg,pug_52.jpg,pug,dog
2,/Users/changshe/.fastai/data/oxford-iiit-pet/images/basset_hound_112.jpg,basset_hound_112.jpg,basset_hound,dog
3,/Users/changshe/.fastai/data/oxford-iiit-pet/images/Siamese_193.jpg,Siamese_193.jpg,Siamese,cat
4,/Users/changshe/.fastai/data/oxford-iiit-pet/images/shiba_inu_122.jpg,shiba_inu_122.jpg,shiba_inu,dog


## Try your own data!

Nice work! In this tutorial, we've created Lance datasets from pandas, arrow, and parquet using only a few lines of code.

Now you should try creating a Lance dataset with your own image data!