# Format LRA Pathfinder
The Long Range Arena (LRA) benchmark contains the PathFinder dataset. This dataset consists of images of paths (dashed lines). The task is to predict if two dots are connected by any of the paths in the image.

This dataset is stored in a custom format that requires heavy dependencies (e.g. `tensorflow`).

In this notebook, we will convert the dataset into a standardized format with Meerkat dataframes. This format will not require heavy dependenices, making it more ubiquitous than existing formats.

**Reference:** https://github.com/google-research/long-range-arena/blob/main/lra_benchmarks/data/pathfinder.py

In [3]:
import os
import tensorflow as tf

import meerkat as mk
import pandas as pd

pathfinder = "pathfinder32"

In [4]:
def extract_metadata(dirpath, base_path: str):
    """Extract the filepath and label from the metadata file.

    Example metadata:
        ['imgs/43', 'sample_0.png', '0', '0', '1.8', '6', '2.0', '5', '1.5', '2', '1']
    
    Args:
        file_path: Path to the metadata file.
    """
    metadata_dir = os.path.join(dirpath, "metadata")
    image_paths = []
    labels = []
    for metadata_file in os.listdir(metadata_dir):
        file_path = os.path.join(metadata_dir, metadata_file)
        meta_examples = tf.io.read_file(file_path).numpy().decode('utf-8').split('\n')[:-1]
        for m_example in meta_examples:
            m_example = m_example.split(' ')
            image_paths.append(os.path.join(base_path, m_example[0], m_example[1]))
            labels.append(int(m_example[3]))
    return {"path": image_paths, "label": labels}

In [5]:
difficulty = {
    "curv_baseline": "easy",
    "curv_contour_length_9": "medium",
    "curv_contour_length_14": "hard",
}
pathfinder_path = os.path.join(mk.config.datasets.root_dir, "lra_release", "lra_release", pathfinder)

In [6]:
# Format the metadata for each subfolder.
dfs = []
for subfolder in os.listdir(pathfinder_path):
    dirpath = os.path.join(pathfinder_path, subfolder, "")
    df = pd.DataFrame(extract_metadata(dirpath, base_path=subfolder))
    df["subfolder"] = subfolder
    df["difficulty"] = difficulty[subfolder]
    dfs.append(df)

# Concatenate the dataframes.
df = pd.concat(dfs, axis=0)
df_pd = df.reset_index(drop=True)

FileNotFoundError: [Errno 2] No such file or directory: '/home/.meerkat/datasets/lra_release/lra_release/pathfinder32'

In [97]:
df = mk.DataFrame.from_pandas(df_pd)
df = df.drop("index")
# df["image"] = mk.files(df["path"], base_dir=os.path.join(mk.config.datasets.root_dir, "lra_release", "lra_release", pathfinder))

## Save the DataFrame
We will save the dataframe and upload it to huggingface.

In [98]:
from huggingface_hub.repository import Repository
_PATH = os.path.abspath(os.path.expanduser("~/.meerkat/hf/pathfinder-gen"))
_HF_PATH = os.path.abspath(os.path.expanduser("~/.meerkat/hf/pathfinder"))

path = str(_HF_PATH)
repo = Repository(
    local_dir=path,
    clone_from="meerkat-ml/pathfinder",
    repo_type="dataset",
)

repo.git_pull()


/Users/arjundd/.meerkat/hf/pathfinder is already a clone of https://huggingface.co/datasets/meerkat-ml/pathfinder. Make sure you pull the latest changes with `repo.git_pull()`.


In [100]:
out = os.path.join(_PATH, f"{pathfinder}.mk")
df.write(out)

Unnamed: 0,path,label,subfolder,difficulty
0,curv_baseline/imgs/121/sample_0.png,1,curv_baseline,easy
1,curv_baseline/imgs/121/sample_1.png,0,curv_baseline,easy
2,curv_baseline/imgs/121/sample_2.png,1,curv_baseline,easy
3,curv_baseline/imgs/121/sample_3.png,1,curv_baseline,easy
4,curv_baseline/imgs/121/sample_4.png,0,curv_baseline,easy
...,...,...,...,...
599995,curv_contour_length_14/imgs/138/sample_995.png,0,curv_contour_length_14,hard
599996,curv_contour_length_14/imgs/138/sample_996.png,1,curv_contour_length_14,hard
599997,curv_contour_length_14/imgs/138/sample_997.png,1,curv_contour_length_14,hard
599998,curv_contour_length_14/imgs/138/sample_998.png,0,curv_contour_length_14,hard


In [101]:
repo.push_to_hub(commit_message="Add pathfinder meerkat dataframes")

Upload file pathfinder32.mk.tar.gz:   1%|          | 32.0k/3.51M [00:00<?, ?B/s]

Upload file pathfinder256.mk.tar.gz:   1%|          | 32.0k/3.51M [00:00<?, ?B/s]

Upload file pathfinder128.mk.tar.gz:   1%|          | 32.0k/3.51M [00:00<?, ?B/s]

Upload file pathfinder64.mk.tar.gz:   1%|          | 32.0k/3.51M [00:00<?, ?B/s]

To https://huggingface.co/datasets/meerkat-ml/pathfinder
   acf576b..d7e9cca  main -> main



'https://huggingface.co/datasets/meerkat-ml/pathfinder/commit/d7e9ccadc4adbf135cf246a6eac83489af80233e'