# 🚀 filoma demo

Fast, multi-backend file analysis with a tiny API surface

In [1]:
import filoma

print(f"filoma version: {filoma.__version__}")

filoma version: 1.7.3


# 🔍📁 Directory Analysis

Do you want to analyze a directory of files and extract metadata, text content, and other useful information?   
`filoma` makes it super easy to do so with just a few lines of code.

In [2]:
from filoma.directories import DirectoryProfiler, DirectoryProfilerConfig

# Create a profiler using the typed config dataclass
config = DirectoryProfilerConfig(use_rust=True)
dp1 = DirectoryProfiler(config)

analysis = dp1.probe("../")
dp1.print_summary(analysis)

[32m2025-09-13 11:11:17.520[0m | [34m[1mDEBUG   [0m | [36mfiloma.directories.directory_profiler[0m:[36m__init__[0m:[36m343[0m - [34m[1mInteractive environment detected, disabling progress bars to avoid conflicts[0m
[32m2025-09-13 11:11:17.520[0m | [1mINFO    [0m | [36mfiloma.directories.directory_profiler[0m:[36mprobe[0m:[36m430[0m - [1mStarting directory analysis of '../' using 🦀 Rust (Parallel) implementation[0m
[32m2025-09-13 11:11:18.158[0m | [32m[1mSUCCESS [0m | [36mfiloma.directories.directory_profiler[0m:[36mprobe[0m:[36m446[0m - [32m[1mDirectory analysis completed in 0.64s - Found 62,801 items (58,979 files, 3,822 folders) using 🦀 Rust (Parallel)[0m


Want to quickly see a report of your findings? filoma has you covered.

In [3]:
dp1.print_report(analysis)

## 📁 Directory to DataFrame

Now that you saw what's up with your files, you might want to explore the data in a familiar format.   
`filoma` can convert the analysis results into a Polars (or Pandas) DataFrame real quick.  
**NOTE**: Pandas support requires the `pd` extra which you can install by running `uv sync --extra pd` in your terminal.

In [4]:
from filoma import probe_to_df

df = probe_to_df("../", max_depth=2, enrich=True)
print(f"Found {len(df)} files")
df.head()

[32m2025-09-13 11:11:18.176[0m | [34m[1mDEBUG   [0m | [36mfiloma.directories.directory_profiler[0m:[36m__init__[0m:[36m343[0m - [34m[1mInteractive environment detected, disabling progress bars to avoid conflicts[0m
[32m2025-09-13 11:11:18.177[0m | [1mINFO    [0m | [36mfiloma.directories.directory_profiler[0m:[36mprobe[0m:[36m430[0m - [1mStarting directory analysis of '../' using 🐍 Python implementation[0m
[32m2025-09-13 11:11:18.527[0m | [32m[1mSUCCESS [0m | [36mfiloma.directories.directory_profiler[0m:[36mprobe[0m:[36m446[0m - [32m[1mDirectory analysis completed in 0.35s - Found 361 items (300 files, 61 folders) using 🐍 Python[0m


Found 360 files


path,depth,parent,name,stem,suffix,size_bytes,modified_time,created_time,is_file,is_dir,owner,group,mode_str,inode,nlink,sha256,xattrs
str,i64,str,str,str,str,i64,str,str,bool,bool,str,str,str,i64,i64,str,str
"""../pyproject.toml""",1,"""..""","""pyproject.toml""","""pyproject""",""".toml""",1838,"""2025-09-11 18:00:08""","""2025-09-11 18:00:08""",True,False,"""kalfasy""","""kalfasy""","""-rw-rw-r--""",7579961,1,,"""{}"""
"""../scripts""",1,"""..""","""scripts""","""scripts""","""""",4096,"""2025-09-05 20:26:25""","""2025-09-05 20:26:25""",False,True,"""kalfasy""","""kalfasy""","""drwxrwxr-x""",7603122,2,,"""{}"""
"""../.pytest_cache""",1,"""..""",""".pytest_cache""",""".pytest_cache""","""""",4096,"""2025-07-05 22:28:03""","""2025-07-05 22:28:03""",False,True,"""kalfasy""","""kalfasy""","""drwxrwxr-x""",7604845,3,,"""{}"""
"""../.vscode""",1,"""..""",""".vscode""",""".vscode""","""""",4096,"""2025-07-06 11:11:18""","""2025-07-06 11:11:18""",False,True,"""kalfasy""","""kalfasy""","""drwxrwxr-x""",7591635,2,,"""{}"""
"""../Makefile""",1,"""..""","""Makefile""","""Makefile""","""""",2827,"""2025-09-07 22:29:37""","""2025-09-07 22:29:37""",True,False,"""kalfasy""","""kalfasy""","""-rw-rw-r--""",7603119,1,,"""{}"""


In [5]:
print(f"Type of df: type(df.to_pandas()), shape: {df.to_pandas().shape}")

Type of df: type(df.to_pandas()), shape: (360, 18)


#### ⚡ DataFrame enrichment

You're probably wondering "what is `enrich=True`?"  
Well, since `filoma` gathers the paths of your files in a DataFrame, why not enrich this DataFrame with additional metadata. Its own `DataFrame` class has convenience functions like: `add_path_components()`, `add_file_stats_cols()`, `add_depth_col()`  
  
Let's see it in action:

In [6]:
from filoma.directories import DirectoryProfiler, DirectoryProfilerConfig

cfg = DirectoryProfilerConfig(build_dataframe=True, use_rust=True, return_absolute_paths=True)
dprof = DirectoryProfiler(cfg)
res = dprof.probe("../")

default_columns = res.dataframe.columns
print(f"Columns before enrich: {default_columns}")
print(f"{res.dataframe.head(3)}\n")

df = res.dataframe.enrich()
new_columns = list(set(df.columns) - set(default_columns))
new_columns.sort()
print(f"New columns after enrich: {new_columns}")
print(df.head(3))

[32m2025-09-13 11:11:18.615[0m | [34m[1mDEBUG   [0m | [36mfiloma.directories.directory_profiler[0m:[36m__init__[0m:[36m343[0m - [34m[1mInteractive environment detected, disabling progress bars to avoid conflicts[0m
[32m2025-09-13 11:11:18.616[0m | [1mINFO    [0m | [36mfiloma.directories.directory_profiler[0m:[36mprobe[0m:[36m430[0m - [1mStarting directory analysis of '../' using 🦀 Rust (Parallel) implementation[0m
[32m2025-09-13 11:11:19.448[0m | [32m[1mSUCCESS [0m | [36mfiloma.directories.directory_profiler[0m:[36mprobe[0m:[36m446[0m - [32m[1mDirectory analysis completed in 0.83s - Found 62,801 items (58,979 files, 3,822 folders) using 🦀 Rust (Parallel)[0m


Columns before enrich: ['path']
shape: (3, 1)
┌───────────────────┐
│ path              │
│ ---               │
│ str               │
╞═══════════════════╡
│ ../pyproject.toml │
│ ../scripts        │
│ ../.pytest_cache  │
└───────────────────┘

New columns after enrich: ['created_time', 'depth', 'group', 'inode', 'is_dir', 'is_file', 'mode_str', 'modified_time', 'name', 'nlink', 'owner', 'parent', 'sha256', 'size_bytes', 'stem', 'suffix', 'xattrs']
shape: (3, 18)
┌──────────────────┬────────┬────────────────┬───────────────┬───┬───────┬────────┬────────┬───────┐
│ path             ┆ parent ┆ name           ┆ stem          ┆ … ┆ nlink ┆ sha256 ┆ xattrs ┆ depth │
│ ---              ┆ ---    ┆ ---            ┆ ---           ┆   ┆ ---   ┆ ---    ┆ ---    ┆ ---   │
│ str              ┆ str    ┆ str            ┆ str           ┆   ┆ i64   ┆ str    ┆ str    ┆ i64   │
╞══════════════════╪════════╪════════════════╪═══════════════╪═══╪═══════╪════════╪════════╪═══════╡
│ ../pyproject.tom ┆ ..    

## 🤖 ML-ready splits

The next logical thing for `filoma` to serve in data science workflows is to provide easy ways to create ML-ready splits.  

In a very simple case, you have a dataframe with a `split`/`label`/... column that you want to use to create the splits like so:  
> train, val, test = df[df["split"] == "train"], df[df["split"] == "val"], df[df["split"] == "test"]  

But things are rarely that simple in practice.  

When you're given data in the "real world", either in folders & files or in a dataframe, you often need to create the splits yourself.  
  
>*Many practicioners unfortunately often disregard the importance of the data splits and split their data randomly, which can lead to data leakage, overfitting issues and unrealistic performance metrics*  

A minimum best practice is to split your data into 3 sets:  
- *Training set*: used to train your model
- *Validation set*: used to tune your model's hyperparameters
- *Testing set*: used to evaluate the final performance of your model  

Ideally, you'd want to use a validation set that is representative of your test set, and both should be representative of your real-world data. So, special care should be taken to ensure that the splits are done correctly.  

##### 📚 traditional `sklearn` way

A very popular (if not the most popular) way of splitting data, is to use `scikit-learn`'s function called `train_test_split` that can split your data into training and testing sets. It can do that for you rather easily, although you'll need to call it twice if you want a 3-way split.:  
> from sklearn.model_selection import train_test_split  
> train, temp = train_test_split(df, test_size=0.4)  
> val, test = train_test_split(temp, test_size=0.5)  

Isn't it confusing that you just wanted a 60/20/20 split but had to specify 40% and then 50%? Imagine doing this for a 70/20/10 split...
> from sklearn.model_selection import train_test_split  
> train, temp = train_test_split(df, test_size=0.3)  
> val, test = train_test_split(temp, test_size=0.3333)  

##### 🧙‍♂️ filoma's `split` method


`filoma` takes this a step further by providing a function that can not only split your data into training, validation, and testing sets in one-go, but also do that based on features found in your file-names or directories.  


In [14]:
from filoma import DataFrame

# For example, your data might have subcategories encoded in filenames
# like dog_bulldog_001.jpg, dog_bulldog_002.jpg, cat_siamese_001.jpg, etc.
# You can use discover_filename_features to extract these features into separate columns.
data = {
    "path": [
        "dog/labrador_black_001.jpg",
        "dog/labrador_black_002.jpg",
        "dog/beagle_tricolor_003.jpg",
        "cat/siamese_sealpoint_001.jpg",
        "cat/siamese_sealpoint_002.jpg",
        "cat/mainecoon_tabby_003.jpg",
        "bird/sparrow_hatchling_001.jpg",
        "bird/robin_adult_002.jpg",
        "bird/robin_adult_003.jpg",
    ]
}

df = DataFrame(data)
df.discover_filename_features(sep="_", token_names=["breed", "color", "number"], path_col="path", enrich=False, include_parent=True, inplace=True)

path,breed,color,number,parent
str,str,str,str,str
"""dog/labrador_black_001.jpg""","""labrador""","""black""","""001""","""dog"""
"""dog/labrador_black_002.jpg""","""labrador""","""black""","""002""","""dog"""
"""dog/beagle_tricolor_003.jpg""","""beagle""","""tricolor""","""003""","""dog"""
"""cat/siamese_sealpoint_001.jpg""","""siamese""","""sealpoint""","""001""","""cat"""
"""cat/siamese_sealpoint_002.jpg""","""siamese""","""sealpoint""","""002""","""cat"""
"""cat/mainecoon_tabby_003.jpg""","""mainecoon""","""tabby""","""003""","""cat"""
"""bird/sparrow_hatchling_001.jpg""","""sparrow""","""hatchling""","""001""","""bird"""
"""bird/robin_adult_002.jpg""","""robin""","""adult""","""002""","""bird"""
"""bird/robin_adult_003.jpg""","""robin""","""adult""","""003""","""bird"""


In [15]:
# to do value_counts on polars dataframe
print(df["breed"].value_counts())

shape: (6, 2)
┌───────────┬───────┐
│ breed     ┆ count │
│ ---       ┆ ---   │
│ str       ┆ u32   │
╞═══════════╪═══════╡
│ beagle    ┆ 1     │
│ robin     ┆ 2     │
│ labrador  ┆ 2     │
│ mainecoon ┆ 1     │
│ siamese   ┆ 2     │
│ sparrow   ┆ 1     │
└───────────┴───────┘


In [16]:
df.auto_split(seed=42, train_val_test=(60, 20, 20))

TypeError: auto_split() got an unexpected keyword argument 'how'

In [None]:
from filoma import ml

# Split into train/val/test sets with 70% train, 15% val, 15% test
train, val, test = ml.auto_split(df, train_val_test=(70, 15, 15), seed=42, include_all_parts=True)
print(f"Split sizes: {len(train)}, {len(val)}, {len(test)}")
train.head(3)

## 📄 Single file probe

In [None]:
from filoma import probe_file

file_info = probe_file("../README.md")
print(f"Path: {file_info.path}")
print(f"Size: {file_info.size}")
print(f"Modified: {file_info.modified}")

## 🖼️ Image analysis

In [None]:
from filoma import probe_image

img = probe_image("../images/logo.png")
print(f"Type: {img.file_type}")
print(f"Shape: {img.shape}")
print(f"Data range: {img.min} - {img.max}")