## Notebook Description

This notebook processes audio datasets by extracting embeddings using the **Birdnetlib library** and saving the results in Parquet format (with optional NPZ support). It is designed to handle multiple bird species datasets and ensures efficient processing with memory management, skip logic for already-processed files, and progress tracking

### **Main Tasks:**

1. **Load and Organize Audio Files:**
   - Dynamically retrieves `.wav` audio files from the Extended_audios directory for the following datasets:
     - `chiffchaff-fg`
     - `littleowl-fg`
     - `pipit-fg`
     - `rtbc-begging`
     - `littlepenguin-display_call-exhale`
     - `greatTit_song-files`
   - Prints the number of audio files found for each dataset and verifies the first few file paths.

2. **Set Up Output Directories:**
   - Creates an output directory (Embeddings_from_3sPadding) if it does not exist.
   - **Parquet output**:
      - Creates a dedicated subfolder per dataset containing multiple small `part_*.parquet` files for incremental appends.
      - Maintains a **registry** file in Parquet format to track processed audio files and skip them on subsequent runs.
   - **NPZ output** (optional):
      - Creates sharded `.npz` files (e.g., `_shard000.npz`) with embeddings, plus a registry file.

3. **Extract Embeddings Using BirdNET:**
   - Initializes the BirdNET model (`Analyzer`).
   - Iterates through the audio files, skipping already processed files (tracked in the CSV).
   - For each audio file:
     - Loads the file into BirdNET.
     - Extracts embeddings and metadata (e.g., start time, end time).
     - Saves results incrementally:
         - **Parquet** → Appends to `part_*.parquet` files in batches of 1000.
         - **NPZ** → Appends to in-memory shard until reaching `NPZ_SHARD_SIZE`, then writes to disk.

4. **Handle Errors and Memory Management:**
   - Skips files that cannot be processed and logs errors.
   - Clears memory after processing each batch to ensure efficient execution.

5. **Post-Processing Analysis:**
   - Compares the original list of audio files with the processed files in the CSV.
   - Reports the total number of processed, unprocessed, and skipped files.
   - Displays a list of unprocessed audio files for further investigation.

### **Output:**
-  **Parquet parts** for each dataset, saved under:
  ```
  Output_files/Embeddings_from_3sPadding/<dataset_name>_parquet_parts/
  ```
  Example:  
  `Output_files/Embeddings_from_3sPadding/littleowl_parquet_parts/part_0000.parquet`
  `Output_files/Embeddings_from_3sPadding/littleowl_parquet_parts/littleowl_processed_files.parquet`

- *(Optional)* **NPZ shards** for each dataset, saved under:
   `Output_files/Embeddings_from_3sPadding/<dataset_name>_npz_shards/`


---

### **Key Features:**
- **Efficient incremental storage** using Parquet (fast, compressed, and columnar) or NPZ (compact for ML training).
- **Skip logic** to avoid reprocessing already-processed audio files.
- **Batch/shard writing** to improve speed and reduce memory load.
- **Dataset registry** to track progress across runs.
- **Detailed stats** on processed vs. pending files.

In [None]:
# === Core imports ===
import os
import glob
import gc
from pathlib import Path

import numpy as np
import pandas as pd
import soundfile as sf

from birdnetlib import Recording
from birdnetlib.analyzer import Analyzer

import ipywidgets as widgets
from IPython.display import display


# =========================
# Output configuration
# =========================
OUTPUT_FORMAT = "parquet"   # or "npz"
PARQUET_COMPRESSION = "zstd"   # "snappy" also fine
PARQUET_ENGINE = "pyarrow"
BATCH_SIZE = 5000  # Number of audio files to process in each batch
#NPZ_SHARD_SIZE = 5000  # number of vocalizations per shard before flushing to disk
EMB_DTYPE = np.float32 # Numeric precision (float32 is good for embeddings)


# =========================
# Paths
# =========================
cwd = Path.cwd()
project_root = cwd.parents[1]
base_path = project_root / 'Output_files' / 'Extended_audios' #Path base
output_dir = project_root / 'Output_files' / 'Embeddings_from_3sPadding' # Output root for embeddings
output_dir.mkdir(parents=True, exist_ok=True) # Create output directory if it doesn't exist

# =========================
# Utility: find audio files
# =========================
def get_audio_files(base_path, folder_name, pattern="**/*.[Ww][Aa][Vv]"):
    return glob.glob(f"{str(base_path)}/{folder_name}/{pattern}", recursive=True)

# =========================
# Output targets per dataset
# - For PARQUET: we save many small parquet "parts" (easy to append), plus a registry of processed files.
# - For NPZ: we save shards: *_shard000.npz, *_shard001.npz, ... plus a small index of processed files.
# =========================
def parquet_targets_for(dataset_name: str):
    ds_dir = output_dir / f"{dataset_name}_parquet_parts"
    ds_dir.mkdir(parents=True, exist_ok=True)
    # Registry to quickly skip already processed files
    registry_path = ds_dir / f"{dataset_name}_processed_files.parquet"
    return ds_dir, registry_path

def npz_targets_for(dataset_name: str):
    ds_dir = output_dir / f"{dataset_name}_npz_shards"
    ds_dir.mkdir(parents=True, exist_ok=True)
    # We’ll write shards like f"{dataset_name}_shard000.npz"
    shard_prefix = ds_dir / f"{dataset_name}_shard"
    registry_path = ds_dir / f"{dataset_name}_processed_files.parquet"  # keep same registry approach
    return ds_dir, shard_prefix, registry_path

# =========================
# Dataset map
# Each entry: (list_of_audio_paths, output_descriptor)
# =========================
def out_descriptor(dataset_name: str):
    if OUTPUT_FORMAT.lower() == "parquet":
        ds_dir, registry = parquet_targets_for(dataset_name)
        return {"format": "parquet", "parts_dir": ds_dir, "registry": registry}
    elif OUTPUT_FORMAT.lower() == "npz":
        ds_dir, shard_prefix, registry = npz_targets_for(dataset_name)
        return {"format": "npz", "shard_prefix": shard_prefix, "registry": registry}
    else:
        raise ValueError("OUTPUT_FORMAT must be 'parquet' or 'npz'")

# Datasets
audios_chiffchaff     = get_audio_files(base_path, "chiffchaff-fg")
audios_littleowl      = get_audio_files(base_path, "littleowl-fg")
audios_pipit          = get_audio_files(base_path, "pipit-fg")
audios_rtbc           = get_audio_files(base_path, "rtbc-begging")
audios_littlepenguin  = get_audio_files(base_path, "littlepenguin-display_call-exhale")
audios_greatTit       = get_audio_files(base_path, "greatTit_song-files")
audios_kiwi           = get_audio_files(base_path, "KiwiTrimmed")  # Kiwi dataset, if needed

dataset_map = {
    "chiffchaff-fg": (audios_chiffchaff,    out_descriptor("chiffchaff")),
    "littleowl-fg":  (audios_littleowl,     out_descriptor("littleowl")),
    "pipit-fg":      (audios_pipit,         out_descriptor("pipit")),
    "rtbc-begging":  (audios_rtbc,          out_descriptor("rtbc")),
    "littlepenguin-display_call-exhale": (audios_littlepenguin, out_descriptor("littlepenguin")),
    "greatTit_song-files": (audios_greatTit, out_descriptor("greatTit")),
    "KiwiTrimmed":   (audios_kiwi,          out_descriptor("kiwi")),
}

# =========================
# Dropdown for dataset selection
# =========================
dataset_dropdown = widgets.Dropdown(
    options=[("Select a dataset", None)] + [(k, k) for k in dataset_map.keys()],
    description="Select dataset:",
    style={"description_width": "initial"},
    layout=widgets.Layout(width="50%"),
)

# Globals set when a dataset is picked
audios = []
output_desc = None
selected_key = None 

def on_selection_change(change):
    if change["new"] is not None:
        global audios, output_desc, selected_key
        selected_key = change["new"]
        audios, output_desc = dataset_map[selected_key]
        print(f"\nSelected: {selected_key}")
        print(f"Total audio files: {len(audios)}")
        print(f"batch size: {BATCH_SIZE} audio files per batch")
        if output_desc["format"] == "parquet":
            print(f"Output: Parquet parts directory -> {output_desc['parts_dir'].name}")
            print(f"Processed registry: {output_desc['registry'].name}")
        else:
            print(f"Output: NPZ shards prefix -> {output_desc['shard_prefix'].name}")
            print(f"Processed registry: {output_desc['registry'].name}")

dataset_dropdown.observe(on_selection_change, names="value")
display(dataset_dropdown)

Dropdown(description='Select dataset:', layout=Layout(width='50%'), options=(('Select a dataset', None), ('chi…


Selected: KiwiTrimmed
Total audio files: 455
batch size: 5000 audio files per batch
Output: Parquet parts directory -> kiwi_parquet_parts
Processed registry: kiwi_processed_files.parquet


In [2]:
# Check selection and discover processed/unprocessed files using the registry
try:
    assert audios is not None and output_desc is not None
except Exception:
    print("❌ Please run the dataset selection cell first.")
else:
       # Build the set of original audio basenames
    print("Selected dataset:", selected_key)
    original_audios = set(Path(audio).name for audio in audios)

    # Load (or initialize) the processed-files registry (Parquet)
    reg_path = output_desc["registry"]
    if reg_path.exists():
        try:
            df_reg = pd.read_parquet(reg_path)
            if "file_name" in df_reg.columns:
                processed_files = set(df_reg["file_name"].astype(str).unique())
            else:
                processed_files = set()
        except Exception as e:
            print(f"⚠️ Could not read registry at {reg_path}: {e}")
            processed_files = set()
    else:
        processed_files = set()

       # Compute unprocessed files
    unprocessed_audios = original_audios - processed_files

    # Report
    print(f"Total audios in the original list: {len(original_audios)}")
    print(f"Total audios already processed (from registry): {len(processed_files)}")
    print(f"Total audios not yet processed: {len(unprocessed_audios)}")

    # (Optional) Preview a few unprocessed/processed files for sanity check
    show_n = 5
    if unprocessed_audios:
        ex_unproc = sorted(list(unprocessed_audios))[:show_n]
        print(f"\nExamples of unprocessed files (up to {show_n}):")
        for s in ex_unproc:
            print("  •", s)

    if processed_files:
        ex_proc = sorted(list(processed_files))[:show_n]
        print(f"\nExamples of processed files (up to {show_n}):")
        for s in ex_proc:
            print("  •", s)


    # Map from basename to full path (for this dataset)
    path_map = {Path(p).name: p for p in audios}

    # Build a list of full paths we actually need to process
    todo_list_paths = []
    missing_from_map = []
    for fname in sorted(list(unprocessed_audios)):
        if fname in path_map:
            todo_list_paths.append(path_map[fname])
        else:
            missing_from_map.append(fname)

    print(f"\nResolved {len(todo_list_paths)} files to full paths.")
    if missing_from_map:
        print(f"⚠️ {len(missing_from_map)} names were not found in path_map (skipping):")
        for s in missing_from_map[:5]:
            print("  •", s)



Selected dataset: KiwiTrimmed
Total audios in the original list: 455
Total audios already processed (from registry): 0
Total audios not yet processed: 455

Examples of unprocessed files (up to 5):
  • 10_2020_10_14_21_07_trim.wav
  • 10_2020_10_15_21_57_trim.wav
  • 11_2020_10_12_2_07_trim.wav
  • 11_2020_10_13_21_08_trim.wav
  • 11_2020_10_2_22_58_trim.wav

Resolved 455 files to full paths.


In [3]:
from contextlib import redirect_stdout, redirect_stderr
import os

try:
    from tqdm import tqdm
except ImportError:
    tqdm = None  # fallback if tqdm isn't available

# =========================
# Embedding extraction + saving (Parquet or NPZ)
# =========================
from pathlib import Path
import os, re, gc
import numpy as np
import pandas as pd

# Safety checks
if "audios" not in globals() or "output_desc" not in globals():
    raise RuntimeError("❌ Please run the dataset selection cell first.")
if "unprocessed_audios" not in globals():
    raise RuntimeError("❌ Please run the registry cell to compute 'unprocessed_audios'.")

if len(unprocessed_audios) == 0:
    print("✅ All audio files have already been processed. No new files to process.")
else:
    # Initialize BirdNET once
    analyzer = Analyzer()

    # Helpers to derive next file index (part/shard)
    def next_parquet_part_idx(parts_dir: Path) -> int:
        pattern = re.compile(r"part_(\d{4})\.parquet$")
        idxs = []
        for p in parts_dir.glob("part_*.parquet"):
            m = pattern.search(p.name)
            if m: idxs.append(int(m.group(1)))
        return (max(idxs) + 1) if idxs else 0

    def next_npz_shard_idx(prefix: Path) -> int:
        # matches e.g. mydataset_shard012.npz
        pattern = re.compile(re.escape(prefix.name) + r"(\d{3})\.npz$")
        idxs = []
        for p in prefix.parent.glob(prefix.name + "*.npz"):
            m = pattern.match(p.name)
            if m: idxs.append(int(m.group(1)))
        return (max(idxs) + 1) if idxs else 0

    # Load current registry (so we can append to it after each flush)
    reg_path = output_desc["registry"]
    if reg_path.exists():
        df_reg = pd.read_parquet(reg_path)
        processed_files = set(df_reg["file_name"].astype(str).unique()) if "file_name" in df_reg.columns else set()
    else:
        df_reg = pd.DataFrame(columns=["file_name"])
        processed_files = set()

    # Iterate only over unprocessed files
    # todo_list = sorted(list(unprocessed_audios))
    if "todo_list_paths" not in globals() or not todo_list_paths:
        raise RuntimeError("❌ 'todo_list_paths' is missing. Re-run the registry cell.")

    # Common controls
    #batch_size = 1000  # how often to flush (for Parquet)
    processed_in_this_run = []

    if output_desc["format"] == "parquet":
        parts_dir: Path = output_desc["parts_dir"]
        part_idx = next_parquet_part_idx(parts_dir)

        rows = []  # will accumulate dict rows for a Parquet part

        #for k, audio_path in enumerate(todo_list_paths, 1):
        for k, audio_path in enumerate( tqdm(todo_list_paths, desc="Extracting embeddings", unit="file"), 1):
            try:
                file_name = os.path.basename(audio_path)

                # BirdNET extraction
                #recording = Recording(analyzer, audio_path)
                #recording.extract_embeddings()

                # BirdNET extraction (silence internal prints)
                with open(os.devnull, "w") as fnull, redirect_stdout(fnull), redirect_stderr(fnull):
                    recording = Recording(analyzer, audio_path)
                    recording.extract_embeddings()

                # Build rows (one row per frame)
                for emb in recording.embeddings:
                    row = {
                        "file_name": file_name,
                        "start_time": float(emb["start_time"]),
                        "end_time": float(emb["end_time"]),
                    }
                    # emb["embeddings"] is a 1024-dim vector
                    # store as float32 for compactness
                    for i, v in enumerate(emb["embeddings"]):
                        row[f"dim_{i+1}"] = np.float32(v)
                    rows.append(row)

                processed_in_this_run.append(file_name)

                # Flush every batch_size files
                if (k % BATCH_SIZE == 0) and rows:
                    df_chunk = pd.DataFrame(rows)
                    float_cols = [c for c in df_chunk.columns if c.startswith("dim_")]
                    if float_cols:
                        df_chunk[float_cols] = df_chunk[float_cols].astype("float32")

                    out_part = parts_dir / f"part_{part_idx:04d}.parquet"
                    df_chunk.to_parquet(out_part, index=False, engine=PARQUET_ENGINE, compression=PARQUET_COMPRESSION)
                    rows.clear()
                    part_idx += 1

                    # Update registry
                    df_new = pd.DataFrame({"file_name": processed_in_this_run})
                    df_reg = pd.concat([df_reg, df_new], ignore_index=True).drop_duplicates(subset=["file_name"])
                    df_reg.to_parquet(reg_path, index=False, engine=PARQUET_ENGINE, compression=PARQUET_COMPRESSION)
                    processed_in_this_run.clear()

                del recording
                gc.collect()

            except Exception as e:
                print(f"Error processing {audio_path}: {e}")

        # Final flush
        if rows:
            df_chunk = pd.DataFrame(rows)
            float_cols = [c for c in df_chunk.columns if c.startswith("dim_")]
            if float_cols:
                df_chunk[float_cols] = df_chunk[float_cols].astype("float32")

            out_part = parts_dir / f"part_{part_idx:04d}.parquet"
            df_chunk.to_parquet(out_part, index=False, engine=PARQUET_ENGINE, compression=PARQUET_COMPRESSION)
            rows.clear()
            part_idx += 1

        # Update registry for any remaining processed files
        if processed_in_this_run:
            df_new = pd.DataFrame({"file_name": processed_in_this_run})
            df_reg = pd.concat([df_reg, df_new], ignore_index=True).drop_duplicates(subset=["file_name"])
            df_reg.to_parquet(reg_path, index=False, engine=PARQUET_ENGINE, compression=PARQUET_COMPRESSION)

        print(f"✅ Parquet parts written to: {parts_dir}")

    # elif output_desc["format"] == "npz":
    #     shard_prefix: Path = output_desc["shard_prefix"]
    #     shard_idx = next_npz_shard_idx(shard_prefix)

    #     # Buffers for the current shard
    #     emb_list = []    # list of (T_i, 1024) arrays
    #     lengths = []     # list of T_i
    #     fnames = []      # one per vocalization
    #     starts = []      # optional: first start_time
    #     ends = []        # optional: last end_time

    #     def flush_npz_shard():
    #         nonlocal shard_idx, emb_list, lengths, fnames, starts, ends, df_reg

    #         if not lengths:
    #             return

    #         lengths_arr = np.asarray(lengths, dtype=np.int32)
    #         offsets = np.zeros(len(lengths_arr) + 1, dtype=np.int64)
    #         offsets[1:] = np.cumsum(lengths_arr)

    #         D = 1024
    #         X = np.empty((int(offsets[-1]), D), dtype=EMB_DTYPE)
    #         pos = 0
    #         for E in emb_list:
    #             n = E.shape[0]
    #             X[pos:pos+n] = E
    #             pos += n

    #         out_npz = shard_prefix.parent / f"{shard_prefix.name}{shard_idx:03d}.npz"
    #         np.savez_compressed(
    #             out_npz,
    #             X=X, lengths=lengths_arr, offsets=offsets,
    #             starts=np.asarray(starts, dtype=np.float32),
    #             ends=np.asarray(ends, dtype=np.float32),
    #         )
    #         # filenames as separate .npy
    #         out_names = shard_prefix.parent / f"{shard_prefix.name}{shard_idx:03d}_filenames.npy"
    #         np.save(out_names, np.array(fnames, dtype=object))

    #         # Update registry
    #         df_new = pd.DataFrame({"file_name": fnames})
    #         df_reg = pd.concat([df_reg, df_new], ignore_index=True).drop_duplicates(subset=["file_name"])
    #         df_reg.to_parquet(reg_path, index=False, engine=PARQUET_ENGINE, compression=PARQUET_COMPRESSION)

    #         # Reset buffers
    #         emb_list.clear(); lengths.clear(); fnames.clear(); starts.clear(); ends.clear()
    #         shard_idx += 1

    #     for k, audio_path in enumerate(todo_list, 1):
    #         try:
    #             file_name = os.path.basename(audio_path)

    #             recording = Recording(analyzer, audio_path)
    #             recording.extract_embeddings()

    #             # Build (T, 1024) matrix and start/end
    #             E = np.vstack([np.asarray(emb["embeddings"], dtype=EMB_DTYPE)
    #                            for emb in recording.embeddings])  # (T, 1024)
    #             s0 = float(recording.embeddings[0]["start_time"])
    #             eN = float(recording.embeddings[-1]["end_time"])

    #             emb_list.append(E)
    #             lengths.append(E.shape[0])
    #             fnames.append(file_name)
    #             starts.append(s0)
    #             ends.append(eN)

    #             # Flush shard if needed
    #             if len(fnames) >= NPZ_SHARD_SIZE:
    #                 flush_npz_shard()

    #             del recording
    #             gc.collect()

    #         except Exception as e:
    #             print(f"Error processing {audio_path}: {e}")

    #     # Final shard
    #     flush_npz_shard()
    #     print(f"✅ NPZ shards written with prefix: {shard_prefix.name}")

    else:
        raise ValueError("OUTPUT_FORMAT must be 'parquet' or 'npz'")


INFO: Created TensorFlow Lite XNNPACK delegate for CPU.


Labels loaded.
load model True
Model loaded.
Labels loaded.
load_species_list_model
Meta model loaded.


Extracting embeddings:   0%|          | 0/455 [00:00<?, ?file/s]

Extracting embeddings: 100%|██████████| 455/455 [06:40<00:00,  1.14file/s]


✅ Parquet parts written to: /teamspace/studios/this_studio/Output_files/Embeddings_from_3sPadding/kiwi_parquet_parts


In [None]:
# Check if the Parquet registry exists to retrieve processed audio files
registry_path = output_desc["registry"]

if registry_path.exists():
    df_embeddings = pd.read_parquet(registry_path, engine=PARQUET_ENGINE)
    processed_files = set(df_embeddings["file_name"])  # Already processed files
else:
    df_embeddings = pd.DataFrame()
    processed_files = set()

# Original audios (only filenames, not full paths)
original_audios = set(Path(audio).name for audio in audios)

# Already processed audios
processed_audios = set(df_embeddings["file_name"].unique()) if not df_embeddings.empty else set()

# Unprocessed audios
unprocessed_audios = original_audios - processed_audios

# Display results
print(f"Total audios processed: {len(processed_audios)}")
print(f"Total audios not processed: {len(unprocessed_audios)}")

# Audios that were not processed
print("Unprocessed audios:")
print("\n".join(unprocessed_audios))

# Clean up memory
del df_embeddings


Total audios processed: 455
Total audios not processed: 0
Unprocessed audios:



In [None]:
import pandas as pd
from pathlib import Path

def load_selected_embeddings(debug: bool = False):
    """
    Load embeddings for the dataset currently selected via the dropdown.
    Uses globals: selected_dataset_key, audios, output_desc.
    Only reads real part files (excludes registry).
    """
    # Safety checks
    if 'output_desc' not in globals() or output_desc is None:
        print("❌ No dataset selected yet. Pick one in the dropdown first.")
        return None
    if 'selected_key' not in globals() or not selected_key:
        print("❌ No dataset selected yet. Pick one in the dropdown first.")
        return None

    if output_desc["format"] != "parquet":
        print("⚠ Current dataset is not using Parquet.")
        return None

    parts_dir: Path = output_desc["parts_dir"]
    if not parts_dir.exists():
        print(f"❌ Parts directory does not exist: {parts_dir}")
        return None

    # ✅ Only the embedding parts, not the registry
    part_files = sorted(parts_dir.glob("part_*.parquet"))
    if debug:
        print("All parquet files in folder:")
        for p in sorted(parts_dir.glob("*.parquet")):
            tag = " (PART)" if p.name.startswith("part_") else " (REGISTRY?)"
            print("  •", p.name, tag)

    if not part_files:
        print(f"⚠ No part_*.parquet found in {parts_dir}")
        return None

    # Read & concat
    df = pd.concat([pd.read_parquet(f, engine=PARQUET_ENGINE) for f in part_files],
                   ignore_index=True)

    # Optional sanity check
    dim_cols = [c for c in df.columns if c.startswith("dim_")]
    if dim_cols:
        nan_rate = float(df[dim_cols].isna().mean().mean())
        if debug:
            print(f"NaN rate across embedding dims: {nan_rate:.6f}")

    print(f"✅ Loaded {len(df)} rows from {len(part_files)} part files.")
    print(f"📂 Dataset: {selected_key}")
    print(f"🎵 Total audio files in folder: {len(audios)}")
    return df

df = load_selected_embeddings(debug=True)
if df is not None:
    print(df.head())



All parquet files in folder:
  • kiwi_processed_files.parquet  (REGISTRY?)
  • part_0000.parquet  (PART)
NaN rate across embedding dims: 0.000000
✅ Loaded 4421 rows from 1 part files.
📂 Dataset: KiwiTrimmed
🎵 Total audio files in folder: 455
                      file_name  start_time  end_time     dim_1  dim_2  \
0  10_2020_10_14_21_07_trim.wav         0.0       3.0  0.339824    0.0   
1  10_2020_10_14_21_07_trim.wav         3.0       6.0  1.204902    0.0   
2  10_2020_10_14_21_07_trim.wav         6.0       9.0  0.822589    0.0   
3  10_2020_10_14_21_07_trim.wav         9.0      12.0  0.482953    0.0   
4  10_2020_10_14_21_07_trim.wav        12.0      15.0  0.332097    0.0   

      dim_3     dim_4     dim_5  dim_6     dim_7  ...  dim_1015  dim_1016  \
0  1.098615  0.031068  1.136220    0.0  0.637314  ...       0.0  0.467335   
1  0.514429  0.081455  1.107298    0.0  0.393557  ...       0.0  0.259372   
2  0.137139  0.270432  0.878936    0.0  0.529867  ...       0.0  0.017661   
3  0.