# Mouse Dataset Cleaning (MERFISH - CCFv3 Isocortex)

This notebook documents the data cleaning process for the mouse spatial transcriptomics dataset, focusing on cells within isocortical areas. The dataset is aligned to the Common Coordinate Framework version 3 (CCFv3), allowing each cell to be assigned anatomical coordinates within the mouse brain.

I reference the **Allen Institute's tutorial** for MERFISH data registration to CCFv3:

📘 **CCFv3 Registration Tutorial:**  
[Allen Institute GitHub - MERFISH CCF Registration](https://github.com/AllenInstitute/abc_atlas_access/blob/main/notebooks/merfish_ccf_registration_tutorial.ipynb)

This tutorial details how to:

- Register MERFISH transcriptomic data to CCFv3
- Assign anatomical parcellations to individual cells
- Extract spatial coordinates for downstream analysis

🧠 **Dataset Access:**  
The full dataset, including each cell’s transcriptomic profile, anatomical parcellation, and spatial coordinates, can be accessed from the Allen Brain Cell Atlas S3 bucket:

📂 [MERFISH-C57BL6J-638850-CCF Dataset – 2023-06-30](https://allen-brain-cell-atlas.s3.us-west-2.amazonaws.com/index.html#metadata/MERFISH-C57BL6J-638850-CCF/20230630/views/)

---

This cleaned dataset, including both the transcriptomic and spatial information, was used as a reference for my project’s preprocessing and integration pipeline.


In [1]:
import os
import numpy as np
import pandas as pd
from scipy.spatial.distance import pdist, squareform


In [2]:
# Define the data path
data_path = '../src/data'

file_path = os.path.join(data_path, "cell_metadata_with_parcellation_annotation.csv")

# Load the CSV file
df = pd.read_csv(file_path)
print(df.columns)

# Filter rows containing 'isocortex' in any column
iso_rows = df[df.apply(lambda row: row.astype(str).str.contains('isocortex', case=False, na=False).any(), axis=1)]

# Save filtered rows to a new CSV file
output_file = os.path.join(data_path, "mouse_isocortex_rows.csv")
iso_rows.to_csv(output_file, index=False)

print(f"Filtered rows saved to: {output_file}")

  df = pd.read_csv(file_path)


Index(['cell_label', 'brain_section_label', 'cluster_alias',
       'average_correlation_score', 'matrix_label', 'donor_label',
       'low_quality_mapping', 'donor_genotype', 'donor_sex', 'x_section',
       'y_section', 'z_section', 'neurotransmitter', 'division', 'class',
       'subclass', 'supertype', 'cluster', 'neurotransmitter_color',
       'division_color', 'class_color', 'subclass_color', 'supertype_color',
       'cluster_color', 'x_reconstructed', 'y_reconstructed',
       'z_reconstructed', 'parcellation_index', 'x_ccf', 'y_ccf', 'z_ccf',
       'parcellation_organ', 'parcellation_category', 'parcellation_division',
       'parcellation_structure', 'parcellation_substructure',
       'parcellation_organ_color', 'parcellation_category_color',
       'parcellation_division_color', 'parcellation_structure_color',
       'parcellation_substructure_color'],
      dtype='object')
Filtered rows saved to: ../src/data/mouse_isocortex_rows.csv


In [3]:
# Shape of the filtered isocortex rows
print("Isocortex rows shape:", iso_rows.shape)

Isocortex rows shape: (929222, 41)


In [4]:
print(iso_rows.info())

<class 'pandas.core.frame.DataFrame'>
Index: 929222 entries, 740742 to 3663794
Data columns (total 41 columns):
 #   Column                           Non-Null Count   Dtype  
---  ------                           --------------   -----  
 0   cell_label                       929222 non-null  object 
 1   brain_section_label              929222 non-null  object 
 2   cluster_alias                    929222 non-null  int64  
 3   average_correlation_score        929222 non-null  float64
 4   matrix_label                     929222 non-null  object 
 5   donor_label                      929222 non-null  object 
 6   low_quality_mapping              929222 non-null  bool   
 7   donor_genotype                   929222 non-null  object 
 8   donor_sex                        929222 non-null  object 
 9   x_section                        929222 non-null  float64
 10  y_section                        929222 non-null  float64
 11  z_section                        929222 non-null  float64
 12  n

In [5]:
import os
import pandas as pd

# Define the data path
data_path = '../src/data'

# Load the previously filtered isocortex rows
input_file = os.path.join(data_path, "mouse_isocortex_rows.csv")
df = pd.read_csv(input_file)

# Specify the columns to keep
cols_to_keep = [
    'x_ccf', 'y_ccf', 'z_ccf',
    'parcellation_division', 'parcellation_structure', 'parcellation_substructure',
    'parcellation_category', 'parcellation_index',
    'cell_label', 'brain_section_label',
    'parcellation_structure_color', 'parcellation_substructure_color'
]

# Apply additional filtering
df_filtered = df[
    (df['parcellation_division'].str.contains('Isocortex', case=False, na=False)) &
    (df['parcellation_category'].isin(['grey'])) &
    (~df['parcellation_structure'].str.contains(r'-unassigned$', regex=True, na=False))
][cols_to_keep]

# Save the final filtered DataFrame
output_path = os.path.join(data_path, "mouse_ccf_isocortex_data.csv")
df_filtered.to_csv(output_path, index=False)

print(f"Final filtered data saved to: {output_path}")


  df = pd.read_csv(input_file)


Final filtered data saved to: ../src/data/mouse_ccf_isocortex_data.csv


In [7]:
import os
import pandas as pd
from scipy.spatial.distance import pdist, squareform

# Define the data path
data_path = '../src/data'

# Load the filtered isocortex data
input_path = os.path.join(data_path, "mouse_ccf_isocortex_data.csv")
df_filtered = pd.read_csv(input_path)

# OPTIONAL: Check how many rows you have
print("Total number of rows:", len(df_filtered))

# Sample a subset to avoid memory crash (adjust n as needed)
sample_size = 43  # You can try 5000 or 2000 if your system can handle it
df_sampled = df_filtered.sample(n=sample_size, random_state=42)

# Extract 3D CCF coordinates
ccf_coords = df_sampled[['x_ccf', 'y_ccf', 'z_ccf']]

# Compute pairwise Euclidean distances
distance_matrix = squareform(pdist(ccf_coords, metric='euclidean'))

# Convert to DataFrame
distance_df = pd.DataFrame(distance_matrix)

# Drop the first row and first column
# distance_df_dropped = distance_df.iloc[1:, 1:]

# Convert to NumPy array
distance_array = distance_df.to_numpy()

# Save as .npy file
output_filename = f"mouse_ccf_isocortex_distance_matrix_sampled_{sample_size}_no_first_row_col.npy"
distance_output_path = os.path.join(data_path, output_filename)
np.save(distance_output_path, distance_array)

print("Distance matrix saved to:", distance_output_path)

  df_filtered = pd.read_csv(input_path)


Total number of rows: 929222
Distance matrix saved to: ../src/data/mouse_ccf_isocortex_distance_matrix_sampled_43_no_first_row_col.npy


In [8]:
# Shape of the filtered isocortex rows
print("Isocortex rows shape:", df_filtered.shape)

Isocortex rows shape: (929222, 12)


In [9]:
# Count unique values in parcellation_structure
structure_counts = df_filtered['parcellation_structure'].value_counts()
print(structure_counts)

parcellation_structure
MOs        90193
MOp        83302
SSs        66149
VISp       61670
SSp-m      55780
SSp-bfd    55098
RSPv       40045
SSp-ul     29616
ACAv       27491
RSPd       27019
ACAd       25267
TEa        23659
SSp-n      23295
AId        22450
ORBl       21366
SSp-ll     19377
RSPagl     18398
PL         18197
AUDp       15556
ORBvl      14968
AUDv       13150
VISa       13069
VISC       12637
ORBm       11590
GU         11215
AIp        11181
VISl       11015
ECT        10456
AIv         9309
SSp-un      9140
VISam       8660
AUDd        8228
SSp-tr      8152
VISpor      8026
VISrl       7499
ILA         7049
VISpm       6226
VISpl       5256
VISli       4429
AUDpo       4120
VISal       3771
PERI        3583
FRP         2565
Name: count, dtype: int64


In [10]:
print(df_filtered['parcellation_structure'].nunique())

43


In [11]:
unique_structures = df_filtered['parcellation_structure'].unique()
print(unique_structures)

['VISpl' 'VISpor' 'VISp' 'RSPagl' 'RSPd' 'VISl' 'PERI' 'VISli' 'ECT' 'TEa'
 'VISpm' 'RSPv' 'AUDv' 'AUDpo' 'AUDp' 'VISal' 'VISrl' 'AUDd' 'SSs' 'VISam'
 'SSp-bfd' 'VISa' 'VISC' 'AIp' 'SSp-tr' 'SSp-un' 'SSp-ul' 'SSp-n' 'SSp-ll'
 'MOs' 'MOp' 'ACAv' 'ACAd' 'GU' 'SSp-m' 'AIv' 'AId' 'ORBl' 'ILA' 'PL'
 'ORBvl' 'ORBm' 'FRP']


In [12]:
print(len(unique_structures))

43
