# Notebook 01: GSE114007 Data Ingestion
## Kairos Therapeutics ML Prototype V0.1

**Author:** Pat Ovando-Roche, PhD  
**Date:** 2025-12-26  
**Dataset:** GSE114007 - OA vs Healthy knee cartilage RNA-seq  

---

### Purpose
Download and parse the GSE114007 dataset into:
1. `metadata.csv` ‚Äî sample phenotype information
2. `raw_source_matrix.csv` ‚Äî expression values as delivered
3. `ml_matrix.csv` ‚Äî log-transformed + z-scored for ML

### Dataset Summary (from Yin et al. 2023, Aging)
- **Tissue:** Human knee articular cartilage
- **Comparison:** 18 healthy controls vs 20 OA patients
- **Platform:** Illumina RNA-seq (GPL11154, GPL18573)
- **Published validation:** AUC = 1.0 for OA classification

---

## Cell 1: Setup and Directory Creation

In [1]:
"""
CELL 1: SETUP AND DIRECTORY CREATION
=====================================
Creates folder structure and imports required libraries.
"""

import os
import sys
import warnings
import gzip
import urllib.request
from pathlib import Path
from datetime import datetime

warnings.filterwarnings('ignore')

# Core data science
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# GEO data access
import GEOparse

# Set plotting style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("husl")

# ============================================
# Define paths (relative to notebook location)
# ============================================
PROJECT_ROOT = Path.cwd().parent  # Go up from notebooks/ to project root
DATA_RAW = PROJECT_ROOT / "data" / "raw" / "GSE114007"
DATA_PROCESSED = PROJECT_ROOT / "data" / "processed"
REPORTS_FIGURES = PROJECT_ROOT / "reports" / "figures"

# Create directories
for folder in [DATA_RAW, DATA_PROCESSED, REPORTS_FIGURES]:
    folder.mkdir(parents=True, exist_ok=True)
    print(f"‚úÖ Created/verified: {folder}")

# Version info
print("\n" + "="*60)
print("KAIROS ML PROTOTYPE V0.1 - GSE114007 DATA INGESTION")
print("="*60)
print(f"\nüìÖ Timestamp: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"üêç Python: {sys.version.split()[0]}")
print(f"üì¶ pandas: {pd.__version__}")
print(f"üì¶ numpy: {np.__version__}")
print(f"üìÅ Working directory: {Path.cwd()}")
print(f"üìÅ Data will be saved to: {DATA_PROCESSED}")
print("\n‚úÖ Cell 1 complete. Ready for Cell 2.")

‚úÖ Created/verified: C:\Users\povan\Kairos_Therapeutics\data\raw\GSE114007
‚úÖ Created/verified: C:\Users\povan\Kairos_Therapeutics\data\processed
‚úÖ Created/verified: C:\Users\povan\Kairos_Therapeutics\reports\figures

KAIROS ML PROTOTYPE V0.1 - GSE114007 DATA INGESTION

üìÖ Timestamp: 2025-12-26 16:55:08
üêç Python: 3.10.11
üì¶ pandas: 2.3.3
üì¶ numpy: 1.26.4
üìÅ Working directory: C:\Users\povan\Kairos_Therapeutics\notebooks
üìÅ Data will be saved to: C:\Users\povan\Kairos_Therapeutics\data\processed

‚úÖ Cell 1 complete. Ready for Cell 2.


## Cell 2: Download GSE114007 Metadata via GEOparse

In [2]:
"""
CELL 2: DOWNLOAD GSE114007 METADATA
===================================
Uses GEOparse to download the GEO Series and extract sample metadata.
Note: For RNA-seq, expression data is typically NOT in GSM.table
"""

GEO_ID = "GSE114007"

print(f"üì• Downloading {GEO_ID} from GEO...")
print("   (This may take 1-3 minutes)\n")

# Download the GEO series
gse = GEOparse.get_GEO(geo=GEO_ID, destdir=str(DATA_RAW), silent=True)

print(f"‚úÖ Downloaded {GEO_ID}")
print(f"\nüìä Dataset Summary:")
print(f"   Title: {gse.metadata.get('title', ['N/A'])[0]}")
print(f"   Type: {gse.metadata.get('type', ['N/A'])[0]}")
print(f"   Platform(s): {list(gse.gpls.keys())}")
print(f"   Number of samples (GSMs): {len(gse.gsms)}")

# List all GSM IDs
gsm_ids = list(gse.gsms.keys())
print(f"\nüìã Sample IDs: {gsm_ids[:5]}... (showing first 5)")

print("\n‚úÖ Cell 2 complete. Ready for Cell 3.")

üì• Downloading GSE114007 from GEO...
   (This may take 1-3 minutes)

‚úÖ Downloaded GSE114007

üìä Dataset Summary:
   Title: Identification of transcription factors responsible for dysregulated networks in human osteoarthritis cartilage by global gene expression analysis
   Type: Expression profiling by high throughput sequencing
   Platform(s): ['GPL11154', 'GPL18573']
   Number of samples (GSMs): 38

üìã Sample IDs: ['GSM3130531', 'GSM3130532', 'GSM3130533', 'GSM3130534', 'GSM3130535']... (showing first 5)

‚úÖ Cell 2 complete. Ready for Cell 3.


## Cell 3: Extract Sample Metadata

In [3]:
"""
CELL 3: EXTRACT SAMPLE METADATA
===============================
Parse phenotype information from each GSM sample.
"""

metadata_records = []

for gsm_id, gsm in gse.gsms.items():
    record = {
        'sample_id': gsm_id,
        'title': gsm.metadata.get('title', [''])[0],
        'source_name': gsm.metadata.get('source_name_ch1', [''])[0],
        'organism': gsm.metadata.get('organism_ch1', [''])[0],
        'platform': gsm.metadata.get('platform_id', [''])[0],
    }
    
    # Parse characteristics (contains disease status, age, sex, etc.)
    characteristics = gsm.metadata.get('characteristics_ch1', [])
    for char in characteristics:
        if ':' in char:
            key, value = char.split(':', 1)
            key = key.strip().lower().replace(' ', '_')
            value = value.strip()
            record[key] = value
    
    metadata_records.append(record)

# Create DataFrame
metadata_df = pd.DataFrame(metadata_records)

print("üìã Extracted Metadata Columns:")
print(f"   {list(metadata_df.columns)}")
print(f"\nüìä Shape: {metadata_df.shape[0]} samples √ó {metadata_df.shape[1]} fields")
print("\nüìã First 5 samples:")
display(metadata_df.head())

# Check for disease status column
print("\nüîç Looking for disease/condition columns...")
disease_cols = [col for col in metadata_df.columns if any(x in col.lower() for x in ['disease', 'condition', 'status', 'diagnosis', 'oa', 'osteoarthritis'])]
if disease_cols:
    print(f"   Found: {disease_cols}")
    for col in disease_cols:
        print(f"\n   Value counts for '{col}':")
        print(metadata_df[col].value_counts().to_string())
else:
    print("   ‚ö†Ô∏è No obvious disease column found. Checking all columns...")
    for col in metadata_df.columns:
        unique_vals = metadata_df[col].nunique()
        if unique_vals <= 5:
            print(f"\n   '{col}' ({unique_vals} unique values):")
            print(f"   {metadata_df[col].value_counts().to_dict()}")

print("\n‚úÖ Cell 3 complete. Ready for Cell 4.")

üìã Extracted Metadata Columns:
   ['sample_id', 'title', 'source_name', 'organism', 'platform', 'age', 'sex', 'oa_grade']

üìä Shape: 38 samples √ó 8 fields

üìã First 5 samples:


Unnamed: 0,sample_id,title,source_name,organism,platform,age,sex,oa_grade
0,GSM3130531,Normal_Cart_2_2,Knee articular cartilage,Homo sapiens,GPL11154,35,F,1
1,GSM3130532,Normal_Cart_3_3,Knee articular cartilage,Homo sapiens,GPL11154,57,F,1
2,GSM3130533,Normal_Cart_4_4,Knee articular cartilage,Homo sapiens,GPL11154,26,M,1
3,GSM3130534,Normal_Cart_5_5,Knee articular cartilage,Homo sapiens,GPL11154,18,M,1
4,GSM3130535,Normal_Cart_6_6,Knee articular cartilage,Homo sapiens,GPL11154,28,M,1



üîç Looking for disease/condition columns...
   Found: ['oa_grade']

   Value counts for 'oa_grade':
oa_grade
4    20
1    18

‚úÖ Cell 3 complete. Ready for Cell 4.


## Cell 4: Standardize Disease Labels

In [4]:
"""
CELL 4: STANDARDIZE DISEASE LABELS
==================================
Create a clean 'condition' column with values: 'OA' or 'Control'
"""

# Identify the disease column (may vary in naming)
# Common names: 'disease_state', 'disease', 'diagnosis', 'tissue'

def find_disease_column(df):
    """Find the column containing disease/condition information."""
    candidates = ['disease_state', 'disease', 'diagnosis', 'condition', 
                  'disease_status', 'phenotype', 'group']
    
    for col in candidates:
        if col in df.columns:
            return col
    
    # Search by content
    for col in df.columns:
        values = df[col].str.lower().unique()
        if any('oa' in str(v) or 'osteoarthritis' in str(v) or 'control' in str(v) 
               or 'normal' in str(v) or 'healthy' in str(v) for v in values):
            return col
    return None

disease_col = find_disease_column(metadata_df)

if disease_col:
    print(f"‚úÖ Found disease column: '{disease_col}'")
    print(f"   Original values: {metadata_df[disease_col].unique()}")
    
    # Standardize to 'OA' and 'Control'
    def standardize_condition(val):
        val_lower = str(val).lower()
        if any(x in val_lower for x in ['oa', 'osteoarthritis', 'disease', 'patient']):
            return 'OA'
        elif any(x in val_lower for x in ['control', 'normal', 'healthy', 'non-oa']):
            return 'Control'
        else:
            return 'Unknown'
    
    metadata_df['condition'] = metadata_df[disease_col].apply(standardize_condition)
    
    print(f"\nüìä Standardized condition counts:")
    condition_counts = metadata_df['condition'].value_counts()
    print(condition_counts.to_string())
    
    # Verify expected counts (18 Control, 20 OA per Yin 2023)
    expected = {'Control': 18, 'OA': 20}
    actual = condition_counts.to_dict()
    
    if actual.get('Control', 0) == 18 and actual.get('OA', 0) == 20:
        print("\n‚úÖ Sample counts match expected (18 Control, 20 OA)")
    else:
        print(f"\n‚ö†Ô∏è Sample counts differ from expected (18 Control, 20 OA)")
        print(f"   This is OK - we'll work with what we have.")
else:
    print("‚ùå Could not automatically identify disease column.")
    print("   Please examine metadata_df.columns and update manually.")
    print(f"\n   Available columns: {list(metadata_df.columns)}")

print("\n‚úÖ Cell 4 complete. Ready for Cell 5.")

‚úÖ Found disease column: 'title'
   Original values: ['Normal_Cart_2_2' 'Normal_Cart_3_3' 'Normal_Cart_4_4' 'Normal_Cart_5_5'
 'Normal_Cart_6_6' 'Normal_Cart_7_3' 'Normal_Cart_9_7' 'Normal_Cart_10_8'
 'OA_Cart_1_7' 'OA_Cart_2_8' 'OA_Cart_3_9' 'OA_Cart_4_10' 'OA_Cart_5_5'
 'OA_Cart_6_1' 'OA_Cart_7_2' 'OA_Cart_8_5' 'OA_Cart_9_6' 'OA_Cart_10_9'
 'normal_01' 'normal_02' 'normal_03' 'normal_04' 'normal_05' 'normal_06'
 'normal_07' 'normal_08' 'normal_09' 'normal_10' 'OA_01' 'OA_02' 'OA_03'
 'OA_04' 'OA_05' 'OA_06' 'OA_07' 'OA_08' 'OA_09' 'OA_10']

üìä Standardized condition counts:
condition
OA         20
Control    18

‚úÖ Sample counts match expected (18 Control, 20 OA)

‚úÖ Cell 4 complete. Ready for Cell 5.


## Cell 5: Enumerate Supplementary Files

In [5]:
"""
CELL 5: ENUMERATE SUPPLEMENTARY FILES
=====================================
For RNA-seq datasets, expression data is in supplementary files,
not in the GSM.table field used by microarrays.
"""

print("üìÅ Supplementary Files Available:")
print("="*60)

# Get supplementary files from the GSE
supp_files = gse.metadata.get('supplementary_file', [])

if supp_files:
    print(f"\nüì¶ Series-level supplementary files ({len(supp_files)}):")
    for i, url in enumerate(supp_files, 1):
        filename = url.split('/')[-1]
        print(f"   {i}. {filename}")
        print(f"      URL: {url}")
else:
    print("   No series-level supplementary files found.")

# Check GSM-level supplementary files (each sample may have its own)
print(f"\nüì¶ Sample-level supplementary files (checking first 3 samples):")
gsm_supps = {}
for i, (gsm_id, gsm) in enumerate(list(gse.gsms.items())[:3]):
    gsm_files = gsm.metadata.get('supplementary_file', [])
    gsm_supps[gsm_id] = gsm_files
    print(f"\n   {gsm_id}:")
    if gsm_files:
        for f in gsm_files:
            print(f"      - {f.split('/')[-1]}")
    else:
        print(f"      (none)")

# Store for later use
SUPP_FILES = supp_files

print("\n‚úÖ Cell 5 complete. Ready for Cell 6.")

üìÅ Supplementary Files Available:

üì¶ Series-level supplementary files (3):
   1. GSE114007_OA_normalized.counts.txt.gz
      URL: ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE114nnn/GSE114007/suppl/GSE114007_OA_normalized.counts.txt.gz
   2. GSE114007_normal_normalized.counts.txt.gz
      URL: ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE114nnn/GSE114007/suppl/GSE114007_normal_normalized.counts.txt.gz
   3. GSE114007_raw_counts.xlsx
      URL: ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE114nnn/GSE114007/suppl/GSE114007_raw_counts.xlsx

üì¶ Sample-level supplementary files (checking first 3 samples):

   GSM3130531:
      (none)

   GSM3130532:
      (none)

   GSM3130533:
      (none)

‚úÖ Cell 5 complete. Ready for Cell 6.


## Cell 6: Download Expression Matrix

In [6]:
"""
CELL 6: DOWNLOAD EXPRESSION MATRIX
==================================
Download the expression matrix from supplementary files.
Look for files containing: counts, TPM, FPKM, or expression.
"""

import urllib.request
import gzip
import shutil

def download_file(url, dest_path):
    """Download a file with progress indication."""
    print(f"   üì• Downloading: {url.split('/')[-1]}")
    try:
        urllib.request.urlretrieve(url, dest_path)
        size_mb = os.path.getsize(dest_path) / (1024 * 1024)
        print(f"   ‚úÖ Downloaded: {size_mb:.2f} MB")
        return True
    except Exception as e:
        print(f"   ‚ùå Failed: {e}")
        return False

def decompress_gz(gz_path, output_path):
    """Decompress a .gz file."""
    with gzip.open(gz_path, 'rb') as f_in:
        with open(output_path, 'wb') as f_out:
            shutil.copyfileobj(f_in, f_out)
    print(f"   üì¶ Decompressed to: {output_path.name}")

# Identify likely expression matrix files
expression_keywords = ['count', 'tpm', 'fpkm', 'rpkm', 'expression', 'matrix', 'abundance']
candidate_files = []

print("üîç Searching for expression matrix files...\n")

for url in SUPP_FILES:
    filename = url.split('/')[-1].lower()
    if any(kw in filename for kw in expression_keywords):
        candidate_files.append(url)
        print(f"   ‚≠ê Candidate: {url.split('/')[-1]}")

# If no obvious candidates, list all for manual selection
if not candidate_files:
    print("   ‚ö†Ô∏è No obvious expression files found by keyword.")
    print("   üìã All supplementary files:")
    for url in SUPP_FILES:
        print(f"      - {url.split('/')[-1]}")
    candidate_files = SUPP_FILES  # Try all

# Download all candidate files
downloaded_files = []

print("\nüì• Downloading candidate files...")
for url in candidate_files:
    filename = url.split('/')[-1]
    dest_path = DATA_RAW / filename
    
    if dest_path.exists():
        print(f"   ‚è≠Ô∏è Already exists: {filename}")
    else:
        download_file(url, dest_path)
    
    downloaded_files.append(dest_path)

# List downloaded files with sizes
print("\nüìÅ Downloaded files in data/raw/GSE114007:")
for f in DATA_RAW.iterdir():
    size_mb = f.stat().st_size / (1024 * 1024)
    print(f"   {f.name}: {size_mb:.2f} MB")

print("\n‚úÖ Cell 6 complete. Ready for Cell 7.")

üîç Searching for expression matrix files...

   ‚≠ê Candidate: GSE114007_OA_normalized.counts.txt.gz
   ‚≠ê Candidate: GSE114007_normal_normalized.counts.txt.gz
   ‚≠ê Candidate: GSE114007_raw_counts.xlsx

üì• Downloading candidate files...
   üì• Downloading: GSE114007_OA_normalized.counts.txt.gz
   ‚úÖ Downloaded: 1.59 MB
   üì• Downloading: GSE114007_normal_normalized.counts.txt.gz
   ‚úÖ Downloaded: 1.41 MB
   üì• Downloading: GSE114007_raw_counts.xlsx
   ‚úÖ Downloaded: 3.86 MB

üìÅ Downloaded files in data/raw/GSE114007:
   GSE114007_family.soft.gz: 0.00 MB
   GSE114007_normal_normalized.counts.txt.gz: 1.41 MB
   GSE114007_OA_normalized.counts.txt.gz: 1.59 MB
   GSE114007_raw_counts.xlsx: 3.86 MB

‚úÖ Cell 6 complete. Ready for Cell 7.


## Cell 7: Parse Expression Matrix

In [14]:
"""
CELL 7 (CORRECTED): PARSE AND MERGE EXPRESSION MATRICES
========================================================
GSE114007 stores Normal and OA samples in SEPARATE files.
This cell loads BOTH and merges them into a single matrix.
"""

import gzip
import pandas as pd
from pathlib import Path

def load_expression_file(filepath):
    """
    Load a single expression file, handling .gz compression.
    Returns DataFrame with genes as index, samples as columns.
    """
    filename = filepath.name
    print(f"   Loading: {filename}")
    
    # Determine if gzipped
    is_gzipped = filename.endswith('.gz')
    
    # Try different separators
    separators = ['\t', ',', ' ']
    
    for sep in separators:
        try:
            if is_gzipped:
                df = pd.read_csv(filepath, compression='gzip', sep=sep, index_col=0)
            else:
                df = pd.read_csv(filepath, sep=sep, index_col=0)
            
            # Check if it looks like expression data (at least 10 columns)
            if df.shape[1] >= 10:
                print(f"      ‚úÖ Shape: {df.shape} (genes √ó samples)")
                print(f"      Columns (first 3): {list(df.columns[:3])}")
                print(f"      Index name: {df.index.name}")
                return df
        except Exception as e:
            continue
    
    print(f"      ‚ùå Could not parse {filename}")
    return None


# --- STEP 1: Find all expression matrix files ---
print("üîç Scanning for expression matrix files in data/raw/GSE114007/...\n")

expression_files = []
for f in sorted(DATA_RAW.iterdir()):
    fname_lower = f.name.lower()
    # Look for normalized count files (typical GEO naming)
    if any(kw in fname_lower for kw in ['normalized', 'counts', 'expression', 'fpkm', 'tpm']):
        if fname_lower.endswith(('.txt.gz', '.csv.gz', '.tsv.gz', '.txt', '.csv', '.tsv')):
            expression_files.append(f)
            print(f"   üìÑ Found: {f.name}")

print(f"\n‚úÖ Found {len(expression_files)} potential expression files.\n")


# --- STEP 2: Load each expression file ---
print("üì• Loading expression matrices...\n")

loaded_matrices = {}
for filepath in expression_files:
    df = load_expression_file(filepath)
    if df is not None and df.shape[1] >= 5:  # At least 5 samples
        # Identify if this is Normal or OA based on filename or column names
        fname_lower = filepath.name.lower()
        if 'normal' in fname_lower or 'control' in fname_lower:
            matrix_type = 'Normal'
        elif 'oa' in fname_lower or 'osteoarthritis' in fname_lower:
            matrix_type = 'OA'
        else:
            # Try to infer from column names
            cols_str = ' '.join(df.columns[:5]).lower()
            if 'normal' in cols_str or 'control' in cols_str:
                matrix_type = 'Normal'
            elif 'oa' in cols_str:
                matrix_type = 'OA'
            else:
                matrix_type = 'Unknown'
        
        loaded_matrices[filepath.name] = {
            'df': df,
            'type': matrix_type,
            'shape': df.shape
        }
        print(f"      Type: {matrix_type}\n")

print(f"‚úÖ Successfully loaded {len(loaded_matrices)} expression matrices.\n")


# --- STEP 3: Merge matrices if multiple exist ---
print("üîó Merging expression matrices...\n")

if len(loaded_matrices) == 0:
    print("‚ùå No expression matrices loaded. Check file formats.")
    expression_df = None

elif len(loaded_matrices) == 1:
    # Single file - use as-is
    key = list(loaded_matrices.keys())[0]
    expression_df = loaded_matrices[key]['df']
    print(f"   Single matrix found: {key}")
    print(f"   Shape: {expression_df.shape}")

else:
    # Multiple files - need to merge
    print(f"   Found {len(loaded_matrices)} matrices to merge:")
    for fname, info in loaded_matrices.items():
        print(f"      - {fname}: {info['shape']} ({info['type']})")
    
    # Get list of DataFrames
    dfs_to_merge = [info['df'] for info in loaded_matrices.values()]
    
    # Check if gene indices match
    print("\n   Checking gene index alignment...")
    first_index = set(dfs_to_merge[0].index)
    all_match = True
    for i, df in enumerate(dfs_to_merge[1:], 2):
        other_index = set(df.index)
        overlap = len(first_index & other_index)
        print(f"      Matrix 1 vs {i}: {overlap} genes in common")
        if overlap < len(first_index) * 0.9:  # Less than 90% overlap
            all_match = False
    
    if all_match or True:  # Proceed anyway with inner join
        # Merge on common genes (inner join on index)
        print("\n   Merging matrices (inner join on gene index)...")
        
        # Use concat with inner join on index
        expression_df = pd.concat(dfs_to_merge, axis=1, join='inner')
        
        # Check for duplicate column names
        if expression_df.columns.duplicated().any():
            print("   ‚ö†Ô∏è Warning: Duplicate column names detected. Keeping first occurrence.")
            expression_df = expression_df.loc[:, ~expression_df.columns.duplicated()]
        
        print(f"\n   ‚úÖ Merged matrix shape: {expression_df.shape}")
        print(f"      Genes: {expression_df.shape[0]}")
        print(f"      Samples: {expression_df.shape[1]}")

print()


# --- STEP 4: Display merged matrix info ---
if expression_df is not None:
    print("=" * 60)
    print("üìä MERGED EXPRESSION MATRIX SUMMARY")
    print("=" * 60)
    print(f"\n   Shape: {expression_df.shape[0]} genes √ó {expression_df.shape[1]} samples")
    print(f"\n   Sample columns (all {expression_df.shape[1]}):")
    for i, col in enumerate(expression_df.columns):
        print(f"      {i+1}. {col}")
    
    print(f"\n   Gene index (first 5): {list(expression_df.index[:5])}")
    print(f"\n   üìã Preview (5√ó5):")
    display(expression_df.iloc[:5, :5])
    
    # Check for Normal vs OA in column names
    normal_cols = [c for c in expression_df.columns if 'normal' in c.lower()]
    oa_cols = [c for c in expression_df.columns if 'oa' in c.lower()]
    print(f"\n   Column breakdown:")
    print(f"      Normal/Control samples: {len(normal_cols)}")
    print(f"      OA samples: {len(oa_cols)}")
    print(f"      Other/Unclassified: {expression_df.shape[1] - len(normal_cols) - len(oa_cols)}")

print("\n‚úÖ Cell 7 complete. Ready for Cell 8.")

üîç Scanning for expression matrix files in data/raw/GSE114007/...

   üìÑ Found: GSE114007_normal_normalized.counts.txt.gz
   üìÑ Found: GSE114007_OA_normalized.counts.txt.gz

‚úÖ Found 2 potential expression files.

üì• Loading expression matrices...

   Loading: GSE114007_normal_normalized.counts.txt.gz
      ‚úÖ Shape: (23710, 20) (genes √ó samples)
      Columns (first 3): ['Normal_Cart_10_8', 'Normal_Cart_2_2', 'Normal_Cart_3_3']
      Index name: symbol
      Type: Normal

   Loading: GSE114007_OA_normalized.counts.txt.gz
      ‚úÖ Shape: (23710, 22) (genes √ó samples)
      Columns (first 3): ['OA_Cart_1_7', 'OA_Cart_10_9', 'OA_Cart_2_8']
      Index name: symbol
      Type: Normal

‚úÖ Successfully loaded 2 expression matrices.

üîó Merging expression matrices...

   Found 2 matrices to merge:
      - GSE114007_normal_normalized.counts.txt.gz: (23710, 20) (Normal)
      - GSE114007_OA_normalized.counts.txt.gz: (23710, 22) (Normal)

   Checking gene index alignment...
    

Unnamed: 0_level_0,Normal_Cart_10_8,Normal_Cart_2_2,Normal_Cart_3_3,Normal_Cart_4_4,Normal_Cart_5_5
symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
FN1,16.277134,15.429753,15.428266,16.305868,14.635041
COMP,15.371944,14.51526,14.813281,14.776144,14.048698
MALAT1,15.441039,14.574888,15.053004,14.793931,14.773987
CHI3L2,7.645584,5.860772,6.055734,8.496841,6.743966
CLU,15.105566,14.493329,14.849689,14.704724,15.092099



   Column breakdown:
      Normal/Control samples: 19
      OA samples: 21
      Other/Unclassified: 1

‚úÖ Cell 7 complete. Ready for Cell 8.


## Cell 8: Alternative - Try Series Matrix File

In [15]:
"""
CELL 8: ALTERNATIVE - TRY SERIES MATRIX FILE
============================================
If supplementary files didn't work, try the series matrix file.
This is more common for microarray but sometimes works for RNA-seq.
"""

# Check if we already have expression data
if expression_df is not None and expression_df.shape[1] >= 30:
    print("‚úÖ Expression matrix already loaded from supplementary files.")
    print(f"   Shape: {expression_df.shape}")
    print("   Skipping series matrix attempt.")
else:
    print("üîç Attempting to extract expression from GEOparse pivot...")
    
    try:
        # Try GEOparse's built-in pivot method
        pivoted = gse.pivot_samples('VALUE')
        
        if pivoted is not None and not pivoted.empty:
            print(f"   ‚úÖ Pivot successful! Shape: {pivoted.shape}")
            expression_df = pivoted
            parsed_file = "GEOparse pivot"
            display(expression_df.iloc[:5, :5])
        else:
            print("   ‚ö†Ô∏è Pivot returned empty (typical for RNA-seq).")
    except Exception as e:
        print(f"   ‚ö†Ô∏è Pivot failed: {e}")
    
    # Try GSM tables directly
    print("\nüîç Checking individual GSM tables...")
    sample_gsm = list(gse.gsms.values())[0]
    
    if hasattr(sample_gsm, 'table') and sample_gsm.table is not None and not sample_gsm.table.empty:
        print(f"   ‚úÖ GSM tables available. Columns: {list(sample_gsm.table.columns)}")
        print(f"   Shape: {sample_gsm.table.shape}")
        
        # Build expression matrix from GSM tables
        all_tables = {}
        for gsm_id, gsm in gse.gsms.items():
            if hasattr(gsm, 'table') and gsm.table is not None and not gsm.table.empty:
                # Assume first column is gene ID, second is expression value
                if len(gsm.table.columns) >= 2:
                    gene_col = gsm.table.columns[0]
                    val_col = gsm.table.columns[1]
                    all_tables[gsm_id] = gsm.table.set_index(gene_col)[val_col]
        
        if all_tables:
            expression_df = pd.DataFrame(all_tables)
            parsed_file = "GSM tables"
            print(f"   ‚úÖ Built matrix from GSM tables: {expression_df.shape}")
    else:
        print("   ‚ö†Ô∏è GSM tables are empty (typical for RNA-seq deposited without processed data).")

print("\n‚úÖ Cell 8 complete. Ready for Cell 9.")

‚úÖ Expression matrix already loaded from supplementary files.
   Shape: (23710, 41)
   Skipping series matrix attempt.

‚úÖ Cell 8 complete. Ready for Cell 9.


## Cell 9: Validate and Align Expression with Metadata

In [16]:
"""
CELL 9 (CORRECTED): ALIGN EXPRESSION COLUMNS WITH METADATA
===========================================================
The expression columns use names like 'Normal_Cart_10_8', 'OA_Cart_1_7'
The metadata uses GSM IDs like 'GSM3130531'.

We need to match via the metadata 'title' column or other characteristics.
"""

import re

if expression_df is None or expression_df.empty:
    print("‚ùå STOP: No expression matrix available.")
    print("   Run Cell 7 first to load and merge expression data.")
else:
    print("üîó Aligning expression columns with metadata...\n")
    
    print(f"   Expression matrix shape: {expression_df.shape}")
    print(f"   Metadata shape: {metadata_df.shape}")
    print()
    
    # --- STEP 1: Get sample IDs from both sources ---
    expr_columns = list(expression_df.columns)
    meta_sample_ids = list(metadata_df['sample_id'])
    
    print(f"   Expression columns (first 5): {expr_columns[:5]}")
    print(f"   Metadata sample_id (first 5): {meta_sample_ids[:5]}")
    
    # Check if metadata has a 'title' column we can match
    if 'title' in metadata_df.columns:
        meta_titles = list(metadata_df['title'])
        print(f"   Metadata titles (first 5): {meta_titles[:5]}")
    print()
    
    # --- STEP 2: Try different matching strategies ---
    
    # Strategy A: Direct match (expression columns ARE GSM IDs)
    direct_overlap = set(expr_columns) & set(meta_sample_ids)
    print(f"   Strategy A - Direct GSM match: {len(direct_overlap)} matches")
    
    # Strategy B: Match expression column to metadata title
    title_matches = {}
    if 'title' in metadata_df.columns:
        for expr_col in expr_columns:
            # Clean the expression column name for matching
            expr_clean = expr_col.strip()
            
            for idx, row in metadata_df.iterrows():
                title = str(row['title']).strip()
                gsm_id = row['sample_id']
                
                # Try exact match
                if expr_clean == title:
                    title_matches[expr_col] = gsm_id
                    break
                
                # Try partial match (expression name contained in title or vice versa)
                if expr_clean in title or title in expr_clean:
                    title_matches[expr_col] = gsm_id
                    break
                
                # Try matching without underscores/spaces
                expr_normalized = expr_clean.lower().replace('_', '').replace(' ', '').replace('-', '')
                title_normalized = title.lower().replace('_', '').replace(' ', '').replace('-', '')
                if expr_normalized == title_normalized:
                    title_matches[expr_col] = gsm_id
                    break
    
    print(f"   Strategy B - Title matching: {len(title_matches)} matches")
    
    # Strategy C: Pattern-based matching for 'Normal_Cart_X_Y' / 'OA_Cart_X_Y' format
    pattern_matches = {}
    if len(title_matches) < len(expr_columns) * 0.5:
        print("\n   Trying Strategy C - Pattern-based matching...")
        
        # Build lookup from metadata titles
        title_to_gsm = {}
        if 'title' in metadata_df.columns:
            for idx, row in metadata_df.iterrows():
                title = str(row['title'])
                gsm_id = row['sample_id']
                # Store multiple normalized versions
                title_to_gsm[title] = gsm_id
                title_to_gsm[title.lower()] = gsm_id
                title_to_gsm[title.replace(' ', '_')] = gsm_id
                title_to_gsm[title.replace('_', ' ')] = gsm_id
        
        for expr_col in expr_columns:
            if expr_col in pattern_matches:
                continue
            
            # Try various transformations
            variants = [
                expr_col,
                expr_col.replace('_', ' '),
                expr_col.lower(),
                expr_col.lower().replace('_', ' '),
            ]
            
            for variant in variants:
                if variant in title_to_gsm:
                    pattern_matches[expr_col] = title_to_gsm[variant]
                    break
    
    print(f"   Strategy C - Pattern matching: {len(pattern_matches)} matches")
    
    # --- STEP 3: Use best matching strategy ---
    if len(direct_overlap) >= len(expr_columns) * 0.9:
        print("\n   ‚úÖ Using Strategy A (direct GSM match)")
        col_to_gsm = {col: col for col in expr_columns if col in meta_sample_ids}
    elif len(title_matches) >= len(expr_columns) * 0.5:
        print("\n   ‚úÖ Using Strategy B (title matching)")
        col_to_gsm = title_matches
    elif len(pattern_matches) >= len(expr_columns) * 0.5:
        print("\n   ‚úÖ Using Strategy C (pattern matching)")
        col_to_gsm = pattern_matches
    else:
        # Strategy D: Positional matching as last resort
        # Match by condition (Normal vs OA) and position
        print("\n   ‚ö†Ô∏è No good ID match found. Attempting positional matching by condition...")
        
        col_to_gsm = {}
        
        # Separate expression columns by type
        normal_expr_cols = sorted([c for c in expr_columns if 'normal' in c.lower()])
        oa_expr_cols = sorted([c for c in expr_columns if 'oa' in c.lower()])
        
        # Separate metadata by condition
        if 'condition' in metadata_df.columns:
            normal_meta = metadata_df[metadata_df['condition'] == 'Control'].sort_values('sample_id')
            oa_meta = metadata_df[metadata_df['condition'] == 'OA'].sort_values('sample_id')
        else:
            # Try to infer from title
            normal_meta = metadata_df[metadata_df['title'].str.lower().str.contains('normal|control', na=False)].sort_values('sample_id')
            oa_meta = metadata_df[~metadata_df['title'].str.lower().str.contains('normal|control', na=False)].sort_values('sample_id')
        
        print(f"\n   Normal: {len(normal_expr_cols)} expr cols, {len(normal_meta)} metadata rows")
        print(f"   OA: {len(oa_expr_cols)} expr cols, {len(oa_meta)} metadata rows")
        
        # Match by position within each group
        for i, expr_col in enumerate(normal_expr_cols):
            if i < len(normal_meta):
                col_to_gsm[expr_col] = normal_meta.iloc[i]['sample_id']
        
        for i, expr_col in enumerate(oa_expr_cols):
            if i < len(oa_meta):
                col_to_gsm[expr_col] = oa_meta.iloc[i]['sample_id']
        
        print(f"\n   Strategy D - Positional matching: {len(col_to_gsm)} matches")
    
    # --- STEP 4: Apply mapping and filter ---
    print(f"\n   Total matched: {len(col_to_gsm)} / {len(expr_columns)} expression columns")
    
    if len(col_to_gsm) > 0:
        # Rename columns to GSM IDs
        expression_df_aligned = expression_df.rename(columns=col_to_gsm)
        
        # Keep only columns that were mapped
        mapped_gsm_ids = list(col_to_gsm.values())
        expression_df_aligned = expression_df_aligned[mapped_gsm_ids]
        
        # Filter metadata to matched samples
        metadata_df_aligned = metadata_df[metadata_df['sample_id'].isin(mapped_gsm_ids)].copy()
        metadata_df_aligned = metadata_df_aligned.set_index('sample_id').loc[mapped_gsm_ids].reset_index()
        
        # Update the main dataframes
        expression_df = expression_df_aligned
        metadata_df = metadata_df_aligned
        
        print(f"\n   ‚úÖ Aligned expression shape: {expression_df.shape}")
        print(f"   ‚úÖ Aligned metadata shape: {metadata_df.shape}")
        
        # Show mapping table
        print(f"\n   üìã Sample mapping (first 10):")
        mapping_preview = [(orig, gsm) for orig, gsm in list(col_to_gsm.items())[:10]]
        for orig, gsm in mapping_preview:
            # Get condition for this sample
            cond = metadata_df[metadata_df['sample_id'] == gsm]['condition'].values
            cond_str = cond[0] if len(cond) > 0 else 'Unknown'
            print(f"      {orig} ‚Üí {gsm} ({cond_str})")
        
        # Verify condition distribution
        if 'condition' in metadata_df.columns:
            print(f"\n   üìä Condition distribution after alignment:")
            for cond, count in metadata_df['condition'].value_counts().items():
                print(f"      {cond}: {count}")
    else:
        print("\n   ‚ùå ERROR: Could not match expression columns to metadata.")
        print("   Manual inspection required. See mapping preview above.")

print("\n‚úÖ Cell 9 complete. Ready for Cell 10.")

üîó Aligning expression columns with metadata...

   Expression matrix shape: (23710, 41)
   Metadata shape: (38, 9)

   Expression columns (first 5): ['Normal_Cart_10_8', 'Normal_Cart_2_2', 'Normal_Cart_3_3', 'Normal_Cart_4_4', 'Normal_Cart_5_5']
   Metadata sample_id (first 5): ['GSM3130531', 'GSM3130532', 'GSM3130533', 'GSM3130534', 'GSM3130535']
   Metadata titles (first 5): ['Normal_Cart_2_2', 'Normal_Cart_3_3', 'Normal_Cart_4_4', 'Normal_Cart_5_5', 'Normal_Cart_6_6']

   Strategy A - Direct GSM match: 0 matches
   Strategy B - Title matching: 38 matches
   Strategy C - Pattern matching: 0 matches

   ‚úÖ Using Strategy B (title matching)

   Total matched: 38 / 41 expression columns

   ‚úÖ Aligned expression shape: (23710, 38)
   ‚úÖ Aligned metadata shape: (38, 9)

   üìã Sample mapping (first 10):
      Normal_Cart_10_8 ‚Üí GSM3130538 (Control)
      Normal_Cart_2_2 ‚Üí GSM3130531 (Control)
      Normal_Cart_3_3 ‚Üí GSM3130532 (Control)
      Normal_Cart_4_4 ‚Üí GSM3130533 (

## Cell 10: Create ML-Ready Matrix

In [17]:
"""
CELL 10: CREATE ML-READY MATRIX
===============================
Transform raw expression into ML-ready format:
1. log2(x + 1) transformation
2. Per-gene z-score standardization
"""

if expression_df is None or expression_df.empty:
    print("‚ùå Cannot create ML matrix - no expression data loaded.")
else:
    print("üîß Creating ML-ready matrix...\n")
    
    # Store raw source matrix
    raw_source_matrix = expression_df.copy()
    
    # Step 1: Ensure numeric
    print("   Step 1: Converting to numeric...")
    raw_source_matrix = raw_source_matrix.apply(pd.to_numeric, errors='coerce')
    
    # Step 2: Remove genes with all NaN or all zero
    print("   Step 2: Filtering genes...")
    n_before = raw_source_matrix.shape[0]
    raw_source_matrix = raw_source_matrix.dropna(how='all')
    raw_source_matrix = raw_source_matrix[(raw_source_matrix != 0).any(axis=1)]
    n_after = raw_source_matrix.shape[0]
    print(f"      Removed {n_before - n_after} empty/zero genes. Remaining: {n_after}")
    
    # Step 3: Log2(x + 1) transformation
    print("   Step 3: Log2(x + 1) transformation...")
    # Handle negative values (shouldn't exist, but just in case)
    raw_source_matrix = raw_source_matrix.clip(lower=0)
    ml_matrix = np.log2(raw_source_matrix + 1)
    
    # Step 4: Per-gene z-score (across samples)
    print("   Step 4: Per-gene z-score standardization...")
    gene_means = ml_matrix.mean(axis=1)
    gene_stds = ml_matrix.std(axis=1)
    # Avoid division by zero
    gene_stds = gene_stds.replace(0, 1)
    ml_matrix = ml_matrix.sub(gene_means, axis=0).div(gene_stds, axis=0)
    
    print(f"\n‚úÖ ML matrix created:")
    print(f"   Shape: {ml_matrix.shape[0]} genes √ó {ml_matrix.shape[1]} samples")
    print(f"   Value range: [{ml_matrix.min().min():.2f}, {ml_matrix.max().max():.2f}]")
    print(f"   Mean (should be ~0): {ml_matrix.values.mean():.4f}")
    print(f"   Std (should be ~1): {ml_matrix.values.std():.4f}")
    
    print("\nüìä ML Matrix Preview (5√ó5):")
    display(ml_matrix.iloc[:5, :5])

print("\n‚úÖ Cell 10 complete. Ready for Cell 11.")

üîß Creating ML-ready matrix...

   Step 1: Converting to numeric...
   Step 2: Filtering genes...
      Removed 0 empty/zero genes. Remaining: 23710
   Step 3: Log2(x + 1) transformation...
   Step 4: Per-gene z-score standardization...

‚úÖ ML matrix created:
   Shape: 23710 genes √ó 38 samples
   Value range: [-5.67, 6.00]
   Mean (should be ~0): 0.0000
   Std (should be ~1): 0.9273

üìä ML Matrix Preview (5√ó5):


Unnamed: 0_level_0,GSM3130538,GSM3130531,GSM3130532,GSM3130533,GSM3130534
symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
FN1,0.327683,-0.501095,-0.502587,0.355068,-1.31816
COMP,1.375804,0.815348,1.013753,0.989234,0.496953
MALAT1,0.830341,-0.070694,0.432696,0.161819,0.140782
CHI3L2,-0.730324,-1.603103,-1.497339,-0.375856,-1.146031
CLU,0.970641,0.324612,0.703679,0.550515,0.956697



‚úÖ Cell 10 complete. Ready for Cell 11.


## Cell 11: Check for Key Genes (PF4, RNMT, RBM24)

In [18]:
"""
CELL 11: CHECK FOR KEY GENES
============================
Verify presence of genes we'll track:
- PF4/CXCL4: Anti-aging factor from Pinho lab
- RNMT, RBM24: Validated OA markers from Yin 2023
"""

if expression_df is None or expression_df.empty:
    print("‚ùå Cannot check genes - no expression data loaded.")
else:
    key_genes = ['PF4', 'CXCL4', 'RNMT', 'RBM24', 'IL1B', 'TNF', 'MMP13', 'COL2A1']
    
    print("üîç Checking for key genes...\n")
    
    # Get gene index as strings for searching
    gene_index = raw_source_matrix.index.astype(str).str.upper()
    
    found_genes = []
    missing_genes = []
    
    for gene in key_genes:
        # Try exact match
        if gene.upper() in gene_index.values:
            found_genes.append(gene)
            # Get actual index name
            idx = raw_source_matrix.index[gene_index == gene.upper()][0]
            mean_expr = raw_source_matrix.loc[idx].mean()
            print(f"   ‚úÖ {gene}: Found (mean raw expression: {mean_expr:.2f})")
        else:
            # Try partial match
            partial_matches = gene_index[gene_index.str.contains(gene.upper())]
            if len(partial_matches) > 0:
                found_genes.append(gene)
                print(f"   ‚ö†Ô∏è {gene}: Partial match found: {partial_matches.values[:3]}")
            else:
                missing_genes.append(gene)
                print(f"   ‚ùå {gene}: Not found")
    
    print(f"\nüìä Summary:")
    print(f"   Found: {len(found_genes)}/{len(key_genes)}")
    print(f"   Missing: {missing_genes}")
    
    # Show some example gene names from the index
    print(f"\nüìã Sample gene names in index (first 20):")
    print(f"   {list(raw_source_matrix.index[:20])}")

print("\n‚úÖ Cell 11 complete. Ready for Cell 12.")

üîç Checking for key genes...

   ‚úÖ PF4: Found (mean raw expression: 0.03)
   ‚ùå CXCL4: Not found
   ‚úÖ RNMT: Found (mean raw expression: 6.42)
   ‚úÖ RBM24: Found (mean raw expression: 0.62)
   ‚úÖ IL1B: Found (mean raw expression: 0.06)
   ‚úÖ TNF: Found (mean raw expression: 0.12)
   ‚úÖ MMP13: Found (mean raw expression: 2.80)
   ‚úÖ COL2A1: Found (mean raw expression: 12.86)

üìä Summary:
   Found: 7/8
   Missing: ['CXCL4']

üìã Sample gene names in index (first 20):
   ['FN1', 'COMP', 'MALAT1', 'CHI3L2', 'CLU', 'DCN', 'PRELP', 'CILP', 'CHI3L1', 'GPX3', 'COL2A1', 'VIM', 'MMP3', 'MT2A', 'ACAN', 'SOD2', 'FMOD', 'SERPINA1', 'LUM', 'COL3A1']

‚úÖ Cell 11 complete. Ready for Cell 12.


## Cell 12: Save Outputs

In [19]:
"""
CELL 12: SAVE OUTPUTS
=====================
Save the three required files:
1. metadata.csv
2. raw_source_matrix.csv
3. ml_matrix.csv
"""

if expression_df is None or expression_df.empty:
    print("‚ùå Cannot save - no expression data loaded.")
    print("\nüìã MANUAL ACTION REQUIRED:")
    print("   See Cell 9 output for instructions.")
else:
    # Save metadata
    metadata_path = DATA_PROCESSED / "GSE114007_metadata.csv"
    metadata_df.to_csv(metadata_path, index=False)
    print(f"‚úÖ Saved: {metadata_path.name}")
    print(f"   Shape: {metadata_df.shape}")
    
    # Save raw source matrix
    raw_path = DATA_PROCESSED / "GSE114007_raw_source_matrix.csv"
    raw_source_matrix.to_csv(raw_path)
    print(f"\n‚úÖ Saved: {raw_path.name}")
    print(f"   Shape: {raw_source_matrix.shape}")
    
    # Save ML matrix
    ml_path = DATA_PROCESSED / "GSE114007_ml_matrix.csv"
    ml_matrix.to_csv(ml_path)
    print(f"\n‚úÖ Saved: {ml_path.name}")
    print(f"   Shape: {ml_matrix.shape}")
    
    # List all saved files
    print(f"\nüìÅ All files in {DATA_PROCESSED.name}/:")
    for f in DATA_PROCESSED.iterdir():
        size_kb = f.stat().st_size / 1024
        print(f"   {f.name}: {size_kb:.1f} KB")

print("\n‚úÖ Cell 12 complete. Ready for Cell 13 (Final Checkpoint).")

‚úÖ Saved: GSE114007_metadata.csv
   Shape: (38, 9)

‚úÖ Saved: GSE114007_raw_source_matrix.csv
   Shape: (23710, 38)

‚úÖ Saved: GSE114007_ml_matrix.csv
   Shape: (23710, 38)

üìÅ All files in processed/:
   .gitkeep: 0.0 KB
   GSE114007_metadata.csv: 3.1 KB
   GSE114007_ml_matrix.csv: 15966.0 KB
   GSE114007_raw_source_matrix.csv: 7932.0 KB

‚úÖ Cell 12 complete. Ready for Cell 13 (Final Checkpoint).


## Cell 13: Final Checkpoint Summary

In [20]:
"""
CELL 13: FINAL CHECKPOINT SUMMARY
=================================
Print comprehensive summary of what was loaded.
"""

print("="*70)
print("üéØ FINAL CHECKPOINT SUMMARY - GSE114007 DATA INGESTION")
print("="*70)

if expression_df is None or expression_df.empty:
    print("\n‚ùå DATA INGESTION INCOMPLETE")
    print("\nüìã Status:")
    print("   - Metadata: ‚úÖ Loaded")
    print("   - Expression: ‚ùå Failed to load")
    print("\nüîß NEXT STEPS:")
    print("   1. Check files in data/raw/GSE114007/")
    print("   2. Open the largest file manually")
    print("   3. Report the file format to continue")
    print(f"\nüìÅ Files downloaded:")
    for f in DATA_RAW.iterdir():
        size_mb = f.stat().st_size / (1024*1024)
        print(f"   - {f.name} ({size_mb:.2f} MB)")
else:
    print("\n‚úÖ DATA INGESTION COMPLETE")
    
    print("\nüìä DATASET METRICS:")
    print(f"   Samples (n_samples): {raw_source_matrix.shape[1]}")
    print(f"   Genes (n_genes): {raw_source_matrix.shape[0]}")
    
    if 'condition' in metadata_df.columns:
        print(f"\nüìã CLASS DISTRIBUTION:")
        for cond, count in metadata_df['condition'].value_counts().items():
            print(f"   {cond}: {count}")
    
    print(f"\nüß¨ EXAMPLE GENES (first 5 in index):")
    for i, gene in enumerate(raw_source_matrix.index[:5], 1):
        mean_expr = raw_source_matrix.loc[gene].mean()
        print(f"   {i}. {gene} (mean expression: {mean_expr:.2f})")
    
    print(f"\nüìÅ OUTPUT FILES:")
    print(f"   1. {DATA_PROCESSED}/GSE114007_metadata.csv")
    print(f"   2. {DATA_PROCESSED}/GSE114007_raw_source_matrix.csv")
    print(f"   3. {DATA_PROCESSED}/GSE114007_ml_matrix.csv")
    
    print(f"\n‚úÖ VALIDATION CHECKS:")
    print(f"   - Samples match metadata: {raw_source_matrix.shape[1] == len(metadata_df)}")
    print(f"   - No all-NaN genes: {not raw_source_matrix.isna().all(axis=1).any()}")
    print(f"   - ML matrix mean ~0: {abs(ml_matrix.values.mean()) < 0.01}")
    
print("\n" + "="*70)
print("üöÄ NEXT NOTEBOOK: 02_QC_and_EDA.ipynb")
print("   - PCA visualization")
print("   - Sample QC (detect outliers)")
print("   - Gene expression distributions")
print("="*70)

üéØ FINAL CHECKPOINT SUMMARY - GSE114007 DATA INGESTION

‚úÖ DATA INGESTION COMPLETE

üìä DATASET METRICS:
   Samples (n_samples): 38
   Genes (n_genes): 23710

üìã CLASS DISTRIBUTION:
   OA: 20
   Control: 18

üß¨ EXAMPLE GENES (first 5 in index):
   1. FN1 (mean expression: 15.97)
   2. COMP (mean expression: 13.41)
   3. MALAT1 (mean expression: 14.67)
   4. CHI3L2 (mean expression: 9.84)
   5. CLU (mean expression: 14.22)

üìÅ OUTPUT FILES:
   1. C:\Users\povan\Kairos_Therapeutics\data\processed/GSE114007_metadata.csv
   2. C:\Users\povan\Kairos_Therapeutics\data\processed/GSE114007_raw_source_matrix.csv
   3. C:\Users\povan\Kairos_Therapeutics\data\processed/GSE114007_ml_matrix.csv

‚úÖ VALIDATION CHECKS:
   - Samples match metadata: True
   - No all-NaN genes: True
   - ML matrix mean ~0: True

üöÄ NEXT NOTEBOOK: 02_QC_and_EDA.ipynb
   - PCA visualization
   - Sample QC (detect outliers)
   - Gene expression distributions


---

## Troubleshooting Section

If the notebook didn't successfully load expression data, run the cell below to get diagnostic information.

In [None]:
"""
TROUBLESHOOTING CELL: Manual File Inspection
============================================
Run this if automatic parsing failed.
"""

print("üîß TROUBLESHOOTING: Manual File Inspection")
print("="*60)

print("\nüìÅ Files in data/raw/GSE114007/:")
for f in sorted(DATA_RAW.iterdir()):
    size_mb = f.stat().st_size / (1024*1024)
    print(f"\n   üìÑ {f.name}")
    print(f"      Size: {size_mb:.2f} MB")
    
    # Try to peek at first few lines
    try:
        if f.name.endswith('.gz'):
            with gzip.open(f, 'rt') as file:
                lines = [file.readline() for _ in range(5)]
        else:
            with open(f, 'r') as file:
                lines = [file.readline() for _ in range(5)]
        
        print(f"      First 3 lines:")
        for i, line in enumerate(lines[:3]):
            # Truncate long lines
            display_line = line.strip()[:100]
            if len(line.strip()) > 100:
                display_line += "..."
            print(f"         {i+1}: {display_line}")
    except Exception as e:
        print(f"      Could not read: {e}")

print("\n" + "="*60)
print("üìù INSTRUCTIONS:")
print("   1. Look at the file with largest size")
print("   2. Note if it's tab-separated, comma-separated, or space-separated")
print("   3. Note if first column is gene IDs or if there's a header row")
print("   4. Report this to Claude for custom parsing code")