# Spatial Transcriptomics Data Download

## Overview
This notebook downloads spatial transcriptomics datasets for integration with scRNA-seq data in the immunotherapy resistance atlas.

### Datasets
| GEO ID | Cancer Type | Platform | Notes |
|--------|-------------|----------|-------|
| GSE203612 | Gastric Cancer | 10x Visium | Primary tumors |
| Additional | NSCLC | CosMx | To be added |
| Additional | Breast | 10x Visium | To be added |

### Spatial Platforms
- **10x Visium**: Spot-based, ~5000 spots per section, ~1-10 cells per spot
- **CosMx**: Single-cell resolution, up to 1000 genes
- **MERFISH**: Single-cell resolution, up to 500 genes

---

## 1. Setup

In [None]:
import os
import sys
from pathlib import Path
import yaml
import requests
import gzip
import shutil
import tarfile
from tqdm import tqdm
import pandas as pd
import numpy as np
import scanpy as sc
import squidpy as sq

# Project paths
PROJECT_ROOT = Path("../..").resolve()
DATA_RAW = PROJECT_ROOT / 'data' / 'raw' / 'spatial'
CONFIG_PATH = PROJECT_ROOT / 'config' / 'analysis_params.yaml'

print(f"Data will be saved to: {DATA_RAW}")

In [None]:
# Load configuration
with open(CONFIG_PATH, 'r') as f:
    config = yaml.safe_load(f)

# Get spatial dataset list
spatial_datasets = config['datasets']['spatial']

print(f"Spatial datasets to download: {len(spatial_datasets)}")
for ds in spatial_datasets:
    print(f"  - {ds['id']}: {ds['cancer_type']} ({ds['platform']})")

## 2. Spatial Data Formats

### 10x Visium Output Structure
```
sample/
├── filtered_feature_bc_matrix.h5
├── spatial/
│   ├── tissue_hires_image.png
│   ├── tissue_lowres_image.png
│   ├── scalefactors_json.json
│   └── tissue_positions_list.csv
└── analysis/
    └── ...
```

### Key Files
- **filtered_feature_bc_matrix.h5**: Gene expression matrix
- **tissue_positions_list.csv**: Spot coordinates
- **tissue_hires_image.png**: H&E stained tissue image
- **scalefactors_json.json**: Scaling factors for coordinates

In [None]:
def download_file(url, output_path, chunk_size=8192):
    """
    Download file with progress bar.
    """
    response = requests.get(url, stream=True)
    total_size = int(response.headers.get('content-length', 0))
    
    with open(output_path, 'wb') as f:
        with tqdm(total=total_size, unit='B', unit_scale=True, desc=output_path.name) as pbar:
            for chunk in response.iter_content(chunk_size=chunk_size):
                f.write(chunk)
                pbar.update(len(chunk))


def load_visium_data(sample_dir):
    """
    Load 10x Visium spatial data.
    
    Parameters
    ----------
    sample_dir : Path
        Directory containing Visium output
    
    Returns
    -------
    AnnData
        Spatial AnnData object with coordinates and images
    """
    sample_dir = Path(sample_dir)
    
    # Read spatial data using scanpy
    adata = sc.read_visium(
        sample_dir,
        count_file='filtered_feature_bc_matrix.h5',
        load_images=True
    )
    
    # Ensure gene names are unique
    adata.var_names_make_unique()
    
    return adata

print("Spatial data loading functions defined")

## 3. Download Spatial Datasets

### 3.1 GSE203612 - Gastric Cancer Visium

In [None]:
# GSE203612 - Gastric cancer spatial transcriptomics
geo_id = "GSE203612"
output_dir = DATA_RAW / geo_id
output_dir.mkdir(parents=True, exist_ok=True)

print(f"Downloading {geo_id}...")
print(f"Output directory: {output_dir}")
print(f"\nGEO URL: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc={geo_id}")
print("\nNote: Download supplementary files manually from GEO.")
print("\nExpected files:")
print("  - Space Ranger output for each sample")
print("  - Tissue images")
print("  - Metadata with clinical information")

### 3.2 Manual Download Instructions for Spatial Data

Spatial transcriptomics data often includes large image files and requires careful organization:

1. Go to the GEO page
2. Download supplementary files (often as .tar.gz archives)
3. Extract and organize as:

```
data/raw/spatial/GSEXXX/
├── sample1/
│   ├── filtered_feature_bc_matrix.h5
│   └── spatial/
├── sample2/
│   ├── filtered_feature_bc_matrix.h5
│   └── spatial/
└── metadata.csv
```

In [None]:
# Check for existing spatial data
print("Checking for existing spatial data:\n")

for dataset in spatial_datasets:
    geo_id = dataset['id']
    data_dir = DATA_RAW / geo_id
    
    if data_dir.exists():
        # Count potential samples
        samples = [d for d in data_dir.iterdir() if d.is_dir()]
        h5_files = list(data_dir.rglob('*.h5'))
        
        print(f"{geo_id}:")
        print(f"  - Directory exists: Yes")
        print(f"  - Subdirectories: {len(samples)}")
        print(f"  - H5 files: {len(h5_files)}")
    else:
        print(f"{geo_id}: NOT DOWNLOADED")

## 4. Verify Spatial Data Structure

After downloading, verify that spatial data has the correct structure for loading with scanpy/squidpy.

In [None]:
def verify_visium_structure(sample_dir):
    """
    Verify that a directory contains valid Visium data.
    
    Parameters
    ----------
    sample_dir : Path
        Directory to check
    
    Returns
    -------
    dict
        Status of each required file
    """
    sample_dir = Path(sample_dir)
    
    required_files = {
        'count_matrix': sample_dir / 'filtered_feature_bc_matrix.h5',
        'positions': sample_dir / 'spatial' / 'tissue_positions_list.csv',
        'scalefactors': sample_dir / 'spatial' / 'scalefactors_json.json',
        'hires_image': sample_dir / 'spatial' / 'tissue_hires_image.png',
        'lowres_image': sample_dir / 'spatial' / 'tissue_lowres_image.png',
    }
    
    status = {}
    for name, path in required_files.items():
        status[name] = path.exists()
    
    return status


# Verify downloaded spatial data
print("Verifying spatial data structure:\n")

for dataset in spatial_datasets:
    geo_id = dataset['id']
    data_dir = DATA_RAW / geo_id
    
    if data_dir.exists():
        # Check each subdirectory as a potential sample
        for sample_dir in [d for d in data_dir.iterdir() if d.is_dir()]:
            status = verify_visium_structure(sample_dir)
            
            all_present = all(status.values())
            status_str = "VALID" if all_present else "INCOMPLETE"
            
            print(f"{geo_id}/{sample_dir.name}: {status_str}")
            
            if not all_present:
                for name, present in status.items():
                    if not present:
                        print(f"    Missing: {name}")

## 5. Test Loading Spatial Data

Let's test loading a spatial dataset to ensure it's properly formatted.

In [None]:
# Example: Test loading a sample
# Uncomment and modify path once data is downloaded

# sample_path = DATA_RAW / 'GSE203612' / 'sample1'
# 
# if sample_path.exists():
#     adata = load_visium_data(sample_path)
#     
#     print(f"Loaded spatial data:")
#     print(f"  - Spots: {adata.n_obs}")
#     print(f"  - Genes: {adata.n_vars}")
#     print(f"  - Spatial coordinates: {adata.obsm['spatial'].shape}")
#     print(f"  - Images loaded: {list(adata.uns['spatial'].keys())}")
# else:
#     print(f"Sample not found at: {sample_path}")

print("Test loading code ready - uncomment once data is downloaded")

## 6. Create Spatial Dataset Registry

In [None]:
# Create metadata for spatial datasets
spatial_metadata = []

for dataset in spatial_datasets:
    geo_id = dataset['id']
    data_dir = DATA_RAW / geo_id
    
    # Count samples if directory exists
    n_samples = 0
    if data_dir.exists():
        n_samples = len([d for d in data_dir.iterdir() if d.is_dir()])
    
    record = {
        'geo_id': geo_id,
        'cancer_type': dataset['cancer_type'],
        'platform': dataset['platform'],
        'data_path': str(data_dir),
        'downloaded': data_dir.exists() and n_samples > 0,
        'n_samples': n_samples
    }
    
    spatial_metadata.append(record)

# Create DataFrame
spatial_df = pd.DataFrame(spatial_metadata)

# Save metadata
metadata_path = DATA_RAW / 'spatial_dataset_metadata.csv'
spatial_df.to_csv(metadata_path, index=False)

print(f"Metadata saved to: {metadata_path}")
display(spatial_df)

## 7. Additional Spatial Data Resources

### Public Spatial Transcriptomics Repositories

1. **10x Genomics Datasets**: https://www.10xgenomics.com/resources/datasets
   - Curated Visium datasets with full Space Ranger output
   
2. **SpatialDB**: http://www.spatialomics.org/
   - Database of spatial transcriptomics experiments
   
3. **Single Cell Portal**: https://singlecell.broadinstitute.org/
   - Some studies include spatial data

### Cancer-Specific Resources

- **TISCH2**: Tumor Immune Single-cell Hub
- **CancerSEA**: Cancer Single-cell State Atlas
- **HTAN**: Human Tumor Atlas Network

## 8. Summary and Next Steps

### Completed
- Set up spatial data download infrastructure
- Created verification functions for data structure
- Created spatial dataset registry

### Next Steps
1. Download remaining spatial datasets
2. Proceed to `02_preprocessing/` for quality control
3. Spatial data will be integrated in `05_spatial_analysis/`

### Important Considerations
- Spatial data files are large (images + counts)
- Ensure consistent coordinate systems
- Match spatial samples with corresponding scRNA-seq if available

In [None]:
# Final status
print("\n" + "="*60)
print("SPATIAL DATA DOWNLOAD STATUS")
print("="*60)

downloaded = spatial_df['downloaded'].sum()
total = len(spatial_df)

print(f"\nDownloaded: {downloaded}/{total} spatial datasets")

if downloaded < total:
    print(f"\nPending downloads:")
    for _, row in spatial_df[~spatial_df['downloaded']].iterrows():
        print(f"  - {row['geo_id']}: {row['cancer_type']} ({row['platform']})")