# Getting metadata for scBaseCamp

In [21]:
## Autoreload extension
%load_ext autoreload
%autoreload 2


# scBaseCamp loading

## Summary

* This is a tutorial on using Python for accessing the scBaseCamp dataset hosted by the Arc Institute.
* The data can be streamed or downloaded locally.
  * For small jobs (e.g., summarizing the some metadata), streaming is recommended.
  * For large jobs (e.g., training a model), downloading is recommended.
* See the [README](README.md#metadata) for a description of the obs metadata.


## Setup

#### Installation

If needed, install the necessary dependencies.

You can use the [conda environment](../conda_envs/python.yml) provided in this git repository. To do so:

In [None]:
## !which conda && conda env create -q -f ../conda_envs/python.yml

## Load packages

In [8]:
import os
import pandas as pd
import scanpy as sc
import pyarrow.dataset as ds
import gcsfs

In [3]:
## initialize GCS file system for reading data from GCS
fs = gcsfs.GCSFileSystem()

## Data location

In [4]:
## GCS bucket path
gcs_base_path = "gs://arc-ctc-scbasecamp/2025-02-25/"

In [5]:
## STARsolo feature type
feature_type = "GeneFull_Ex50pAS"

## List available files

Let's see what we have to work with!

In [6]:
## helper function to list files 
def get_file_table(gcs_base_path: str, target: str=None, endswith: str=None):
    files = fs.glob(os.path.join(gcs_base_path, "**"))
    if target:
        files = [f for f in files if os.path.basename(f) == target]
    else:
        files = [f for f in files if f.endswith(endswith)]
    file_list = []
    for f in files:
        file_list.append(f.split("/")[-2:-1] + [f])
    return pd.DataFrame(file_list, columns=["organism", "file_path"])

### Parquet files

* Contain the obs metadata
* These can be read efficiently with [pyarrow](https://arrow.apache.org/docs/python/index.html)
  * We will read in via pyarrow and convert to pandas

In [7]:
## set the path to the metadata files
gcs_path = os.path.join(gcs_base_path, "metadata", feature_type)
gcs_path

'gs://arc-ctc-scbasecamp/2025-02-25/metadata/GeneFull_Ex50pAS'

#### List per-sample metadata files

Per-sample (SRX accession) metadata (e.g., tissue)

In [8]:
## list files
sample_pq_files = get_file_table(gcs_path, "sample_metadata.parquet")
print(sample_pq_files.shape)
sample_pq_files.head()

(21, 2)


Unnamed: 0,organism,file_path
0,Arabidopsis_thaliana,arc-ctc-scbasecamp/2025-02-25/metadata/GeneFul...
1,Bos_taurus,arc-ctc-scbasecamp/2025-02-25/metadata/GeneFul...
2,Caenorhabditis_elegans,arc-ctc-scbasecamp/2025-02-25/metadata/GeneFul...
3,Callithrix_jacchus,arc-ctc-scbasecamp/2025-02-25/metadata/GeneFul...
4,Danio_rerio,arc-ctc-scbasecamp/2025-02-25/metadata/GeneFul...


**Notes:**

* As you can see, the files are organized by `feature_type` (STARsolo output type) and `organism`

#### List per-obs metadata files

Per-observation (cell) metadata

In [9]:
## list files
obs_pq_files = get_file_table(gcs_path, "obs_metadata.parquet")
print(obs_pq_files.shape)
obs_pq_files.head()

(21, 2)


Unnamed: 0,organism,file_path
0,Arabidopsis_thaliana,arc-ctc-scbasecamp/2025-02-25/metadata/GeneFul...
1,Bos_taurus,arc-ctc-scbasecamp/2025-02-25/metadata/GeneFul...
2,Caenorhabditis_elegans,arc-ctc-scbasecamp/2025-02-25/metadata/GeneFul...
3,Callithrix_jacchus,arc-ctc-scbasecamp/2025-02-25/metadata/GeneFul...
4,Danio_rerio,arc-ctc-scbasecamp/2025-02-25/metadata/GeneFul...


### h5ad files 

* Contain count matrices and per-obs metadata

In [10]:
## set the path
gcs_path = os.path.join(gcs_base_path, "h5ad", feature_type)
gcs_path

'gs://arc-ctc-scbasecamp/2025-02-25/h5ad/GeneFull_Ex50pAS'

In [11]:
## list files
h5ad_files = get_file_table(gcs_path, endswith=".h5ad")
print(h5ad_files.shape)
h5ad_files.head()

(30387, 2)


Unnamed: 0,organism,file_path
0,Arabidopsis_thaliana,arc-ctc-scbasecamp/2025-02-25/h5ad/GeneFull_Ex...
1,Arabidopsis_thaliana,arc-ctc-scbasecamp/2025-02-25/h5ad/GeneFull_Ex...
2,Arabidopsis_thaliana,arc-ctc-scbasecamp/2025-02-25/h5ad/GeneFull_Ex...
3,Arabidopsis_thaliana,arc-ctc-scbasecamp/2025-02-25/h5ad/GeneFull_Ex...
4,Arabidopsis_thaliana,arc-ctc-scbasecamp/2025-02-25/h5ad/GeneFull_Ex...


## Explore the per-sample metadata

#### Just human samples

In [12]:
## get the per-sample metadata file path
infile = sample_pq_files[sample_pq_files["organism"] == "Homo_sapiens"]["file_path"].values[0]
infile

'arc-ctc-scbasecamp/2025-02-25/metadata/GeneFull_Ex50pAS/Homo_sapiens/sample_metadata.parquet'

In [13]:
## load the metadata
sample_metadata = ds.dataset(infile, filesystem=fs, format="parquet").to_table().to_pandas()
print(sample_metadata.shape)
sample_metadata.head(2)

(16077, 14)


Unnamed: 0,entrez_id,srx_accession,file_path,obs_count,lib_prep,tech_10x,cell_prep,organism,tissue,disease,perturbation,cell_line,czi_collection_id,czi_collection_name
0,29110018,ERX11148735,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/GeneFu...,747,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,skin of body,normal,surplus skin from breast reconstruction surgery,not applicable,73f82ac8-15cc-4fcd-87f8-5683723fce7f,Developmental cell programs are co-opted in in...
1,29110027,ERX11148744,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/GeneFu...,2379,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,skin of body,normal,treated with dispase II and collagenase for ce...,keratinocyte CD49f-,73f82ac8-15cc-4fcd-87f8-5683723fce7f,Developmental cell programs are co-opted in in...


In [14]:
sample_metadata.tail(2)

Unnamed: 0,entrez_id,srx_accession,file_path,obs_count,lib_prep,tech_10x,cell_prep,organism,tissue,disease,perturbation,cell_line,czi_collection_id,czi_collection_name
16075,37011694,SRX27443008,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/GeneFu...,9024,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,blood,severe fever with thrombocytopenia syndrome,unsure,not applicable,,
16076,37050686,SRX27477190,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/GeneFu...,10872,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,brain,glioblastoma,transgenic expression of POU5F1 (Oct4) and SOX2,GBM1A (patient-derived neurospheres),,


In [15]:
df = sample_metadata

In [16]:
df['srx_accession'].str.startswith('SRX').sum()

np.int64(14431)

In [17]:
df['srx_accession'].str.startswith('ERX').sum()

np.int64(1646)

In [18]:
print(df[['entrez_id', 'srx_accession']].head(2))

   entrez_id srx_accession
0   29110018   ERX11148735
1   29110027   ERX11148744


In [19]:
## All human?
sample_metadata["organism"].value_counts()

organism
Homo sapiens    16077
Name: count, dtype: int64

In [20]:
## 10X library prep methods
sample_metadata["tech_10x"].value_counts()

tech_10x
3_prime_gex          10851
5_prime_gex           3746
vdj                    437
multiome               366
not_applicable         250
feature_barcoding      230
other                  168
cellplex                19
flex                     6
atac                     4
Name: count, dtype: int64

In [21]:
## cell prep method
sample_metadata["cell_prep"].value_counts()

cell_prep
single_cell       14661
single_nucleus     1393
unsure               22
not_applicable        1
Name: count, dtype: int64

# Get GEO/ArrayExpress datasets for each

In [None]:
## !pip install biopython

In [22]:
import pandas as pd

result_df = pd.read_csv("result_df.csv")

In [27]:
from get_srp_ids import get_srp_for_srx_batch

In [28]:

# Example usage
if __name__ == "__main__":
    # Example with a single SRX ID
    srx_ids = ["SRX27443010", "ERX11148735"]
    result = get_srp_for_srx_batch(srx_ids, "your.email@example.com", debug=True)
    print(f"Result: {result}")


Processing batch: ['SRX27443010', 'ERX11148735']
Received XML data of length: 24546
Found mapping: SRX27443010 -> SRP559437
Found mapping: ERX11148735 -> ERP149679
Result: {'SRX27443010': 'SRP559437', 'ERX11148735': 'ERP149679'}


In [30]:
result = get_srp_for_srx_batch(result_df['srx_accession'].tolist(), "your.email@example.com", debug=True)

Processing batch: ['ERX11148735', 'ERX11148744', 'ERX11148743', 'ERX11148740', 'ERX11148732', 'ERX11148739', 'ERX11148741', 'ERX11148737', 'ERX11148727', 'ERX11148730', 'ERX11148734', 'ERX11148724', 'ERX11148726', 'ERX11148723', 'ERX10299506', 'ERX11148736', 'ERX11148733', 'ERX11148729', 'ERX11148725', 'ERX10299505', 'ERX10299504', 'ERX11148728', 'ERX10299500', 'ERX10299507', 'ERX10299501', 'ERX10299508', 'ERX10299503', 'ERX11148755', 'ERX11148751', 'ERX11148749', 'ERX11148772', 'ERX11148752', 'ERX11148753', 'ERX11148763', 'ERX11148746', 'ERX11148759', 'ERX11148774', 'ERX11148748', 'ERX11148760', 'ERX11148745', 'ERX11148747', 'ERX11148754', 'ERX11148738', 'ERX11148750', 'ERX11148758', 'ERX11148757', 'ERX11148761', 'ERX11148769', 'ERX11148762', 'ERX11148768', 'ERX11148766', 'ERX11148767', 'ERX11148756', 'ERX11148789', 'ERX11148773', 'ERX11148785', 'ERX11148777', 'ERX11148781', 'ERX11148793', 'ERX11148771', 'ERX11148776', 'ERX11148783', 'ERX11148797', 'ERX11148778', 'ERX11148779', 'ERX11

In [32]:
result

{'ERX11148735': 'ERP149679',
 'ERX11148744': 'ERP149679',
 'ERX11148743': 'ERP149679',
 'ERX11148740': 'ERP149679',
 'ERX11148732': 'ERP149679',
 'ERX11148739': 'ERP149679',
 'ERX11148741': 'ERP149679',
 'ERX11148737': 'ERP149679',
 'ERX11148727': 'ERP149679',
 'ERX11148730': 'ERP149679',
 'ERX11148734': 'ERP149679',
 'ERX11148724': 'ERP149679',
 'ERX11148726': 'ERP149679',
 'ERX11148723': 'ERP149679',
 'ERX10299506': 'ERP144781',
 'ERX11148736': 'ERP149679',
 'ERX11148733': 'ERP149679',
 'ERX11148729': 'ERP149679',
 'ERX11148725': 'ERP149679',
 'ERX10299505': 'ERP144781',
 'ERX10299504': 'ERP144781',
 'ERX11148728': 'ERP149679',
 'ERX10299500': 'ERP144781',
 'ERX10299507': 'ERP144781',
 'ERX10299501': 'ERP144781',
 'ERX10299508': 'ERP144781',
 'ERX10299503': 'ERP144781',
 'ERX11148755': 'ERP149679',
 'ERX11148751': 'ERP149679',
 'ERX11148749': 'ERP149679',
 'ERX11148772': 'ERP149679',
 'ERX11148752': 'ERP149679',
 'ERX11148753': 'ERP149679',
 'ERX11148763': 'ERP149679',
 'ERX11148746'

In [31]:
# map the study_id to the srx_accession
result_df['study_id'] = result_df['srx_accession'].map(result)


result_df.to_csv("result_df.csv", index=False)

In [34]:
result_df['study_id'].isnull().sum()

np.int64(0)

In [19]:
df_test = df.head(100).copy()

In [17]:
from get_study_id import get_study_id, process_dataframe


In [18]:

## Get study ID for a single accession
study_id = get_study_id("ERX11148735")  ## Returns "ERP149679"

In [None]:
df_test_result = process_dataframe(df_test, accession_col='srx_accession', output_col='study_id')

In [None]:
df_test_result

In [33]:
result_df['study_id'].isnull().sum()

np.int64(450)

In [39]:
rest = result_df.loc[result_df['study_id'].isnull()].copy()

In [40]:
rest_df = process_dataframe(rest, accession_col='srx_accession', output_col='study_id')

2025-03-13 22:00:59,654 - get_study_id - INFO - Getting study IDs for 450 NCBI accessions in batch mode


In [41]:
mapping = dict(zip(rest_df['srx_accession'], rest_df['study_id']))

In [42]:
## apply the mapping, but only to the nulls
result_df.loc[result_df['study_id'].isnull(), 'study_id'] = result_df['srx_accession'].map(mapping)

In [43]:
result_df['study_id'].isnull().sum()

np.int64(0)

In [50]:
result_df['srx_accession'].nunique()

16077

In [48]:
result_df['study_id'].nunique()

298

In [53]:
result_df['study_id'].value_counts()

study_id
SRP510712    400
SRP439378    300
SRP362101    300
ERP136281    292
SRP423586    250
            ... 
ERP159291      1
ERP158531      1
ERP123534      1
ERP125482      1
ERP164644      1
Name: count, Length: 298, dtype: int64

In [10]:
get_study_id("SRX5126512")  ## Returns "ERP149679"

'SRP147554'

In [11]:
result_df.loc[result_df['srx_accession'] == 'SRX5126512', 'study_id']

1783    SRP510712
Name: study_id, dtype: object

In [44]:
result_df.to_csv("result_df.csv", index=False)

In [5]:
result_df

Unnamed: 0,entrez_id,srx_accession,file_path,obs_count,lib_prep,tech_10x,cell_prep,organism,tissue,disease,perturbation,cell_line,czi_collection_id,czi_collection_name,study_id
0,29110018,ERX11148735,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/GeneFu...,747,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,skin of body,normal,surplus skin from breast reconstruction surgery,not applicable,73f82ac8-15cc-4fcd-87f8-5683723fce7f,Developmental cell programs are co-opted in in...,ERP149679
1,29110027,ERX11148744,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/GeneFu...,2379,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,skin of body,normal,treated with dispase II and collagenase for ce...,keratinocyte CD49f-,73f82ac8-15cc-4fcd-87f8-5683723fce7f,Developmental cell programs are co-opted in in...,ERP149679
2,29110026,ERX11148743,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/GeneFu...,2316,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,skin of body,normal,treated with dispase II and collagenase for ce...,epidermal myeloid cells,73f82ac8-15cc-4fcd-87f8-5683723fce7f,Developmental cell programs are co-opted in in...,ERP149679
3,29110023,ERX11148740,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/GeneFu...,2907,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,skin of body,normal,skin collected from breast reconstruction surgery,not specified,73f82ac8-15cc-4fcd-87f8-5683723fce7f,Developmental cell programs are co-opted in in...,ERP149679
4,29110015,ERX11148732,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/GeneFu...,4082,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,skin of body,normal,treated with dispase II and collagenase,not_applicable,73f82ac8-15cc-4fcd-87f8-5683723fce7f,Developmental cell programs are co-opted in in...,ERP149679
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16072,37011696,SRX27443010,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/GeneFu...,5033,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,blood,severe fever with thrombocytopenia syndrome (S...,unsure,not_applicable,,,SRP559980
16073,11301487,SRX8685435,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/GeneFu...,1390,10x_Genomics,3_prime_gex,single_nucleus,Homo sapiens,cortex,frontotemporal dementia,control,not_applicable,,,SRP559980
16074,11301472,SRX8685420,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/GeneFu...,2595,10x_Genomics,3_prime_gex,single_nucleus,Homo sapiens,brain,frontotemporal dementia,control,other,,,SRP559980
16075,37011694,SRX27443008,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/GeneFu...,9024,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,blood,severe fever with thrombocytopenia syndrome,unsure,not applicable,,,SRP559980
