# Getting metadata for scBaseCamp

In [1]:
## Autoreload extension
%load_ext autoreload
%autoreload 2


In [None]:
# !pip install tiledb tiledbsoma tiledb-cloud

# [COLLAPSE THIS HEADER] scBaseCamp loading

## Summary

* This is a tutorial on using Python for accessing the scBaseCamp dataset hosted by the Arc Institute.
* The data can be streamed or downloaded locally.
  * For small jobs (e.g., summarizing the some metadata), streaming is recommended.
  * For large jobs (e.g., training a model), downloading is recommended.
* See the [README](README.md#metadata) for a description of the obs metadata.


## Setup

#### Installation

If needed, install the necessary dependencies.

You can use the [conda environment](../conda_envs/python.yml) provided in this git repository. To do so:

In [None]:
#!which conda && conda env create -q -f ../conda_envs/python.yml

## Load packages

In [1]:
import os
import pandas as pd
import scanpy as sc
import pyarrow.dataset as ds
import gcsfs

ModuleNotFoundError: No module named 'scanpy'

In [3]:
## initialize GCS file system for reading data from GCS
fs = gcsfs.GCSFileSystem()

## Data location

In [4]:
## GCS bucket path
gcs_base_path = "gs://arc-ctc-scbasecamp/2025-02-25/"

In [5]:
## STARsolo feature type
feature_type = "GeneFull_Ex50pAS"

## List available files

Let's see what we have to work with!

In [6]:
## helper function to list files 
def get_file_table(gcs_base_path: str, target: str=None, endswith: str=None):
    files = fs.glob(os.path.join(gcs_base_path, "**"))
    if target:
        files = [f for f in files if os.path.basename(f) == target]
    else:
        files = [f for f in files if f.endswith(endswith)]
    file_list = []
    for f in files:
        file_list.append(f.split("/")[-2:-1] + [f])
    return pd.DataFrame(file_list, columns=["organism", "file_path"])

### Parquet files

* Contain the obs metadata
* These can be read efficiently with [pyarrow](https://arrow.apache.org/docs/python/index.html)
  * We will read in via pyarrow and convert to pandas

In [7]:
## set the path to the metadata files
gcs_path = os.path.join(gcs_base_path, "metadata", feature_type)
gcs_path

'gs://arc-ctc-scbasecamp/2025-02-25/metadata/GeneFull_Ex50pAS'

#### List per-sample metadata files

Per-sample (SRX accession) metadata (e.g., tissue)

In [8]:
## list files
sample_pq_files = get_file_table(gcs_path, "sample_metadata.parquet")
print(sample_pq_files.shape)
sample_pq_files.head()

(21, 2)


Unnamed: 0,organism,file_path
0,Arabidopsis_thaliana,arc-ctc-scbasecamp/2025-02-25/metadata/GeneFul...
1,Bos_taurus,arc-ctc-scbasecamp/2025-02-25/metadata/GeneFul...
2,Caenorhabditis_elegans,arc-ctc-scbasecamp/2025-02-25/metadata/GeneFul...
3,Callithrix_jacchus,arc-ctc-scbasecamp/2025-02-25/metadata/GeneFul...
4,Danio_rerio,arc-ctc-scbasecamp/2025-02-25/metadata/GeneFul...


**Notes:**

* As you can see, the files are organized by `feature_type` (STARsolo output type) and `organism`

#### List per-obs metadata files

Per-observation (cell) metadata

In [9]:
## list files
obs_pq_files = get_file_table(gcs_path, "obs_metadata.parquet")
print(obs_pq_files.shape)
obs_pq_files.head()

(21, 2)


Unnamed: 0,organism,file_path
0,Arabidopsis_thaliana,arc-ctc-scbasecamp/2025-02-25/metadata/GeneFul...
1,Bos_taurus,arc-ctc-scbasecamp/2025-02-25/metadata/GeneFul...
2,Caenorhabditis_elegans,arc-ctc-scbasecamp/2025-02-25/metadata/GeneFul...
3,Callithrix_jacchus,arc-ctc-scbasecamp/2025-02-25/metadata/GeneFul...
4,Danio_rerio,arc-ctc-scbasecamp/2025-02-25/metadata/GeneFul...


### h5ad files 

* Contain count matrices and per-obs metadata

In [10]:
## set the path
gcs_path = os.path.join(gcs_base_path, "h5ad", feature_type)
gcs_path

'gs://arc-ctc-scbasecamp/2025-02-25/h5ad/GeneFull_Ex50pAS'

In [11]:
## list files
h5ad_files = get_file_table(gcs_path, endswith=".h5ad")
print(h5ad_files.shape)
h5ad_files.head()

(30387, 2)


Unnamed: 0,organism,file_path
0,Arabidopsis_thaliana,arc-ctc-scbasecamp/2025-02-25/h5ad/GeneFull_Ex...
1,Arabidopsis_thaliana,arc-ctc-scbasecamp/2025-02-25/h5ad/GeneFull_Ex...
2,Arabidopsis_thaliana,arc-ctc-scbasecamp/2025-02-25/h5ad/GeneFull_Ex...
3,Arabidopsis_thaliana,arc-ctc-scbasecamp/2025-02-25/h5ad/GeneFull_Ex...
4,Arabidopsis_thaliana,arc-ctc-scbasecamp/2025-02-25/h5ad/GeneFull_Ex...


## Explore the per-sample metadata

#### Just human samples

In [12]:
## get the per-sample metadata file path
infile = sample_pq_files[sample_pq_files["organism"] == "Homo_sapiens"]["file_path"].values[0]
infile

'arc-ctc-scbasecamp/2025-02-25/metadata/GeneFull_Ex50pAS/Homo_sapiens/sample_metadata.parquet'

In [13]:
## load the metadata
sample_metadata = ds.dataset(infile, filesystem=fs, format="parquet").to_table().to_pandas()
print(sample_metadata.shape)
sample_metadata.head(2)

(16077, 14)


Unnamed: 0,entrez_id,srx_accession,file_path,obs_count,lib_prep,tech_10x,cell_prep,organism,tissue,disease,perturbation,cell_line,czi_collection_id,czi_collection_name
0,29110018,ERX11148735,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/GeneFu...,747,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,skin of body,normal,surplus skin from breast reconstruction surgery,not applicable,73f82ac8-15cc-4fcd-87f8-5683723fce7f,Developmental cell programs are co-opted in in...
1,29110027,ERX11148744,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/GeneFu...,2379,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,skin of body,normal,treated with dispase II and collagenase for ce...,keratinocyte CD49f-,73f82ac8-15cc-4fcd-87f8-5683723fce7f,Developmental cell programs are co-opted in in...


In [14]:
sample_metadata.tail(2)

Unnamed: 0,entrez_id,srx_accession,file_path,obs_count,lib_prep,tech_10x,cell_prep,organism,tissue,disease,perturbation,cell_line,czi_collection_id,czi_collection_name
16075,37011694,SRX27443008,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/GeneFu...,9024,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,blood,severe fever with thrombocytopenia syndrome,unsure,not applicable,,
16076,37050686,SRX27477190,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/GeneFu...,10872,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,brain,glioblastoma,transgenic expression of POU5F1 (Oct4) and SOX2,GBM1A (patient-derived neurospheres),,


In [15]:
df = sample_metadata

In [16]:
df['srx_accession'].str.startswith('SRX').sum()

np.int64(14431)

In [17]:
df['srx_accession'].str.startswith('ERX').sum()

np.int64(1646)

In [18]:
print(df[['entrez_id', 'srx_accession']].head(2))

   entrez_id srx_accession
0   29110018   ERX11148735
1   29110027   ERX11148744


In [19]:
## All human?
sample_metadata["organism"].value_counts()

organism
Homo sapiens    16077
Name: count, dtype: int64

In [20]:
## 10X library prep methods
sample_metadata["tech_10x"].value_counts()

tech_10x
3_prime_gex          10851
5_prime_gex           3746
vdj                    437
multiome               366
not_applicable         250
feature_barcoding      230
other                  168
cellplex                19
flex                     6
atac                     4
Name: count, dtype: int64

In [21]:
## cell prep method
sample_metadata["cell_prep"].value_counts()

cell_prep
single_cell       14661
single_nucleus     1393
unsure               22
not_applicable        1
Name: count, dtype: int64

In [4]:
from dotenv import load_dotenv
import os

load_dotenv()

def print_green(text):
    print(f"\033[92m{text}\033[0m")

def print_red(text):
    print(f"\033[91m{text}\033[0m")

if os.getenv("NCBI_API_KEY"):
    print_green("NCBI_API_KEY is set")
else:
    print_red("NCBI_API_KEY is not set")

if os.getenv("OPENAI_API_KEY"):
    print_green("OPENAI_API_KEY is set")
else:
    print_red("OPENAI_API_KEY is not set")

if os.getenv("GOOGLE_SEARCH_API_KEY"):
    print_green("GOOGLE_SEARCH_API_KEY is set")
else:
    print_red("GOOGLE_SEARCH_API_KEY is not set")

if os.getenv("GOOGLE_SEARCH_CSE_ID"):
    print_green("GOOGLE_SEARCH_CSE_ID is set")
else:
    print_red("GOOGLE_SEARCH_CSE_ID is not set")


[92mNCBI_API_KEY is set[0m
[92mOPENAI_API_KEY is set[0m
[92mGOOGLE_SEARCH_API_KEY is set[0m
[92mGOOGLE_SEARCH_CSE_ID is set[0m


# Get GEO/ArrayExpress datasets for each

In [5]:
# !pip install biopython

In [3]:
import pandas as pd

result_df = pd.read_csv("result_df.csv")
sdf = pd.read_csv("study_df.csv")

In [None]:
result_df['sra_id'].nunique()

1665

In [10]:
result_df.rename(columns={'sra_id': 'srp_id'}, inplace=True)

In [11]:
study_df = result_df[['srp_id']].drop_duplicates()

In [12]:
study_df.reset_index(drop=True, inplace=True)

In [14]:
result_df['lib_prep'].unique()

array(['10x_Genomics'], dtype=object)

In [15]:
result_df['tech_10x'].unique()

array(['3_prime_gex', 'other', 'not_applicable', '5_prime_gex', 'vdj',
       'multiome', 'feature_barcoding', 'cellplex', 'flex', 'atac'],
      dtype=object)

## Get all PRJ and GSE / EMTAB IDs

In [17]:
sra_ids = study_df['srp_id'].tolist()

In [18]:
from sra_batch_processor import process_sra_ids, results_to_dataframe
import pandas as pd


results = process_sra_ids(
    sra_ids=sra_ids,
    batch_size=50,  # Process 20 IDs at a time
    max_workers=3,  # Use 3 parallel workers
    delay_between_batches=3.0,  # Wait 3 seconds between batches
    cache_file="my_large_cache.json"  # Cache results in this file
)


2025-03-16 19:49:40,539 - sra_batch_processor - INFO - Processing 1665 SRA IDs


Processing SRA IDs:   0%|          | 0/1665 [00:00<?, ?it/s]

2025-03-16 19:49:40,553 - sra_id_converter - INFO - Processing 1665 unique SRA IDs out of 1665 total
2025-03-16 19:49:40,557 - sra_id_converter - INFO - Loaded 1665 cached results from my_large_cache.json
2025-03-16 19:49:40,558 - sra_id_converter - INFO - Found 1665 IDs in cache/known mappings, 0 remaining to process
2025-03-16 19:49:40,570 - sra_id_converter - INFO - Updated cache file my_large_cache.json with 1665 results
2025-03-16 19:49:40,572 - sra_batch_processor - INFO - Processing completed in 0.03 seconds
2025-03-16 19:49:40,574 - sra_batch_processor - INFO - Summary:
2025-03-16 19:49:40,574 - sra_batch_processor - INFO -   Total SRA IDs processed: 1665
2025-03-16 19:49:40,575 - sra_batch_processor - INFO -   BioProject IDs found: 1665 (100.0%)
2025-03-16 19:49:40,575 - sra_batch_processor - INFO -   GEO/ArrayExpress IDs found: 1540 (92.5%)


In [20]:
len(results)

1665

In [21]:
results

{'ERP149679': {'bioproject_id': 'PRJEB64504', 'geo_id': 'E-MTAB-8142'},
 'ERP144781': {'bioproject_id': 'PRJEB59723', 'geo_id': 'E-MTAB-12650'},
 'ERP123138': {'bioproject_id': 'PRJEB39602', 'geo_id': ''},
 'ERP156277': {'bioproject_id': 'PRJEB71477', 'geo_id': 'E-MTAB-13085'},
 'ERP151533': {'bioproject_id': 'PRJEB66480', 'geo_id': 'E-MTAB-13382'},
 'ERP160803': {'bioproject_id': 'PRJEB76244', 'geo_id': 'E-MTAB-11528'},
 'ERP158366': {'bioproject_id': 'PRJEB73595', 'geo_id': 'E-MTAB-13874'},
 'SRP402417': {'bioproject_id': 'PRJNA890219', 'geo_id': 'GSE215403'},
 'ERP136281': {'bioproject_id': 'PRJEB51634', 'geo_id': 'E-MTAB-11536'},
 'SRP324458': {'bioproject_id': 'PRJNA738600', 'geo_id': 'GSE178360'},
 'SRP364677': {'bioproject_id': 'PRJNA816172', 'geo_id': 'GSE198623'},
 'ERP136992': {'bioproject_id': 'PRJEB52292', 'geo_id': ''},
 'SRP329496': {'bioproject_id': 'PRJNA749041', 'geo_id': 'GSE180661'},
 'SRP273096': {'bioproject_id': 'PRJNA647809', 'geo_id': 'GSE154795'},
 'SRP510712':

In [26]:
sdf = pd.DataFrame([
    {'sra_id': sra_id, 'prj_id': info['bioproject_id'], 'gse_id': info['geo_id']}
    for sra_id, info in results.items()
])

In [None]:
import numpy as np
sdf.replace({'': np.nan}, inplace=True)

In [31]:
sdf['gse_id'].isnull().sum()

np.int64(125)

In [None]:
sdf.head()

In [33]:
sdf.to_csv("study_df.csv", index=False)

## Get all GSMs from SRX

In [34]:
srx_list = result_df.loc[result_df['srx_accession'].str.startswith('SRX'), 'srx_accession'].to_list()

In [None]:
# Import the function
from srx_to_gsm_standalone import batch_srx_to_gsm


# Process in batches of 200
srx2gsm = batch_srx_to_gsm(srx_list, batch_size=200, api_key=os.getenv("NCBI_API_KEY"))


In [38]:
srx2gsm['experiment_alias'].isnull().sum()

np.int64(1744)

In [44]:
srx2gsm.rename(columns={'experiment_accession': 'srx_accession', 'experiment_alias': 'gsm_id'}, inplace=True)

In [47]:
result_df = result_df.merge(srx2gsm, on='srx_accession', how='left')

In [48]:
result_df.to_csv("result_df.csv", index=False)

In [49]:
result_df['sra_id'].nunique()

1665

In [13]:
result_df['sra_id'].unique()

array(['ERP149679', 'ERP144781', 'ERP123138', ..., 'SRP559437',
       'SRP270870', 'SRP559980'], dtype=object)

## Get SRA ID for all SRX

In [34]:
result_df['sra_id'].isnull().sum()

np.int64(0)

In [17]:
from get_sra_id import get_sra_id, process_dataframe


In [18]:

## Get study ID for a single accession
sra_id = get_sra_id("ERX11148735")  ## Returns "ERP149679"

In [None]:
df_test_result = process_dataframe(df_test, accession_col='srx_accession', output_col='sra_id')

In [None]:
df_test_result

In [33]:
result_df['sra_id'].isnull().sum()

np.int64(450)

In [39]:
rest = result_df.loc[result_df['sra_id'].isnull()].copy()

In [40]:
rest_df = process_dataframe(rest, accession_col='srx_accession', output_col='sra_id')

2025-03-13 22:00:59,654 - get_study_id - INFO - Getting study IDs for 450 NCBI accessions in batch mode


In [41]:
mapping = dict(zip(rest_df['srx_accession'], rest_df['sra_id']))

In [42]:
## apply the mapping, but only to the nulls
result_df.loc[result_df['sra_id'].isnull(), 'sra_id'] = result_df['srx_accession'].map(mapping)

In [43]:
result_df['sra_id'].isnull().sum()

np.int64(0)

In [50]:
result_df['srx_accession'].nunique()

16077

In [48]:
result_df['sra_id'].nunique()

298

In [53]:
result_df['sra_id'].value_counts()

study_id
SRP510712    400
SRP439378    300
SRP362101    300
ERP136281    292
SRP423586    250
            ... 
ERP159291      1
ERP158531      1
ERP123534      1
ERP125482      1
ERP164644      1
Name: count, Length: 298, dtype: int64

In [10]:
get_sra_id("SRX5126512")  ## Returns "ERP149679"

'SRP147554'

In [11]:
result_df.loc[result_df['srx_accession'] == 'SRX5126512', 'sra_id']

1783    SRP510712
Name: study_id, dtype: object

In [44]:
result_df.to_csv("result_df.csv", index=False)

## Get Publications for all studies

In [4]:
sdf = pd.read_csv("study_df.csv")

In [5]:
sdf

Unnamed: 0,sra_id,prj_id,gse_id
0,ERP149679,PRJEB64504,E-MTAB-8142
1,ERP144781,PRJEB59723,E-MTAB-12650
2,ERP123138,PRJEB39602,
3,ERP156277,PRJEB71477,E-MTAB-13085
4,ERP151533,PRJEB66480,E-MTAB-13382
...,...,...,...
1660,SRP557106,PRJNA1210001,
1661,SRP557851,PRJNA1210535,GSE286911
1662,SRP559437,PRJNA1214776,GSE287827
1663,SRP270870,PRJNA644744,


## Save to TileDB SQL Table

In [29]:
import warnings
warnings.filterwarnings("ignore", category=UserWarning)
import tiledb

In [9]:
table_name = 'studies'
# ensure unique table URI
table_uri = os.path.join(
    "tiledb://Cellarity-analysis/s3://tiledb-analysis",
    f"{table_name}",
)

In [None]:
sdf['gse_id'].fillna('', inplace=True)

In [20]:
tiledb.from_pandas(table_uri, sdf)

In [32]:
results_df = pd.read_csv("result_df.csv")

In [33]:
samples_table_name = 'samples'
# ensure unique table URI
samples_table_uri = os.path.join(
    "tiledb://Cellarity-analysis/s3://tiledb-analysis",
    f"{samples_table_name}",
)

In [34]:
results_df.fillna('', inplace=True)

In [36]:
results_df.rename(columns={'study_id': 'sra_id'}, inplace=True)

In [38]:
tiledb.from_pandas(samples_table_uri, results_df)

In [40]:
samples = tiledb.open(samples_table_uri, "r")
samples_df = samples.df[:]

In [39]:
studies = tiledb.open(table_uri, "r")
df = studies.df[:]

In [47]:
sdf['gse_id'].value_counts().head(1) # see number of NULL (i.e. '')

gse_id
    125
Name: count, dtype: int64

In [42]:
sdf

Unnamed: 0,sra_id,prj_id,gse_id
0,ERP149679,PRJEB64504,E-MTAB-8142
1,ERP144781,PRJEB59723,E-MTAB-12650
2,ERP123138,PRJEB39602,
3,ERP156277,PRJEB71477,E-MTAB-13085
4,ERP151533,PRJEB66480,E-MTAB-13382
...,...,...,...
1660,SRP557106,PRJNA1210001,
1661,SRP557851,PRJNA1210535,GSE286911
1662,SRP559437,PRJNA1214776,GSE287827
1663,SRP270870,PRJNA644744,


# Load Studies and Samples Tables from TileDB

In [1]:
import warnings
warnings.filterwarnings("ignore", category=UserWarning)
import tiledb
import os


studies_table_name = 'studies'
studies_table_uri = os.path.join(
    "tiledb://Cellarity-analysis/s3://tiledb-analysis",
    f"{studies_table_name}",
)

samples_table_name = 'samples'
samples_table_uri = os.path.join(
    "tiledb://Cellarity-analysis/s3://tiledb-analysis",
    f"{samples_table_name}",
)

studies = tiledb.open(studies_table_uri, "r")
studies_df = studies.df[:]

samples = tiledb.open(samples_table_uri, "r")
samples_df = samples.df[:]

In [2]:
studies_df.head(1)

Unnamed: 0,sra_id,prj_id,gse_id
0,ERP149679,PRJEB64504,E-MTAB-8142


In [None]:
samples_df.head(1)

Unnamed: 0,entrez_id,srx_accession,file_path,obs_count,lib_prep,tech_10x,cell_prep,organism,tissue,disease,perturbation,cell_line,czi_collection_id,czi_collection_name,sra_id,gsm_id
0,29110018,ERX11148735,gs://arc-ctc-scbasecamp/2025-02-25/h5ad/GeneFu...,747,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,skin of body,normal,surplus skin from breast reconstruction surgery,not applicable,73f82ac8-15cc-4fcd-87f8-5683723fce7f,Developmental cell programs are co-opted in in...,ERP149679,


In [20]:
samples_df['cell_prep'].unique()

array(['single_cell', 'single_nucleus', 'unsure', 'not_applicable'],
      dtype=object)

In [5]:
import cynapse

client = cynapse.Client('dev')
adata = client.to_adata('tiledb://Cellarity-dev/6bb8ad78-4598-4d9d-a0d5-3571a074f3e5', x_layer_name='raw', cache=False)
adata

2025-03-18 15:26:55,087 - cynapse.core.core - INFO - Exporting tiledb://Cellarity-dev/6bb8ad78-4598-4d9d-a0d5-3571a074f3e5 as AnnData object.
2025-03-18 15:26:57,870 - cynapse.core.core - INFO - Resolved URI: tiledb://Cellarity-dev/s3://tiledb-dev/groups/ppj42_srp400429_scbasecamp
2025-03-18 15:27:11,726 - cynapse.core.core - INFO - Loaded RNA


AnnData object with n_obs × n_vars = 4302 × 36601
    obs: 'gene_count', 'umi_count', 'srx_accession', 'entrez_id', 'file_path', 'obs_count', 'lib_prep', 'tech_10x', 'cell_prep', 'organism', 'tissue', 'disease', 'perturbation', 'cell_line', 'czi_collection_id', 'czi_collection_name'
    var: 'feature_type', '_feature_types', 'gene_id_ensembl', 'gene_symbol_original', '_gene_symbols', 'gene_symbol_is_curated', 'gene_symbol_resolution', 'gene_symbol_source', 'gene_symbol_source_date'

In [12]:
adata.var['gene_symbol_is_curated'].value_counts()

gene_symbol_is_curated
True     34621
False     1980
Name: count, dtype: int64

In [16]:
adata.obs['tissue'].unique()

array(['other'], dtype=object)

In [24]:
adata.obs.index.nunique()

4302

In [25]:
adata.obs.shape

(4302, 16)

In [18]:
adata.obs['cell_prep']

SRX17761117-AAAGGATCACGCACCA    single_cell
SRX17761117-AAAGGATCAGGCATGA    single_cell
SRX17761117-AAAGGATGTCCTGTCT    single_cell
SRX17761117-AAAGTCCAGATCGGTG    single_cell
SRX17761117-AAAGTGACATAGAGGC    single_cell
                                   ...     
SRX17761120-TTTGGAGAGCACTCGC    single_cell
SRX17761120-TTTGGAGTCTCATTGT    single_cell
SRX17761120-TTTGGTTTCCCGTTGT    single_cell
SRX17761120-TTTGTTGAGCCTCGTG    single_cell
SRX17761120-TTTGTTGTCGTGCGAC    single_cell
Name: cell_prep, Length: 4302, dtype: object

In [None]:
adata.obs.drop(columns=['file_path', 'czi_collection_id', 'czi_collection_name'], inplace=True)

In [None]:
'single_nucleus'

In [None]:
adata.obs['resolution'] = adata.obs['cell_prep'].map({'single_cell': 'sc', 'single_nucleus': 'sn'})  # other possible values are 'unsure', 'not_applicable' which will be mapped to NaN
adata.obs.rename(columns={'tissue': '_tissue', 'disease': '_disease'}, inplace=True)  # we ultimately want to make 'tissue' and 'disease' compliant with cyntax

In [10]:
(adata.obs == '').sum()

gene_count                0
umi_count                 0
srx_accession             0
entrez_id                 0
file_path                 0
obs_count                 0
lib_prep                  0
tech_10x                  0
cell_prep                 0
organism                  0
tissue                    0
disease                   0
perturbation              0
cell_line                 0
czi_collection_id      4302
czi_collection_name    4302
dtype: int64

In [6]:
adata

AnnData object with n_obs × n_vars = 4302 × 36601
    obs: 'gene_count', 'umi_count', 'srx_accession', 'entrez_id', 'file_path', 'obs_count', 'lib_prep', 'tech_10x', 'cell_prep', 'organism', 'tissue', 'disease', 'perturbation', 'cell_line', 'czi_collection_id', 'czi_collection_name'
    var: 'feature_type', '_feature_types', 'gene_id_ensembl', 'gene_symbol_original', '_gene_symbols', 'gene_symbol_is_curated', 'gene_symbol_resolution', 'gene_symbol_source', 'gene_symbol_source_date'

In [21]:
adata.var

Unnamed: 0_level_0,feature_type,_feature_types,gene_id_ensembl,gene_symbol_original,_gene_symbols,gene_symbol_is_curated,gene_symbol_resolution,gene_symbol_source,gene_symbol_source_date
gene_symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
MIR1302-2HG,RNA,Gene Expression,ENSG00000243485,ENSG00000243485,MIR1302-2HG,True,reference gene_id,hg38-gencode29-allgenes-cellranger,2022-05-13
FAM138A,RNA,Gene Expression,ENSG00000237613,ENSG00000237613,FAM138A,True,reference gene_id,hg38-gencode29-allgenes-cellranger,2022-05-13
OR4F5,RNA,Gene Expression,ENSG00000186092,ENSG00000186092,OR4F5,True,reference gene_id,hg38-gencode29-allgenes-cellranger,2022-05-13
AL627309.1,RNA,Gene Expression,ENSG00000238009,ENSG00000238009,AL627309.1,True,reference gene_id,hg38-gencode29-allgenes-cellranger,2022-05-13
AL627309.3,RNA,Gene Expression,ENSG00000239945,ENSG00000239945,AL627309.3,True,reference gene_id,hg38-gencode29-allgenes-cellranger,2022-05-13
...,...,...,...,...,...,...,...,...,...
ENSG00000277836,RNA,Gene Expression,,ENSG00000277836,AC141272.1,False,,,
ENSG00000278633,RNA,Gene Expression,,ENSG00000278633,AC023491.2,False,,,
ENSG00000276017,RNA,Gene Expression,,ENSG00000276017,AC007325.1,False,,,
ENSG00000278817,RNA,Gene Expression,,ENSG00000278817,AC007325.4,False,,,


# Get all Publications using SRAgent

In [5]:
df = studies_df.head(10).copy()

In [21]:
next_10 = studies_df.iloc[10:20].copy()

In [22]:
next_10

Unnamed: 0,sra_id,prj_id,gse_id
10,SRP364677,PRJNA816172,GSE198623
11,ERP136992,PRJEB52292,
12,SRP329496,PRJNA749041,GSE180661
13,SRP273096,PRJNA647809,GSE154795
14,SRP510712,PRJNA1117936,GSE268630
15,SRP308561,PRJNA705464,
16,SRP306446,PRJNA701930,GSE166766
17,SRP309720,PRJNA707445,GSE168453
18,SRP310949,PRJNA714963,GSE169047
19,SRP288163,PRJNA670674,GSE159812


In [23]:
import SRAgent
import sys
from SRAgent.workflows.publications import run_in_notebook, process_dataframe
import asyncio

In [24]:
run_in_notebook(df=next_10, output_file='next_10.csv', batch_size=10)

INFO:SRAgent.workflows.publications:Processed batch 1/1 (10/10 studies)


Results saved to next_10.csv


Unnamed: 0,pmid,pmcid,preprint_doi,message,title,source,multiple_publications,all_publications,accessions
0,36113773.0,PMC9526148,,"The publication associated with SRP364677, PRJ...",,unknown,False,[],"[SRP364677, PRJNA816172, GSE198623]"
1,,,,Found a publication titled 'A spatially resolv...,,google_search,False,[],"[ERP136992, PRJEB52292, ]"
2,36517593.0,PMC9771812,,Here are the publication details associated wi...,Ovarian cancer mutational processes drive site...,unknown,False,[],"[SRP329496, PRJNA749041, GSE180661]"
3,34836966.0,PMC8626557,,"The publication associated with SRP273096, PRJ...",,unknown,False,[],"[SRP273096, PRJNA647809, GSE154795]"
4,,PMC11310509,,Found publication 'Single cell dual-omic atlas...,,unknown,False,[],"[SRP510712, PRJNA1117936, GSE268630]"
5,,,,Found publications for SRP308561 and PRJNA7054...,,unknown,False,[],"[SRP308561, PRJNA705464, ]"
6,,,,Found a publication titled 'Single-cell longit...,,unknown,False,[],"[SRP306446, PRJNA701930, GSE166766]"
7,,PMC8601717,10.1126/scitranslmed.abh2624,"Found a publication associated with SRP309720,...",,unknown,False,[],"[SRP309720, PRJNA707445, GSE168453]"
8,34014299.0,PMC8330894,,The publication associated with the accessions...,,unknown,False,[],"[SRP310949, PRJNA714963, GSE169047]"
9,,,,The publication 'Dysregulation of brain and ch...,,unknown,False,[],"[SRP288163, PRJNA670674, GSE159812]"


In [25]:
pub_next_10 = pd.read_csv('next_10.csv')

In [27]:
pub_next_10.to_dict(orient='records')

[{'pmid': 36113773.0,
  'pmcid': 'PMC9526148',
  'preprint_doi': nan,
  'message': "The publication associated with SRP364677, PRJNA816172, and GSE198623 is titled 'A transcriptional cross species map of pancreatic islet cells' and is published in Molecular Metabolism. The PMID is 36113773 and the PMCID is PMC9526148.",
  'title': nan,
  'source': 'unknown',
  'multiple_publications': False,
  'all_publications': '[]',
  'accessions': "['SRP364677', 'PRJNA816172', 'GSE198623']"},
 {'pmid': nan,
  'pmcid': nan,
  'preprint_doi': nan,
  'message': "Found a publication titled 'A spatially resolved atlas of the human lung characterizes a gland' in Nature Genetics associated with ERP136992 and PRJEB52292. Please check the Nature Genetics website for more details.",
  'title': nan,
  'source': 'google_search',
  'multiple_publications': False,
  'all_publications': '[]',
  'accessions': "['ERP136992', 'PRJEB52292', '']"},
 {'pmid': 36517593.0,
  'pmcid': 'PMC9771812',
  'preprint_doi': nan,


In [None]:
run_in_notebook(df=df, output_file='publications.csv', batch_size=10)

In [8]:
import pandas as pd
pub = pd.read_csv('publications.csv')


In [12]:
pub

Unnamed: 0,pmid,pmcid,preprint_doi,title,message,source,multiple_publications,all_publications,accessions
0,,,,,Error: 'NoneType' object does not support item...,error,False,[],"['ERP149679', 'PRJEB64504', 'E-MTAB-8142']"
1,,,,,Error: 'NoneType' object does not support item...,error,False,[],"['ERP144781', 'PRJEB59723', 'E-MTAB-12650']"
2,,,,,Found a publication titled 'Cells of the adult...,,,,"['ERP123138', 'PRJEB39602', '']"
3,,PMC10786309,,,Found a publication associated with E-MTAB-130...,,,,"['ERP156277', 'PRJEB71477', 'E-MTAB-13085']"
4,,,,,Found a publication associated with E-MTAB-133...,,,,"['ERP151533', 'PRJEB66480', 'E-MTAB-13382']"
5,38100545.0,PMC7615868,,,Found the publication associated with ERP16080...,,,,"['ERP160803', 'PRJEB76244', 'E-MTAB-11528']"
6,,,,,Found a publication titled 'Human skeletal mus...,,,,"['ERP158366', 'PRJEB73595', 'E-MTAB-13874']"
7,,,,,"The publication associated with SRP402417, PRJ...",,,,"['SRP402417', 'PRJNA890219', 'GSE215403']"
8,,PMC7612735,,,"The publication associated with ERP136281, PRJ...",,,,"['ERP136281', 'PRJEB51634', 'E-MTAB-11536']"
9,,,,,Error: 'NoneType' object does not support item...,error,False,[],"['SRP324458', 'PRJNA738600', 'GSE178360']"


In [18]:
pub[['accessions','message']].to_dict(orient='records')

[{'accessions': "['ERP149679', 'PRJEB64504', 'E-MTAB-8142']",
  'message': "Error: 'NoneType' object does not support item assignment"},
 {'accessions': "['ERP144781', 'PRJEB59723', 'E-MTAB-12650']",
  'message': "Error: 'NoneType' object does not support item assignment"},
 {'accessions': "['ERP123138', 'PRJEB39602', '']",
  'message': "Found a publication titled 'Cells of the adult human heart' in Nature associated with ERP123138 and PRJEB39602. Please check the Nature article for more details."},
 {'accessions': "['ERP156277', 'PRJEB71477', 'E-MTAB-13085']",
  'message': 'Found a publication associated with E-MTAB-13085, likely linked to ERP156277 and PRJEB71477. The PMCID is PMC10786309.'},
 {'accessions': "['ERP151533', 'PRJEB66480', 'E-MTAB-13382']",
  'message': "Found a publication associated with E-MTAB-13382 in Cell Stem Cell. The publication is titled 'Human uterine natural killer cells regulate differentiation of ...' and is authored by Qian Li et al. No PMID or PMCID found