# Retrieving Data from CELLxGENE

CELLxGENE collects single cell datasets from various sources and provides a platform to explore them.
On top of that, it also provides a way to download the datasets to h5ad.
Here is a quick guide on how to retrieve datasets from CELLxGENE.

| Dataset | URL | collection_id | dataset_id |
| --- | --- | --- | --- |
| Hrovatin 2023 Mouse pancreatic islets | https://cellxgene.cziscience.com/collections/296237e2-393d-4e31-b590-b03f74ac5070 | 296237e2-393d-4e31-b590-b03f74ac5070 | 49e4ffcc-5444-406d-bdee-577127404ba8 |
| Sikkema 2023 HLCA core | https://cellxgene.cziscience.com/collections/6f6d381a-7701-4781-935c-db10d30de293 | 6f6d381a-7701-4781-935c-db10d30de293 | 066943a2-fdac-4b29-b348-40cede398e4e |
| Yoshida 2022 PBMC | https://datasets.cellxgene.cziscience.com/926a6acd-6555-4d55-9ba5-6927c9884e96 | 03f821b4-87be-4ff4-b65a-b5fc00061da7 | 2a498ace-872a-4935-984b-1afa70fd9886 |

## Dependencies

Make sure you have the scanpy environment and an instance of jupyterlab installed. See `envs/README.md` for more details.

## Downloading the dataset

In [1]:
from utils.cellxgene import get_cxg_url

Use the `get_cxg_url` function from `utils` to get the URL of a dataset by providing the `collection_id` and `dataset_id`.
In the following example, we will get the URL of the Yoshida 2022 PBMC dataset and download it. You can adapt the code to download your dataset of choice by changing the `collection_id` and `dataset_id` values as well as the file name.

In [2]:
# get Yoshida data
url = get_cxg_url(
    collection_id="03f821b4-87be-4ff4-b65a-b5fc00061da7",
    dataset_id="2a498ace-872a-4935-984b-1afa70fd9886"
)

Get URL for CxG collection ID "03f821b4-87be-4ff4-b65a-b5fc00061da7" and dataset ID "2a498ace-872a-4935-984b-1afa70fd9886"
URL: https://datasets.cellxgene.cziscience.com/fb338c4d-e63a-4b17-abd6-1032a66c8886.h5ad


In [3]:
file_path = 'Yoshida2022.h5ad'

In [4]:
!curl {url} > {file_path}

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 4227M  100 4227M    0     0  20.0M      0  0:03:30  0:03:30 --:--:-- 18.4M


In [5]:
import scanpy as sc 

In [6]:
adata = sc.read(file_path)
adata

AnnData object with n_obs × n_vars = 422220 × 33105
    obs: 'Ethnicity', 'BMI', 'annotation_broad', 'annotation_detailed', 'annotation_detailed_fullNames', 'Age_group', 'COVID_severity', 'COVID_status', 'Group', 'Smoker', 'sample_id', 'sequencing_library', 'Protein_modality_weight', 'assay_ontology_term_id', 'cell_type_ontology_term_id', 'development_stage_ontology_term_id', 'disease_ontology_term_id', 'is_primary_data', 'organism_ontology_term_id', 'sex_ontology_term_id', 'tissue_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'donor_id', 'suspension_type', 'tissue_type', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'
    var: 'name', 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length', 'feature_type'
    uns: 'antibody_X', 'antibody_features', 'citation', 'default_embedding', 'schema_reference', 'schema_version', 'title'
    obsm: 'X_umap_aft