Script to query data from Cell Census.
    
    
SOMA = STACKS of matrices, annotated: https://github.com/single-cell-data/SOMA/blob/main/abstract_specification.md

CELLxGENE dataset schema: https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/3.0.0/schema.md

Helpful links:
https://github.com/chanzuckerberg/cell-census/blob/main/api/python/notebooks/api_demo/census_query_extract.ipynb

Future helpful link:
Interfacing pytorch models with anndata: https://anndata.readthedocs.io/en/latest/tutorials/notebooks/annloader.html

In [2]:
import cell_census
import anndata as ad

In [3]:
census = cell_census.open_soma(census_version="latest")


In [3]:
# Define a simple obs-axis query for all cells where tissue is UBERON:0001264 and sex is PATO:0000383.
adata = cell_census.get_anndata(
    census,
    "Homo sapiens",
    obs_value_filter="tissue_ontology_term_id=='UBERON:0002048' and sex_ontology_term_id=='PATO:0000383' and cell_type_ontology_term_id in ['CL:0002063', 'CL:0000499']",
)

display(adata)

AnnData object with n_obs × n_vars = 119269 × 60664
    obs: 'soma_joinid', 'dataset_id', 'assay', 'assay_ontology_term_id', 'cell_type', 'cell_type_ontology_term_id', 'development_stage', 'development_stage_ontology_term_id', 'disease', 'disease_ontology_term_id', 'donor_id', 'is_primary_data', 'self_reported_ethnicity', 'self_reported_ethnicity_ontology_term_id', 'sex', 'sex_ontology_term_id', 'suspension_type', 'tissue', 'tissue_ontology_term_id', 'tissue_general', 'tissue_general_ontology_term_id'
    var: 'soma_joinid', 'feature_id', 'feature_name', 'feature_length'

In ```n_obs``` there are a few ontology related terms. One of these might be our target variable, perhaps cell_type?

- ```cell_type_ontology_term_id``` 
- ```development_stage_ontology_term_id``` 
- ```disease_ontology_term_id``` 
- ```self_reported_ethnicity_ontology_term_id``` 
- ```sex_ontology_term_id``` 
- ```tissue_ontology_term_id``` 
- ```tissue_general_ontology_term_id```

```obs``` = cell metadata
```var``` = feature metadata

In [4]:
type(adata)

anndata._core.anndata.AnnData

In [5]:
shape = ad.AnnData(adata)
print(shape)

AnnData object with n_obs × n_vars = 119269 × 60664
    obs: 'soma_joinid', 'dataset_id', 'assay', 'assay_ontology_term_id', 'cell_type', 'cell_type_ontology_term_id', 'development_stage', 'development_stage_ontology_term_id', 'disease', 'disease_ontology_term_id', 'donor_id', 'is_primary_data', 'self_reported_ethnicity', 'self_reported_ethnicity_ontology_term_id', 'sex', 'sex_ontology_term_id', 'suspension_type', 'tissue', 'tissue_ontology_term_id', 'tissue_general', 'tissue_general_ontology_term_id'
    var: 'soma_joinid', 'feature_id', 'feature_name', 'feature_length'


In [6]:
adata.obs_names

Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
       ...
       '119259', '119260', '119261', '119262', '119263', '119264', '119265',
       '119266', '119267', '119268'],
      dtype='object', length=119269)

In [7]:
adata.var_names

Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
       ...
       '60654', '60655', '60656', '60657', '60658', '60659', '60660', '60661',
       '60662', '60663'],
      dtype='object', length=60664)

## Both adata.obs and adata.vars are Pandas DataFrames

In [8]:
adata.var

Unnamed: 0,soma_joinid,feature_id,feature_name,feature_length
0,0,ENSG00000238009,RP11-34P13.7,3726
1,1,ENSG00000279457,WASH9P,1397
2,2,ENSG00000228463,AP006222.1,8224
3,3,ENSG00000237094,RP4-669L17.4,6204
4,4,ENSG00000230021,RP11-206L10.17,5495
...,...,...,...,...
60659,60659,ENSG00000288699,RP11-182N22.10,654
60660,60660,ENSG00000288700,RP11-22E12.2,6888
60661,60661,ENSG00000288710,RP11-386G11.12,2968
60662,60662,ENSG00000288711,AP000326.5,1307


In [9]:
adata.obs

Unnamed: 0,soma_joinid,dataset_id,assay,assay_ontology_term_id,cell_type,cell_type_ontology_term_id,development_stage,development_stage_ontology_term_id,disease,disease_ontology_term_id,...,is_primary_data,self_reported_ethnicity,self_reported_ethnicity_ontology_term_id,sex,sex_ontology_term_id,suspension_type,tissue,tissue_ontology_term_id,tissue_general,tissue_general_ontology_term_id
0,1947485,97a17473-e2b1-4f31-a544-44a60773e2dd,10x 3' v3,EFO:0009922,type II pneumocyte,CL:0002063,59-year-old human stage,HsapDv:0000153,normal,PATO:0000461,...,False,European,HANCESTRO:0005,female,PATO:0000383,cell,lung,UBERON:0002048,lung,UBERON:0002048
1,1947486,97a17473-e2b1-4f31-a544-44a60773e2dd,10x 3' v3,EFO:0009922,type II pneumocyte,CL:0002063,59-year-old human stage,HsapDv:0000153,normal,PATO:0000461,...,False,European,HANCESTRO:0005,female,PATO:0000383,cell,lung,UBERON:0002048,lung,UBERON:0002048
2,1947489,97a17473-e2b1-4f31-a544-44a60773e2dd,10x 3' v3,EFO:0009922,type II pneumocyte,CL:0002063,59-year-old human stage,HsapDv:0000153,normal,PATO:0000461,...,False,European,HANCESTRO:0005,female,PATO:0000383,cell,lung,UBERON:0002048,lung,UBERON:0002048
3,1947491,97a17473-e2b1-4f31-a544-44a60773e2dd,10x 3' v3,EFO:0009922,type II pneumocyte,CL:0002063,59-year-old human stage,HsapDv:0000153,normal,PATO:0000461,...,False,European,HANCESTRO:0005,female,PATO:0000383,cell,lung,UBERON:0002048,lung,UBERON:0002048
4,1947494,97a17473-e2b1-4f31-a544-44a60773e2dd,10x 3' v3,EFO:0009922,type II pneumocyte,CL:0002063,59-year-old human stage,HsapDv:0000153,normal,PATO:0000461,...,False,European,HANCESTRO:0005,female,PATO:0000383,cell,lung,UBERON:0002048,lung,UBERON:0002048
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119264,35694911,4023a2bc-6325-47db-bfdf-9639e91042c2,10x 5' v1,EFO:0011025,type II pneumocyte,CL:0002063,15th week post-fertilization human stage,HsapDv:0000052,normal,PATO:0000461,...,False,unknown,unknown,female,PATO:0000383,cell,lung,UBERON:0002048,lung,UBERON:0002048
119265,35694942,4023a2bc-6325-47db-bfdf-9639e91042c2,10x 5' v1,EFO:0011025,type II pneumocyte,CL:0002063,20th week post-fertilization human stage,HsapDv:0000057,normal,PATO:0000461,...,False,unknown,unknown,female,PATO:0000383,cell,lung,UBERON:0002048,lung,UBERON:0002048
119266,35694943,4023a2bc-6325-47db-bfdf-9639e91042c2,10x 5' v1,EFO:0011025,type II pneumocyte,CL:0002063,20th week post-fertilization human stage,HsapDv:0000057,normal,PATO:0000461,...,False,unknown,unknown,female,PATO:0000383,cell,lung,UBERON:0002048,lung,UBERON:0002048
119267,35694955,4023a2bc-6325-47db-bfdf-9639e91042c2,10x 5' v1,EFO:0011025,type II pneumocyte,CL:0002063,20th week post-fertilization human stage,HsapDv:0000057,normal,PATO:0000461,...,False,unknown,unknown,female,PATO:0000383,cell,lung,UBERON:0002048,lung,UBERON:0002048


In [10]:
print(adata.obs_names[:10])

Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'], dtype='object')


In [11]:
# return a dataframe
adata_df = adata.obs['cell_type']#.to_df()

In [21]:
adata_df.value_counts()

type II pneumocyte    79151
stromal cell          40118
Name: cell_type, dtype: int64

In [13]:
adata_df.shape

(119269,)

## Find 10X 3' V3 data from human immune cells

10x 3' V3 is the assay
homo sapiens gets us the human portion

how do we get immune cells only?

From (https://www.cancer.gov/publications/dictionaries/cancer-terms/def/immune-cell), immune cells include neutrophils, eosinophils, basophils, mast cells, monocytes, macrophages, dendritic cells, natural killer cells, and lymphocytes (B cells and T cells).

All of these show up as s ```cell_type``` using the ```obs``` axis. Some show up in multiple ways. We could create a list of cell_types to search for. 

Use cell_census.get_anndata to get the gene expression data


In [15]:
# does this just bring in the meta data? I think I need the gene expression data as well. How do I get that? 
# use .get_anndata

cell_10v3 = (
    census["census_data"]["homo_sapiens"].obs.read(value_filter='''assay == "10x 3\' v3"''').concat().to_pandas()
)

# another query method
# lung_adata = cell_census.get_anndata(
#     census,
#     organism="Homo sapiens",
#     obs_coords=(lung_cell_subsampled_ids,),
#     var_coords=(lung_gene_ids,),
# )

# adata = cell_census.get_anndata(
#     census=census,
#     organism="Homo sapiens",
#     var_value_filter="feature_id in ['ENSG00000161798', 'ENSG00000188229']",
#     obs_value_filter="cell_type == 'B cell' and tissue_general == 'lung' and disease == 'COVID-19'",
#     column_names={"obs": ["sex"]},
# )

In [None]:
# B cell is in there just for testing

human_immune_data = cell_census.get_anndata(
        census,
        organism = "Homo sapiens",
        obs_value_filter = '''assay == "10x 3\' v3" and cell_type == "B cell"'''#"cell_type == 'B cell' and assay == '10x 3\' v3'"
        )
        

The above was started at 14:08
Ended at

Much too slow. 

In [36]:
human_immune_data

NameError: name 'human_immune_data' is not defined

In [1]:
help(cell_census.get_anndata)

NameError: name 'cell_census' is not defined

In [16]:
cell_10v3.head()

Unnamed: 0,soma_joinid,dataset_id,assay,assay_ontology_term_id,cell_type,cell_type_ontology_term_id,development_stage,development_stage_ontology_term_id,disease,disease_ontology_term_id,...,is_primary_data,self_reported_ethnicity,self_reported_ethnicity_ontology_term_id,sex,sex_ontology_term_id,suspension_type,tissue,tissue_ontology_term_id,tissue_general,tissue_general_ontology_term_id
0,68036,1e5bd3b8-6a0e-4959-8d69-cafed30fe814,10x 3' v3,EFO:0009922,T cell,CL:0000084,63-year-old human stage,HsapDv:0000157,normal,PATO:0000461,...,True,European,HANCESTRO:0005,male,PATO:0000384,cell,alveolus of lung,UBERON:0002299,lung,UBERON:0002048
1,68037,1e5bd3b8-6a0e-4959-8d69-cafed30fe814,10x 3' v3,EFO:0009922,T cell,CL:0000084,63-year-old human stage,HsapDv:0000157,normal,PATO:0000461,...,True,European,HANCESTRO:0005,male,PATO:0000384,cell,alveolus of lung,UBERON:0002299,lung,UBERON:0002048
2,68038,1e5bd3b8-6a0e-4959-8d69-cafed30fe814,10x 3' v3,EFO:0009922,monocyte,CL:0000576,63-year-old human stage,HsapDv:0000157,normal,PATO:0000461,...,True,European,HANCESTRO:0005,male,PATO:0000384,cell,alveolus of lung,UBERON:0002299,lung,UBERON:0002048
3,68039,1e5bd3b8-6a0e-4959-8d69-cafed30fe814,10x 3' v3,EFO:0009922,T cell,CL:0000084,63-year-old human stage,HsapDv:0000157,normal,PATO:0000461,...,True,European,HANCESTRO:0005,male,PATO:0000384,cell,alveolus of lung,UBERON:0002299,lung,UBERON:0002048
4,68040,1e5bd3b8-6a0e-4959-8d69-cafed30fe814,10x 3' v3,EFO:0009922,monocyte,CL:0000576,63-year-old human stage,HsapDv:0000157,normal,PATO:0000461,...,True,European,HANCESTRO:0005,male,PATO:0000384,cell,alveolus of lung,UBERON:0002299,lung,UBERON:0002048


In [28]:
cell_10v3

Unnamed: 0,soma_joinid,dataset_id,assay,assay_ontology_term_id,cell_type,cell_type_ontology_term_id,development_stage,development_stage_ontology_term_id,disease,disease_ontology_term_id,...,is_primary_data,self_reported_ethnicity,self_reported_ethnicity_ontology_term_id,sex,sex_ontology_term_id,suspension_type,tissue,tissue_ontology_term_id,tissue_general,tissue_general_ontology_term_id
0,68036,1e5bd3b8-6a0e-4959-8d69-cafed30fe814,10x 3' v3,EFO:0009922,T cell,CL:0000084,63-year-old human stage,HsapDv:0000157,normal,PATO:0000461,...,True,European,HANCESTRO:0005,male,PATO:0000384,cell,alveolus of lung,UBERON:0002299,lung,UBERON:0002048
1,68037,1e5bd3b8-6a0e-4959-8d69-cafed30fe814,10x 3' v3,EFO:0009922,T cell,CL:0000084,63-year-old human stage,HsapDv:0000157,normal,PATO:0000461,...,True,European,HANCESTRO:0005,male,PATO:0000384,cell,alveolus of lung,UBERON:0002299,lung,UBERON:0002048
2,68038,1e5bd3b8-6a0e-4959-8d69-cafed30fe814,10x 3' v3,EFO:0009922,monocyte,CL:0000576,63-year-old human stage,HsapDv:0000157,normal,PATO:0000461,...,True,European,HANCESTRO:0005,male,PATO:0000384,cell,alveolus of lung,UBERON:0002299,lung,UBERON:0002048
3,68039,1e5bd3b8-6a0e-4959-8d69-cafed30fe814,10x 3' v3,EFO:0009922,T cell,CL:0000084,63-year-old human stage,HsapDv:0000157,normal,PATO:0000461,...,True,European,HANCESTRO:0005,male,PATO:0000384,cell,alveolus of lung,UBERON:0002299,lung,UBERON:0002048
4,68040,1e5bd3b8-6a0e-4959-8d69-cafed30fe814,10x 3' v3,EFO:0009922,monocyte,CL:0000576,63-year-old human stage,HsapDv:0000157,normal,PATO:0000461,...,True,European,HANCESTRO:0005,male,PATO:0000384,cell,alveolus of lung,UBERON:0002299,lung,UBERON:0002048
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18278131,43548596,d3a83885-5198-4b04-8314-b753b66ef9a8,10x 3' v3,EFO:0009922,"effector CD8-positive, alpha-beta T cell",CL:0001050,71-year-old human stage,HsapDv:0000165,benign prostatic hyperplasia,MONDO:0010811,...,False,European,HANCESTRO:0005,male,PATO:0000384,cell,transition zone of prostate,UBERON:8410025,prostate gland,UBERON:0002367
18278132,43548597,d3a83885-5198-4b04-8314-b753b66ef9a8,10x 3' v3,EFO:0009922,"effector CD8-positive, alpha-beta T cell",CL:0001050,71-year-old human stage,HsapDv:0000165,benign prostatic hyperplasia,MONDO:0010811,...,False,European,HANCESTRO:0005,male,PATO:0000384,cell,transition zone of prostate,UBERON:8410025,prostate gland,UBERON:0002367
18278133,43548598,d3a83885-5198-4b04-8314-b753b66ef9a8,10x 3' v3,EFO:0009922,CD1c-positive myeloid dendritic cell,CL:0002399,71-year-old human stage,HsapDv:0000165,benign prostatic hyperplasia,MONDO:0010811,...,False,European,HANCESTRO:0005,male,PATO:0000384,cell,transition zone of prostate,UBERON:8410025,prostate gland,UBERON:0002367
18278134,43548599,d3a83885-5198-4b04-8314-b753b66ef9a8,10x 3' v3,EFO:0009922,"effector CD8-positive, alpha-beta T cell",CL:0001050,71-year-old human stage,HsapDv:0000165,benign prostatic hyperplasia,MONDO:0010811,...,False,European,HANCESTRO:0005,male,PATO:0000384,cell,transition zone of prostate,UBERON:8410025,prostate gland,UBERON:0002367


In [26]:
cell_types = cell_10v3['cell_type'].unique()

In [27]:
print(cell_types)

['T cell' 'monocyte' 'dendritic cell' 'alveolar macrophage'
 'natural killer cell' 'B cell' 'mast cell' 'macrophage' 'plasma cell'
 'type II pneumocyte' 'endothelial cell'
 'epithelial cell of lower respiratory tract' 'smooth muscle cell'
 'fibroblast' 'type I pneumocyte' 'endothelial cell of lymphatic vessel'
 'ciliated cell' 'pericyte' 'enterocyte of epithelium of small intestine'
 'intestinal tuft cell' 'enterocyte of epithelium of large intestine'
 'colon goblet cell' 'gut absorptive cell' 'small intestine goblet cell'
 'enteroendocrine cell of colon' 'tuft cell of colon'
 'intestinal crypt stem cell of colon'
 'intestinal crypt stem cell of small intestine'
 'epithelial cell of small intestine'
 'transit amplifying cell of small intestine'
 'transit amplifying cell of colon' 'progenitor cell'
 'enteroendocrine cell of small intestine'
 'paneth cell of epithelium of small intestine'
 'microfold cell of epithelium of small intestine'
 'luminal epithelial cell of mammary gland' 'basa