This code will help you get the Cell Ontology values that are in the Cell Census based on your preferences. There are 3 steps:

1) Get the Values from the Ontology
2) Get the Cell Types that are included in the Cell Census
3) Compare these values to select those that overlap

This will result in a list of CL values that match your requirements. You can then use this with ```download_cell_census.py``` to download the specific parts of the Census you need. 

We should consider saving a text file of values that we can then import into that program. That would probably be a bit more consistent.


In [1]:
import pronto 
import warnings
warnings.filterwarnings("ignore", category=pronto.warnings.ProntoWarning)

import cellxgene_census


In [2]:
cl = pronto.Ontology.from_obo_library('cl.owl')


## Get the  values from the ontology

First, you need to query the ontology in order to identify which values you are interested in. There are a few options here based on your preference. 

In [42]:
# set the top level of the ontology
root_node = cl['CL:0000988'] # lymphocyte (542), leukocyte (738), hematopoietic (0000988)


Run the following if you want leaf values only:

In [17]:


leaf_list = []

for term in root_node.subclasses(distance=None,with_self=False).to_set():
    if term.is_leaf():
        leaf_list.append(term.id)
        
ontology_list = [x for x in leaf_list if x.startswith('CL')]


Run the following if you want any values under the top level:

In [4]:
leaf_list = []

for term in root_node.subclasses(distance=None,with_self=False).to_set():
    leaf_list.append(term.id)
        
ontology_list = [x for x in leaf_list if x.startswith('CL')]


Run the following if you want any values under the top level that are within a certain distance of the leaf level. Note that most values in the ontology are within 2-3 levels of a leaf. 

In [45]:
leaf_list = []

for term in root_node.subclasses(distance=None,with_self=False).to_set():
    leaf_list.append(term.id)
    
cl_leaf_list_full = [x for x in leaf_list if x.startswith('CL')]

# set the max distance from the bottom level of the ontology 
# that is allowed
max_distance = 5

# set with_self = True because we want to include Leafs that have no descendents

ontology_list = []
for node in cl_leaf_list_full:
    for term in cl[node].subclasses(distance=max_distance,with_self=True).to_set():
        if term.is_leaf():
            if node not in ontology_list:
                ontology_list.append(node)



## Get the possible values that are included in the Cell Census

Now, we need to query the census in order to find which Cell Type values are included in the Census. This was originally needed because when querying for data the census would return an error if any cell type value was not in the Census. However, there's been an update so that it returns only those that are present. But we still think it's nice to have a clear list of which values we have data for. 

In [29]:
census = cellxgene_census.open_soma(census_version="stable")


The "stable" release is currently 2023-12-15. Specify 'census_version="2023-12-15"' in future calls to open_soma() to ensure data consistency.


In [30]:
def check_subset(filter,col):
    '''
    This function checks an active census object to identify the unique values contained in the
    column of interest, after filtering on an initial column.
    
    Assumes there is an active census object already open. Assumes you only want to query on cell metadata. 
    Gene metadata querying not currently supported. Currently only supports querying one column at a time.
    
    Parameters
    ----------
    filter : string
        string containing obs parameter filter
        
    col : string
        string containing column of interest for identifying unique values
                
    Returns
    -------
        printed string detailing unique values for input column after applying filter
    
    '''
    cell_data = (
        census["census_data"]["homo_sapiens"]
        .obs.read(value_filter=filter)
        .concat()
        .to_pandas()
    )
    
    #print('After filtering on ', filter, 'the unique values for ', col, 'are:')
    #print(cell_data[col].unique())
    return(cell_data[col].unique())
    

In [34]:
census_vals = check_subset('''assay == "10x 3\' v3"''',# and cell_type_ontology_term_id in ["CL:0000738","CL:0000542"]''',
             'cell_type_ontology_term_id')

After filtering on  assay == "10x 3' v3" the unique values for  cell_type_ontology_term_id are:
['CL:0002605' 'CL:0000128' 'CL:4023051' 'CL:0000129' 'CL:0002453'
 'CL:1001602' 'CL:4023012' 'CL:4023013' 'CL:4023038' 'CL:4023041'
 'CL:4023070' 'CL:4023016' 'CL:4023015' 'CL:4023011' 'CL:4023017'
 'CL:4023018' 'CL:4023036' 'CL:4023040' 'CL:0000115' 'CL:0008034'
 'CL:0002138' 'CL:0000763' 'CL:0000057' 'CL:0002319' 'CL:0000542'
 'CL:0000077' 'CL:0000097' 'CL:0002131' 'CL:0002129' 'CL:0000136'
 'CL:0002046' 'CL:0000817' 'CL:0000051' 'CL:0000826' 'CL:0001029'
 'CL:0000990' 'CL:0000785' 'CL:0000816' 'CL:0000786' 'CL:0000784'
 'CL:0000557' 'CL:0000084' 'CL:0000814' 'CL:0000837' 'CL:0000037'
 'CL:0000576' 'CL:0000050' 'CL:0001054' 'CL:0002397' 'CL:0000003'
 'CL:0000913' 'CL:0002338' 'CL:0000939' 'CL:0000895' 'CL:0000623'
 'CL:0000794' 'CL:0000904' 'CL:0011025' 'CL:0000900' 'CL:0000540'
 'CL:0000700' 'CL:0000127' 'CL:0000815' 'CL:0000912' 'CL:0000940'
 'CL:0000899' 'CL:0000312' 'CL:0000798' 'CL:10

## Compare ontology and cell_census values to select only ontology values in cell census

Now, we compare the values from the ontology and cell_census that overlap to get our final list. 

In [48]:
cell_type_list = [x for x in census_vals if x in ontology_list]


In [49]:
print('There are {} values in ontology_list'.format(len(ontology_list)))
print('There are {} values in census_vals'.format(len(census_vals)))

print('There are {} values in cell_type_list'.format(len(cell_type_list)))

There are 666 values in ontology_list
There are 448 values in census_vals
There are 146 values in cell_type_list


In [50]:
print(cell_type_list)

['CL:0000129', 'CL:0000763', 'CL:0000542', 'CL:0000097', 'CL:0002046', 'CL:0000817', 'CL:0000051', 'CL:0000826', 'CL:0001029', 'CL:0000990', 'CL:0000785', 'CL:0000816', 'CL:0000786', 'CL:0000784', 'CL:0000557', 'CL:0000084', 'CL:0000814', 'CL:0000837', 'CL:0000037', 'CL:0000576', 'CL:0000050', 'CL:0001054', 'CL:0002397', 'CL:0000913', 'CL:0002338', 'CL:0000939', 'CL:0000895', 'CL:0000623', 'CL:0000794', 'CL:0000904', 'CL:0011025', 'CL:0000900', 'CL:0000815', 'CL:0000912', 'CL:0000940', 'CL:0000899', 'CL:0000798', 'CL:0000235', 'CL:0000775', 'CL:0000236', 'CL:0000453', 'CL:0002343', 'CL:0001078', 'CL:3000001', 'CL:0000451', 'CL:0000094', 'CL:0000788', 'CL:0000787', 'CL:0000492', 'CL:0000625', 'CL:0000624', 'CL:0000809', 'CL:0000038', 'CL:0000909', 'CL:0002393', 'CL:0000860', 'CL:0000897', 'CL:0000921', 'CL:0001065', 'CL:0000232', 'CL:0000233', 'CL:0002394', 'CL:0002399', 'CL:0000764', 'CL:0000908', 'CL:0000875', 'CL:0000767', 'CL:0000980', 'CL:0000049', 'CL:0000738', 'CL:2000055', 'CL:0