# CCM Benchmate Apis module tutorial part 2

There are a lot of apis integrated (and will likely grow more) so there may be part3.

In the last notebook we looked at ensembl, biogrid, intact and stringdb. In this notebook we will look at ncbi, uniprot, reactome and rnacentral apis.

## NCBI

This probabaly is the thinnest wrapper around all the apis. The main reason for that is that we cover basically the entrirety of the ncbi databsae and you have a lot of options and flexibility for querying. As always with that flexibility comes the burden of verbosity. We will cover **some** of the enpoints in this tutorial for the rest you can check the eutils guide and the ncbi website.

In [1]:
from ccm_benchmate.apis.ncbi import Ncbi
ncbi = Ncbi(email="alper.celik@sickkids.ca") # so ncbi can tell you to stop abusing their resources. Also the rate limit increase dramatically when an email or api key is provided.

In [2]:
# let get some info these are all the databases you can search, I would not use pubmed and pmc searches here, Instead you should use the literature submodule (see its readme and notebook)
ncbi.databases

['pubmed',
 'protein',
 'nuccore',
 'ipg',
 'nucleotide',
 'structure',
 'genome',
 'annotinfo',
 'assembly',
 'bioproject',
 'biosample',
 'blastdbinfo',
 'books',
 'cdd',
 'clinvar',
 'gap',
 'gapplus',
 'grasp',
 'dbvar',
 'gene',
 'gds',
 'geoprofiles',
 'medgen',
 'mesh',
 'nlmcatalog',
 'omim',
 'orgtrack',
 'pmc',
 'proteinclusters',
 'pcassay',
 'protfam',
 'pccompound',
 'pcsubstance',
 'seqannot',
 'snp',
 'sra',
 'taxonomy',
 'biocollections',
 'gtr']

In [4]:
omim_codes=ncbi.search(db="omim", query="cancer", retmax=1000)
omim_codes

['621238', '621215', '621206', '621205', '621204', '621181', '621173', '621165', '621132', '621115', '621105', '621094', '621050', '621047', '621041', '621039', '621022', '621014', '621005', '620995', '620992', '620980', '620961', '620959', '620946', '620934', '620933', '620929', '620909', '620896', '620874', '620862', '620859', '620839', '620831', '620824', '620821', '620791', '620788', '620770', '620761', '620735', '620733', '620723', '620720', '620701', '620697', '620696', '620694', '620691', '620682', '620671', '620638', '620579', '620566', '620555', '620554', '620552', '620544', '620539', '620529', '620524', '620517', '620497', '620477', '620464', '620459', '620442', '620436', '620431', '620421', '620412', '620410', '620408', '620396', '620390', '620373', '620365', '620355', '620302', '620290', '620267', '620259', '620255', '620230', '620226', '620214', '620189', '620163', '620162', '620153', '620119', '620109', '620087', '620079', '620064', '620062', '620059', '620054', '620051',

In [7]:
# search function returns ids, that's it you can then use the other methods to get what you need.
# wow that's a lot of omim codes related to cancer, I'm going take the first 10
mycodes=omim_codes[0:10]
summaries=[ncbi.summary("omim", code)[0] for code in mycodes]

In [10]:
# we have a list of summaries, let's take a look at one
summaries[0]

{'Item': [], 'Id': '621238', 'Oid': '*621238', 'Title': 'RNA, U2 SMALL NUCLEAR, 2; RNU2-2', 'AltTitles': '', 'Locus': ''}

In [12]:
# you can get the full record by using fetch, sometimes there is no full record, the summary is the record. in case of omim this is true
# for things like genes not so
full_record=ncbi.fetch("omim", omim_codes[0])
full_record

['621238']

In [23]:
genes=ncbi.search("gene", "cancer", retmax=10)
genes

['141653481', '141569333', '141567492', '14910', '141633492', '141592724', '7157', '141569264', '141584944', '141579888']

In [25]:
#brief information
ncbi.summary("gene", genes[0])

{'DocumentSummarySet': DictElement({'DocumentSummary': [DictElement({'Name': 'LOC141653481', 'Description': 'protein BREAST CANCER SUSCEPTIBILITY 2 homolog B-like', 'Status': '0', 'CurrentID': '0', 'Chromosome': '4', 'GeneticSource': 'genomic', 'MapLocation': '', 'OtherAliases': '', 'OtherDesignations': 'protein BREAST CANCER SUSCEPTIBILITY 2 homolog B-like', 'NomenclatureSymbol': '', 'NomenclatureName': '', 'NomenclatureStatus': '', 'Mim': [], 'GenomicInfo': [{'ChrLoc': '4', 'ChrAccVer': 'NC_133529.1', 'ChrStart': '47356275', 'ChrStop': '47340589', 'ExonCount': '22'}], 'GeneWeight': '0', 'Summary': '', 'ChrSort': '~~last', 'ChrStart': '999999999', 'Organism': {'ScientificName': 'Silene latifolia', 'CommonName': '', 'TaxID': '37657'}, 'LocationHist': [{'AnnotationRelease': 'RS_2025_06', 'AssemblyAccVer': 'GCF_048544455.1', 'ChrAccVer': 'NC_133529.1', 'ChrStart': '47356275', 'ChrStop': '47340589'}]}, attributes={'uid': '141653481'})], 'DbBuild': 'Build250617-2150m.1'}, attributes={'stat

In [26]:
#full record
print(ncbi.fetch("gene", genes[0]))

[{'Entrezgene_track-info': {'Gene-track': {'Gene-track_geneid': '141653481', 'Gene-track_status': StringElement('0', attributes={'value': 'live'}), 'Gene-track_create-date': {'Date': {'Date_std': {'Date-std': {'Date-std_year': '2025', 'Date-std_month': '6', 'Date-std_day': '12'}}}}, 'Gene-track_update-date': {'Date': {'Date_std': {'Date-std': {'Date-std_year': '2025', 'Date-std_month': '6', 'Date-std_day': '13'}}}}}}, 'Entrezgene_type': StringElement('6', attributes={'value': 'protein-coding'}), 'Entrezgene_source': {'BioSource': {'BioSource_genome': StringElement('1', attributes={'value': 'genomic'}), 'BioSource_origin': StringElement('1', attributes={'value': 'natural'}), 'BioSource_org': {'Org-ref': {'Org-ref_taxname': 'Silene latifolia', 'Org-ref_db': [{'Dbtag_db': 'taxon', 'Dbtag_tag': {'Object-id': {'Object-id_id': '37657'}}}], 'Org-ref_orgname': {'OrgName': {'OrgName_name': {'OrgName_name_binomial': {'BinomialOrgName': {'BinomialOrgName_genus': 'Silene', 'BinomialOrgName_species

I'm not going to go into a lot of detailes partly because there are so many different databases and all of them either have the summary or fetch method return something and what they return is different in each case. However, the response per call/db is quite consistent and if you know what you are looking for it's not that difficult to streamline the search and knowledge gathering using these enpoints.

## Uniprot

uniprot is an extensive database of proteins and features of proteins, It has several api endpoinst, the ones that are integrated are the most compreshenive ones called: proteins, mutagensis (high throughput mutagenesis experiments), isoforms and variation. You can query this using a single command like so:

In [1]:
from ccm_benchmate.apis.uniprot import UniProt
uniprot=UniProt()

In [2]:
results=uniprot.search_uniprot(uniprot_id="P01308", get_isoforms=True, get_variations=True,
                       get_mutagenesis=True, get_interactions=True, consolidate_refs=True, )

Seems like there are no protein isoforms what we looked for, if that is the case for any of the other endpoints you will get a similar warning. Let's see what the results look like

In [3]:
results

{'id': 'P01308',
 'name': 'Insulin',
 'sequence': 'MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN',
 'organism': {'name': [{'type': 'scientific', 'value': 'Homo sapiens'},
   {'type': 'common', 'value': 'Human'}],
  'taxid': 9606},
 'gene': [{'name': {'value': 'INS'}}],
 'feature_types': {'DISULFID',
  'HELIX',
  'PEPTIDE',
  'PROPEP',
  'SIGNAL',
  'STRAND',
  'TURN',
  'VARIANT'},
 'comment_types': {'ALTERNATIVE_PRODUCTS',
  'DISEASE',
  'FUNCTION',
  'INTERACTION',
  'PHARMACEUTICAL',
  'SEQUENCE_CAUTION',
  'SIMILARITY',
  'SUBCELLULAR_LOCATION',
  'SUBUNIT',
  'WEBRESOURCE'},
 'references': ['9235985',
  '5560404',
  '2991050',
  '6248962',
  '4803504',
  '3470784',
  '8421693',
  '6382002',
  '4443293',
  '7242673',
  '14426955',
  '6312455',
  '2196279',
  '3306677',
  '9141561',
  '18162506',
  '5101771',
  '1646635',
  '3057496',
  '381941',
  '6368587',
  '17855560',
  '1601997',
  '18192540',
  '3511099',
  '735

That's a lot, there are a few parts that I think are more interesting than others, let's look at those

In [4]:
results.keys()

dict_keys(['id', 'name', 'sequence', 'organism', 'gene', 'feature_types', 'comment_types', 'references', 'xref_types', 'xrefs', 'description', 'json', 'secondary_accessions', 'variation', 'interactions', 'mutagenesis', 'isoforms'])

In [5]:
results["references"]

['9235985',
 '5560404',
 '2991050',
 '6248962',
 '4803504',
 '3470784',
 '8421693',
 '6382002',
 '4443293',
 '7242673',
 '14426955',
 '6312455',
 '2196279',
 '3306677',
 '9141561',
 '18162506',
 '5101771',
 '1646635',
 '3057496',
 '381941',
 '6368587',
 '17855560',
 '1601997',
 '18192540',
 '3511099',
 '7350438',
 '18451997',
 '4698555',
 '23106816',
 '15070567',
 '1433291',
 '6243748',
 '2271664',
 '4019786',
 '503234',
 '8358440',
 '6339950',
 '6424111',
 '8636380',
 '18171712',
 '9667398',
 '6261753',
 '3537011',
 '2036420',
 '6927840',
 '4698553',
 '12952878',
 '6371526',
 '25423173',
 '20226046',
 '15489334']

There are pubmed ids that you can feed into a lit search for literature class or use in Paper class directly.

In [6]:
# here is a free text description of what this protein is, this will be useful when we are talking about RAG applications in literature module.
results["description"]

'Insulin decreases blood glucose concentration. It increases cell permeability to monosaccharides, amino acids and fatty acids. It accelerates glycolysis, the pentose phosphate cycle, and glycogen synthesis in liver\nHeterodimer of a B chain and an A chain linked by two disulfide bonds (PubMed:25423173)\nThe disease is caused by variants affecting the gene represented in this entry\nThe disease is caused by variants affecting the gene represented in this entry\nThe disease is caused by variants affecting the gene represented in this entry\nThe disease is caused by variants affecting the gene represented in this entry\nAvailable under the names Humulin or Humalog (Eli Lilly) and Novolin (Novo Nordisk). Used in the treatment of diabetes. Humalog is an insulin analog with 52-Lys-Pro-53 instead of 52-Pro-Lys-53\nBelongs to the insulin family\n\n'

In [7]:
# cross references to other databases, so you don't need to look them up manually. Similar to ensembl.xrefs
results["xrefs"]

Unnamed: 0,type,id,properties,isoform,evidences
0,EMBL,V00565,"{'molecule type': 'Genomic_DNA', 'protein sequ...",,
1,EMBL,M10039,"{'molecule type': 'Genomic_DNA', 'protein sequ...",,
2,EMBL,J00265,"{'molecule type': 'Genomic_DNA', 'protein sequ...",,
3,EMBL,X70508,"{'molecule type': 'mRNA', 'protein sequence ID...",,
4,EMBL,L15440,"{'molecule type': 'Genomic_DNA', 'protein sequ...",,
...,...,...,...,...,...
972,PRINTS,PR00277,{'entry name': 'INSULIN'},,
973,PRINTS,PR00276,{'entry name': 'INSULINFAMLY'},,
974,SMART,SM00078,"{'match status': '1', 'entry name': 'IlGF'}",,
975,SUPFAM,SSF56994,"{'match status': '1', 'entry name': 'Insulin-l...",,


In [8]:
results["variation"]

Unnamed: 0,type,alternativeSequence,begin,end,xrefs,genomicLocation,locations,consequenceType,wildType,mutatedType,...,sourceType,cytogeneticBand,clinicalSignificances,association,descriptions,populationFrequencies,codon,predictions,ftId,evidences
0,VARIANT,?,1,1,"[{'name': 'cosmic curated', 'id': 'COSV9917115...","[NC_000011.10:g.2160969C>A, NC_000011.10:g.216...","[{'loc': 'p.Met1?', 'seqId': 'ENST00000381330'...",missense,M,?,...,large_scale_study,,,,,,,,,
1,VARIANT,I,1,1,"[{'name': 'ClinGen', 'id': 'CA344913', 'url': ...","[NC_000011.10:g.2160969C>T, NC_000011.10:g.216...","[{'loc': 'p.Met1Ile', 'seqId': 'ENST0000038133...",missense,M,I,...,large_scale_study,11p15.5,"[{'type': 'Variant of uncertain significance',...","[{'name': 'Diabetes mellitus, permanent neonat...","[{'value': 'Diabetes mellitus, permanent neona...",,,,,
2,VARIANT,V,1,1,"[{'name': 'ClinGen', 'id': 'CA5818208', 'url':...",[NC_000011.10:g.2160971T>C],"[{'loc': 'p.Met1Val', 'seqId': 'ENST0000038133...",missense,M,V,...,large_scale_study,11p15.5,"[{'type': 'Pathogenic', 'sources': ['ClinVar']...","[{'name': 'Diabetes mellitus, permanent neonat...","[{'value': 'Type 1 diabetes mellitus 2', 'sour...","[{'populationName': 'MAF', 'frequency': 1e-05,...",,,,
3,VARIANT,G,2,2,"[{'name': 'Ensembl', 'id': 'rs1845881727', 'ur...",[NC_000011.10:g.2160967G>C],"[{'loc': 'p.Ala2Gly', 'seqId': 'ENST0000038133...",missense,A,G,...,large_scale_study,11p15.5,,,,,GCC/GGC,"[{'predictionValType': 'unknown', 'predictorTy...",,
4,VARIANT,T,2,2,"[{'name': '1000Genomes', 'id': 'rs535989053', ...",[NC_000011.10:g.2160968C>T],"[{'loc': 'p.Ala2Thr', 'seqId': 'ENST0000038133...",missense,A,T,...,large_scale_study,11p15.5,,,,"[{'populationName': 'AF', 'frequency': 1.5754e...",GCC/ACC,"[{'predictionValType': 'unknown', 'predictorTy...",,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
164,VARIANT,,108,108,"[{'name': 'ClinGen', 'id': 'CA645372891', 'url...",[NC_000011.10:g.2159860AGT[1]],"[{'loc': 'p.Tyr108del', 'seqId': 'ENST00000381...",inframe deletion,Y,,...,large_scale_study,11p15.5,"[{'type': 'Likely benign', 'sources': ['ClinVa...",,,,,,,
165,VARIANT,N,108,108,"[{'name': 'cosmic curated', 'id': 'COSV9917111...",[NC_000011.10:g.2159863A>T],"[{'loc': 'p.Tyr108Asn', 'seqId': 'ENST00000250...",missense,Y,N,...,large_scale_study,,,,,,,,,
166,VARIANT,*,109,109,"[{'name': 'gnomAD', 'id': 'rs1271239621', 'url...",[NC_000011.10:g.2159858G>T],"[{'loc': 'p.Cys109Ter', 'seqId': 'ENST00000381...",stop gained,C,*,...,large_scale_study,11p15.5,,,,,TGC/TGA,,,
167,VARIANT,F,109,109,"[{'name': 'ClinGen', 'id': 'CA379120644', 'url...",[NC_000011.10:g.2159859C>A],"[{'loc': 'p.Cys109Phe', 'seqId': 'ENST00000381...",missense,C,F,...,large_scale_study,11p15.5,"[{'type': 'Likely pathogenic', 'sources': ['En...","[{'name': 'Neonatal diabetes mellitus (NDM)', ...","[{'value': 'Neonatal diabetes mellitus', 'sour...",,TGC/TTC,"[{'predictionValType': 'probably damaging', 'p...",,


In [9]:
results["mutagenesis"]

Unnamed: 0,type,description,start,end,alt,pubmed_id
0,MUTAGENESIS,,59,59,A,[23106816]
1,MUTAGENESIS,,67,67,A,[23106816]
2,MUTAGENESIS,,67,67,A,[23106816]
3,MUTAGENESIS,,83,83,A,[23106816]
4,MUTAGENESIS,,83,83,A,[23106816]


In [10]:
results["feature_types"]

{'DISULFID',
 'HELIX',
 'PEPTIDE',
 'PROPEP',
 'SIGNAL',
 'STRAND',
 'TURN',
 'VARIANT'}

In [15]:
results["comment_types"]

{'ALTERNATIVE_PRODUCTS',
 'DISEASE',
 'FUNCTION',
 'INTERACTION',
 'PHARMACEUTICAL',
 'SEQUENCE_CAUTION',
 'SIMILARITY',
 'SUBCELLULAR_LOCATION',
 'SUBUNIT',
 'WEBRESOURCE'}

In [14]:
# pass the raw return to extract features, same is true for comments
uniprot.get_features(results["json"], "SIGNAL")

[{'type': 'SIGNAL',
  'category': 'MOLECULE_PROCESSING',
  'description': '',
  'begin': '1',
  'end': '24',
  'molecule': '',
  'evidences': [{'code': 'ECO:0000269',
    'source': {'name': 'PubMed',
     'id': '14426955',
     'url': 'http://www.ncbi.nlm.nih.gov/pubmed/14426955',
     'alternativeUrl': 'https://europepmc.org/abstract/MED/14426955'}}]}]

In [16]:
# get disease annotations
uniprot.get_comments(results["json"], "DISEASE")


[{'type': 'DISEASE',
  'diseaseId': 'Hyperproinsulinemia',
  'acronym': 'HPRI',
  'dbReference': {'type': 'MIM', 'id': '616214'},
  'description': {'value': 'An autosomal dominant condition characterized by elevated levels of serum proinsulin-like material.',
   'evidences': [{'code': 'ECO:0000269',
     'source': {'name': 'PubMed',
      'id': '1601997',
      'url': 'http://www.ncbi.nlm.nih.gov/pubmed/1601997',
      'alternativeUrl': 'https://europepmc.org/abstract/MED/1601997'}},
    {'code': 'ECO:0000269',
     'source': {'name': 'PubMed',
      'id': '2196279',
      'url': 'http://www.ncbi.nlm.nih.gov/pubmed/2196279',
      'alternativeUrl': 'https://europepmc.org/abstract/MED/2196279'}},
    {'code': 'ECO:0000269',
     'source': {'name': 'PubMed',
      'id': '3470784',
      'url': 'http://www.ncbi.nlm.nih.gov/pubmed/3470784',
      'alternativeUrl': 'https://europepmc.org/abstract/MED/3470784'}},
    {'code': 'ECO:0000269',
     'source': {'name': 'PubMed',
      'id': '4019

I think you get the idea. Please create an issue if we missed something.

## Reactome

Reactome is more concerned about biological reactions, pathways and the genes/proteins that are associated with it. You need ot know your reactome id but I think we can figure that out either through ensembl or uniprot.

In [1]:
from ccm_benchmate.apis.reactome import Reactome
reactome=Reactome()

In [2]:
# initialization gathers some information that is up to date, these are the fields you can search for
reactome.show_fields()

['species', 'type', 'keyword', 'compartment']

In [3]:
reactome.show_values("species")

['Homo sapiens',
 'Mus musculus',
 'Rattus norvegicus',
 'Sus scrofa',
 'Bos taurus',
 'Canis familiaris',
 'Gallus gallus',
 'Drosophila melanogaster',
 'Caenorhabditis elegans',
 'Xenopus tropicalis',
 'Danio rerio',
 'Dictyostelium discoideum',
 'Saccharomyces cerevisiae',
 'Schizosaccharomyces pombe',
 'Entries without species',
 'Plasmodium falciparum',
 'Severe acute respiratory syndrome coronavirus 2',
 'Human immunodeficiency virus 1',
 'Human SARS coronavirus',
 'Influenza A virus',
 'Human respiratory syncytial virus A',
 'Mycobacterium tuberculosis',
 'Human cytomegalovirus',
 'Clostridium botulinum',
 'Xenopus laevis',
 'Hepatitis C Virus',
 'Oryctolagus cuniculus',
 'Rotavirus',
 'Human herpesvirus 1',
 'Escherichia coli',
 'Hepatitis B virus',
 'Infectious bronchitis virus',
 'Measles virus',
 'Bacillus anthracis',
 'Salmonella typhimurium',
 'Cricetulus griseus',
 'Neisseria meningitidis serogroup B',
 'Chlamydia trachomatis',
 'Listeria monocytogenes',
 'Leishmania majo

In [4]:
results=reactome.query(query="cancer", species="Homo sapiens", force_filters=False)

In [5]:
# the main results are in results section, but they are also divided into different groups.
results.keys()

dict_keys(['Pathway', 'Reaction', 'Interactor', 'Set', 'Protein', 'Complex', 'DNA Sequence', 'Icon'])

In [6]:
# we can take a look at one of them
results["Complex"]

[{'dbId': '2029919',
  'stId': 'R-HSA-2029919',
  'id': 'R-HSA-2029919',
  'name': 'FGFR2 W290C mutant dimer',
  'exactType': 'Complex',
  'species': ['Homo sapiens'],
  'compartmentNames': ['plasma membrane'],
  'compartmentAccession': ['0005886'],
  'isDisease': True,
  'icon': False,
  'disease': True},
 {'dbId': '8874796',
  'stId': 'R-HSA-8874796',
  'id': 'R-HSA-8874796',
  'name': 'TFAP2C homodimer:EGFR gene',
  'exactType': 'Complex',
  'species': ['Homo sapiens'],
  'compartmentNames': ['nucleoplasm'],
  'compartmentAccession': ['0005654'],
  'isDisease': False,
  'icon': False,
  'disease': False},
 {'dbId': '5655216',
  'stId': 'R-HSA-5655216',
  'id': 'R-HSA-5655216',
  'name': 'Activated FGFR1 mutants:p-FRS2',
  'exactType': 'Complex',
  'species': ['Homo sapiens'],
  'compartmentNames': ['plasma membrane'],
  'compartmentAccession': ['0005886'],
  'isDisease': True,
  'icon': False,
  'disease': True},
 {'dbId': '5655328',
  'stId': 'R-HSA-5655328',
  'id': 'R-HSA-5655328

In [7]:
# That's a lot let's take one
results["Pathway"][0]

{'dbId': '9842640',
 'stId': 'R-HSA-9842640',
 'id': 'R-HSA-9842640',
 'name': 'Signaling by LTK in <span class="highlighting" >cancer</span>',
 'exactType': 'Pathway',
 'species': ['Homo sapiens'],
 'summation': 'LTK is a member of the anaplastic lymphoma kinase (ALK)/LTK subfamily within the insulin receptor superfamily of RTKs. LTK encodes an 864-amino-acid protein consisting of extracellular, transmembrane, and tyrosine kinase domains and a short carboxy terminus. The LTK kinase domain shares 80% identity with ALK (Roll and Reuther, 2012). The biological role of LTK is not well defined under normal physiological conditions, and unlike ALK, a clear role for LTK in <span class="highlighting" >cancer</span> is also not yet well established. LTK is overexpressed in leukemia, and high expression of LTK in early-stage non-small cell lung <span class="highlighting" >cancer</span> (NSCLC) has been associated with greater risk of metastasis (Mueller-Tidow et al, 2005; Roll and Reuther, 2012

In [10]:
# That's great there is a plain text summary of what this is all about, again think rag, you don't have to do the thinking.
details=reactome.get_details(results["Pathway"][0]["dbId"])
details

{'dbId': 9842640,
 'displayName': 'Signaling by LTK in cancer',
 'stId': 'R-HSA-9842640',
 'stIdVersion': 'R-HSA-9842640.1',
 'created': {'dbId': 9842653,
  'displayName': 'Rothfels, Karen, 2023-08-27',
  'dateTime': '2023-08-27 21:38:46',
  'author': [{'dbId': 1226097,
    'displayName': 'Rothfels, K',
    'firstname': 'Karen',
    'initial': 'K',
    'orcidId': '0000-0002-0705-7048',
    'project': 'Reactome',
    'surname': 'Rothfels',
    'className': 'Person',
    'schemaClass': 'Person'}],
  'className': 'InstanceEdit',
  'schemaClass': 'InstanceEdit'},
 'modified': {'dbId': 9863688,
  'displayName': 'Rothfels, Karen, 2024-03-04',
  'dateTime': '2024-03-04 21:44:07',
  'author': [1226097],
  'className': 'InstanceEdit',
  'schemaClass': 'InstanceEdit'},
 'isInDisease': True,
 'isInferred': False,
 'name': ['Signaling by LTK in cancer'],
 'releaseDate': '2024-03-27',
 'speciesName': 'Homo sapiens',
 'authored': [{'dbId': 9851215,
   'displayName': 'Rothfels, Karen, 2023-10-14',
  

Whoa! there is evem nore stuff, some of which I do not really care about like who created the entry, but a closer look is warranted

In [12]:
details.keys()

dict_keys(['dbId', 'displayName', 'stId', 'stIdVersion', 'created', 'modified', 'isInDisease', 'isInferred', 'name', 'releaseDate', 'speciesName', 'authored', 'disease', 'edited', 'literatureReference', 'species', 'summation', 'reviewStatus', 'hasDiagram', 'hasEHLD', 'hasEvent', 'normalPathway', 'schemaClass', 'className'])

In [13]:
details["disease"]

[{'dbId': 1247848,
  'displayName': 'non-small cell lung carcinoma',
  'databaseName': 'DOID',
  'identifier': '3908',
  'name': ['non-small cell lung carcinoma'],
  'synonym': ['Non-small cell lung cancer (disorder)', 'NSCLC', 'NSCLC'],
  'url': 'https://www.ebi.ac.uk/ols/ontologies/doid/terms?obo_id=DOID:3908',
  'className': 'Disease',
  'schemaClass': 'Disease'},
 {'dbId': 1500689,
  'displayName': 'cancer',
  'databaseName': 'DOID',
  'definition': 'A disease of cellular proliferation that is malignant and primary, characterized by uncontrolled cellular proliferation, local cell invasion and metastasis.',
  'identifier': '162',
  'name': ['cancer'],
  'synonym': ['malignant tumor', 'malignant neoplasm', 'primary cancer'],
  'url': 'https://www.ebi.ac.uk/ols/ontologies/doid/terms?obo_id=DOID:162',
  'className': 'Disease',
  'schemaClass': 'Disease'}]

In [14]:
details["summation"]

[{'dbId': 9842650,
  'displayName': 'LTK is a member of the anaplastic lymphoma kinase (ALK)/LTK ...',
  'text': 'LTK is a member of the anaplastic lymphoma kinase (ALK)/LTK subfamily within the insulin receptor superfamily of RTKs. LTK encodes an 864-amino-acid protein consisting of extracellular, transmembrane, and tyrosine kinase domains and a short carboxy terminus. The LTK kinase domain shares 80% identity with ALK (Roll and Reuther, 2012). The biological role of LTK is not well defined under normal physiological conditions, and unlike ALK, a clear role for LTK in cancer is also not yet well established. LTK is overexpressed in leukemia, and high expression of LTK in early-stage non-small cell lung cancer (NSCLC) has been associated with greater risk of metastasis (Mueller-Tidow et al, 2005; Roll and Reuther, 2012). More recently, a novel CLIP1-LTK fusion protein has been identified in a small proportion of NSCLC cases (Izumi et al, 2021).',
  'className': 'Summation',
  'schemaCl

In [15]:
details["literatureReference"]

[{'dbId': 9842429,
  'displayName': 'ALK-activating homologous mutations in LTK induce cellular transformation',
  'title': 'ALK-activating homologous mutations in LTK induce cellular transformation',
  'author': [{'dbId': 8950297,
    'displayName': 'Roll, JD',
    'firstname': 'J Devon',
    'initial': 'JD',
    'surname': 'Roll',
    'className': 'Person',
    'schemaClass': 'Person',
    'publications': [9842429]},
   {'dbId': 2029784,
    'displayName': 'Reuther, GW',
    'firstname': 'Gary W',
    'initial': 'GW',
    'surname': 'Reuther',
    'className': 'Person',
    'schemaClass': 'Person',
    'publications': [9842429]}],
  'journal': 'PLoS One',
  'pages': 'e31733',
  'pubMedIdentifier': 22347506,
  'volume': 7,
  'year': 2012,
  'url': 'http://www.ncbi.nlm.nih.gov/pubmed/22347506',
  'className': 'LiteratureReference',
  'schemaClass': 'LiteratureReference'},
 {'dbId': 9842438,
  'displayName': 'The CLIP1-LTK fusion is an oncogenic driver\xa0in non-small-cell lung cancer',

In [17]:
#hmm I think I can do better
pubmedids=[item["pubMedIdentifier"] for item in details["literatureReference"] if "pubMedIdentifier" in item.keys()]

In [18]:
pubmedids

[22347506, 34819663, 15753374]

There is a lot more to explore here but I don't need to know everything about cancer right now.

## RNA Central

Last but not least we have RNA Central

In [1]:
from ccm_benchmate.apis.rnacentral import RnaCentral

#you need the rnacentral id to search
rnacentral=RnaCentral()

results=rnacentral.get_information(id="URS00000CE0D1")

In [3]:
results.keys()

dict_keys(['url', 'rnacentral_id', 'md5', 'sequence', 'length', 'xrefs', 'publications', 'is_active', 'description', 'rna_type', 'count_distinct_organisms', 'distinct_databases', 'references'])

Some of these are pretty obvious, let's look at some of the more interesting ones.

In [4]:
results["xrefs"]

Unnamed: 0,upi,database,is_active,first_seen,last_seen,taxid,url,id,parent_ac,seq_version,...,ndb_external_url,mirbase_mature_products,mirbase_precursor,refseq_mirna_mature_products,refseq_mirna_precursor,refseq_splice_variants,gencode_transcript_id,gencode_ensembl_url,ensembl_url,quickgo_hits
0,URS00000CE0D1,Expression Atlas,True,2025-04-01 00:00:00,2025-04-01 00:00:00,9606,http://rnacentral.org/api/v1/accession/EXPRESS...,EXPRESSIONATLAS:ENSG00000269821,,1.0,...,,,,,,,,,,
1,URS00000CE0D1,Ensembl/GENCODE,True,2020-12-17 00:00:00,2025-03-19 00:00:00,9606,http://rnacentral.org/api/v1/accession/GENCODE...,GENCODE:ENST00000597346.1,11.GRCh38,1.0,...,,,,,,,ENST00000597346.1,http://ensembl.org/Homo_sapiens/Transcript/Sum...,,
2,URS00000CE0D1,GeneCards,True,2019-12-06 00:00:00,2025-03-11 00:00:00,9606,http://rnacentral.org/api/v1/accession/GENECAR...,GENECARDS:KCNQ1OT1:URS00000CE0D1_9606,,1.0,...,,,,,,,,,,
3,URS00000CE0D1,MalaCards,True,2019-12-06 00:00:00,2025-03-19 00:00:00,9606,http://rnacentral.org/api/v1/accession/MALACAR...,MALACARDS:KCNQ1OT1:URS00000CE0D1_9606,,1.0,...,,,,,,,,,,
4,URS00000CE0D1,LncBook,True,2022-07-18 00:00:00,2022-08-30 00:00:00,9606,http://rnacentral.org/api/v1/accession/LncBook...,LncBook:HSALNT0361046,,1.0,...,,,,,,,,,,
5,URS00000CE0D1,Ensembl,True,2017-04-25 00:00:00,2025-03-19 00:00:00,9606,http://rnacentral.org/api/v1/accession/ENST000...,ENST00000597346.1,11.GRCh38,1.0,...,,,,,,,,,,
6,URS00000CE0D1,HGNC,True,2024-02-22 00:00:00,2025-03-11 00:00:00,9606,http://rnacentral.org/api/v1/accession/HGNC:62...,HGNC:6295,,,...,,,,,,,,,,
7,URS00000CE0D1,LNCipedia,True,2018-07-27 00:00:00,2024-11-08 00:00:00,9606,http://rnacentral.org/api/v1/accession/LNCIPED...,LNCIPEDIA:KCNQ1OT1:5,CM000673,1.0,...,,,,,,,,,,
8,URS00000CE0D1,RefSeq,True,2023-10-20 00:00:00,2025-03-11 00:00:00,9606,http://rnacentral.org/api/v1/accession/NR_0027...,NR_002728.4:1..91667:ncRNA,NR_002728,4.0,...,,,,,,,,,,


In [5]:
results["description"]

'lncRNA from 1 species'

In [7]:
results["references"]

Unnamed: 0,title,publication,pmid,doi,pub_id,expert_db
0,A maternally methylated CpG island in KvLQT1 i...,Proc Natl Acad Sci U S A 96(14):8064-8069 (1999),10393948,10.1073/pnas.96.14.8064,545532,False
1,A review on the role of KCNQ1OT1 lncRNA in hum...,Pathol Res Pract 255:155188 (2024),38330620,10.1016/j.prp.2024.155188,1378213,False
2,Associations between KCNQ1OT1 genetic variatio...,Environ Mol Mutagen 64(6):354-358 (2023),37349861,10.1002/em.22559,1281029,False
3,Beckwith-Wiedemann Syndrome,(1993),20301568,,1295838,False
4,Biological role of long non-coding RNA KCNQ1OT...,Biomed Pharmacother 169:115876 (2023),37976888,10.1016/j.biopha.2023.115876,1297892,False
5,Epigenetics of imprinted long noncoding RNAs,Epigenetics 4(5):277-286 (2009),19617707,10.4161/epi.4.5.9242,783763,False
6,Exploring the clinical and cellular mechanisms...,J Matern Fetal Neonatal Med 37(1):2337723 (2024),38637274,10.1080/14767058.2024.2337723,1360974,False
7,Initial assessment of human gene diversity and...,Nature 377(6547 suppl):3-174 (1995),7566098,,887580,False
8,Kcnq1ot1 antisense noncoding RNA mediates line...,Mol Cell 32(2):232-246 (2008),18951091,10.1016/j.molcel.2008.08.022,539134,False
9,"LIT1, an imprinted antisense RNA in the human ...",Hum Mol Genet 8(7):1209-1217 (1999),10369866,10.1093/hmg/8.7.1209,539689,False


Again we can use the ids in the xrefs to connect to other databases and use the pubmed ids to dig a bit deeper.