## Semantic search with OpenAI `text-embedding-ada-002` embeddings.

Here's an example of using the underlying `ssed` (semantic search with ENCODE data) Python library for calculating embeddings on a custom set of JSON documents, and finding the _k_ most similar documents given a custom query or an existing document.

In [1]:
import os

import requests

import openai as openai_client

from ssed.remote.openai import OpenAI
from ssed.remote.openai import OpenAIProps

from ssed.embeddings import Embeddings
from ssed.embeddings import EmbeddingsProps

Explicitly set your OpenAI API key:

In [2]:
openai_client.api_key = os.environ['OPENAI_API_KEY'] # Assumes OPENAI_API_KEY is defined in environment

Get some ENCODE Project Award documents (with select fields):

In [3]:
url = (
    'https://www.encodeproject.org'
    '/search/?type=Award&frame=object&format=json&status=current&limit=all'
    '&field=pi.title&field=description&field=title&field=rfa'
)
response = requests.get(url)
documents = response.json()['@graph']
documents[0]

{'@id': '/awards/U01HG009380/',
 '@type': ['Award', 'Item'],
 'description': 'While much progress has been made generating high quality chromatin state and accessibility data from the ENCODE and Roadmap consortia, accurately identifying cell-type specific enhancers from these data remains a significant challenge. We have recently developed a computational approach (gkmSVM) to predict regulatory elements from DNA sequence, and we have shown that when gkmSVM is trained on DHS data from each of the human and mouse ENCODE and Roadmap cells and tissues, it can predict both cell specific enhancer activity and the impact of regulatory variants (deltaSVM) with greater precision than alternative approaches. The gkmSVM model encapsulates a set of cell-type specific weights describing the regulatory binding site vocabulary controlling chromatin accessibility in each cell type. A striking observation is that the significant gkmSVM weights are generally identifiable with a small (~20) set of TF bin

In [4]:
[doc['title'] for doc in documents]

['SYSTEMATIC IDENTIFICATION OF CORE REGULATORY CIRCUITRY FROM ENCODE DATA',
 'NUCLEOME POSITIONING SYSTEM FOR SPATIOTEMPORAL GENOME ORGANIZATION AND REGULATION',
 'WHOLE GENOME CHROMATIN INTERACTION ANALYSIS USING PAIR-END-DITAGGING (CIA-PET)',
 'ENCODING GENOMIC ARCHITECTURE IN THE ENCYCLOPEDIA: LINKING DNA ELEMENTS, CHROMATIN STATE, AND GENE EXPRESSION IN 3D',
 'A COMPREHENSIVE FUNCTIONAL MAP OF HUMAN PROTEIN-RNA INTERACTIONS',
 'PRODUCTION CENTER FOR MAPPING REGULATORY REGIONS OF THE HUMAN GENOME',
 'AN ENCODE CHIP-SEQ PIPELINE USING ENDOGENOUSLY TAGGED HUMAN DNA-ASSOCIATED PROTEINS',
 'CONNECTING TRANSPOSABLE ELEMENTS AND REGULATORY INNOVATION USING ENCODE DATA',
 'THE SAN DIEGO EPIGENOME CENTER',
 'X CHROMOSOME REGULATION AND ROLE IN ANEUPLOIDY',
 'ANALYSIS OF FUNCTIONAL GENETIC VARIANTS IN RNA PROCESSING AND EXPRESSION',
 'GENOME WIDE MAPPING OF LOOPS USING IN SITU HI-C',
 'EPIGENOMICS DATA ANALYSIS AND COORDINATION CENTER AT BAYLOR COLLEGE OF MEDICINE',
 'DECODING THE REGULATORY

In [5]:
print(len(documents))

65


Create an `Embeddings` object from the documents:

In [6]:
embeddings = Embeddings.from_documents(
    props=EmbeddingsProps(
        openai=OpenAI()
    ),
    documents=documents,
    id_key='@id'
)

Calculate the embeddings (this is rate limited by OpenAI's API and could take a long time if you have a lot of documents):

In [7]:
values = await embeddings.get_values()
values

array([[-0.0203915 ,  0.00433584, -0.0037502 , ..., -0.01621442,
        -0.00690418,  0.0022226 ],
       [-0.02199213,  0.01176144, -0.0105888 , ..., -0.02159891,
        -0.01867786, -0.0273427 ],
       [-0.0376124 ,  0.00952829, -0.00936832, ..., -0.02256188,
        -0.00536575, -0.02071186],
       ...,
       [ 0.00107457, -0.00064013, -0.0274494 , ..., -0.02273974,
         0.00525227, -0.0354076 ],
       [-0.01652613,  0.01145403, -0.02446871, ..., -0.01226222,
        -0.01578761, -0.02778508],
       [-0.02497705,  0.00330579, -0.01010448, ..., -0.01341026,
        -0.00121888, -0.02165047]])

Every row is a document and every embedding has 1536 dimensions:

In [8]:
values.shape

(65, 1536)

Making a query to the `Embeddings` object returns a `Results` object. The raw results of a query show the document index and similarity score.

In [9]:
crispr_results = await embeddings.get_k_results_most_similar_to_query(
    query='crispr',
    k=3
)
crispr_results.raw

[(3, 0.8344338672426572), (24, 0.8264028965176172), (57, 0.8218114454369716)]

Underneath we are embedding the query in the same high-dimensional space as the documents, and calculating the cosine similarity between all document embeddings and the query embedding. 

Iterating through the results yields a `(document, score)` tuple. We can define a helper function to print out the relevant fields from a document and its score:

In [10]:
def print_results(results, description_max=1500):
    for document, score in results:
        print('\nID:', document['@id'])
        print('Title:', document['title'])
        print('Lab:', document.get('pi', {}).get('title', ''))
        print('Description:', document.get('description', '')[:description_max] + '...')
        print('Similarity score:', score)

In [11]:
print_results(crispr_results, description_max=2000)


ID: /awards/U01HG009395/
Title: ENCODING GENOMIC ARCHITECTURE IN THE ENCYCLOPEDIA: LINKING DNA ELEMENTS, CHROMATIN STATE, AND GENE EXPRESSION IN 3D
Lab: Christina Leslie
Description: Most of the 1000s of sequencing experiments generated by ENCODE provide 1D readouts of the epigenetic landscape or transcriptional output of a 3D genome. New chromosome conformation capture (3C) technologies – in particular Hi-C and ChIA-PET – have begun to provide insight into the hierarchical 3D organization of the genome: the partition of chromosomes into open and closed compartments; the existence of structural subunits defined as topologically associated domains (TADs); and the presence of regulatory and structural DNA loops within TADs. New experimental evidence using CRISPR/Cas-mediated genome editing suggests that disruption of local 3D structure can alter regulation of neighboring genes, and there have been early efforts to use data on 3D DNA looping to predict the impact of non-coding SNPs from 

Instead of an arbitrary query we can also use a document's embedding to find other similar documents:

In [12]:
similar_awards = await embeddings.get_k_results_most_similar_to_id(
    '/awards/UM1HG009375/',
    k=3
)
print(similar_awards)

/awards/UM1HG009375/: 0.9999999923035059
/awards/U01HG009395/: 0.9244261232904212
/awards/1U54DK107980/: 0.9152081446814886


As expected the document is most similar with itself (cosine simliarity of 1).

All of the documents seem to be related to the structure of the genome in three-dimensional space:

In [13]:
print_results(similar_awards)


ID: /awards/UM1HG009375/
Title: GENOME WIDE MAPPING OF LOOPS USING IN SITU HI-C
Lab: Erez Aiden
Description: The roughly two meters of DNA in the human genome is intricately packaged to form the chromatin and chromosomes in each cell nucleus. In addition to its structural role, this organization has critical regulatory functions. In particular, the formation of loops in the human genome plays an essential role in regulating genes. We recently demonstrated the ability to create reliable maps of these loops, using an in situ Hi-C method for three-dimensional genome sequencing. Hi-C characterizes the three dimensional configuration of the genome by determining the frequency of physical contact between all pairs of loci, genome-wide. The proposed center will apply Hi-C and other new technologies to characterize genomic loops, their regulation, and their functions. We will specifically examine these structures in a wide variety of ENCODE cell types. The principles deduced from our study wi

Here are more examples of arbitrary queries:

In [14]:
data_coordination_results = await embeddings.get_k_results_most_similar_to_query(
    query='data coordination center',
    k=3
)

In [15]:
print_results(data_coordination_results)


ID: /awards/U24HG009397/
Title: A DATA COORDINATING CENTER FOR ENCODE
Lab: J. Michael Cherry
Description: The goals of the ENCODE Data Coordinating Center (DCC) is to support the ENCODE Consortium by defining and establishing a strategy that connects all participants to the data and by creating avenues of access that distribute these data to the greater biological research community. The ENCODE Consortium brings together laboratories that generate complex data types via experimental assays with laboratories that integrate these unique data using computational analyses to discover how chromosomal elements function together to define human cells and tissues. The DCC's participation enhances the data created by these laboratories through the creation of structured procedures for the verification and validation of all submitted data and providing processes for the documentation of metadata that describe each biological sample and assay method. To facilitate access to all the data created 

In [16]:
print_results(
    await embeddings.get_k_results_most_similar_to_query(
        query='snyder',
        k=2
    )
)


ID: /awards/U01HG007919/
Title: GENOMICS OF GENE REGULATION IN PROGENITOR TO DIFFERENTIATED KERATINOCYTES
Lab: Michael Snyder
Description: The modeling of transcription to genome proximal elements to date has revealed associations, but seldom are disruptions performed to confirm mechanistic possibilities and substantiate causality. In studying multistate cell systems, many processes important to human health are difficult to study due to low cell availability and/or dyssynchrony leading to heterogeneous cell populations. We propose to study a model of the human epidermal differentiation system, which by its intrinsic properties does not have these problems but at the same time closely simulates the native process. We plan to perform multiple next generation sequencing modalities of transcription and gene proximal components over a time course spanning the transition from progenitor to differentiated keratinocytes. A network based on boosting methods and dynamic Bayesian networks will 

In [17]:
print_results(
    await embeddings.get_k_results_most_similar_to_query(
        query='how cell phenotype is affected by gene mutations',
        k=5
    )
)


ID: /awards/U01HG007910/
Title: RULES OF GENE EXPRESSION MODELED ON HUMAN DENDRITIC CELL RESPONSE TO PATHOGENS
Lab: Jeremy Luban
Description: The developmental shifts that occur when cells respond to environmental stimuli are controlled in large part by gene expression programs involving thousands of genes. Transcription factors (TFs), chromatin modifying enzymes, and cis-acting DNA elements contribute to the networks that underlie such programs. The code that links these variables in such a way that the expression of a given gene can be predicted based on the presence of specific components has yet to be deciphered. A model for such a code will be constructed here based on genome-wide analysis of human dendritic cells (DCs) as they mature in response to pathogens. DCs are antigen-presenting cells that initiate and determine the quality and magnitude of the host immune response. Recent technical advances in stem cell biology, reverse-genetic tools for primary human cells, and genome-w

In [18]:
print_results(
    await embeddings.get_k_results_most_similar_to_query(
        query='how transcription factors regulate gene expression',
        k=3
    )
)


ID: /awards/U01HG007910/
Title: RULES OF GENE EXPRESSION MODELED ON HUMAN DENDRITIC CELL RESPONSE TO PATHOGENS
Lab: Jeremy Luban
Description: The developmental shifts that occur when cells respond to environmental stimuli are controlled in large part by gene expression programs involving thousands of genes. Transcription factors (TFs), chromatin modifying enzymes, and cis-acting DNA elements contribute to the networks that underlie such programs. The code that links these variables in such a way that the expression of a given gene can be predicted based on the presence of specific components has yet to be deciphered. A model for such a code will be constructed here based on genome-wide analysis of human dendritic cells (DCs) as they mature in response to pathogens. DCs are antigen-presenting cells that initiate and determine the quality and magnitude of the host immune response. Recent technical advances in stem cell biology, reverse-genetic tools for primary human cells, and genome-w

You can save your document embeddings for later use to avoid recalculating them all again using OpenAI's API:

In [19]:
embeddings.save('./data/awards')

And load them again to make queries:

In [20]:
loaded_embeddings = Embeddings.load(
    props=EmbeddingsProps(
        openai=OpenAI()
    ),
    path='./data/awards'
)

In [21]:
print_results(
    await loaded_embeddings.get_k_results_most_similar_to_query(
        'machine learning',
        k=5,
    ),
    description_max=5000,
)


ID: /awards/NSERC06150/
Title: Unsupervised machine learning methods that discover the molecular programs underlying cellular biology
Lab: Max Libbrecht
Description: ...
Similarity score: 0.7874600180306519

ID: /awards/U01HG009380/
Title: SYSTEMATIC IDENTIFICATION OF CORE REGULATORY CIRCUITRY FROM ENCODE DATA
Lab: Michael Beer
Description: While much progress has been made generating high quality chromatin state and accessibility data from the ENCODE and Roadmap consortia, accurately identifying cell-type specific enhancers from these data remains a significant challenge. We have recently developed a computational approach (gkmSVM) to predict regulatory elements from DNA sequence, and we have shown that when gkmSVM is trained on DHS data from each of the human and mouse ENCODE and Roadmap cells and tissues, it can predict both cell specific enhancer activity and the impact of regulatory variants (deltaSVM) with greater precision than alternative approaches. The gkmSVM model encapsula

Given some results we can also determine if the results are relevant to the query by asking `gpt-3.5.turbo`.

In [22]:
from ssed.expert import SearchRelevancyExpert

In [23]:
results = await embeddings.get_k_results_most_similar_to_query(
    query='dnase-seq',
    k=3
)

In [24]:
print_results(results)


ID: /awards/UM1HG009444/
Title: ENCODE MAPPING CENTER-A COMPREHENSIVE CATALOG OF DNASE I HYPERSENSITIVE SITES
Lab: John Stamatoyannopoulos
Description: The overall mission of this Mapping Center is to create and disseminate open access, comprehensive, high-quality, high-resolution reference maps of DNase I hypersensitive sites (DHSs) in the human and mouse genomes, at previously unattainable levels of cellular and anatomical definition. Regulatory DNA is actuated in an exceptionally state-specific manner;; accordingly, achieving a comprehensive map of DHSs necessitates the interrogation of an expansive and finely partitioned range of cell and tissue samples. Progressive technical improvements and recent innovations have resulted in dramatic (>100x) decreases in requisite input biological sample quantities coupled with equally dramatic (>100x) increases in assay throughput, and corresponding decreases in the incremental cost of generating reference-quality DHS maps. These advances have

In [26]:
expert = SearchRelevancyExpert.from_results(results)

In [27]:
await expert.evaluate()

'The results match the query as they all relate to the study of DNase-seq, which is a technique used to identify DNase I hypersensitive sites (DHSs) in the genome. The results include descriptions of projects that aim to create comprehensive maps of DHSs in human and mouse genomes, as well as proposals to use DNase-seq to identify novel candidate regulatory elements and functional DNA elements in the human genome.'

In [28]:
results = await embeddings.get_k_results_most_similar_to_query(
    query='jupiter mars venus moon',
    k=5
)

In [29]:
print_results(results)


ID: /awards/ROADMAP/
Title: ROADMAP
Lab: Aleksandar Milosavljevic
Description: ...
Similarity score: 0.7407589433313034

ID: /awards/R21HG011280-01/
Title: NIH Exploratory/Developmental Research Grant Award, Conesa Lab
Lab: 
Description: ...
Similarity score: 0.7314526039532072

ID: /awards/ucsf-award-from-yin-shen/
Title: UCSF award from Yin Shen
Lab: Yin Shen
Description: Funded by UCSF from Yin Shen...
Similarity score: 0.7310959907281356

ID: /awards/ENCODE/
Title: ENCODE
Lab: 
Description: ...
Similarity score: 0.7222473131937561

ID: /awards/Pew-00032016/
Title: Pew Biomedical Scholars, Award Year 2018
Lab: 
Description: ...
Similarity score: 0.7220591440971962


In [30]:
await SearchRelevancyExpert.from_results(results).evaluate()

'The results do not match the query as they are all related to awards and do not contain information about Jupiter, Mars, Venus, or the Moon.'