## Semantic search with OpenAI `text-embedding-ada-002` embeddings.

Here's an example of using the underlying `ssed` (semantic search with ENCODE data) Python library for calculating embeddings on a custom set of JSON documents, and finding the _k_ most similar documents given a custom query or an existing document.

In [1]:
import requests

import openai as openai_client # Assumes OPENAI_API_KEY is defined in environment

from ssed.remote.openai import OpenAI
from ssed.remote.openai import OpenAIProps

from ssed.embeddings import Embeddings
from ssed.embeddings import EmbeddingsProps

Get some ENCODE Project Award documents (with select fields):

In [2]:
url = (
    'https://www.encodeproject.org'
    '/search/?type=Award&frame=object&format=json&status=current&limit=all'
    '&field=pi.title&field=description&field=title&field=rfa'
)
response = requests.get(url)
documents = response.json()['@graph']
documents[0]

{'@id': '/awards/1U54DK107967/',
 '@type': ['Award', 'Item'],
 'description': 'PROJECT SUMMARY / ABSTRACT This proposal seeks to fulfill a community need for a comprehensive, high-resolution genome-mapping platform that will enable investigation of the structural, functional and spatiotemporal organization of the human genome. Our ultimate goal is to deliver complex chromatin interaction network maps in the context of 3D genome structures from which the dynamics of individual genomic elements can be monitored and referenced. Here, we propose to develop a Nucleome Positioning System (NPS)-comprised of 1) a robust genome- wide mapping technology platform, 2) advanced computational modeling algorithms and 3) state-of-the-art nuclear imaging methods-that will allow users community-wide to uncover the regulatory functions of 3D genome organization in human cells. NPS will be based upon the established ChIA-PET method (1,2), enhanced by process optimizations-i.e., microfluidic-based miniatur

In [3]:
[doc['title'] for doc in documents]

['NUCLEOME POSITIONING SYSTEM FOR SPATIOTEMPORAL GENOME ORGANIZATION AND REGULATION',
 'ENCODING GENOMIC ARCHITECTURE IN THE ENCYCLOPEDIA: LINKING DNA ELEMENTS, CHROMATIN STATE, AND GENE EXPRESSION IN 3D',
 'COMPREHENSIVE FUNCTIONAL CHARACTERIZATION AND DISSECTION OF NONCODING REGULATORY ELEMENTS AND HUMAN GENETIC VARIATION',
 'A COMPREHENSIVE FUNCTIONAL MAP OF HUMAN PROTEIN-RNA INTERACTIONS',
 'A DATA COORDINATING CENTER FOR ENCODE',
 'CENTER FOR 3D STRUCTURE AND PHYSICS OF THE GENOME',
 'HIGHER PRECISION HUMAN AND MOUSE TRANSCRIPTOMES',
 'CONNECTING TRANSPOSABLE ELEMENTS AND REGULATORY INNOVATION USING ENCODE DATA',
 'GENOME WIDE MAPPING OF LOOPS USING IN SITU HI-C',
 'AN ENCODE CHIP-SEQ PIPELINE USING ENDOGENOUSLY TAGGED HUMAN DNA-ASSOCIATED PROTEINS',
 'IDENTIFICATION OF FUNCTIONAL DNA ELEMENTS BY HSQPCR',
 'MAPPING SITES OF TRANSCRIPTION AND REGULATION',
 'ANALYSIS OF FUNCTIONAL GENETIC VARIANTS IN RNA PROCESSING AND EXPRESSION',
 'ENCODE MAPPING CENTER-A COMPREHENSIVE CATALOG OF 

In [4]:
print(len(documents))

64


Create an `Embeddings` object from the documents:

In [5]:
openai = OpenAI(
    props=OpenAIProps(
        embedding_client=openai_client.Embedding
    )
)

embeddings = Embeddings.from_documents(
    props=EmbeddingsProps(
        openai=openai
    ),
    documents=documents,
    id_key='@id'
)

Calculate the embeddings (this is rate limited by OpenAI's API and could take a long time if you have a lot of documents):

In [6]:
embeddings.get_values()

array([[-0.02199213,  0.01176144, -0.0105888 , ..., -0.02159891,
        -0.01867786, -0.0273427 ],
       [-0.01924773,  0.02382784, -0.00755157, ..., -0.01639569,
        -0.01431637, -0.02554187],
       [-0.01985164,  0.0066079 , -0.0167782 , ..., -0.02340006,
        -0.01160923, -0.01816124],
       ...,
       [-0.02352513,  0.00578107, -0.00970999, ..., -0.01259189,
        -0.00209922, -0.01774752],
       [ 0.00107457, -0.00064013, -0.0274494 , ..., -0.02273974,
         0.00525227, -0.0354076 ],
       [-0.01652613,  0.01145403, -0.02446871, ..., -0.01226222,
        -0.01578761, -0.02778508]])

Every row is a document and every embedding has 1536 dimensions:

In [7]:
embeddings.get_values().shape

(64, 1536)

Making a query to the `Embeddings` object returns a `Result` object. The raw results of a query show the document index and similarity score.

In [8]:
crispr_results = embeddings.get_k_results_most_similar_to_query(
    query='crispr',
    k=3
)
crispr_results.raw

[(1, 0.8344355876893013), (2, 0.8264028965176172), (56, 0.8218114454369716)]

Underneath we are embedding the query in the same high-dimensional space as the documents, and calculating the cosine similarity between all document embeddings and the query embedding. 

Iterating through the results yields a `(document, score)` tuple. We can define a helper function to print out the relevant fields from a document and its score:

In [9]:
def print_results(results, description_max=1500):
    for document, score in results:
        print('\nID:', document['@id'])
        print('Title:', document['title'])
        print('Lab:', document.get('pi', {}).get('title', ''))
        print('Description:', document.get('description', '')[:description_max] + '...')
        print('Similarity score:', score)

In [10]:
print_results(crispr_results, description_max=2000)


ID: /awards/U01HG009395/
Title: ENCODING GENOMIC ARCHITECTURE IN THE ENCYCLOPEDIA: LINKING DNA ELEMENTS, CHROMATIN STATE, AND GENE EXPRESSION IN 3D
Lab: Christina Leslie
Description: Most of the 1000s of sequencing experiments generated by ENCODE provide 1D readouts of the epigenetic landscape or transcriptional output of a 3D genome. New chromosome conformation capture (3C) technologies – in particular Hi-C and ChIA-PET – have begun to provide insight into the hierarchical 3D organization of the genome: the partition of chromosomes into open and closed compartments; the existence of structural subunits defined as topologically associated domains (TADs); and the presence of regulatory and structural DNA loops within TADs. New experimental evidence using CRISPR/Cas-mediated genome editing suggests that disruption of local 3D structure can alter regulation of neighboring genes, and there have been early efforts to use data on 3D DNA looping to predict the impact of non-coding SNPs from 

Instead of an arbitrary query we can also use a document's embedding to find other similar documents:

In [11]:
similar_awards = embeddings.get_k_results_most_similar_to_id(
    '/awards/UM1HG009375/',
    k=3
)
print(similar_awards)

/awards/UM1HG009375/: 1.0000000050508615
/awards/U01HG009395/: 0.9244691689010602
/awards/1U54DK107980/: 0.9151913345001851


As expected the document is most similar with itself (cosine simliarity of 1).

All of the documents seem to be related to the structure of the genome in three-dimensional space:

In [12]:
print_results(similar_awards)


ID: /awards/UM1HG009375/
Title: GENOME WIDE MAPPING OF LOOPS USING IN SITU HI-C
Lab: Erez Aiden
Description: The roughly two meters of DNA in the human genome is intricately packaged to form the chromatin and chromosomes in each cell nucleus. In addition to its structural role, this organization has critical regulatory functions. In particular, the formation of loops in the human genome plays an essential role in regulating genes. We recently demonstrated the ability to create reliable maps of these loops, using an in situ Hi-C method for three-dimensional genome sequencing. Hi-C characterizes the three dimensional configuration of the genome by determining the frequency of physical contact between all pairs of loci, genome-wide. The proposed center will apply Hi-C and other new technologies to characterize genomic loops, their regulation, and their functions. We will specifically examine these structures in a wide variety of ENCODE cell types. The principles deduced from our study wi

Here are more examples of arbitrary queries:

In [13]:
data_coordination_results = embeddings.get_k_results_most_similar_to_query(
    query='data coordination center',
    k=3
)

In [14]:
print_results(data_coordination_results)


ID: /awards/U24HG009397/
Title: A DATA COORDINATING CENTER FOR ENCODE
Lab: J. Michael Cherry
Description: The goals of the ENCODE Data Coordinating Center (DCC) is to support the ENCODE Consortium by defining and establishing a strategy that connects all participants to the data and by creating avenues of access that distribute these data to the greater biological research community. The ENCODE Consortium brings together laboratories that generate complex data types via experimental assays with laboratories that integrate these unique data using computational analyses to discover how chromosomal elements function together to define human cells and tissues. The DCC's participation enhances the data created by these laboratories through the creation of structured procedures for the verification and validation of all submitted data and providing processes for the documentation of metadata that describe each biological sample and assay method. To facilitate access to all the data created 

In [15]:
print_results(
    embeddings.get_k_results_most_similar_to_query(
        query='snyder',
        k=2
    )
)


ID: /awards/U01HG007919/
Title: GENOMICS OF GENE REGULATION IN PROGENITOR TO DIFFERENTIATED KERATINOCYTES
Lab: Michael Snyder
Description: The modeling of transcription to genome proximal elements to date has revealed associations, but seldom are disruptions performed to confirm mechanistic possibilities and substantiate causality. In studying multistate cell systems, many processes important to human health are difficult to study due to low cell availability and/or dyssynchrony leading to heterogeneous cell populations. We propose to study a model of the human epidermal differentiation system, which by its intrinsic properties does not have these problems but at the same time closely simulates the native process. We plan to perform multiple next generation sequencing modalities of transcription and gene proximal components over a time course spanning the transition from progenitor to differentiated keratinocytes. A network based on boosting methods and dynamic Bayesian networks will 

In [16]:
print_results(
    embeddings.get_k_results_most_similar_to_query(
        query='how cell phenotype is affected by gene mutations',
        k=5
    )
)


ID: /awards/U01HG007910/
Title: RULES OF GENE EXPRESSION MODELED ON HUMAN DENDRITIC CELL RESPONSE TO PATHOGENS
Lab: Jeremy Luban
Description: The developmental shifts that occur when cells respond to environmental stimuli are controlled in large part by gene expression programs involving thousands of genes. Transcription factors (TFs), chromatin modifying enzymes, and cis-acting DNA elements contribute to the networks that underlie such programs. The code that links these variables in such a way that the expression of a given gene can be predicted based on the presence of specific components has yet to be deciphered. A model for such a code will be constructed here based on genome-wide analysis of human dendritic cells (DCs) as they mature in response to pathogens. DCs are antigen-presenting cells that initiate and determine the quality and magnitude of the host immune response. Recent technical advances in stem cell biology, reverse-genetic tools for primary human cells, and genome-w

In [21]:
print_results(
    embeddings.get_k_results_most_similar_to_query(
        query='how transcription factors regulate gene expression',
        k=3
    )
)


ID: /awards/U01HG007910/
Title: RULES OF GENE EXPRESSION MODELED ON HUMAN DENDRITIC CELL RESPONSE TO PATHOGENS
Lab: Jeremy Luban
Description: The developmental shifts that occur when cells respond to environmental stimuli are controlled in large part by gene expression programs involving thousands of genes. Transcription factors (TFs), chromatin modifying enzymes, and cis-acting DNA elements contribute to the networks that underlie such programs. The code that links these variables in such a way that the expression of a given gene can be predicted based on the presence of specific components has yet to be deciphered. A model for such a code will be constructed here based on genome-wide analysis of human dendritic cells (DCs) as they mature in response to pathogens. DCs are antigen-presenting cells that initiate and determine the quality and magnitude of the host immune response. Recent technical advances in stem cell biology, reverse-genetic tools for primary human cells, and genome-w

You can save your document embeddings for later use to avoid recalculating them all again using OpenAI's API:

In [18]:
embeddings.save('./data/awards')

And load them again to make queries:

In [19]:
loaded_embeddings = Embeddings.load(
    props=EmbeddingsProps(
        openai=openai
    ),
    path='./data/awards'
)

In [20]:
print_results(
    loaded_embeddings.get_k_results_most_similar_to_query(
        'machine learning',
        k=5,
    ),
    description_max=5000,
)


ID: /awards/NSERC06150/
Title: Unsupervised machine learning methods that discover the molecular programs underlying cellular biology
Lab: Max Libbrecht
Description: ...
Similarity score: 0.7874355634695548

ID: /awards/U01HG009380/
Title: SYSTEMATIC IDENTIFICATION OF CORE REGULATORY CIRCUITRY FROM ENCODE DATA
Lab: Michael Beer
Description: While much progress has been made generating high quality chromatin state and accessibility data from the ENCODE and Roadmap consortia, accurately identifying cell-type specific enhancers from these data remains a significant challenge. We have recently developed a computational approach (gkmSVM) to predict regulatory elements from DNA sequence, and we have shown that when gkmSVM is trained on DHS data from each of the human and mouse ENCODE and Roadmap cells and tissues, it can predict both cell specific enhancer activity and the impact of regulatory variants (deltaSVM) with greater precision than alternative approaches. The gkmSVM model encapsula