## Semantic search with OpenAI `text-embedding-ada-002` embeddings.

Here's an example of using the underlying Python library for calculating embeddings on a custom set of documents, and finding the _k_ most similar documents given a custom query or an existing document.

In [1]:
import requests

import openai as openai_client # Assumes OPENAI_API_KEY is defined in environment

from ssed.remote.openai import OpenAI
from ssed.remote.openai import OpenAIProps

from ssed.embeddings import Embeddings
from ssed.embeddings import EmbeddingsProps

Get some Award documents (with select fields):

In [2]:
url = (
    'https://www.encodeproject.org'
    '/search/?type=Award&frame=object&format=json&status=current&limit=all'
    '&field=pi.title&field=description&field=title&field=rfa'
)
response = requests.get(url)
documents = response.json()['@graph']
documents[0]

{'@id': '/awards/1U54DK107967/',
 '@type': ['Award', 'Item'],
 'description': 'PROJECT SUMMARY / ABSTRACT This proposal seeks to fulfill a community need for a comprehensive, high-resolution genome-mapping platform that will enable investigation of the structural, functional and spatiotemporal organization of the human genome. Our ultimate goal is to deliver complex chromatin interaction network maps in the context of 3D genome structures from which the dynamics of individual genomic elements can be monitored and referenced. Here, we propose to develop a Nucleome Positioning System (NPS)-comprised of 1) a robust genome- wide mapping technology platform, 2) advanced computational modeling algorithms and 3) state-of-the-art nuclear imaging methods-that will allow users community-wide to uncover the regulatory functions of 3D genome organization in human cells. NPS will be based upon the established ChIA-PET method (1,2), enhanced by process optimizations-i.e., microfluidic-based miniatur

In [3]:
[doc['title'] for doc in documents]

['NUCLEOME POSITIONING SYSTEM FOR SPATIOTEMPORAL GENOME ORGANIZATION AND REGULATION',
 'ENCODING GENOMIC ARCHITECTURE IN THE ENCYCLOPEDIA: LINKING DNA ELEMENTS, CHROMATIN STATE, AND GENE EXPRESSION IN 3D',
 'COMPREHENSIVE FUNCTIONAL CHARACTERIZATION AND DISSECTION OF NONCODING REGULATORY ELEMENTS AND HUMAN GENETIC VARIATION',
 'A COMPREHENSIVE FUNCTIONAL MAP OF HUMAN PROTEIN-RNA INTERACTIONS',
 'A DATA COORDINATING CENTER FOR ENCODE',
 'CENTER FOR 3D STRUCTURE AND PHYSICS OF THE GENOME',
 'HIGHER PRECISION HUMAN AND MOUSE TRANSCRIPTOMES',
 'CONNECTING TRANSPOSABLE ELEMENTS AND REGULATORY INNOVATION USING ENCODE DATA',
 'GENOME WIDE MAPPING OF LOOPS USING IN SITU HI-C',
 'AN ENCODE CHIP-SEQ PIPELINE USING ENDOGENOUSLY TAGGED HUMAN DNA-ASSOCIATED PROTEINS',
 'IDENTIFICATION OF FUNCTIONAL DNA ELEMENTS BY HSQPCR',
 'MAPPING SITES OF TRANSCRIPTION AND REGULATION',
 'ANALYSIS OF FUNCTIONAL GENETIC VARIANTS IN RNA PROCESSING AND EXPRESSION',
 'ENCODE MAPPING CENTER-A COMPREHENSIVE CATALOG OF 

In [4]:
print(len(documents))

64


Create an `Embeddings` object from the documents:

In [5]:
openai = OpenAI(
    props=OpenAIProps(
        embedding_client=openai_client.Embedding
    )
)

embeddings = Embeddings.from_documents(
    props=EmbeddingsProps(
        openai=openai
    ),
    documents=documents,
    id_key='@id'
)

Calculate the embedding values:

In [6]:
embeddings.get_values()

array([[-0.02199213,  0.01176144, -0.0105888 , ..., -0.02159891,
        -0.01867786, -0.0273427 ],
       [-0.01916428,  0.023857  , -0.00747463, ..., -0.01642452,
        -0.01433106, -0.02548681],
       [-0.01985164,  0.0066079 , -0.0167782 , ..., -0.02340006,
        -0.01160923, -0.01816124],
       ...,
       [-0.02352513,  0.00578107, -0.00970999, ..., -0.01259189,
        -0.00209922, -0.01774752],
       [ 0.00107457, -0.00064013, -0.0274494 , ..., -0.02273974,
         0.00525227, -0.0354076 ],
       [-0.01652613,  0.01145403, -0.02446871, ..., -0.01226222,
        -0.01578761, -0.02778508]])

In [29]:
def print_results(indices_and_scores, embeddings, description_max=1500):
    for index, score in indices_and_scores:
        print('\nID:', embeddings.documents[index]['@id'])
        print('Title:', embeddings.documents[index]['title'])
        print('Lab:', embeddings.documents[index]['pi']['title'])
        print('Description:', embeddings.documents[index]['description'][:description_max] + '...')
        print('Similarity score:', score)

In [26]:
crispr_results = embeddings.get_k_results_most_similar_to_query(
    query='crispr',
    k=3
)
crispr_results

[(1, 0.8344338672426572), (2, 0.8264028965176172), (56, 0.8218114454369716)]

In [32]:
print_results(crispr_results, embeddings, description_max=2000)


ID: /awards/U01HG009395/
Title: ENCODING GENOMIC ARCHITECTURE IN THE ENCYCLOPEDIA: LINKING DNA ELEMENTS, CHROMATIN STATE, AND GENE EXPRESSION IN 3D
Lab: Christina Leslie
Description: Most of the 1000s of sequencing experiments generated by ENCODE provide 1D readouts of the epigenetic landscape or transcriptional output of a 3D genome. New chromosome conformation capture (3C) technologies – in particular Hi-C and ChIA-PET – have begun to provide insight into the hierarchical 3D organization of the genome: the partition of chromosomes into open and closed compartments; the existence of structural subunits defined as topologically associated domains (TADs); and the presence of regulatory and structural DNA loops within TADs. New experimental evidence using CRISPR/Cas-mediated genome editing suggests that disruption of local 3D structure can alter regulation of neighboring genes, and there have been early efforts to use data on 3D DNA looping to predict the impact of non-coding SNPs from 

In [34]:
similar_awards = embeddings.get_k_results_most_similar_to_id(
    '/awards/UM1HG009375/',
    k=3
)
similar_awards

[(8, 1.0000000050508615), (1, 0.9244194182656655), (5, 0.9151913345001851)]

In [35]:
print_results(similar_awards, embeddings)


ID: /awards/UM1HG009375/
Title: GENOME WIDE MAPPING OF LOOPS USING IN SITU HI-C
Lab: Erez Aiden
Description: The roughly two meters of DNA in the human genome is intricately packaged to form the chromatin and chromosomes in each cell nucleus. In addition to its structural role, this organization has critical regulatory functions. In particular, the formation of loops in the human genome plays an essential role in regulating genes. We recently demonstrated the ability to create reliable maps of these loops, using an in situ Hi-C method for three-dimensional genome sequencing. Hi-C characterizes the three dimensional configuration of the genome by determining the frequency of physical contact between all pairs of loci, genome-wide. The proposed center will apply Hi-C and other new technologies to characterize genomic loops, their regulation, and their functions. We will specifically examine these structures in a wide variety of ENCODE cell types. The principles deduced from our study wi

In [36]:
data_coordination_results = embeddings.get_k_results_most_similar_to_query(
    query='data coordination center',
    k=3
)

In [37]:
print_results(data_coordination_results, embeddings)


ID: /awards/U24HG009397/
Title: A DATA COORDINATING CENTER FOR ENCODE
Lab: J. Michael Cherry
Description: The goals of the ENCODE Data Coordinating Center (DCC) is to support the ENCODE Consortium by defining and establishing a strategy that connects all participants to the data and by creating avenues of access that distribute these data to the greater biological research community. The ENCODE Consortium brings together laboratories that generate complex data types via experimental assays with laboratories that integrate these unique data using computational analyses to discover how chromosomal elements function together to define human cells and tissues. The DCC's participation enhances the data created by these laboratories through the creation of structured procedures for the verification and validation of all submitted data and providing processes for the documentation of metadata that describe each biological sample and assay method. To facilitate access to all the data created 