## Semantic search with OpenAI `text-embedding-ada-002` embeddings.

Here's an example of using the underlying Python library for calculating embeddings on a custom set of documents, and finding the _k_ most similar documents given a custom query or an existing document.

In [6]:
import requests

import openai as openai_client # Assumes OPENAI_API_KEY is defined in environment

from ssed.remote.openai import OpenAI
from ssed.remote.openai import OpenAIProps

from ssed.embeddings import Embeddings
from ssed.embeddings import EmbeddingsProps

Get a few example Award documents (with select fields):

In [2]:
url = (
    'https://www.encodeproject.org'
    '/search/?type=Award&frame=object&format=json'
    '&field=pi.title&field=description&field=title&field=rfa'
)
response = requests.get(url)
documents = response.json()['@graph']
documents[0]

{'@id': '/awards/R01GM083337/',
 '@type': ['Award', 'Item'],
 'description': 'Abnormal temporal control of replication is observed in many diseases but causal linkages are unknown. This gap will remain incomprehensible until the mechanisms regulating replication timing during normal development are understood. The long-term goal is to understand the relationship of replication timing to cellular epigenetic states and disease. The immediate goal is to identify cis-acting DNA/chromatin elements that regulate changes in replication timing during differentiation of mouse embryonic stem cells (ESCs). Mouse ESCs are an ideal experimental system due to the availability of chromosome engineering tools, directed cell differentiation systems, and comprehensive genome-wide maps of replication timing and transcription. These maps have identified the molecular coordinates of programmed changes in replication timing that occur in 400-800kb units termed "replication domains". The central hypothesis i

In [51]:
[doc['title'] for doc in documents]

['CIS-ACTING ELEMENTS REGULATING DEVELOPMENTAL CONTROL OF REPLICATION TIMING',
 'GENOME-WIDE MAPPING OF CHROMOSOMAL PROTEINS IN DROSOPHILIA',
 'GENCODE: COMPREHENSIVE GENOME ANNOTATION FOR HUMAN AND MOUSE',
 'IDENTIFICATION OF FUNCTIONAL DNA ELEMENTS BY HSQPCR',
 'A CATALOG OF CELL TYPES AND GENOMIC ELEMENTS IN TISSUES, ORGANOIDS AND DISEASE',
 'WHOLE GENOME CHROMATIN INTERACTION ANALYSIS USING PAIR-END-DITAGGING (CIA-PET)',
 'FUNCTIONALLY SPECIALIZED COMPONENTS OF DISEASE HERITABILITY IN ENCODE DATA',
 'Unattributed',
 'DECODING THE REGULATORY ARCHITECTURE OF THE HUMAN GENOME ACROSS CELL TYPES, INDIVIDUALS AND DISEASE',
 'Unsupervised machine learning methods that discover the molecular programs underlying cellular biology',
 'A COMPREHENSIVE CATALOG OF DNASEI HYPERSENSITIVE SITES',
 'HIGH-THROUGHPUT FUNCTIONAL CHARACTERIZATION OF HUMAN ENHANCERS',
 'HIGH THROUGHPUT CRISPR-MEDIATED FUNCTIONAL VALIDATION OF REGULATORY ELEMENTS',
 'ANALYSIS OF FUNCTIONAL GENETIC VARIANTS IN RNA PROCESSI

In [46]:
print(len(documents))

25


Create an `Embeddings` object from the documents:

In [10]:
openai = OpenAI(
    props=OpenAIProps(
        embedding_client=openai_client.Embedding
    )
)

embeddings = Embeddings.from_documents(
    props=EmbeddingsProps(
        openai=openai
    ),
    documents=documents,
    id_key='@id'
)

Calculate the embedding values:

In [11]:
embeddings.get_values()

array([[-0.03312446,  0.00638379, -0.00196582, ..., -0.00536663,
         0.00491456, -0.02282272],
       [-0.03454168, -0.00143894, -0.02220738, ...,  0.00072738,
        -0.01035125, -0.00953553],
       [-0.02822752, -0.00126266, -0.01149023, ..., -0.0355229 ,
        -0.00266562, -0.00618003],
       ...,
       [-0.02541946,  0.00930381, -0.01822204, ..., -0.01340948,
        -0.02533377, -0.00939663],
       [-0.02193064,  0.01178017, -0.01062814, ..., -0.02159346,
        -0.01869935, -0.02731145],
       [-0.03558172,  0.01007418, -0.01048537, ..., -0.02404096,
        -0.01983311, -0.01965492]])

In [58]:
crispr_results = embeddings.get_k_results_most_similar_to_query(
    query='crispr',
    k=3
)
crispr_results

[(11, 0.8191637486093268), (17, 0.815446654778106), (14, 0.8153190754152267)]

In [64]:
def print_results(indices_and_scores, embeddings, description_max=1500):
    for index, score in indices_and_scores:
        print('\nID:', embeddings.documents[index]['@id'])
        print('Title:', embeddings.documents[index]['title'])
        print('Lab:', embeddings.documents[index]['pi']['title'])
        print('Description:', embeddings.documents[index]['description'][:description_max] + '...')
        print('Similarity score:', score)

In [66]:
print_results(crispr_results, embeddings)


ID: /awards/UM1HG009393/
Title: HIGH-THROUGHPUT FUNCTIONAL CHARACTERIZATION OF HUMAN ENHANCERS
Lab: John Lis
Description: Specific enhancers interact with promoters to specify the cellular pattern, timing, and levels of gene expression. Enhancers can reside up to megabases away from their target gene promoters and strongly activate transcription. Aim 1 will characterize active enhancer elements and their relationship to promoter elements in vivo in human K562 (a tier 1 ENCODE cell line) by testing a broad array of Transcription Regulatory Elements (TREs) for their enhancer activity using eSTARR-seq, our modified element-clone- compatible STARR-seq assay. This collection of TREs will be selected based on a variety of criteria established by ENCODE and others. Large numbers of selected TREs can be handled using our new Clone- seq method, and then tested for enhancer activity by eSTARR-seq. For the TREs that have significant enhancer activity, ~10,000 synthetic mutations will be generate

In [67]:
similar_awards = embeddings.get_k_results_most_similar_to_id(
    '/awards/R01GM083337/',
    k=3
)
similar_awards

[(0, 1.000000068437994), (24, 0.8899324659385169), (5, 0.8851815932001575)]

In [68]:
print_results(similar_awards, embeddings)


ID: /awards/R01GM083337/
Title: CIS-ACTING ELEMENTS REGULATING DEVELOPMENTAL CONTROL OF REPLICATION TIMING
Lab: David Gilbert
Description: Abnormal temporal control of replication is observed in many diseases but causal linkages are unknown. This gap will remain incomprehensible until the mechanisms regulating replication timing during normal development are understood. The long-term goal is to understand the relationship of replication timing to cellular epigenetic states and disease. The immediate goal is to identify cis-acting DNA/chromatin elements that regulate changes in replication timing during differentiation of mouse embryonic stem cells (ESCs). Mouse ESCs are an ideal experimental system due to the availability of chromosome engineering tools, directed cell differentiation systems, and comprehensive genome-wide maps of replication timing and transcription. These maps have identified the molecular coordinates of programmed changes in replication timing that occur in 400-800k

In [69]:
documents

[{'@id': '/awards/R01GM083337/',
  '@type': ['Award', 'Item'],
  'description': 'Abnormal temporal control of replication is observed in many diseases but causal linkages are unknown. This gap will remain incomprehensible until the mechanisms regulating replication timing during normal development are understood. The long-term goal is to understand the relationship of replication timing to cellular epigenetic states and disease. The immediate goal is to identify cis-acting DNA/chromatin elements that regulate changes in replication timing during differentiation of mouse embryonic stem cells (ESCs). Mouse ESCs are an ideal experimental system due to the availability of chromosome engineering tools, directed cell differentiation systems, and comprehensive genome-wide maps of replication timing and transcription. These maps have identified the molecular coordinates of programmed changes in replication timing that occur in 400-800kb units termed "replication domains". The central hypothesi

In [70]:
data_coordination_results = embeddings.get_k_results_most_similar_to_query(
    query='data coordination center',
    k=3
)

In [71]:
print_results(data_coordination_results, embeddings)


ID: /awards/U01HG004695/
Title: EDAC: ENCODE DATA ANALYSIS CENTER
Lab: Ewan Birney
Description: The ENCODE Data Analysis Center (EDAC) proposal aims to provide a flexible analysis resource for the ENCODE project. The ENCODE project is a large multi center project which aims to define all the functional elements in the human genome. This will be achieved using many different experimental techniques coupled with numerous computational techniques. A critical part in delivering this set of functional elements is the integration of data from multiple sources. The ED AC proposal aims to provide this integration. As proscribed by the RFA for this proposal, the precise prioritization for the EDAC's work will be set by an external group, the Analysis Working Group (AWG). Based on previous experience, these analysis methods will require a variety of techniques. We expect to have to apply sophisticated statistical models to the integration of the data, in particular mitigating the problems of th