# Gene Embedding Generation for GenePT

This notebook generates custom GenePT embeddings for genes using OpenAI's text-embedding-3-large model. The process includes:

1. Loading gene descriptions from originally sourced from NCBI and UniProt
3. Generating enhanced gene descriptions using GPT-4-mini (or whatever model you want)
4. Creating embeddings for each gene using the combined descriptions
5. Handling duplicate genes by averaging their embeddings
6. Saving the final embeddings to a parquet file

The prompt we use by default creates embeddings that capture information about:
- Gene associations
- Cell types
- Drug interactions
- Biological pathways

Output: A 3072-dimensional embedding vector for each gene

This notebook uses `dotenv` to load the OpenAI API key from the `.env` file. To install the `dotenv` package, run
```
pip install python-dotenv
```

## Setup 

In [1]:
# Variables imported from notebook_setup.ipynb
# this just gets the linter to stop complaining

repo_dir = None  # type: ignore
data_dir = None  # type: ignore

%run notebook_setup.ipynb

autoreload enabled
repo_dir set to /Users/rj/personal/GenePT-tools
File already exists at /Users/rj/personal/GenePT-tools/data/GenePT_emebdding_v2.zip
Extracting files...
Extracting GenePT_emebdding_v2/
Skipping GenePT_emebdding_v2/NCBI_UniProt_summary_of_genes.json - already exists with same size
Skipping GenePT_emebdding_v2/GenePT_gene_embedding_ada_text.pickle - already exists with same size
Skipping GenePT_emebdding_v2/GenePT_gene_protein_embedding_model_3_text.pickle. - already exists with same size
Skipping GenePT_emebdding_v2/NCBI_summary_of_genes.json - already exists with same size
Extraction complete!
Skipping embedding_original_ada_text.parquet - already exists
Skipping embedding_original_large_3.parquet - already exists
Skipping embedding_associations_age_cell_type_drugs_pathways_openai_large.parquet - already exists
Skipping embedding_associations_age_drugs_pathways_openai_large.parquet - already exists
Skipping embedding_associations_cell_type_openai_large.parquet - alrea

# Load `gene_info_table`

We use `gene_info_table` as the reference table.  This is from the original GenePT paper, and allows us to use the same set of genes for our custom embeddings.

In [2]:
import pandas as pd

gene_info_table = pd.read_parquet(data_dir / "gene_info_table.parquet")
gene_info_table

Unnamed: 0_level_0,ensembl_id,gene_type
index,Unnamed: 1_level_1,Unnamed: 2_level_1
TSPAN6,ENSG00000000003,protein_coding
TNMD,ENSG00000000005,protein_coding
DPM1,ENSG00000000419,protein_coding
SCYL3,ENSG00000000457,protein_coding
C1orf112,ENSG00000000460,protein_coding
...,...,...
LINC02481,ENSG00000246526,
LINC01856,ENSG00000237574,
LINC02698,ENSG00000256717,
BOLA2-SMG1P6,ENSG00000261740,


In [3]:
from src.utils import setup_data_dir
from pathlib import Path


embedding_dir = data_dir / "GenePT_emebdding_v2"
ncbi_summary_of_genes_path = embedding_dir / "NCBI_summary_of_genes.json"
ncbi_uniprot_summary_of_genes_path = (
    embedding_dir / "NCBI_UniProt_summary_of_genes.json"
)

print("embedding_dir exists:", embedding_dir.exists())
print("ncbi_summary_of_genes_path exists:", ncbi_summary_of_genes_path.exists())
print(
    "ncbi_uniprot_summary_of_genes_path exists:",
    ncbi_uniprot_summary_of_genes_path.exists(),
)

embedding_dir exists: True
ncbi_summary_of_genes_path exists: True
ncbi_uniprot_summary_of_genes_path exists: True


In [4]:
import json

ncbi_summary_of_genes = json.load(open(ncbi_summary_of_genes_path))
ncbi_uniprot_summary_of_genes = json.load(open(ncbi_uniprot_summary_of_genes_path))

In [6]:
import os
from src.prompt_templates import *

prompt_template = NCBI_UNIPROT_ASSOCIATED_CELL_TYPE_TISSUE_DRUG_PATHWAY_PROMPT_V1

print(
    prompt_template.format(
        "LOC124907803", ncbi_uniprot_summary_of_genes["LOC124907803"]
    )
)

Tell me about the LOC124907803 gene.

Here is the NCBI and UniProt summary of the gene:

Gene Symbol LOC124907803

----

In addition to the provided information, please:

1. List any other genes that the gene is associated with, particularly those not mentioned in the summaries above.
2. List any tissues or tissue classes that the gene is expressed in.
3. List any cell types or cell classes that the gene is expressed in.
4. List any drug or drug classes that are known to interact with this gene. 
5. List pathways and biological processes that this gene is involved in.

Only include specific information about the gene or gene class.
Avoid generic language and non-informative discussion.
Do not make suggestions about other sources of information.



# Prompt quality characterization

To characterize the quality of the prompt. Lets look at a few example genes.

* **BRCA1** is a very well documented breast cancer gene
* **LOC124907803** is completely undocumented
* **PRDM9** is an obscure but studied gene

In [7]:
from dotenv import load_dotenv
load_dotenv()
from openai import OpenAI  # New import

client = OpenAI()  # Initialize client


In [8]:

gene_completion_test = ["LOC124907803", "BRCA1", 'PRDM9']
for gene in sorted(gene_completion_test):
    completion = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": "You are a detail oriented bioinformatics and genetics expert help to give accurate and precise descriptions of genes",
            },
            {
                "role": "user",
                "content": prompt_template.format(
                    gene, ncbi_uniprot_summary_of_genes[gene]
                ),
            }
        ],
        temperature=0.0,
    )
    print(
        f"""
{ncbi_uniprot_summary_of_genes[gene]}
                    
{completion.choices[0].message.content}

"""
    )
print("-" * 100)


Gene Symbol BRCA1 This gene encodes a 190 kD nuclear phosphoprotein that plays a role in maintaining genomic stability, and it also acts as a tumor suppressor. The BRCA1 gene contains 22 exons spanning about 110 kb of DNA. The encoded protein combines with other tumor suppressors, DNA damage sensors, and signal transducers to form a large multi-subunit protein complex known as the BRCA1-associated genome surveillance complex (BASC). This gene product associates with RNA polymerase II, and through the C-terminal domain, also interacts with histone deacetylase complexes. This protein thus plays a role in transcription, DNA repair of double-stranded breaks, and recombination. Mutations in this gene are responsible for approximately 40% of inherited breast cancers and more than 80% of inherited breast and ovarian cancers. Alternative splicing plays a role in modulating the subcellular localization and physiological function of this gene. Many alternatively spliced transcript variants, som

# Create request batches

We do a test_batch first to make sure all of the machinery is working wel, the prompts are working as expected, etc.  Then we do a full batch of all the genes.

In [9]:
from src import embeddings


test_batch_info =embeddings.BatchInfo(
    batch_name = "test_batch",
    request_data = embeddings.get_gene_text_batch_requests(
        dict(list(ncbi_uniprot_summary_of_genes.items())[:10]),
        prompt_template, "test_batch_requests",
        model = "gpt-4o",
    ),
    batch_description = "small batch of gene embeddings for testing the batch API",
    data_dir = data_dir
)
test_batch_job = embeddings.create_batch_job(test_batch_info, "completion", client)
test_batch_job

Batch(id='batch_67c532ae6b34819082856708518e011a', completion_window='24h', created_at=1740976814, endpoint='/v1/chat/completions', input_file_id='file-ViqkaePht9qvaNWXjbiUYG', object='batch', status='validating', cancelled_at=None, cancelling_at=None, completed_at=None, error_file_id=None, errors=None, expired_at=None, expires_at=1741063214, failed_at=None, finalizing_at=None, in_progress_at=None, metadata={'description': 'small batch of gene embeddings for testing the batch API'}, output_file_id=None, request_counts=BatchRequestCounts(completed=0, failed=0, total=0))

In [10]:
test_batch_status = embeddings.monitor_batch_status(client, test_batch_job, check_interval=60, verbose=True)

2025-03-02 20:40:18 - Completed: 0, Failed: 0, Total: 10
2025-03-02 20:41:19 - Completed: 10, Failed: 0, Total: 10

Batch completed


In [11]:
from src import embeddings
test_output_path = embeddings.save_batch_response(test_batch_info, test_batch_status, client)


In [12]:

# responses = embeddings.load_batch_responses(test_batch_info)
full_batch_info = embeddings.BatchInfo(
    batch_name = "full_batch",
    request_data = embeddings.get_gene_text_batch_requests(
        ncbi_uniprot_summary_of_genes, prompt_template,
        "full_batch_requests",
        model = "gpt-4o",
    ),
    batch_description = "full batch of gene embedding text",
    data_dir = data_dir
)
full_batch_job = embeddings.create_batch_job(full_batch_info, "completion", client)
full_batch_job

Batch(id='batch_67c53558e8c48190a11544c0e04679bb', completion_window='24h', created_at=1740977496, endpoint='/v1/chat/completions', input_file_id='file-DmEJuSERwQeobwdWG4Wpsi', object='batch', status='validating', cancelled_at=None, cancelling_at=None, completed_at=None, error_file_id=None, errors=None, expired_at=None, expires_at=1741063896, failed_at=None, finalizing_at=None, in_progress_at=None, metadata={'description': 'full batch of gene embedding text'}, output_file_id=None, request_counts=BatchRequestCounts(completed=0, failed=0, total=0))

In [None]:
# batch_indices = pd.read_csv(
#     "/Users/rj/personal/GenePT-tools/data/generated/batch-requests/failed_batch_indices.txt",
#     header = None
#     )

In [37]:
# error_request_data = [
#     request_data
#     for i, request_data in enumerate(full_batch_info.request_data)
#     if i in set(batch_indices[0])
# ]

In [None]:
# error_batch_info = embeddings.BatchInfo(
#     batch_name = "error_batch",
#     request_data = error_request_data,
#     batch_description = "error batch of gene embeddings for testing the batch API",
#     data_dir = data_dir
# )
# error_batch_job = embeddings.create_batch_job(error_batch_info, "completion", client)
# error_batch_job

Batch(id='batch_67c5090072548190b8013752c4063f3d', completion_window='24h', created_at=1740966144, endpoint='/v1/chat/completions', input_file_id='file-DMYipneBtzyXSR3WuFzVby', object='batch', status='validating', cancelled_at=None, cancelling_at=None, completed_at=None, error_file_id=None, errors=None, expired_at=None, expires_at=1741052544, failed_at=None, finalizing_at=None, in_progress_at=None, metadata={'description': 'error batch of gene embeddings for testing the batch API'}, output_file_id=None, request_counts=BatchRequestCounts(completed=0, failed=0, total=0))

In [None]:
# error_batch_status = embeddings.monitor_batch_status(client, error_batch_job, check_interval=60)

2025-03-02 17:43:02 - Completed: 0, Failed: 0, Total: 11948
2025-03-02 17:44:03 - Completed: 487, Failed: 0, Total: 11948
2025-03-02 17:45:03 - Completed: 827, Failed: 0, Total: 11948
2025-03-02 17:46:03 - Completed: 1233, Failed: 0, Total: 11948
2025-03-02 17:47:03 - Completed: 1631, Failed: 0, Total: 11948
2025-03-02 17:48:03 - Completed: 2056, Failed: 0, Total: 11948
2025-03-02 17:49:03 - Completed: 2624, Failed: 0, Total: 11948
2025-03-02 17:50:04 - Completed: 3016, Failed: 0, Total: 11948
2025-03-02 17:51:04 - Completed: 3408, Failed: 0, Total: 11948
2025-03-02 17:52:04 - Completed: 3591, Failed: 0, Total: 11948
2025-03-02 17:53:04 - Completed: 4240, Failed: 0, Total: 11948
2025-03-02 17:54:04 - Completed: 4621, Failed: 0, Total: 11948
2025-03-02 17:55:05 - Completed: 4928, Failed: 0, Total: 11948
2025-03-02 17:56:05 - Completed: 5589, Failed: 0, Total: 11948
2025-03-02 17:57:05 - Completed: 5961, Failed: 0, Total: 11948
2025-03-02 17:58:05 - Completed: 6260, Failed: 0, Total: 119

In [43]:
# error_batch_output_path = embeddings.save_batch_response(error_batch_info, error_batch_status, client)


In [45]:
# responses = embeddings.load_batch_responses(error_batch_info)

In [47]:
# original_responses = []
# with open("/Users/rj/personal/GenePT-tools/data/generated/batch-requests/batch_67c4cede540c81908b6bfc81a658e8df_output.jsonl", "r") as f:
#     for line in f:
#         original_responses.append(json.loads(line))


In [48]:
# original_responses[:10]

[{'id': 'batch_req_67c4de1099b4819090d77ffbb78730e6',
  'custom_id': None,
  'response': {'status_code': 200,
   'request_id': 'cbd5e437c3edf6aaf2b5f0c5b495dcb0',
   'body': {'id': 'chatcmpl-B6lFdKfCBUNe2aVCx5pWkAsDMdAOU',
    'object': 'chat.completion',
    'created': 1740951301,
    'model': 'gpt-4o-2024-08-06',
    'choices': [{'index': 0,
      'message': {'role': 'assistant',
       'content': 'LINC01409 is categorized as a long intergenic non-protein coding RNA (lincRNA) gene. LincRNAs are a subtype of long non-coding RNAs (lncRNAs) that are located between protein-coding genes. These RNA molecules do not code for proteins but are known to play various roles in gene regulation, including chromatin remodeling, transcriptional regulation, and post-transcriptional processing.\n\nRegarding the expression of LINC01409, it is known to be expressed in specific human tissues and cell types. RNA expression datasets, such as those compiled from RNA-seq analyses, indicate that LINC01409 sh

In [None]:
# error_responses = responses
# responses = (original_responses + error_responses)
# responses.sort(key=lambda x: x["id"])

In [58]:
responses

{'id': 'batch_req_67c5151b0320819090d952ecde5ec19d',
 'custom_id': 'full_batch_requests-28703',
 'response': {'status_code': 200,
  'request_id': 'c9fbf9bf1ba5e4d8bfb1c8f69a940562',
  'body': {'id': 'chatcmpl-B6pMv9VGGGFu2z4cGxiGDrIkAi4Hv',
   'object': 'chat.completion',
   'created': 1740967129,
   'model': 'gpt-4o-2024-08-06',
   'choices': [{'index': 0,
     'message': {'role': 'assistant',
      'content': "The SCRG1 gene is primarily expressed in the central nervous system of adults and is notably absent in the fetal brain. Beyond this, high transcript levels are observed in the testis and aorta. While specific cell types or classes are not detailed in the summary, the gene's expression in the central nervous system suggests a potential presence in neuronal cells or related supporting cells. Its role in interactions with BST1 indicates a possible expression or activity in stromal and mesenchymal stem cells, pivotal for tissue and bone regeneration.",
      'refusal': None},
     

In [13]:
full_batch_status = embeddings.monitor_batch_status(client, full_batch_job, check_interval=60)

2025-03-02 20:51:52 - Completed: 0, Failed: 0, Total: 33703
2025-03-02 20:52:52 - Completed: 248, Failed: 0, Total: 33703
2025-03-02 20:53:52 - Completed: 650, Failed: 0, Total: 33703
2025-03-02 20:54:52 - Completed: 1192, Failed: 0, Total: 33703
2025-03-02 20:55:53 - Completed: 1673, Failed: 0, Total: 33703
2025-03-02 20:56:53 - Completed: 2146, Failed: 0, Total: 33703
2025-03-02 20:57:53 - Completed: 2481, Failed: 0, Total: 33703
2025-03-02 20:58:53 - Completed: 2963, Failed: 0, Total: 33703
2025-03-02 21:31:47 - Completed: 19409, Failed: 0, Total: 33703
2025-03-02 21:32:47 - Completed: 20425, Failed: 0, Total: 33703
2025-03-02 21:33:47 - Completed: 20897, Failed: 0, Total: 33703
2025-03-02 21:34:47 - Completed: 21515, Failed: 0, Total: 33703
2025-03-02 21:35:47 - Completed: 21941, Failed: 0, Total: 33703
2025-03-02 21:36:48 - Completed: 21941, Failed: 0, Total: 33703
2025-03-02 21:37:48 - Completed: 22653, Failed: 0, Total: 33703
2025-03-02 21:38:48 - Completed: 23496, Failed: 0, To

In [14]:
full_batch_output_path = embeddings.save_batch_response(full_batch_info, full_batch_status, client)


In [15]:
responses = embeddings.load_batch_responses(full_batch_info)

In [61]:
# custom_ids = [response["custom_id"][len('full_batch_requests-'):] for response in responses]

TypeError: 'NoneType' object is not subscriptable

In [None]:
# len(error_responses)


11948

In [16]:

gene_descriptions_df = embeddings.create_gene_descriptions_dataframe(
    ncbi_uniprot_summary_of_genes,
    responses,
    gene_info_table
)
gene_descriptions_df

Unnamed: 0_level_0,description,gpt_response,ensembl_id,gene_type
gene_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
LINC01409,Gene Symbol LINC01409,LINC01409 is a long intergenic non-protein cod...,ENSG00000237491,
FAM87B,Gene Symbol FAM87B,FAM87B (Family with Sequence Similarity 87 Mem...,ENSG00000177757,lincRNA
LINC01128,Gene Symbol LINC01128,"The LINC01128 gene, part of the long intergeni...",ENSG00000228794,lncRNA
LINC00115,Gene Symbol LINC00115,LINC00115 is a long intergenic non-protein cod...,ENSG00000225880,lincRNA
FAM41C,Gene Symbol FAM41C,"The FAM41C gene, also known as Family with Seq...",ENSG00000230368,lincRNA
...,...,...,...,...
ZUP1,Gene Symbol ZUP1 This gene encodes a protein c...,1. **Associated Genes:**\n - No additional s...,ENSG00000153975,
ZWILCH,Gene Symbol ZWILCH Involved in protein localiz...,1. **Associated Genes**: The ZWILCH gene is pa...,ENSG00000174442,protein_coding
ZXDA,Gene Symbol ZXDA This gene encodes one of two ...,1. Associated Genes:\n - Besides its duplica...,ENSG00000198205,protein_coding
ZXDB,Gene Symbol ZXDB Protein summary: Cooperates w...,1. **Associated Genes**: The ZXDB gene is part...,ENSG00000198455,protein_coding


In [17]:
gene_descriptions_df.loc["ZUP1"]

description     Gene Symbol ZUP1 This gene encodes a protein c...
gpt_response    1. **Associated Genes:**\n   - No additional s...
ensembl_id                                        ENSG00000153975
gene_type                                                    None
Name: ZUP1, dtype: object

In [18]:
print(gene_descriptions_df.shape)
print(gene_info_table.shape)
print(gene_info_table.loc["SNORD112"].shape)

(37262, 4)
(84425, 2)
(51, 2)


In [19]:
# Create embeddings directory if it doesn't exist
dataset_dir = data_dir / "generated" / "huggingface_dataset"
dataset_dir.mkdir(parents=True, exist_ok=True)

dataset_file_name = "generated_descriptions_gpt4o_cell_type_tissue_drug_pathway.parquet"
dataset_file_path = dataset_dir / dataset_file_name
gene_descriptions_df.to_parquet( dataset_file_path )


# Identify genes with missing Ensembl ids

In [20]:
missing_ensembl_ids = gene_descriptions_df[
    gene_descriptions_df.ensembl_id.isna()
].index
missing_ensembl_ids

print(f"Descriptions with missing ensembl ids: {len(gene_descriptions_df.loc[missing_ensembl_ids])}")

Descriptions with missing ensembl ids: 576


In [21]:
full_batch_of_embedding_requests = embeddings.get_gene_embedding_batch_requests(gene_descriptions_df, "full_batch_embedding_requests")

In [22]:
len(full_batch_of_embedding_requests)

37262

In [23]:
full_batch_of_embedding_requests[1000]

{'custom_id': 'full-batch-embedding-request-1000',
 'method': 'POST',
 'url': '/v1/embeddings',
 'body': {'model': 'text-embedding-3-large',
  'input': '\n    Gene Symbol SYCP1 Enables double-stranded DNA binding activity. Involved in protein homotetramerization. Predicted to be located in synaptonemal complex. Predicted to be active in central element; male germ cell nucleus; and transverse filament. Protein summary: Major component of the transverse filaments of synaptonemal complexes, formed between homologous chromosomes during meiotic prophase. Required for normal assembly of the central element of the synaptonemal complexes. Required for normal centromere pairing during meiosis. Required for normal meiotic chromosome synapsis during oocyte and spermatocyte development and for normal male and female fertility.\n    \n    1. **Associated Genes:**\n   - SYCP1 is closely associated with other synaptonemal complex proteins such as SYCP2 and SYCP3, which are also essential for synapsis

In [24]:
embedding_batch_info = embeddings.BatchInfo(
    batch_name = "full_batch_embeddings",
    request_data = full_batch_of_embedding_requests,
batch_description = "full batch of gene embeddings based on texted generated by GPT-4o based on the NCBI_UNIPROT_ASSOCIATED_CELL_TYPE_TISSUE_DRUG_PATHWAY_PROMPT_V1 prompt",
    data_dir = data_dir
)

embedding_batch_job = embeddings.create_batch_job(embedding_batch_info, 'embedding', client)

print(embedding_batch_job)

Batch(id='batch_67c5ca6d808481908dc69d09a8593c1a', completion_window='24h', created_at=1741015661, endpoint='/v1/embeddings', input_file_id='file-5KQEGxFJnvvrjTXwgc2zKV', object='batch', status='validating', cancelled_at=None, cancelling_at=None, completed_at=None, error_file_id=None, errors=None, expired_at=None, expires_at=1741102061, failed_at=None, finalizing_at=None, in_progress_at=None, metadata={'description': 'full batch of gene embeddings based on texted generated by GPT-4o based on the NCBI_UNIPROT_ASSOCIATED_CELL_TYPE_TISSUE_DRUG_PATHWAY_PROMPT_V1 prompt'}, output_file_id=None, request_counts=BatchRequestCounts(completed=0, failed=0, total=0))


In [26]:
from src import embeddings

# # to monitor a batch job that you don't have the job object for, you can use the batch id
# final_status = embeddings.monitor_batch_status(
#     client, "batch_67bb84122cc48190b5c95ced84631145", check_interval=60, verbose=True
# )

final_status = embeddings.monitor_batch_status(
    client, embedding_batch_job, check_interval=60, verbose=True
)

2025-03-03 08:52:02 - Completed: 37262, Failed: 0, Total: 37262
Batch(id='batch_67c5ca6d808481908dc69d09a8593c1a', completion_window='24h', created_at=1741015661, endpoint='/v1/embeddings', input_file_id='file-5KQEGxFJnvvrjTXwgc2zKV', object='batch', status='finalizing', cancelled_at=None, cancelling_at=None, completed_at=None, error_file_id=None, errors=None, expired_at=None, expires_at=1741102061, failed_at=None, finalizing_at=1741020416, in_progress_at=1741015668, metadata={'description': 'full batch of gene embeddings based on texted generated by GPT-4o based on the NCBI_UNIPROT_ASSOCIATED_CELL_TYPE_TISSUE_DRUG_PATHWAY_PROMPT_V1 prompt'}, output_file_id=None, request_counts=BatchRequestCounts(completed=37262, failed=0, total=37262))
Batch(id='batch_67c5ca6d808481908dc69d09a8593c1a', completion_window='24h', created_at=1741015661, endpoint='/v1/embeddings', input_file_id='file-5KQEGxFJnvvrjTXwgc2zKV', object='batch', status='finalizing', cancelled_at=None, cancelling_at=None, comple

# Save the embeddings to a file

We save the embeddings to a file, and then load them into a dataframe to make sure that we don't lose our work!

In [28]:
full_embedding_batch_response_path = embeddings.save_batch_response(
    embedding_batch_info, final_status, client
)

In [29]:
responses = []
with open(full_embedding_batch_response_path, "r") as f:
    for line in f:
        responses.append(json.loads(line))

embedding_df = pd.DataFrame(
    (response["response"]["body"]["data"][0]["embedding"] for response in responses),
    index=pd.Series(gene_descriptions_df.index),
)
embedding_df

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,3062,3063,3064,3065,3066,3067,3068,3069,3070,3071
gene_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
LINC01409,-0.026849,0.022704,-0.013214,-0.004473,0.005596,-0.022308,0.053253,-0.000448,-0.013548,0.024276,...,-0.016357,0.011482,0.023051,-0.023991,0.003693,-0.011043,0.025043,-0.015441,-0.006483,-0.016666
FAM87B,-0.020295,0.020836,-0.009747,-0.022720,-0.000100,-0.035224,0.025599,-0.021920,0.021686,0.006701,...,-0.000447,0.003594,0.013883,-0.012412,-0.007963,-0.014966,0.022030,-0.015434,-0.029242,-0.015790
LINC01128,-0.015540,0.018474,-0.009361,0.011891,-0.015590,-0.005849,0.023584,-0.007385,0.018661,-0.016212,...,-0.016411,0.002513,0.032299,-0.031155,0.000467,-0.026754,0.030011,-0.016460,-0.006216,-0.033269
LINC00115,-0.011257,-0.010442,-0.007174,-0.001170,0.017488,-0.036631,0.067470,-0.012438,0.019131,0.009067,...,-0.024766,0.008555,0.016685,-0.023707,0.003621,-0.023244,0.021821,-0.020190,-0.006328,-0.033954
FAM41C,-0.026513,0.050997,-0.011402,-0.014598,-0.027553,-0.003094,0.014151,-0.022078,0.006033,0.004550,...,-0.025933,-0.000810,0.025257,-0.009033,-0.011770,-0.021619,0.002426,-0.009758,-0.034658,-0.010362
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZUP1,-0.005932,0.007535,0.001352,0.016483,0.009970,-0.016021,0.030834,-0.015661,0.058328,-0.036230,...,0.002473,0.005145,0.010696,-0.009135,0.024423,-0.000833,0.017511,0.011331,-0.005698,-0.005348
ZWILCH,0.027868,0.008611,-0.003215,0.015072,-0.018436,-0.023897,0.009770,-0.020560,0.024476,0.003277,...,-0.005985,0.020684,0.041258,-0.005405,0.009087,0.002163,-0.002096,0.009501,-0.005771,-0.016354
ZXDA,0.009211,0.029213,-0.006353,0.031258,0.037814,-0.010653,0.041354,-0.018422,0.048959,-0.027193,...,-0.009408,0.001146,0.016442,-0.004120,0.017084,-0.011138,0.006421,0.012043,-0.004101,-0.020205
ZXDB,0.023583,0.021932,-0.007256,0.043214,0.040235,-0.026778,0.045163,-0.016990,0.037528,-0.024260,...,-0.012692,0.002606,0.014039,-0.007595,0.020443,-0.019887,0.014391,-0.000501,0.000410,-0.023962


# Remove duplicates

We average genes with the same ensembl id, since they should have generally the same direction if they are correspond to the same gene.

In [30]:
pd.Series(embedding_df.index).value_counts()

gene_name
SNORD112    51
SNORA31     26
SNORA40     24
SNORA48     21
SNORA25     20
            ..
EMX2OS       1
EMX2         1
EMSY         1
EMP2         1
ZYXP1        1
Name: count, Length: 33703, dtype: int64

In [31]:
embedding_df.loc["SNORA31"].describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,3062,3063,3064,3065,3066,3067,3068,3069,3070,3071
count,26.0,26.0,26.0,26.0,26.0,26.0,26.0,26.0,26.0,26.0,...,26.0,26.0,26.0,26.0,26.0,26.0,26.0,26.0,26.0,26.0
mean,-0.004922,0.003821,-0.012852,0.009555,-0.000842,-0.02165,0.037605,-0.016877,0.012646,0.008419,...,0.002801,0.007631,0.024997,-0.009218,-0.00885,-0.027147,0.017134,0.010462,-0.002243,0.000869
std,0.000224,0.000172,3.3e-05,0.000188,0.000187,0.000113,7.8e-05,0.000166,0.000168,0.000172,...,9.6e-05,0.000126,0.000101,9.1e-05,6.1e-05,0.000194,0.00019,8.9e-05,8.9e-05,0.000131
min,-0.00536,0.003296,-0.012914,0.008705,-0.001574,-0.021897,0.037503,-0.017647,0.012514,0.008198,...,0.002521,0.007051,0.024923,-0.009595,-0.008971,-0.027246,0.017022,0.010067,-0.002312,0.000245
25%,-0.005029,0.003727,-0.012869,0.009504,-0.00092,-0.021691,0.037546,-0.016913,0.012587,0.008274,...,0.002749,0.007622,0.02496,-0.009251,-0.008895,-0.027213,0.01705,0.010457,-0.00227,0.000864
50%,-0.004907,0.003823,-0.012862,0.009596,-0.000786,-0.021665,0.037589,-0.016832,0.012598,0.008415,...,0.0028,0.007654,0.02497,-0.009204,-0.008848,-0.027184,0.017094,0.010488,-0.002263,0.000906
75%,-0.004901,0.003966,-0.012845,0.009636,-0.000717,-0.021599,0.037677,-0.016799,0.012635,0.008485,...,0.002867,0.007686,0.024996,-0.009166,-0.00881,-0.02717,0.017145,0.010495,-0.002238,0.000919
max,-0.003983,0.004038,-0.012735,0.009704,-0.000617,-0.021316,0.037806,-0.016783,0.013381,0.00901,...,0.003064,0.007717,0.025469,-0.009105,-0.008756,-0.026216,0.017995,0.010532,-0.001828,0.000928


In [32]:
embedding_df_averaged = embedding_df.groupby(level=0).mean()

In [33]:
embedding_df_averaged.loc["SNORA31"]

0      -0.004922
1       0.003821
2      -0.012852
3       0.009555
4      -0.000842
          ...   
3067   -0.027147
3068    0.017134
3069    0.010462
3070   -0.002243
3071    0.000869
Name: SNORA31, Length: 3072, dtype: float64

In [34]:
embedding_df_averaged.shape

(33703, 3072)

In [36]:
# Create embeddings directory if it doesn't exist
embedding_dir = data_dir / "generated" / "embeddings"
embedding_dir.mkdir(parents=True, exist_ok=True)

embedding_df_averaged.columns = [str(col) for col in embedding_df_averaged.columns]
# Save the averaged embeddings to a parquet file
embedding_file_name = "embedding_associations_cell_type_tissue_drug_pathway_openai_large.parquet"
embedding_file_path = embedding_dir / embedding_file_name
embedding_df_averaged.to_parquet( embedding_file_path )


# Upload the embeddings to HuggingFace

Put your HuggingFace write token in a .env file in the root of the repo. DO NOT CHECK IT IN! .env is in the .gitignore file

In [37]:
from dotenv import load_dotenv

load_dotenv()

from datasets import Dataset
from huggingface_hub import HfApi
import os

# Initialize Hugging Face API
token = os.getenv("HF_WRITE_TOKEN")
api = HfApi(token=token)

  from .autonotebook import tqdm as notebook_tqdm


## Upload the gene descriptions dataset


In [38]:
dataset_file_path.name, embedding_file_name, embedding_file_path

('generated_descriptions_gpt4o_cell_type_tissue_drug_pathway.parquet',
 'embedding_associations_cell_type_tissue_drug_pathway_openai_large.parquet',
 PosixPath('/Users/rj/personal/GenePT-tools/data/generated/embeddings/embedding_associations_cell_type_tissue_drug_pathway_openai_large.parquet'))

In [39]:
dataset_repo_id = "honicky/genept-composable-embeddings-source-data"

# Upload each parquet file directly
try:
    api.upload_file(
        path_or_fileobj=str(dataset_file_path),
        path_in_repo=dataset_file_path.name,
        repo_id=dataset_repo_id,
        repo_type="dataset",
    )

    print(f"Successfully uploaded {dataset_file_path}")

except Exception as e:
    print(f"Error uploading {dataset_file_path}")

generated_descriptions_gpt4o_cell_type_tissue_drug_pathway.parquet: 100%|██████████| 29.6M/29.6M [00:01<00:00, 20.9MB/s]


Successfully uploaded /Users/rj/personal/GenePT-tools/data/generated/huggingface_dataset/generated_descriptions_gpt4o_cell_type_tissue_drug_pathway.parquet


## Upload the embedding model

Don't forget to update the README.md to add the new embedding model to the list of models if needed!

In [40]:
model_repo_id = "honicky/genept-composable-embeddings"

try:
    api.upload_file(
        path_or_fileobj=embedding_file_path,
        path_in_repo=embedding_file_name,
        repo_id=model_repo_id,
        repo_type="model",
    )
    print(f"Successfully uploaded {embedding_file_name}")
except Exception as e:
    print(f"Error uploading {embedding_file_path}")



embedding_associations_cell_type_tissue_drug_pathway_openai_large.parquet: 100%|██████████| 1.04G/1.04G [00:35<00:00, 28.9MB/s]


Successfully uploaded embedding_associations_cell_type_tissue_drug_pathway_openai_large.parquet
