## System Report
#### Goal: 
 - Assign a candidate name for the system
 - Explain the assignment
 - Provide information about the system and its proteins
 - Provide results of analyses
 
#### Report Structure:
 - the system ID
 - The assigned name
 - A brief summary of the system
 - Alternative names
 - An explanation of the name
 - Supporting information
     - system level
         - summary of shared terms
         - summary of shared diseases
     - per gene from uniprot and spoke
         - gene name
         - gene summary
         - go BP
         - go CC
         - Disease association
     - ChatGPT
         - important concepts
         - search suggestions
             - google
             - pubmed
         - shared disease symptoms
         

In [24]:
import requests
import json
import os
import openai



### Design
 - set the system proteins (SP)
 - set the disease of interest (DOI)
 - annotate SPs with associations with the DOI
 - annotate SPs with any other relevant data 
 - get the system similarity network
 - get the source networks, e.g. "BioPlex 3" or an AP-MS experiment
 
 #### Sections
 Each section will be written out as both a JSON document and an HTML file
 - section: get the analysis results of the input interactome
     - SPs covered
     - interactome modularity analysis for the SPs
     - interactome modularity analysis for the DOIs
     - membership in the DOI
     - significance in relevant data sets   
 - **section: SP information**
     - GO annotations
     - Disease associations
     - UniProt description
     - aliases
 - **section: perform enrichment analyses**
     - adjusting for the SPs covered in the interactome and enrichment sources 
     - identify which SPs are not covered in the enrichment sources
     - a link to re-run the query
 - section: gather information on selected interactions
     - (selected where full nxn coverage is impractical)
 - section: gather summaries from analyses of child systems
 - **section: analyze the information to find features shared between n or more SPs**
 - **section: select subsets of the gathered information and query ChatGPT to summarize and extract key concepts**
 - **section: use the key concepts to make literature queries**
     - such as DOI + a few SP names
         - expand with aliases
     - evaluate and summarize the query results with ChatGPT queries
     - each query result is presented with a link to re-run the query
 - section: query ChatGPT to perform higher level summarization, including candidate system names
 - **propose names**
     - merge ChatGPT names with enrichment query names 
     - annotate the candidate names with coverage metrics: how many proteins support the name, which ones
     - sort the candidate names by coverage
 - **create a top-level page**
     - proposed names
     - a system summary
     - support for each name
     - outline of the sections with links

### Organize the system proteins

In [25]:
# Replace with your system proteins
# AKAP11 ANAPC1 ANKRD11 ANKRD31 DOCK2 HECTD4 ITPR1 LYST MYLK MYO5A PCDH15 
# PFDN6 PLXNA2 PLXNA4 PTPN13 RALGAPA2 TRRAP
system_proteins = ["AKAP11", "ANAPC1", "ANKRD11", "ANKRD31", 
                 "DOCK2", "HECTD4", "ITPR1", "LYST", "MYLK", 
                 "MYO5A", "PCDH15", "PFDN6", "PLXNA2", "PLXNA4", 
                 "PTPN13", "RALGAPA2", "TRRAP"]

system_proteins_text = " ".join(system_proteins)
system_proteins_text

'AKAP11 ANAPC1 ANKRD11 ANKRD31 DOCK2 HECTD4 ITPR1 LYST MYLK MYO5A PCDH15 PFDN6 PLXNA2 PLXNA4 PTPN13 RALGAPA2 TRRAP'

### System protein information
 - GO annotations
 - Disease associations
 - UniProt description
 - aliases

### Enrichment Analysis
#### query_gprofiler
returns a list of objects with name, description, source, and p_value
#### gprofiler_results_to_json
writes the results as a JSON file: <system_id>_enrichment_analysis.json
#### gprofiler_results_to_section
writes the results to an HTML file: <system_id>_enrichment_analysis.html
formatted as a table
#### gprofiler_results_to_text
makes a text string to use in creating ChatGPT prompts

In [26]:

import requests
import json

def query_gprofiler(system_proteins):
    url = "https://biit.cs.ut.ee/gprofiler/api/gost/profile"
    headers = {"Content-Type": "application/json"}
    payload = {
        "organism": "hsapiens",
        "query": cluster_genes,
        "sources": ["GO:BP", "KEGG", "REAC", "WP", "MIRNA", "HPA", "CORUM"],
        "user_threshold": 0.1,
        "all_results": False,
        "ordered": False,
        "no_iea": False,
        "combined": True,
        "measure_underrepresentation": False
    }
    response = requests.post(url, headers=headers, data=json.dumps(payload))
    json_response = response.json()

    filtered_results = []
    for item in json_response['result']:
        filtered_item = {
            "name": item["name"],
            "description": item["description"],
            "source": item["source"],
            "p_value": item["p_values"]
        }
        filtered_results.append(filtered_item)

    return filtered_results

def gprofiler_results_to_section(gprofiler_results):
    # 
def gprofiler_results_to_text(gprofiler_results):
    result_names = [result['name'] for result in gprofiler_results]
    return '\n'.join(result_names)

gprofiler_results = query_gprofiler(system_proteins)

In [27]:

gprofiler_results

[{'name': 'SEMA3A-Plexin repulsion signaling by inhibiting Integrin adhesion',
  'description': 'SEMA3A-Plexin repulsion signaling by inhibiting Integrin adhesion',
  'source': 'REAC',
  'p_value': [0.02967279395498139]},
 {'name': 'MFAP5 effect on permeability and motility of endothelial cells via cytoskeleton rearrangement',
  'description': 'MFAP5 effect on permeability and motility of endothelial cells via cytoskeleton rearrangement',
  'source': 'WP',
  'p_value': [0.030285217532717647]},
 {'name': 'Sema3A PAK dependent Axon repulsion',
  'description': 'Sema3A PAK dependent Axon repulsion',
  'source': 'REAC',
  'p_value': [0.039080559405259445]},
 {'name': 'CRMPs in Sema3A signaling',
  'description': 'CRMPs in Sema3A signaling',
  'source': 'REAC',
  'p_value': [0.039080559405259445]},
 {'name': 'Other semaphorin interactions',
  'description': 'Other semaphorin interactions',
  'source': 'REAC',
  'p_value': [0.0555865049099508]},
 {'name': 'semaphorin-plexin signaling pathway

### Interaction Analysis
#### query_stringdb
#### query-indra

In [28]:
import requests
from indra.sources import indra_db_rest
from indra.statements import pretty_print_stmts

def query_stringdb(system_proteins):
    string_api_url = "https://string-db.org/api/json/network"
    string_params = {
        "identifiers": "%0d".join(cluster_genes),
        "species": 9606,
        "caller_identity": "myapp"
    }
    response = requests.get(string_api_url, params=string_params)
    json_response = response.json()

    nodes = set()
    edges = []
    for interaction in json_response:
        nodes.add(interaction["preferredName_A"])
        nodes.add(interaction["preferredName_B"])
        edges.append({
            "source": interaction["preferredName_A"],
            "target": interaction["preferredName_B"]
        })

    return {"nodes": list(nodes), "edges": edges}

def query_indra(system_proteins):
# An example query is something like
# p = indra_db_rest.get_statements(subject="HYDIN")
# pretty_print_stmts(p.statements)

indra_results = query_indra(system_proteins)

stringdb_results = query_string(system_proteins)

def interaction_results_to_json(indra_results, stringdb_results):
    
def interaction_results_to_section(indra_results, stringdb_results):
    
def interaction_results_to_text(indra_results, stringdb_results):


In [29]:
string_results

{'nodes': ['ITPR1', 'PLXNA4', 'PLXNA2', 'MYLK', 'MYO5A', 'LYST'],
 'edges': [{'source': 'ITPR1', 'target': 'MYLK'},
  {'source': 'ITPR1', 'target': 'MYLK'},
  {'source': 'PLXNA4', 'target': 'PLXNA2'},
  {'source': 'PLXNA4', 'target': 'PLXNA2'},
  {'source': 'LYST', 'target': 'MYO5A'},
  {'source': 'LYST', 'target': 'MYO5A'}]}

### ChatGPT query functions

In [30]:
## placeholder data for chatgtp
cluster_name = "my_cluster"
summary = "summary of the cluster"
chatgpt_analysis = "analysis by chatgpt"

# Load your API key from an environment variable or secret management service
openai.api_key = os.getenv("OPENAI_API_KEY")

gprofiler_text = gprofiler_results_to_text(gprofiler_results)

def generate_summary(system_proteins):
    # Combine the background input and questions into a single prompt
    prompt = f"write a brief analysis of these genes {cluster_genes_text} \n based on background knowledge plus these processes relevant to some of the genes \n{gprofiler_text}"
    #print(prompt)
    response = chatgpt_query(prompt)

    return response

def chatgpt_query_prompt_template(system_proteins, prompt_template):
    
def chatgpt_query(prompt):
    # Call the OpenAI API to generate answers
    response = openai.Completion.create(
        engine="davinci",
        prompt=prompt,
        max_tokens=1000,
        n=1,
        stop=None,
        temperature=0,
    )
    return response

#print(response)

# Parse the response to get the text of the first choice
chatgpt_analysis = generate_summary(prompt_template, prompt):
    response.choices[0].text

print(chatgpt_analysis)







The following is a brief description of the genes and their functions.

AKAP11 is a gene that encodes a protein called A-kinase anchor protein 11. This protein is a member of the AKAP family, which is a group of proteins that bind to the regulatory subunit of protein kinase A (PKA) and anchor it to the cytoskeleton. This protein is expressed in the brain, and is thought to be involved in the regulation of PKA activity.

ANAPC1 is a gene that encodes a protein called Anaphase-promoting complex subunit 1. This protein is a subunit of the anaphase-promoting complex (APC), which is a complex that targets proteins for degradation. This protein is thought to be involved in the regulation of the cell cycle.

ANKRD11 is a gene that encodes a protein called ankyrin repeat domain 11. This protein is a member of the ankyrin repeat protein family, which is a group of proteins that contain ankyrin repeats. This protein is thought to be involved in the regulation of the cell cycle.

ANKRD31 is a g

### Shared Features 
analyze the information to find features shared between n or more SPs


### Summarize and select keywords
select subsets of the gathered information and query ChatGPT to summarize and extract key concepts

GPT prompts:
 - I am analyzing a system of proteins: *SPs*
 - *DOI proteins* are known to be associated with <DOI>
 - The following table lists disease association that are shared by two or more of the proteins are involved in shared biological process or mechanism
 - These sets of proteins share a disease association
    

### Summarize system

GPT prompts:
 - 

### Query Literature
use the key concepts to make literature queries

## report generation
The report is generated using the jinja2 python templating library
The following code was generated by ChatGPT 3.5

TODO: Output the report to files. If possible, save as a PDF and/or a google doc page

In [31]:
from jinja2 import Template

def generate_html_report(cluster_name, summary, gprofiler_results, string_results, chatgpt_analysis):
    template_string = '''
    <!DOCTYPE html>
    <html>
    <head>
        <meta charset="UTF-8">
        <title>{{ cluster_name }} Cluster Report</title>
        <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css">
        <script src="https://code.jquery.com/jquery-3.3.1.min.js"></script>
        <script src="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/js/bootstrap.min.js"></script>
    </head>
    <body>
        <div class="container">
            <h1>{{ cluster_name }} Cluster Report</h1>
            <p>{{ summary }}</p>
            <h2>ChatGPT Analysis</h2>
            <p>{{ chatgpt_analysis }}</p>
            <h2>g:Profiler Functional Enrichment Results</h2>
            <table class="table table-striped">
                <thead>
                    <tr>
                        <th>Term Name</th>
                        <th>Description</th>
                        <th>Source</th>
                        <th>p-value</th>
                    </tr>
                </thead>
                <tbody>
                    {% for result in gprofiler_results %}
                    <tr>
                        <td>{{ result.name }}</td>
                        <td>{{ result.description }}</td>
                        <td>{{ result.source }}</td>
                        <td>{{ result.p_value }}</td>
                    </tr>
                    {% endfor %}
                </tbody>
            </table>
            <h2>STRING Interaction Network</h2>
            <table class="table table-striped">
                <thead>
                    <tr>
                        <th>Source</th>
                        <th>Target</th>
                    </tr>
                </thead>
                <tbody>
                    {% for edge in string_results.edges %}
                    <tr>
                        <td>{{ edge.source }}</td>
                        <td>{{ edge.target }}</td>
                    </tr>
                    {% endfor %}
                </tbody>
            </table>
        </div>
    </body>
    </html>
    '''
    template = Template(template_string)
    html_report = template.render(cluster_name=cluster_name, summary=summary, chatgpt_analysis=chatgpt_analysis, gprofiler_results=gprofiler_results, string_results=string_results)
    return html_report




In [1]:
from IPython.display import HTML
chatgpt_analysis = response_text
html_report = generate_html_report(cluster_name, summary, gprofiler_results, string_results, chatgpt_analysis)
#HTML(html_report)
print("done")

NameError: name 'response_text' is not defined