## System Report
#### Goal: 
 - Assign a candidate name for the system
 - Explain the assignment
 - Provide information about the system and its proteins
 - Provide results of analyses
 
#### Report Structure:
 - the system ID
 - The assigned name
 - A brief summary of the system
 - Alternative names
 - An explanation of the name
 - Supporting information
     - system level
         - summary of shared terms
         - summary of shared diseases
     - per gene from uniprot and spoke
         - gene name
         - gene summary
         - go BP
         - go CC
         - Disease association
     - ChatGPT
         - important concepts
         - search suggestions
             - google
             - pubmed
         - shared disease symptoms
         

### Design
 - set the system proteins (SP)
 - set the disease of interest (DOI)
 - annotate SPs with associations with the DOI
 - annotate SPs with any other relevant data 
 - get the system similarity network
 - get the source networks, e.g. "BioPlex 3" or an AP-MS experiment
 
 #### Sections
 Each section will be written out as both a JSON document and an HTML file
 - section: get the analysis results of the input interactome
     - SPs covered
     - interactome modularity analysis for the SPs
     - interactome modularity analysis for the DOIs
     - membership in the DOI
     - significance in relevant data sets   
 - **section: SP information**
     - GO annotations
     - Disease associations
     - UniProt description
     - aliases
 - **section: perform enrichment analyses**
     - adjusting for the SPs covered in the interactome and enrichment sources 
     - identify which SPs are not covered in the enrichment sources
     - a link to re-run the query
 - section: gather information on selected interactions
     - (selected where full nxn coverage is impractical)
 - section: gather summaries from analyses of child systems
 - **section: analyze the information to find features shared between n or more SPs**
 - **section: select subsets of the gathered information and query ChatGPT to summarize and extract key concepts**
 - **section: use the key concepts to make literature queries**
     - such as DOI + a few SP names
         - expand with aliases
     - evaluate and summarize the query results with ChatGPT queries
     - each query result is presented with a link to re-run the query
 - section: query ChatGPT to perform higher level summarization, including candidate system names
 - **propose names**
     - merge ChatGPT names with enrichment query names 
     - annotate the candidate names with coverage metrics: how many proteins support the name, which ones
     - sort the candidate names by coverage
 - **create a top-level page**
     - proposed names
     - a system summary
     - support for each name
     - outline of the sections with links

### Folder structure

In [1]:
import os
# Get the user's home directory
home_dir = os.path.expanduser("~")

# Create the path for the "models" folder in the home directory
models_path = os.path.join(home_dir, "models")

print(models_path)

model_name = "nesa"
version = "1"
model_path = os.path.join(models_path, model_name, version)
print(model_path)

/Users/depratt/models
/Users/depratt/models/nesa/1


### Load the model

In [2]:
import json
cx2_filename = "hidef_50_0.75_5_leiden_pruned.edges.cx2"
temp_model_path = os.path.join(home_dir, model_name, cx2_filename)
with open(temp_model_path, encoding='utf-8') as f:
    data = f.read()
    model = json.loads(data)


### Select the system


In [85]:
import os
import json

'''
Cluster2-157	6
Cluster6-30	9
Cluster5-35	51
Cluster4-40	58
Cluster3-32	120
Cluster2-11	394
Cluster5-1	461
Cluster4-3	700
Cluster4-1	931
Cluster3-2	1705
Cluster2-1	2905

'''

def find_first_dict_with_key(dicts, key_name):
    """
    Find the first dictionary in a list of dictionaries that contains a specific key.

    :param dicts: A list of dictionaries.
    :param key_name: The key to search for.
    :return: The first dictionary that contains the key, or None if not found.
    """
    for dictionary in dicts:
        if key_name in dictionary:
            return dictionary
    return None


def find_system_in_systems(systems, name):
    for system in systems:
        values = system["v"]
        if name == values.get("name"):
            return system
    return None

        
def read_system_file(system_name, extension=None, suffix="json"):
    if extension is not None:
        filename = os.path.join(model_path, f'{system_name}_{extension}.{suffix}')
    else:
        filename = os.path.join(model_path, f'{system_name}.{suffix}')
    with open(filename, encoding='utf-8') as f:
        print(f'reading {filename}')
        data = f.read()
        model = json.loads(data)
        return model
    
    
def write_system_file(system_name, data, extension=None, suffix="json"):  
    if extension is not None:
        filename = os.path.join(model_path, f'{system_name}_{extension}.{suffix}')
    else:
        filename = os.path.join(model_path, f'{system_name}.{suffix}')
    with open(filename, "w") as f:
        print(f'writing {filename}')
        json.dump(data, f, indent=4)
        
        
def get_root_path():
    home_dir = os.path.expanduser("~")
    # Get the root path from the environment variable
    root_path = os.path.join(home_dir, os.getenv("MODEL_ANNOTATION_ROOT"))

    # Check if the environment variable is set and the path exists
    if not root_path or not os.path.isdir(root_path):
        raise ValueError("MODEL_ANNOTATION_ROOT environment variable is not set or does not exist")

    return root_path


def read_system_json(model, version, system_name, extension, root_path):
    # Check if the file exists
    file_path = os.path.join(root_path, model, version, f"{system_name}_{extension}.json")
    if not os.path.isfile(file_path):
        return None

    # Read the JSON data from the file
    with open(file_path, "r") as f:
        json_data = json.load(f)

    return json_data


def write_system_json(json_data, model, version, system_name, extension, root_path):
    # Create the folder for the system if it does not already exist
    folder_path = os.path.join(root_path, model, version)
    os.makedirs(folder_path, exist_ok=True)

    # Write the JSON data to a file
    file_path = os.path.join(folder_path, f"{system_name}_{extension}.json")
    with open(file_path, "w") as f:
        json.dump(json_data, f, indent=4)


def write_system_page(html_content, model, version, system_name, extension, root_path):
    # Create the folder for the system if it does not already exist
    folder_path = os.path.join(root_path, model, version)
    os.makedirs(folder_path, exist_ok=True)

    # Write the HTML content to a file
    file_name = f"{system_name}_{extension}.html"
    file_path = os.path.join(folder_path, file_name)
    with open(file_path, "w") as f:
        f.write(html_content)

    # Update the urls.json file
    urls_file_path = os.path.join(root_path, "urls.json")
    urls_data = []

    if os.path.isfile(urls_file_path) is True:
        with open(urls_file_path, "r") as f:
            urls_data = json.load(f)

    # Find the model and version in the urls data
    model_data = next((x for x in urls_data if x["name"] == model), None)
    if not model_data:
        model_data = {"name": model, "versions": []}
        urls_data.append(model_data)

    version_data = next((x for x in model_data["versions"] if x["name"] == version), None)
    if not version_data:
        version_data = {"name": version, "files": []}
        model_data["versions"].append(version_data)

    # Add the new system file to the version data
    file_url = f"{model}/{version}/{file_name}"
    version_data["files"].append({"name": system_name, "url": file_url})

    # Write the updated urls data to the file
    with open(urls_file_path, "w") as f:
        json.dump(urls_data, f, indent=4)
        

def write_model_page(model_name, version, root_path):
    """
    Write an index.html file to the same directory as the urls.json file.
    The index.html page should display links to the documents in a hierarchical format.
    It should also have a title "<model name> Annotation Analysis".

    :param model_name: The name of the model.
    :param version: The version of the model.
    :param root_path: The root path of the model.
    """
    # Create the HTML for the index page
    html = f"""
    <!DOCTYPE html>
    <html>
    <head>
        <title>{model_name} Annotation Analysis</title>
        <style>
            body {{
                font-family: Arial, sans-serif;
                background-color: #f5f5f5;
                color: #333333;
                margin: 0;
                padding: 0;
            }}

            h1 {{
                font-size: 2em;
                margin: 0 0 0.5em;
            }}

            ul {{
                list-style-type: none;
                margin: 0;
                padding: 0;
            }}

            li {{
                font-size: 1.2em;
                margin: 0;
            }}

            .container {{
                max-width: 800px;
                margin: 0 auto;
                padding: 2em;
                background-color: #ffffff;
            }}
        </style>
    </head>
    <body>
        <div class="container">
            <h1>{model_name} Annotation Analysis</h1>
            <ul>
                {get_file_links(model_name, version, root_path)}
            </ul>
        </div>
    </body>
    </html>
    """

    # Write the HTML to the index file
    index_file_path = os.path.join(root_path, model_name, version, f"{model_name}_v_{version}.html")
    with open(index_file_path, "w") as index_file:
        index_file.write(html)

'''
def get_file_links(model_name, version, root_path):
    """
    Get the HTML for the links to the files in the model directory.

    :param model_name: The name of the model.
    :param version: The version of the model.
    :param root_path: The root path of the model.
    :return: The HTML for the links to the files in the model directory.
    """
    file_links = ""
    model_path = os.path.join(root_path, model_name, version)

    # Get a list of all the files in the model directory
    for root, dirs, files in os.walk(model_path):
        # Don't include the root directory in the links
        if root == model_path:
            continue

        # Add a heading for the current directory
        file_links += f"<li><strong>{os.path.basename(root)}</strong></li>"

        # Add links for all the files in the current directory
        for file in files:
            file_path = os.path.join(root, file)
            file_url = os.path.relpath(file_path, model_path)
            file_links += f"<li><a href='{file_url}'>{file}</a></li>"

    return file_links
'''

def get_file_links(model_name, version, root_path):
    """
    Get the HTML for the links to the files in the model directory.

    :param model_name: The name of the model.
    :param version: The version of the model.
    :param root_path: The root path of the model.
    :return: The HTML for the links to the files in the model directory.
    """
    file_links = ""
    model_path = os.path.join(root_path, model_name, version)

    # Get a sorted list of all the files in the model directory (excluding hidden files)
    file_list = sorted(f for f in os.listdir(model_path) if not f.startswith('.') and os.path.isfile(os.path.join(model_path, f)))

    # Group the files by directory
    files_by_dir = {}
    for file in file_list:
        file_path = os.path.join(model_path, file)
        dir_path = os.path.dirname(os.path.relpath(file_path, model_path))
        if dir_path not in files_by_dir:
            files_by_dir[dir_path] = []
        files_by_dir[dir_path].append(file)

    # Generate the links for each directory and file
    for dir_path, files in files_by_dir.items():
        # Skip directories that don't have any visible files
        if not files:
            continue

        # Add a heading for the current directory
        file_links += f"<li><strong>{os.path.basename(dir_path)}</strong></li>"

        # Add links for all the files in the current directory
        for file in sorted(files):
            file_path = os.path.join(model_path, dir_path, file)
            file_url = os.path.relpath(file_path, model_path)
            file_links += f"<li><a href='{file_url}'>{file}</a></li>"

    return file_links


        
        
def write_system_tsv(tsv_data, model, version, system_name, extension, root_path):
    # Create the folder for the system if it does not already exist
    folder_path = os.path.join(root_path, model, version)
    os.makedirs(folder_path, exist_ok=True)

    # Write the TSV data to a file
    file_path = os.path.join(folder_path, f"{system_name}_{extension}.tsv")
    with open(file_path, "w") as f:
        f.write(tsv_data)

# root_path = os.getenv("MODEL_ANNOTATION_ROOT")
# print(f'model annotation root = {root_path}')

system_name = "Cluster4-40"
systems = find_first_dict_with_key(model, "nodes")["nodes"]
system = find_system_in_systems(systems, system_name).get("v")
write_system_json(system, model_name, version, system_name, "data", get_root_path())
#write_system_file(system_name, system)


### Load ASD Gene Data


In [97]:
import pandas as pd
import os

def dataframe_to_dict(df):
    """
    Convert a pandas DataFrame into a dictionary indexed by the first column.

    :param df: The pandas DataFrame to convert.
    :return: A dictionary indexed by the first column.
    """
    # Set the index to be the first column
    df = df.set_index(df.columns[0])

    # Convert the DataFrame to a dictionary
    result_dict = df.to_dict(orient='index')

    return result_dict

def make_gene_candidacy_text(gene_data, selected_genes):
    attribute_descriptions = {
        'hasHighConfidenceMut': "Genes with high confidence mutation in ASD-diagnosed individuals:",
        'in_WES_2020': "ASD-risk genes identified in Satterstrom et al., 2020:",
        'in_WES_2022': "ASD-risk genes identified in Fu et al., 2022:",
        'connectedToASDPPI': "Proteins connected to ASD-risk proteins (AP-MS experiment):",
        'in_SFARI_cat_2_3': "ASD-risk in SFARI categories 2 and 3:"
    }
    attributes = {key: [] for key in attribute_descriptions.keys()}

    for gene, attributes_data in gene_data.items():
        if gene not in selected_genes:
            continue
        for attribute in attributes.keys():
            if attributes_data[attribute] == 1:
                attributes[attribute].append(gene)

    text_output = ''
    for attribute, genes in attributes.items():
        if len(genes) != 0:
            gene_list = ', '.join(genes)
            text_output += f"{attribute_descriptions[attribute]} {gene_list}\n"

    return text_output.strip()


# Get the path to the 'nesa' folder in your home directory
home_dir = os.path.expanduser("~")
nesa_folder_path = os.path.join(home_dir, "nesa")

# Set the file path for 'geneCandidacy_DF.xlsx' in the 'nesa' folder
file_path = os.path.join(nesa_folder_path, 'geneCandidacy_DF.xlsx')

# Load the first worksheet of the Excel file into a DataFrame
df = pd.read_excel(file_path, sheet_name=0)

# Convert the DataFrame to a dictionary indexed by the first column
gene_data = dataframe_to_dict(df)
system = read_system_json(model_name, version, system_name, "data", get_root_path())
gene_names = system.get("CD_MemberList").split(" ")
print(gene_names)
gene_candidacy_text = make_gene_candidacy_text(gene_data, gene_names)
gene_candidacy_text             


['ADAMTS13', 'ADGRV1', 'C16orf46', 'C2CD4A', 'C2CD4B', 'CACNG2', 'CARD16', 'CASP1', 'CC2D1B', 'CELSR1', 'CELSR2', 'CELSR3', 'CHRM5', 'CLBA1', 'CMA1', 'COL20A1', 'DCANP1', 'DCDC1', 'DCDC2B', 'DCHS1', 'DICER1', 'DTWD2', 'DTX4', 'FAM169BP', 'FAT1', 'FAT3', 'FAT4', 'FREM2', 'FZD3', 'HPSE2', 'IFNE', 'INSL4', 'LAG3', 'LINC01588', 'LINC02694', 'LIX1L', 'NUP210P1', 'NXPH2', 'PASK', 'PCDH20', 'PCDH7', 'PPIAL4E', 'PPIAL4G', 'PRSS37', 'RBPJL', 'RPS14P3', 'RPSAP52', 'SMTNL1', 'SOCS1', 'SPATA4', 'SSX6P', 'TAFA5', 'UCN3', 'XAGE1A', 'XAGE1B', 'XCL1', 'XYLB', 'ZBBX']


'Genes with high confidence mutation in ASD-diagnosed individuals: DCDC2B, DICER1, FAT3\nASD-risk in SFARI categories 2 and 3: FAT1'

In [87]:
import requests
def get_hugo_data(system_name):
    root_path = get_root_path()
    print(f'reading system info from root_path = {root_path}')
    system = read_system_json(model_name, version, system_name, "data", root_path)
    gene_names = system.get("CD_MemberList").split(" ")
    #gene_names = [gene_names[0]]
    print(gene_names)
    hugo_data = {}

    for gene in gene_names:
        print(f'getting Hugo data for {gene}')
        url = "https://rest.genenames.org/"
        headers = {
            'Content-Type': 'application/json',
            'Accept': 'application/json'}
        response = requests.get(f'https://rest.genenames.org/fetch/symbol/{gene}',
                                headers=headers)
        #response = requests.get(url, params=params)
        data = json.loads(response.content)["response"]["docs"][0]
        #print(data)
        hugo_data[gene] = data
    return hugo_data

hugo_data = get_hugo_data(system_name)
#write_system_file(system_name, hugo_data, "hugo")
write_system_json(hugo_data, model_name, version, system_name, "hugo", get_root_path())
hugo_data
        

reading system info from root_path = /Users/depratt/Dropbox (Personal)/GitHub/model_annotation
['ADAMTS13', 'ADGRV1', 'C16orf46', 'C2CD4A', 'C2CD4B', 'CACNG2', 'CARD16', 'CASP1', 'CC2D1B', 'CELSR1', 'CELSR2', 'CELSR3', 'CHRM5', 'CLBA1', 'CMA1', 'COL20A1', 'DCANP1', 'DCDC1', 'DCDC2B', 'DCHS1', 'DICER1', 'DTWD2', 'DTX4', 'FAM169BP', 'FAT1', 'FAT3', 'FAT4', 'FREM2', 'FZD3', 'HPSE2', 'IFNE', 'INSL4', 'LAG3', 'LINC01588', 'LINC02694', 'LIX1L', 'NUP210P1', 'NXPH2', 'PASK', 'PCDH20', 'PCDH7', 'PPIAL4E', 'PPIAL4G', 'PRSS37', 'RBPJL', 'RPS14P3', 'RPSAP52', 'SMTNL1', 'SOCS1', 'SPATA4', 'SSX6P', 'TAFA5', 'UCN3', 'XAGE1A', 'XAGE1B', 'XCL1', 'XYLB', 'ZBBX']
getting Hugo data for ADAMTS13
getting Hugo data for ADGRV1
getting Hugo data for C16orf46
getting Hugo data for C2CD4A
getting Hugo data for C2CD4B
getting Hugo data for CACNG2
getting Hugo data for CARD16
getting Hugo data for CASP1
getting Hugo data for CC2D1B
getting Hugo data for CELSR1
getting Hugo data for CELSR2
getting Hugo data for CEL

{'ADAMTS13': {'hgnc_id': 'HGNC:1366',
  'symbol': 'ADAMTS13',
  'name': 'ADAM metallopeptidase with thrombospondin type 1 motif 13',
  'status': 'Approved',
  'locus_type': 'gene with protein product',
  'prev_symbol': ['C9orf8'],
  'prev_name': ['a disintegrin-like and metalloprotease (reprolysin type) with thrombospondin type 1 motif, 13'],
  'alias_symbol': ['VWFCP',
   'TTP',
   'vWF-CP',
   'FLJ42993',
   'MGC118899',
   'MGC118900',
   'DKFZp434C2322'],
  'location': '9q34.2',
  'date_approved_reserved': '1999-08-23T00:00:00Z',
  'date_modified': '2023-03-15T00:00:00Z',
  'date_name_changed': '2015-11-09T00:00:00Z',
  'ena': ['AJ011374'],
  'entrez_id': '11093',
  'mgd_id': ['MGI:2685556'],
  'iuphar': 'objectId:1685',
  'merops': 'M12.241',
  'orphanet': 117776,
  'pubmed_id': [11557746, 11535495],
  'refseq_accession': ['NM_139025'],
  'gene_group': ['ADAM metallopeptidases with thrombospondin type 1 motif'],
  'date_symbol_changed': '2001-09-21T00:00:00Z',
  'vega_id': 'OTTHUM

### Get Uniprot Data
Write a python program to gather a protein's function, pathway, disease association, aliases, and summary description data from the uniprot database using its REST api

Update the function to take a set of human gene names from a file "system_system_id", find the primary uniprot id for each name, get data for that uniprot id, merge the results for each gene into a data structure, and then write that structure to a json file named "system_system_id_analysis"

In [93]:


def query_uniprot_by_id(uniprot_id):
    """
    Query UniProt for data about a protein given its gene symbol.

    :param gene_symbol: The gene symbol to search for.
    :return: The response text from UniProt or None if not found.
    """
    url = f"https://www.uniprot.org/uniprot/{uniprot_id}.json"
    #url = "https://www.uniprot.org/uniprot/"

    headers = {
        'Content-Type': 'application/json',
        'Accept': 'application/json'}
    print(f'querying uniprot id {uniprot_id}')
    response = requests.get(url, 
                            headers=headers)
    # print(response.text)
    if response.status_code == 200:
        return json.loads(response.text)
    else:
        return None
    
def filter_uniprot_response(uniprot_json):
    """
    Filter a UniProt JSON response to include specific fields.

    :param uniprot_json: A parsed JSON response from UniProt.
    :return: A dictionary containing the filtered data.
    """
    filtered_data = {
        "UniProtKB_ID": uniprot_json["uniProtkbId"],
        "Description": uniprot_json["proteinDescription"]["recommendedName"]["fullName"]["value"],
        "GO": [],
        "Location": [],
        "Disease": [],
        "Disease_description": [],
        "Complexes": []
    }

    for comment in uniprot_json.get("comments", []):
        
        comment_type = comment.get("commentType")
        # print(f'comment of type: {comment_type}')
        if comment_type is not None:       
            if comment_type == "SUBCELLULAR LOCATION":
                for loc in comment.get("subcellularLocations"):
                    location = loc.get("location")
                    # print(location)
                    values = location.get("value").split(",")
                    for value in values:
                        cleaned_value = value.strip().lower()
                        # print(cleaned_value)
                        filtered_data["Location"].append(cleaned_value)
            if comment_type == "DISEASE":
                #print(comment)
                disease = comment.get("disease")
                #print(disease)
                if disease is not None:
                    disease_id = disease.get("diseaseId")
                    disease_name = disease_id.split(",")[0]
                    description = disease.get("description")
                    filtered_data["Disease"].append(disease_name)
                    filtered_data["Disease_description"].append(f'{disease_name}: {description}')
                        
            if comment_type == "SUBUNIT":
                    for text in comment.get("texts"):
                        filtered_data["Complexes"].append(text.get("value"))
    for keyword in uniprot_json.get("keywords"):
        if keyword.get("category") == "Cellular component":
            name = keyword.get("name")
            cleaned_value = name.strip().lower()
            # print(cleaned_value)
            filtered_data["Location"].append(cleaned_value)
        if keyword.get("category") == "Disease":
            filtered_data["Disease"].append(keyword.get("name"))
    
    for db_ref in uniprot_json.get("uniProtKBCrossReferences", []):
        if db_ref["database"] == "GO":
                for property in db_ref["properties"]:
                    if property["key"] == "GoTerm":
                        if property["value"].split("0")[0] != "M":
                            filtered_data["GO"].append(property["value"].split(":")[1])
                        
    filtered_data["Location"] = list(set(filtered_data["Location"]))

    return filtered_data


def get_uniprot_data_for_system(system_name):
    system = read_system_json(model_name, version, system_name, "data", get_root_path())
    gene_names = system.get("CD_MemberList").split(" ")
    hugo_data = read_system_json(model_name, version, system_name, "hugo", get_root_path())
    print(hugo_data)
    analysis_data = {}
    # gene_names = [gene_names[0], gene_names[1]]
    print(f'gene names: {gene_names}')

    for gene_name in gene_names:
        print(f'gene name = {gene_name}')
        hugo_gene = hugo_data[gene_name]
        uniprot_ids = hugo_gene.get("uniprot_ids")
        print(f'uniprot_ids = {uniprot_ids}')
        if uniprot_ids is not None:
            uniprot_id = uniprot_ids[0]
            uniprot_data = query_uniprot_by_id(uniprot_id)

            if uniprot_data:
                filtered_data = filter_uniprot_response(uniprot_data)
                analysis_data[gene_name] = filtered_data
        else:
            print(f'no uniprot id found for {gene_name}')

    #write_system_file(system_name, analysis_data, "uniprot")
    write_system_json(analysis_data, model_name, version, system_name, "uniprot", get_root_path())
    return analysis_data


uniprot_data = get_uniprot_data_for_system(system_name)

{'ADAMTS13': {'hgnc_id': 'HGNC:1366', 'symbol': 'ADAMTS13', 'name': 'ADAM metallopeptidase with thrombospondin type 1 motif 13', 'status': 'Approved', 'locus_type': 'gene with protein product', 'prev_symbol': ['C9orf8'], 'prev_name': ['a disintegrin-like and metalloprotease (reprolysin type) with thrombospondin type 1 motif, 13'], 'alias_symbol': ['VWFCP', 'TTP', 'vWF-CP', 'FLJ42993', 'MGC118899', 'MGC118900', 'DKFZp434C2322'], 'location': '9q34.2', 'date_approved_reserved': '1999-08-23T00:00:00Z', 'date_modified': '2023-03-15T00:00:00Z', 'date_name_changed': '2015-11-09T00:00:00Z', 'ena': ['AJ011374'], 'entrez_id': '11093', 'mgd_id': ['MGI:2685556'], 'iuphar': 'objectId:1685', 'merops': 'M12.241', 'orphanet': 117776, 'pubmed_id': [11557746, 11535495], 'refseq_accession': ['NM_139025'], 'gene_group': ['ADAM metallopeptidases with thrombospondin type 1 motif'], 'date_symbol_changed': '2001-09-21T00:00:00Z', 'vega_id': 'OTTHUMG00000020876', 'lsdb': ['LRG_544|http://ftp.ebi.ac.uk/pub/data

gene name = ADGRV1
uniprot_ids = ['Q8WXG9']
querying uniprot id Q8WXG9
gene name = C16orf46
uniprot_ids = ['Q6P387']
querying uniprot id Q6P387
gene name = C2CD4A
uniprot_ids = ['Q8NCU7']
querying uniprot id Q8NCU7
gene name = C2CD4B
uniprot_ids = ['A6NLJ0']
querying uniprot id A6NLJ0
gene name = CACNG2
uniprot_ids = ['Q9Y698']
querying uniprot id Q9Y698
gene name = CARD16
uniprot_ids = ['Q5EG05']
querying uniprot id Q5EG05
gene name = CASP1
uniprot_ids = ['P29466']
querying uniprot id P29466
gene name = CC2D1B
uniprot_ids = ['Q5T0F9']
querying uniprot id Q5T0F9
gene name = CELSR1
uniprot_ids = ['Q9NYQ6']
querying uniprot id Q9NYQ6
gene name = CELSR2
uniprot_ids = ['Q9HCU4']
querying uniprot id Q9HCU4
gene name = CELSR3
uniprot_ids = ['Q9NYQ7']
querying uniprot id Q9NYQ7
gene name = CHRM5
uniprot_ids = ['P08912']
querying uniprot id P08912
gene name = CLBA1
uniprot_ids = ['Q96F83']
querying uniprot id Q96F83
gene name = CMA1
uniprot_ids = ['P23946']
querying uniprot id P23946
gene name

### Summarized Features 
analyze the information to find features shared between n or more SPs


In [94]:
import pandas as pd
from io import StringIO

def summarize_features(data):
    """
    Take a data structure and output a new list of dictionaries 
    summarizing specific features.

    :param data: The input data structure.
    :return: A list of dictionaries summarizing the features.
    """
    summarized_data = []
    target_sections = ["GO", "Location", "Disease"]

    for gene, gene_data in data.items():
        for feature, values in gene_data.items():
            if feature in target_sections and isinstance(values, list):
                for value in values:
                    existing_summary = next((item for item in summarized_data if value in item), None)
                    if existing_summary:
                        if gene not in existing_summary[value]["genes"]:
                            existing_summary[value]["number_of_genes"] += 1
                            existing_summary[value]["genes"].append(gene)
                    else:
                        new_summary = {
                            value: {
                                "number_of_genes": 1,
                                "genes": [gene]
                            }
                        }
                        summarized_data.append(new_summary)

    return summarized_data

def summarized_data_to_tsv(summarized_data):
    """
    Take the summarized data, sort it by the number of genes for each feature, and output it as tab-delimited text.

    :param summarized_data: A list of dictionaries containing summarized data.
    :return: A string containing the tab-delimited text.
    """
    # Flatten the summarized data
    flattened_data = []
    for feature_dict in summarized_data:
        for feature, feature_data in feature_dict.items():
            flattened_data.append({
                "feature": feature,
                "number_of_genes": feature_data["number_of_genes"],
                "genes": ", ".join(feature_data["genes"])
            })

    # Sort the flattened data by the number of genes
    sorted_data = sorted(flattened_data, key=lambda x: x["number_of_genes"], reverse=True)

    # Convert the sorted data to tab-delimited text
    tsv_data = "Feature\tNumber of Genes\tGenes\n"
    for item in sorted_data:
        tsv_data += f"{item['feature']}\t{item['number_of_genes']}\t{item['genes']}\n"

    return tsv_data

def get_system_uniprot_data(system, root_path):
    # Load the UniProt data from the uniprot_data.json file
    uniprot_data = read_system_json(model_name, version, system_name, "uniprot", get_root_path())
    
    

    # Extract the gene names from the system
    gene_names = system.get("CD_MemberList").split(" ")
    
    print(gene_names)

    # Create a dictionary of the UniProt data for each gene in the system
    system_uniprot_data = {}
    for gene_name in gene_names:
        
        if gene_name in uniprot_data:
            system_uniprot_data[gene_name] = uniprot_data[gene_name]

    return system_uniprot_data

u_data = get_system_uniprot_data(system, root_path)
summarized_data = summarize_features(u_data)
tsv_data = summarized_data_to_tsv(summarized_data)
write_system_tsv(tsv_data, model_name, version, system_name, "uniprot_summary", get_root_path())


tsv_file = StringIO(tsv_data)
df = pd.read_csv(tsv_file, sep='\t')

df


['ADAMTS13', 'ADGRV1', 'C16orf46', 'C2CD4A', 'C2CD4B', 'CACNG2', 'CARD16', 'CASP1', 'CC2D1B', 'CELSR1', 'CELSR2', 'CELSR3', 'CHRM5', 'CLBA1', 'CMA1', 'COL20A1', 'DCANP1', 'DCDC1', 'DCDC2B', 'DCHS1', 'DICER1', 'DTWD2', 'DTX4', 'FAM169BP', 'FAT1', 'FAT3', 'FAT4', 'FREM2', 'FZD3', 'HPSE2', 'IFNE', 'INSL4', 'LAG3', 'LINC01588', 'LINC02694', 'LIX1L', 'NUP210P1', 'NXPH2', 'PASK', 'PCDH20', 'PCDH7', 'PPIAL4E', 'PPIAL4G', 'PRSS37', 'RBPJL', 'RPS14P3', 'RPSAP52', 'SMTNL1', 'SOCS1', 'SPATA4', 'SSX6P', 'TAFA5', 'UCN3', 'XAGE1A', 'XAGE1B', 'XCL1', 'XYLB', 'ZBBX']


Unnamed: 0,Feature,Number of Genes,Genes
0,cytoplasm,17,"ADGRV1, CASP1, CELSR2, DCANP1, DCDC1, DICER1, ..."
1,membrane,16,"ADGRV1, CACNG2, CASP1, CELSR1, CELSR2, CELSR3,..."
2,plasma membrane,16,"ADGRV1, CACNG2, CASP1, CELSR1, CELSR2, CELSR3,..."
3,nucleus,14,"C2CD4A, C2CD4B, CC2D1B, DCANP1, DICER1, DTWD2,..."
4,cell membrane,13,"ADGRV1, CASP1, CELSR1, CELSR2, CELSR3, CHRM5, ..."
...,...,...,...
559,glucuronate catabolic process to xylulose 5-ph...,1,XYLB
560,phosphorylation,1,XYLB
561,xylulose catabolic process,1,XYLB
562,xylulose metabolic process,1,XYLB


In [98]:
import pandas as pd

def create_chatGPT_prompt(protein_list, tsv_data, n_genes=2, gene_candidacy_text=''):
    """
    Create a ChatGPT prompt based on the given protein list and TSV data.

    :param protein_list: A list of protein names.
    :param tsv_data: A string containing TSV formatted summary data.
    :param n_genes: An integer representing the minimum number of genes for a feature to be included.
    :return: A string containing the ChatGPT prompt in HTML format.
    """
    # Read the TSV data into a DataFrame
    tsv_file = StringIO(tsv_data)
    df = pd.read_csv(tsv_file, sep='\t')

    # Filter the DataFrame based on the n_genes criterion
    #df['Number of Genes'] = pd.to_numeric(df['Number of Genes'], errors='coerce')
    #print(df[df['Number of Genes']])

    #df = df[df['Number of Genes'] >= n_genes]

    # Generate the ChatGPT prompt in HTML format
    prompt_text = f"Your response should be formatted as HTML paragraphs"
    prompt_text += f"The following is a system of interacting proteins."
    prompt_text += f' Write a critical analysis of this system, describing your reasoning as you go.'
    prompt_text += f'\nWhat mechanisms and biological processes are performed by this system?'
    prompt_text += f'\nWhat cellular components and complexes are involved in this system?'
    prompt_text += f'\nProteins: '
    prompt_text += ", ".join(protein_list) + ".\n\n"
    prompt_text += f"\nA critical goal of the analysis is to determine what, if any, relationship this system has to ASD (Autism Spectrum Disorder)"
    prompt_text += f"\nHere are some ASD-related facts about these proteins"
    prompt_text += f"\n{gene_candidacy_text}"
    prompt_text += f'\n\nSystem features from a Uniprot analysis: \n'

    for index, row in df.iterrows():
        number_of_genes = 0
        if row['Number of Genes'] is not None:
            number_of_genes = int(row['Number of Genes'])
        if number_of_genes >= n_genes:
            prompt_text += f"{row['Feature']}: {row['Number of Genes']} proteins: {row['Genes']}\n"
        

    prompt = f"<div class='code-section'><button class='copy-prompt-button' onclick='copyPrompt()'>Copy Prompt</button>"
    prompt += f"<pre><code id='prompt-code'>{prompt_text}</code></pre></div>"
    prompt += "<script>function copyPrompt() {var copyText = document.getElementById('prompt-code').innerText; navigator.clipboard.writeText(copyText);}</script>"
    return prompt

def create_system_prompt_page(system_name, prompt):
    """
    Create an HTML page with a ChatGPT prompt for the specified system.

    :param system_name: The name of the system.
    :param prompt: The ChatGPT prompt for the system.
    :return: A string containing the HTML page.
    """
    # Create the HTML page with the specified title and prompt content
    html = f"<!DOCTYPE html>\n<html>\n<head>\n<title>{system_name} Summary ChatGPT Prompt</title>\n</head>\n<body>\n{prompt}\n</body>\n</html>"
    return html

def dataframe_to_html_table(df):
    table_html = "<table>\n"
    table_html += "<thead>\n<tr>\n"
    table_html += "".join([f"<th>{col}</th>" for col in df.columns])
    table_html += "</tr>\n</thead>\n<tbody>\n"
    for index, row in df.iterrows():
        table_html += "<tr>\n"
        table_html += "".join([f"<td>{val}</td>" for val in row.values])
        table_html += "</tr>\n"
    table_html += "</tbody>\n</table>"
    return table_html


def create_system_analysis_page(protein_list, tsv_data, n_genes=2):
    # Read the TSV data into a DataFrame
    tsv_file = StringIO(tsv_data)
    df = pd.read_csv(tsv_file, sep='\t')

    # Filter the DataFrame based on the n_genes criterion
    df = df[df['Number of Genes'] >= n_genes]
    
    uniprot_table = dataframe_to_html_table(df)

    # Create the ChatGPT analysis section with a placeholder for the analysis text
    chatgpt_analysis = "<h2>ChatGPT 4 Analysis</h2>\n<p>Paste ChatGPT analysis here:</p>\n<!-- Analysis goes here -->"

    # Create the HTML page with the system summary
    page_title = f"{system_name} Summary"
    html_page = f"<!DOCTYPE html>\n<html>\n<head>\n<title>{page_title}</title>\n</head>\n<body>\n<h1>{system_name} System Summary</h1>\n<h2>Proteins</h2>\n<p>{', '.join(protein_list)}</p>\n<h2>UniProt Data</h2>\n{uniprot_table}\n{chatgpt_analysis}\n</body>\n</html>"

    return html_page


def get_file_links(model_name, version, root_path):
    """
    Get the HTML for the links to the files in the model directory.

    :param model_name: The name of the model.
    :param version: The version of the model.
    :param root_path: The root path of the model.
    :return: The HTML for the links to the files in the model directory.
    """
    file_links = ""
    model_path = os.path.join(root_path, model_name, version)

    # Get a list of all the files in the model directory
    for root, dirs, files in os.walk(model_path):
        # Don't include the root directory in the links
        #if root == model_path:
         #   continue

        # Add a heading for the current directory
        file_links += f"<li><strong>{os.path.basename(root)}</strong></li>"

        # Add links for all the files in the current directory
        for file in files:
            file_path = os.path.join(root, file)
            file_url = os.path.relpath(file_path, model_path)
            file_links += f"<li><a href='{file_url}'>{file}</a></li>"

    return file_links

sys = read_system_json(model_name, version, system_name, "data", get_root_path())
protein_list = sys.get("CD_MemberList").split(" ")
prompt = create_chatGPT_prompt(protein_list, tsv_data, gene_candidacy_text=gene_candidacy_text)
prompt_page = create_system_prompt_page(system_name, prompt)
write_system_page(prompt_page, model_name, version, system_name, "chatgtp_prompt", get_root_path())
analysis_page = create_system_analysis_page(protein_list, tsv_data)
write_system_page(analysis_page, model_name, version, system_name, "analysis", get_root_path())

write_model_page(model_name, version, get_root_path())

In [None]:
# Next round of prompts
'''
Your response should be formatted as HTML paragraphs
The following is a system of interacting proteins. 
Write a critical analysis of this system, describing your reasoning as you go.
What mechanisms and biological processes are performed by this system?
What cellular components and complexes are involved in this system?
Do not recapitulate the list of proteins, do not simply restate the other information provided. 
'''

'''
Give me 5 candidate names for this system. 
The names should not include direct references to ASD. Do not create acronyms.
format your output as an HTML list

'''

'''

Genes with high confidence mutation in ASD-diagnosed individuals: DCDC2B, DICER1, FAT3
ASD-risk in SFARI categories 2 and 3: FAT1
Are any of the genes with high-confidence mutations in this system potentially novel ASD-risk genes? 
If a gene is included in one of the ASD-risk gene sets, such as SFARI, it is not novel. 
What other evidence supports each candidate? 
For example, is it associated with a disease that is co-morbid with ASD? 
Give specific, analytic reasons for each candidate. Be succinct and omit general caveats.
Format your output as HTML paragraphs

'''




### Uniprot Report

In [None]:
import pandas as pd

def format_uniprot_report(system_id, uniprot_data):
    df = pd.DataFrame(uniprot_data).transpose()
    html_table = df.to_html()

    html_content = f"""<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>{system_id} UniProt Report</title>
    <style>
        body {{
            font-family: Arial, sans-serif;
            margin: 40px;
        }}
        table {{
            border-collapse: collapse;
            width: 100%;
        }}
        th, td {{
            border: 1px solid #dddddd;
            padding: 8px;
            text-align: left;
        }}
        th {{
            background-color: #f2f2f2;
        }}
    </style>
</head>
<body>
    <h1>{system_id} UniProt Report</h1>
    <table>
        {html_table}
    </table>
</body>
</html>
"""

    with open(f"{system_id}_uniprot_report.html", "w") as f:
        f.write(html_content)

    print(f"UniProt report saved to {system_id}_uniprot_report.html")


### Get Tissue Expression Data

In [None]:
import pandas as pd

# GTEx
def download_gtex_data():
    gtex_sample_url = "https://storage.googleapis.com/gtex_analysis_v8/sample_attributes/GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt"
    gtex_expression_url = "https://storage.googleapis.com/gtex_analysis_v8/rna_seq_data/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct.gz"
    
    gtex_sample_metadata = pd.read_csv(gtex_sample_url, sep='\t')
    gtex_expression_data = pd.read_csv(gtex_expression_url, sep='\t', skiprows=2, compression='gzip', nrows=100)  # Limit to 100 rows for demonstration purposes

    return gtex_sample_metadata, gtex_expression_data

def filter_gtex_brain_expression(gtex_sample_metadata, gtex_expression_data):
    brain_samples = gtex_sample_metadata[gtex_sample_metadata['SMTS'] == 'Brain']
    brain_sample_ids = set(brain_samples['SAMPID'])
    gtex_brain_expression = gtex_expression_data[['Name', 'Description'] + list(brain_sample_ids.intersection(gtex_expression_data.columns))]
    
    return gtex_brain_expression

# BrainSpan
def download_brainspan_data():
    brainspan_url = "http://www.brainspan.org/static/download.html"
    brainspan_expression_url = "http://www.brainspan.org/api/v2/well_known_file_download/267666525"
    
    brainspan_metadata = pd.read_html(brainspan_url)[0]
    brainspan_expression_data = pd.read_csv(brainspan_expression_url, sep='\t', nrows=100)  # Limit to 100 rows for demonstration purposes

    return brainspan_metadata, brainspan_expression_data

def filter_brainspan_brain_expression(brainspan_metadata, brainspan_expression_data):
    brain_regions = brainspan_metadata[brainspan_metadata['Column Type'] == 'brain']
    brain_column_ids = set(brain_regions['Column ID'])
    brainspan_brain_expression = brainspan_expression_data[['gene_id', 'ensembl_gene_id'] + list(brain_column_ids.intersection(brainspan_expression_data.columns))]
    
    return brainspan_brain_expression

def download_and_filter_brain_expression_data():
    # GTEx
    gtex_sample_metadata, gtex_expression_data = download_gtex_data()
    gtex_brain_expression = filter_gtex_brain_expression(gtex_sample_metadata, gtex_expression_data)
    
    # BrainSpan
    brainspan_metadata, brainspan_expression_data = download_brainspan_data()
    brainspan_brain_expression = filter_brainspan_brain_expression(brainspan_metadata, brainspan_expression_data)

    # Print samples of the data
    print("GTEx brain expression data (first 5 rows):")
    print(gtex_brain_expression.head())
    print("\nBrainSpan brain expression data (first 5 rows):")
    print(brainspan_brain_expression.head())
    return brainspan_brain_expression, gtex_brain_expression



In [None]:
brainspan_metadata, brainspan_expression_data = download_brainspan_data()

### Tissue Expression Report
Report on the BrainScan and GTEX data

In [None]:
def format_expression_report(system_id, gtex_data, brainspan_data):
    gtex_df = pd.DataFrame(gtex_data)
    brainspan_df = pd.DataFrame(brainspan_data)

    gtex_html_table = gtex_df.to_html()
    brainspan_html_table = brainspan_df.to_html()

    html_content = f"""<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>{system_id} Expression Report</title>
    <style>
        body {{
            font-family: Arial, sans-serif;
            margin: 40px;
        }}
        table {{
            border-collapse: collapse;
            width: 100%;
        }}
        th, td {{
            border: 1px solid #dddddd;
            padding: 8px;
            text-align: left;
        }}
        th {{
            background-color: #f2f2f2;
        }}
    </style>
</head>
<body>
    <h1>{system_id} Expression Report</h1>
    <h2>GTEx Data</h2>
    <table>
        {gtex_html_table}
    </table>
    <h2>BrainSpan Data</h2>
    <table>
        {brainspan_html_table}
    </table>
</body>
</html>
"""

    with open(f"{system_id}_expression_report.html", "w") as f:
        f.write(html_content)

    print(f"Expression report saved to {system_id}_expression_report.html")


In [None]:
format_expression_report(system_id, gtex_brain_expression, brainspan_brain_expression)

### ChatGPT Summarize Analysis Parcel


In [None]:
import os
import openai
import json

openai.api_key = os.getenv("OPENAI_API_KEY")

def create_parcel(data):
    parcel = "\n".join(f"{key}: {value}" for key, value in data.items())
    return parcel

def summarize_parcel(parcel):
    prompt = f"Summarize this information about my set of proteins as brief text and a set of relevant keywords and key phrases:\n\n{parcel}"
    response = openai.Completion.create(
        engine="text-davinci-002",
        prompt=prompt,
        max_tokens=100,
        n=1,
        stop=None,
        temperature=0.5,
    )

    summary = response.choices[0].text.strip()
    return summary

def write_analysis_to_file(filename, analysis_data):
    with open(filename, "w") as f:
        json.dump(analysis_data, f, indent=4)



### Summarize and select keywords
select subsets of the gathered information and query ChatGPT to summarize and extract key concepts

GPT prompts:
 - I am analyzing a system of proteins: *SPs*
 - *DOI proteins* are known to be associated with <DOI>
 - The following table lists disease association that are shared by two or more of the proteins are involved in shared biological process or mechanism
 - These sets of proteins share a disease association
    

### Summarize the Analysis

**ChatGPT prompts: required some cleanup because they were originally generated as "main" functions.**

Write a function in which makes parcels of (1) the uniprot data file created by the uniprot data downloader (2) the GTEX and BrainSpan data file created by its downloader. The summarize each into a datastructure with a description and a list of keywords. Then format the summaries and keywords to provide to ChatGPT along with  "Review these summaries and keywords for my set of proteins. Critique them and synthesize the information into (1) candidate names for the set of proteins and (2) an outline of the reasoning behind the candidate names. Return this as a datastructure. ". Then append this to the datastructure with the parcels, include the list of gene names, and write out as <system id>analysis

In [None]:
import json

def read_data_from_file(filename):
    with open(filename, "r") as f:
        data = json.load(f)
    return data

def create_parcel_from_uniprot_data(uniprot_data):
    parcel = "\n".join(f"{key}: {value}" for key, value in uniprot_data.items() if key not in ['function', 'pathway', 'disease_association'])
    return parcel

def create_parcel_from_expression_data(expression_data):
    parcel = f"Gene expression data from GTEx and BrainSpan: {expression_data}"
    return parcel

def summarize_analyses(system_id:"001"):
    output_filename = f"system_{system_id}_analysis.json"

    uniprot_data_filename = f"system_{system_id}_analysis"
    gtex_brain_expression_filename = "gtex_brain_expression.json"
    brainspan_brain_expression_filename = "brainspan_brain_expression.json"

    uniprot_data = read_data_from_file(uniprot_data_filename)
    gtex_brain_expression = read_data_from_file(gtex_brain_expression_filename)
    brainspan_brain_expression = read_data_from_file(brainspan_brain_expression_filename)

    parcels = []
    for gene_name, gene_uniprot_data in uniprot_data.items():
        parcel = create_parcel_from_uniprot_data(gene_uniprot_data)
        summary = summarize_parcel(parcel)
        parcels.append({
            "gene_name": gene_name,
            "parcel": parcel,
            "summary": summary
        })

    expression_data = {
        "gtex": gtex_brain_expression,
        "brainspan": brainspan_brain_expression
    }
    parcel = create_parcel_from_expression_data(expression_data)
    summary = summarize_parcel(parcel)
    parcels.append({
        "parcel": parcel,
        "summary": summary
    })

    summaries_and_keywords = [parcel["summary"] for parcel in parcels]
    chatgpt_prompt = f"Review these summaries and keywords for my set of proteins:\n\n{summaries_and_keywords}\n\nCritique them and synthesize the information into (1) candidate names for the set of proteins and (2) an outline of the reasoning behind the candidate names."

    chatgpt_response = summarize_parcel(chatgpt_prompt)

    analysis_data = {
        "gene_names": list(uniprot_data.keys()),
        "parcels": parcels,
        "chatgpt_response": chatgpt_response
    }

    write_analysis_to_file(output_filename, analysis_data)
    print(f"Analysis data saved to {output_filename}")
    return analysis_data



### System Analysis Report

In [None]:
def format_system_analysis_report(system_id, analysis_data):
    gene_names = analysis_data['gene_names']
    parcels = analysis_data['parcels']
    chatgpt_response = analysis_data['chatgpt_response']

    parcels_html = ""
    for i, parcel in enumerate(parcels):
        parcels_html += f"<h3>Parcel {i+1}: {parcel['gene_name']}</h3>"
        parcels_html += f"<h4>Original Data</h4><pre>{parcel['parcel']}</pre>"
        parcels_html += f"<h4>Summary</h4><p>{parcel['summary']}</p>"

    html_content = f"""<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>{system_id} System Analysis Report</title>
    <style>
        body {{
            font-family: Arial, sans-serif;
            margin: 40px;
        }}
        pre {{
            background-color: #f2f2f2;
            padding: 1em;
            white-space: pre-wrap;
        }}
    </style>
</head>
<body>
    <h1>{system_id} System Analysis Report</h1>
    <h2>Gene Names</h2>
    <p>{', '.join(gene_names)}</p>
    <h2>Parcels</h2>
    {parcels_html}
    <h2>ChatGPT Synthesis</h2>
    <p>{chatgpt_response}</p>
</body>
</html>
"""

    with open(f"{system_id}_system_analysis_report.html", "w") as f:
        f.write(html_content)

    print(f"System analysis report saved to {system_id}_system_analysis_report.html")


### Query INDRA, prioritizing more specific interactions

##### ChatGPT 4 prompts. The output needed a little cleanup and removing a first attempt at ranking statements, superseded by _sort_response_by_relationship_type_ 

Write a function that takes my set of proteins and queries INDRA for the statements for the interactions between each pair. For each pair, return up to 50 statements.

Give me a python list of the INDRA relationships you described ranked by an estimate how specific a relationship is.

Write a function, sort_response_by_relationship_type, that bins a list of Indra statements first by the order of the relationship (a relationship b vs b relationship a) and then by the relationship types in the list indra_relationships_ranked. The function should then make a list of bins ranked  from most specific to least specific. Finally, it should repeatedly loop through the bins, picking the statement with the highest confidence score and adding to a list ranked_statements until there are no statements remaining. If there are no statements remaining in a bin, skip the bin. Return the ranked statements. Also, comment the function.

In [143]:
import requests
from itertools import combinations
import indra

def query_indra(agent_a, agent_b, limit=50):
    base_url = "https://db.indra.bio/statements/from_agents"
    query = f'HasAgent({agent_a}) & HasAgent({agent_b})'
    indra.
    params = {
        "subject": agent_a,
        "object": agent_b,
        "offset": 0,
        "limit": limit,
        "format": "json",
    }

    response = requests.get(base_url, params=params)
    if response.status_code == 200:
        data = response.json()
        print(data["statements"])
        return data["statements"]
    else:
        print(f"Error {response.status_code}: {response.text}")
        return []

def get_indra_pairwise_interactions(proteins, limit=50):
    pairwise_interactions = {}
    for agent_a, agent_b in combinations(proteins, 2):
        print(f'Querying {agent_a} - {agent_b}')
        
        statements = 
        statements_a_b = query_indra(agent_a, agent_b)
        print(f'Querying {agent_b} -> {agent_a}')
        statements_b_a = query_indra(agent_b, agent_a)
        ranked_statements = rank_statements(statements)[:limit]
        pairwise_interactions[(agent_a, agent_b)] = [s[1] for s in statements]

    return pairwise_interactions

'''
You can refer to the INDRA documentation for a more comprehensive 
and up-to-date list: https://indra.readthedocs.io/en/latest/statements.html
'''
indra_relationships_ranked = [
    "Complex",
    "Binding",
    "Activation",
    "Inhibition",
    "IncreaseAmount",
    "DecreaseAmount",
    "Translocation",
    "Phosphorylation",
    "Dephosphorylation",
    "Ubiquitination",
    "Deubiquitination",
    "Acetylation",
    "Deacetylation",
    "Methylation",
    "Demethylation",
    "Glycosylation",
    "Deglycosylation",
    "Palmitoylation",
    "Depalmitoylation",
    "Myristoylation",
    "Demyristoylation",
    "Hydroxylation",
    "Dehydroxylation",
    "Sumoylation",
    "Desumoylation",
    "Autophosphorylation",
    "SelfModification",
    "ActiveForm",
]

def sort_response_by_relationship_type(statements, indra_relationships_ranked):
    """
    Sort a list of INDRA statements based on the order of the relationship and the relationship types in
    the provided indra_relationships_ranked list.

    :param statements: A list of INDRA statements to be sorted.
    :param indra_relationships_ranked: A list of INDRA relationship types ranked by specificity.
    :return: A list of ranked INDRA statements.
    """

    # Create bins for each relationship type
    bins = {relation: [] for relation in indra_relationships_ranked}

    # Add statements to the appropriate bin based on the relationship type
    for statement in statements:
        relation = statement["type"]
        if relation in bins:
            bins[relation].append(statement)

    # Sort each bin based on the confidence score
    for relation in bins:
        bins[relation].sort(key=lambda s: s["evidence"][0]["confidence"], reverse=True)

    # Create a list to store the ranked statements
    ranked_statements = []

    # Loop through the bins, picking the statement with the highest confidence score
    # and adding it to the ranked_statements list until there are no statements remaining
    while True:
        statements_added = 0
        for relation in indra_relationships_ranked:
            if bins[relation]:
                ranked_statements.append(bins[relation].pop(0))
                statements_added += 1
        if statements_added == 0:
            break

    return ranked_statements


In [144]:
# Test Query
#protein_list = system.get("CD_MemberList").split(" ")
protein_list = ["KCNMA1", "SCN1A", "SCN2A"]
pairwise_interactions = get_indra_pairwise_interactions(protein_list)
ranked_statements = sort_response_by_relationship_type(pairwise_interactions, 
                                                       indra_relationships_ranked)

for (agent_a, agent_b), statements in pairwise_interactions.items():
    print(f"Interactions between {agent_a} and {agent_b}:")
    for statement in ranked_statements:
        print(f"- {statement['type']} (Evidence: {statement['evidence'][0]['text']})")
    print()


Querying KCNMA1 -> SCN1A
{}
Querying SCN1A -> KCNMA1
{}


NameError: name 'rank_statements' is not defined

### Query Literature
use the key concepts to make literature queries

## report generation
The report is generated using the jinja2 python templating library
The following code was generated by ChatGPT 3.5

TODO: Output the report to files. If possible, save as a PDF and/or a google doc page

In [31]:
from jinja2 import Template

def generate_html_report(cluster_name, summary, gprofiler_results, string_results, chatgpt_analysis):
    template_string = '''
    <!DOCTYPE html>
    <html>
    <head>
        <meta charset="UTF-8">
        <title>{{ cluster_name }} Cluster Report</title>
        <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css">
        <script src="https://code.jquery.com/jquery-3.3.1.min.js"></script>
        <script src="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/js/bootstrap.min.js"></script>
    </head>
    <body>
        <div class="container">
            <h1>{{ cluster_name }} Cluster Report</h1>
            <p>{{ summary }}</p>
            <h2>ChatGPT Analysis</h2>
            <p>{{ chatgpt_analysis }}</p>
            <h2>g:Profiler Functional Enrichment Results</h2>
            <table class="table table-striped">
                <thead>
                    <tr>
                        <th>Term Name</th>
                        <th>Description</th>
                        <th>Source</th>
                        <th>p-value</th>
                    </tr>
                </thead>
                <tbody>
                    {% for result in gprofiler_results %}
                    <tr>
                        <td>{{ result.name }}</td>
                        <td>{{ result.description }}</td>
                        <td>{{ result.source }}</td>
                        <td>{{ result.p_value }}</td>
                    </tr>
                    {% endfor %}
                </tbody>
            </table>
            <h2>STRING Interaction Network</h2>
            <table class="table table-striped">
                <thead>
                    <tr>
                        <th>Source</th>
                        <th>Target</th>
                    </tr>
                </thead>
                <tbody>
                    {% for edge in string_results.edges %}
                    <tr>
                        <td>{{ edge.source }}</td>
                        <td>{{ edge.target }}</td>
                    </tr>
                    {% endfor %}
                </tbody>
            </table>
        </div>
    </body>
    </html>
    '''
    template = Template(template_string)
    html_report = template.render(cluster_name=cluster_name, summary=summary, chatgpt_analysis=chatgpt_analysis, gprofiler_results=gprofiler_results, string_results=string_results)
    return html_report


