## Creating Datasets from the Dengue Hierarchy and interactome.
- The goal is to generate datasets in various formats based on the assemblies in a hierarchical model.
- This involves:
    - filtering the assemblies on assembly names, min size, and max size
    - filtering the data by columns and by row values
    - changing column names
    - generating/adding experiment, assembly, and content descriptions
    - cleaning the data, such as non-numeric values
    - limiting the precision of numeric values
    - optionaly saving the datasets to the database
    - optionally adding interaction data
    - optionally adding information from other sources, such as genecards
- The Dengue data is on the STRING+diffusion based interactome
- We can also get the data directly from the dengue_with_uniprot.csv
- Laura Martin-Sancho is most interested in the assemblies listed in the interesting_dengue_communities.xlsx spreadsheet

In [None]:
import sys
import os

# Add the parent directory of the current script to the Python path
cwd = os.getcwd()
dirname = os.path.dirname(cwd)
print(cwd)
print(dirname)
sys.path.append(dirname)

print(sys.path)

from models.analysis_plan import AnalysisPlan
from services.analysisrunner import AnalysisRunner
from models.review_plan import ReviewPlan
from services.reviewrunner import ReviewRunner
from app.sqlite_database import SqliteDatabase
from app.config import load_database_config

# Load the db connection details
# db_type, uri, user, password = load_database_config(path='~/ae_config/test_config.ini')
# self.db = Database(uri, db_type, user, password)

_, database_uri, _, _ = load_database_config()
db = SqliteDatabase(database_uri)

In [None]:


dengue_column_name_mapping ={}

## Get the assemblies of interest

In [None]:
import pandas as pd
cwd = os.getcwd()
dirname = os.path.dirname(cwd)
assembly_spreadsheet_filename = "interesting_dengue_communities.xlsx"
top_20_assembly_names_df = pd.read_excel(os.path.join(dirname, 
                                                      "data", 
                                                      assembly_spreadsheet_filename),
                                                      sheet_name=0)
top_10_assembly_names_df = pd.read_excel(os.path.join(dirname, 
                                                      "data", 
                                                      assembly_spreadsheet_filename),
                                                      sheet_name=1)

## Get the model and the interactome from NDEx in CX2

In [None]:
from models.hierarchy import Hierarchy
import json
import ndex2 
from ndex2.cx2 import RawCX2NetworkFactory

# Create NDEx2 python client
client = ndex2.client.Ndex2()

# Create CX2Network factory
factory = RawCX2NetworkFactory()

# Download BioGRID: Protein-Protein Interactions (SARS-CoV) from NDEx
# https://www.ndexbio.org/viewer/networks/669f30a3-cee6-11ea-aaef-0ac135e8bacf
# client_resp = client.get_network_as_cx2_stream('669f30a3-cee6-11ea-aaef-0ac135e8bacf')

# Dengue string interactome network c223d6db-b0e2-11ee-8a13-005056ae23aa
client_resp = client.get_network_as_cx2_stream('c223d6db-b0e2-11ee-8a13-005056ae23aa')

# Convert downloaded interactome network to CX2Network object
interactome = factory.get_cx2network(json.loads(client_resp.content))

# Dengue hierarchy
# https://www.ndexbio.org/viewer/networks/59bbb9f1-e029-11ee-9621-005056ae23aa
client_resp = client.get_network_as_cx2_stream('59bbb9f1-e029-11ee-9621-005056ae23aa')

# Convert downloaded interactome network to CX2Network object
hierarchy = factory.get_cx2network(json.loads(client_resp.content))

# Display information about the hierarchy network and output 1st 100 characters of CX2
print('Name: ' + hierarchy.get_name())
print('Number of nodes: ' + str(len(hierarchy.get_nodes())))
print('Number of nodes: ' + str(len(hierarchy.get_edges())))

# Display information about the interactome network 
print('Name: ' + interactome.get_name())
print('Number of nodes: ' + str(len(interactome.get_nodes())))
print('Number of nodes: ' + str(len(interactome.get_edges())))

# this is brief so that it will help keep the context small for fast operation.
brief_dengue_dataset_description = """
The dataset includes the following for genes/proteins: 

"binds": Dengue virus proteins bound to the human protein
"knockdown_inhbits": 1 = siRNA knockdown inhibits dengue virus infection, 2 = no effect

rna and protein expression changes following dengue virus infection:
"protein_logFC_24h"
"protein_log2FC_48h"
"rna_logFC_24h"
"rna_log2FC_48h"
"""
      
hierarchy.add_network_attribute("experiment_description", brief_dengue_dataset_description)


## Preview the data columns

In [None]:
# csv_path = os.path.join(dirname, "data", "dengue_with_uniprot.csv")
# data = pd.read_csv(csv_path)

excel_path = os.path.join(dirname, "data", "dengue_with_uniprot_full.xlsx")
data = pd.read_excel(excel_path)

# data.columns.to_list()
column_dict = {}
for column in data.columns:
    column_dict[column] = column
column_dict

## Make the Hierarchy object and annotate it with data
- Select the columns and optionally rename them
- Only annotate those selected by the optional filter, based on name and size range
- Optionally reduce the precision of floats


In [None]:
import json
import csv
from io import StringIO

def data_dict_to_csv(data_dict, columns=None, decimal_places=None):
    if isinstance(data_dict, str):
        data_dict = json.loads(data_dict)
    
    # Scan the data_dict for all properties if columns is not specified
    if columns is None:
        columns = set()
        for item in data_dict.values():
            columns.update(item.keys())
        columns = sorted(columns)
    
    # Prepare the CSV data
    output = StringIO()
    writer = csv.writer(output)
    
    # Write header
    writer.writerow(['GeneSymbol'] + columns)
    
    # Write data rows
    for gene_symbol, gene_data in data_dict.items():
        row = [gene_symbol]
        for col in columns:
            value = gene_data.get(col, '')
            # Preserve the format, especially for numbers
            if isinstance(value, (int, float)):
                if isinstance(value, float) and decimal_places is not None:
                    value = f"{value:.{decimal_places}f}"
                else:
                    value = f"{value}"
            row.append(value)
        writer.writerow(row)
    
    return output.getvalue()

# Test Cases
test_data_dicts = [
    '{"STOM": {"GeneSymbol": "STOM", "GeneID": 2040, "rna_log2FC_48h": 1.88, "protein_log2FC_48h": 2.08, "binds": "DENV2 16681 NS4A,ZIKVfp NS4A,ZIKVug NS4A", "knockdown_inhibits": 0}, "SERPINH1": {"GeneSymbol": "SERPINH1", "GeneID": 871, "rna_log2FC_48h": 1.07, "knockdown_inhibits": 0}, "CLDN5": {"GeneSymbol": "CLDN5", "GeneID": 7122, "knockdown_inhibits": 1}, "COL15A1": {"GeneSymbol": "COL15A1", "GeneID": 1306, "knockdown_inhibits": 1}, "DCC": {"GeneSymbol": "DCC", "GeneID": 1630, "knockdown_inhibits": 1}, "COL6A5": {"GeneSymbol": "COL6A5", "GeneID": 256076, "knockdown_inhibits": 1}, "IGFBP7": {"GeneSymbol": "IGFBP7", "GeneID": 3490, "knockdown_inhibits": 1}, "ITGA11": {"GeneSymbol": "ITGA11", "GeneID": 22801, "knockdown_inhibits": 1}, "ITGAD": {"GeneSymbol": "ITGAD", "GeneID": 3681, "knockdown_inhibits": 1}, "LAMC2": {"GeneSymbol": "LAMC2", "GeneID": 3918, "knockdown_inhibits": 1}, "MSN": {"GeneSymbol": "MSN", "GeneID": 4478, "knockdown_inhibits": 1}, "MSX1": {"GeneSymbol": "MSX1", "GeneID": 4487, "knockdown_inhibits": 1}, "NBL1": {"GeneSymbol": "NBL1", "GeneID": 4681, "knockdown_inhibits": 1}, "NEO1": {"GeneSymbol": "NEO1", "GeneID": 4756, "knockdown_inhibits": 1}, "PF4V1": {"GeneSymbol": "PF4V1", "GeneID": 5197, "knockdown_inhibits": 1}, "PHEX": {"GeneSymbol": "PHEX", "GeneID": 5251, "knockdown_inhibits": 1}, "PLS1": {"GeneSymbol": "PLS1", "GeneID": 5357, "protein_log2FC_48h": -2.11, "knockdown_inhibits": 1}, "SELP": {"GeneSymbol": "SELP", "GeneID": 6403, "knockdown_inhibits": 1}, "SMAD6": {"GeneSymbol": "SMAD6", "GeneID": 4091, "knockdown_inhibits": 1}, "SNX17": {"GeneSymbol": "SNX17", "GeneID": 9784, "knockdown_inhibits": 1}, "SYNPO2": {"GeneSymbol": "SYNPO2", "GeneID": 171024, "protein_log2FC_48h": 5.0, "knockdown_inhibits": 1}, "NCF1": {"GeneSymbol": "NCF1", "GeneID": 653361, "protein_log2FC_24h": 2.30710712368644, "protein_log2FC_48h": 3.27, "knockdown_inhibits": 0}, "ESPNL": {"GeneSymbol": "ESPNL", "GeneID": 339768, "protein_log2FC_24h": -2.46610507958088, "knockdown_inhibits": 0}, "OSR2": {"GeneSymbol": "OSR2", "GeneID": 116039, "protein_log2FC_24h": 5.18836762083554, "protein_log2FC_48h": 6.57, "knockdown_inhibits": 0}, "BMP2": {"GeneSymbol": "BMP2", "GeneID": 650, "protein_log2FC_48h": 3.92, "knockdown_inhibits": 0}, "CAV1": {"GeneSymbol": "CAV1", "GeneID": 857, "protein_log2FC_48h": 2.86, "knockdown_inhibits": 0}, "COL6A2": {"GeneSymbol": "COL6A2", "GeneID": 1292, "protein_log2FC_48h": -2.5, "knockdown_inhibits": 0}, "DMP1": {"GeneSymbol": "DMP1", "GeneID": 1758, "protein_log2FC_48h": -2.17, "knockdown_inhibits": 0}, "F13A1": {"GeneSymbol": "F13A1", "GeneID": 2162, "protein_log2FC_48h": -3.49, "knockdown_inhibits": 0}, "FLNA": {"GeneSymbol": "FLNA", "GeneID": 2316, "protein_log2FC_48h": -2.59, "knockdown_inhibits": 0}, "FSTL1": {"GeneSymbol": "FSTL1", "GeneID": 11167, "protein_log2FC_48h": 2.42, "knockdown_inhibits": 0}, "ITGA2": {"GeneSymbol": "ITGA2", "GeneID": 3673, "protein_log2FC_48h": 2.59, "knockdown_inhibits": 0}, "ITGA9": {"GeneSymbol": "ITGA9", "GeneID": 3680, "protein_log2FC_48h": -2.52, "knockdown_inhibits": 0}, "ITGB3": {"GeneSymbol": "ITGB3", "GeneID": 3690, "protein_log2FC_48h": -2.05, "knockdown_inhibits": 0}, "ITGB8": {"GeneSymbol": "ITGB8", "GeneID": 3696, "protein_log2FC_48h": 3.11, "knockdown_inhibits": 0}, "LAMA3": {"GeneSymbol": "LAMA3", "GeneID": 3909, "protein_log2FC_48h": 3.12, "knockdown_inhibits": 0}, "PTPN13": {"GeneSymbol": "PTPN13", "GeneID": 5783, "protein_log2FC_48h": -2.11, "knockdown_inhibits": 0}, "RND3": {"GeneSymbol": "RND3", "GeneID": 390, "protein_log2FC_48h": 3.33, "knockdown_inhibits": 0}, "SAT1": {"GeneSymbol": "SAT1", "GeneID": 6303, "protein_log2FC_48h": 2.02, "knockdown_inhibits": 0}, "SPARC": {"GeneSymbol": "SPARC", "GeneID": 6678, "protein_log2FC_48h": -2.27, "knockdown_inhibits": 0}, "STAB1": {"GeneSymbol": "STAB1", "GeneID": 23166, "protein_log2FC_48h": -2.82, "knockdown_inhibits": 0}, "UNC5C": {"GeneSymbol": "UNC5C", "GeneID": 8633, "protein_log2FC_48h": 4.01, "knockdown_inhibits": 0}, "VCAM1": {"GeneSymbol": "VCAM1", "GeneID": 7412, "protein_log2FC_48h": 2.34, "knockdown_inhibits": 0}, "VWF": {"GeneSymbol": "VWF", "GeneID": 7450, "protein_log2FC_48h": -3.17, "knockdown_inhibits": 0}}',
    '{"SERPINH1": {"GeneSymbol": "SERPINH1", "GeneID": 871, "rna_log2FC_48h": 1.07, "knockdown_inhibits": 0}, "COL15A1": {"GeneSymbol": "COL15A1", "GeneID": 1306, "knockdown_inhibits": 1}, "COL6A5": {"GeneSymbol": "COL6A5", "GeneID": 256076, "knockdown_inhibits": 1}, "ITGA11": {"GeneSymbol": "ITGA11", "GeneID": 22801, "knockdown_inhibits": 1}, "LAMC2": {"GeneSymbol": "LAMC2", "GeneID": 3918, "knockdown_inhibits": 1}, "PHEX": {"GeneSymbol": "PHEX", "GeneID": 5251, "knockdown_inhibits": 1}, "SYNPO2": {"GeneSymbol": "SYNPO2", "GeneID": 171024, "protein_log2FC_48h": 5.0, "knockdown_inhibits": 1}, "COL6A2": {"GeneSymbol": "COL6A2", "GeneID": 1292, "protein_log2FC_48h": -2.5, "knockdown_inhibits": 0}, "DMP1": {"GeneSymbol": "DMP1", "GeneID": 1758, "protein_log2FC_48h": -2.17, "knockdown_inhibits": 0}, "FLNA": {"GeneSymbol": "FLNA", "GeneID": 2316, "protein_log2FC_48h": -2.59, "knockdown_inhibits": 0}, "ITGA2": {"GeneSymbol": "ITGA2", "GeneID": 3673, "protein_log2FC_48h": 2.59, "knockdown_inhibits": 0}, "ITGA9": {"GeneSymbol": "ITGA9", "GeneID": 3680, "protein_log2FC_48h": -2.52, "knockdown_inhibits": 0}, "ITGB3": {"GeneSymbol": "ITGB3", "GeneID": 3690, "protein_log2FC_48h": -2.05, "knockdown_inhibits": 0}, "ITGB8": {"GeneSymbol": "ITGB8", "GeneID": 3696, "protein_log2FC_48h": 3.11, "knockdown_inhibits": 0}, "LAMA3": {"GeneSymbol": "LAMA3", "GeneID": 3909, "protein_log2FC_48h": 3.12, "knockdown_inhibits": 0}, "SAT1": {"GeneSymbol": "SAT1", "GeneID": 6303, "protein_log2FC_48h": 2.02, "knockdown_inhibits": 0}}'
]

# Run Tests
for i, test_data in enumerate(test_data_dicts):
    print(f"Test Case {i + 1}:")
    csv_output = data_dict_to_csv(test_data)
    print(csv_output)
    print("\n" + "="*50 + "\n")

# Test with specific columns
specific_columns = ['GeneID', 'knockdown_inhibits', 'protein_log2FC_48h']
print("Test with specific columns:")
csv_output = data_dict_to_csv(test_data_dicts[0], columns=specific_columns)
print(csv_output)
print("\n" + "="*50 + "\n")

# Test with decimal_places
print("Test with decimal_places=2:")
csv_output = data_dict_to_csv(test_data_dicts[0], decimal_places=2)
print(csv_output)

In [None]:
import os

dengue_hierarchy = Hierarchy(hierarchy, interactome)

# csv_path = os.path.join(dirname, "data", "dengue_with_uniprot.csv")
excel_path = os.path.join(dirname, "data", "dengue_with_uniprot_full.xlsx")

assembly_list = top_10_assembly_names_df["Community"].to_list()

data_columns = columns={'GeneID': 'GeneID',
                        # 'UniprotID': 'UniprotID',
                        # 'GeneSymbol': 'GeneSymbol',
                        # 'siRNA_GeneSymbol': 'siRNA_GeneSymbol',
                        'DV3_24h-Mock_24h': 'rna_log2FC-24h',
                        'DV3_48h-Mock_48h': 'rna_log2FC_48h',
                        # 'siRNA_Screen_Average_Zscore': 'siRNA_Screen_Average_Zscore',
                        'log2FC': 'protein_log2FC_24h',
                        #'Condition': 'Condition',
                        # 'GeneSymbol_48hpi': 'GeneSymbol_48hpi',
                        'log2FC_48hpi': 'protein_log2FC_48h',
                        # 'Condition_48hpi': 'Condition_48hpi',
                        'dengue_protein_list': 'binds',
                        # 'dengue_MiST_list': 'dengue_MiST_list',
                        # 'PPI_GeneSymbol': 'PPI_GeneSymbol',
                        # 'viral_interaction': 'viral_interaction',
                        'has_siRNA': 'knockdown_inhibits',
                        # 'has_protein_24hr': 'has_protein_24hr',
                        # 'has_protein_48hr': 'has_protein_48hr',
                        # 'has_rnaSeq_24hr': 'has_rnaSeq_24hr',
                        # 'has_rnaSeq_48hr': 'has_rnaSeq_48hr',
                        # 'UniProtID': 'UniProtID',
                        'HGNC': 'GeneSymbol'}

print(data_columns)
# add_data_from_file(self, file_path, key_column='name', columns=None, filter=None, sheet_name=0, delimiter=None):
dengue_assemblies = dengue_hierarchy.add_data_from_file(excel_path,
                                                 key_column="HGNC",
                                                 filter={"names": assembly_list},
                                                 columns=data_columns)

dengue_assemblies[0]

## Generate Datasets for a list of assemblies
- optionally selecting columns
- get a list of their ids to use elsewhere

In [None]:
from models.hierarchy import dataset_from_assembly

dataset_ids = []

dataset_columns = {'GeneID': 'GeneID',
                        'rna_log2FC-24h': 'rna_log2FC-24h',
                        'rna_log2FC_48h': 'rna_log2FC_48h',
                        'protein_log2FC_24h': 'protein_log2FC_24h',
                        'protein_log2FC_48h': 'protein_log2FC_48h',
                        'binds': 'binds',
                        'knockdown_inhibits': 'knockdown_inhibits',
                        'name': 'GeneSymbol'}

for assembly in dengue_assemblies:
    dataset = dataset_from_assembly(db, assembly, 
                                    type="csv",
                                    columns=dataset_columns,
                                    experiment_description=brief_dengue_dataset_description)
    print(dataset.name)
    dataset_ids.append(dataset.object_id)
 

In [None]:
#```python
def list_assemblies(db, hierarchy, filter=None):
    for assembly in hierarchy.get_assemblies(filter=filter):
        #print(json.dumps(assembly, indent=4))
        print(assembly["v"].get("name"))
        print(f'--LLM-> {assembly["v"].get("LLM Name")}')
        print(f'--ENR-> {assembly["v"].get("CD_CommunityName")}')
        print(assembly['v'].get('data'))
        #print(dataset_from_assembly(db, assembly))
        print('___________________________________________________________________')

list_assemblies(db, dengue_hierarchy, filter={"names": top_10_assembly_names_df["Community"].to_list(),
                                              "columns": ["binds", 
                                                          "knockdown_inhbits", 
                                                          "protein_logFC_24h", "protein_log2FC_48h",
                                                          "rna_logFC_24h", "rna_log2FC_48h"]})
#```

```
from models.dataset import Dataset

c4493827 = Dataset.load(db, "dataset_ac5bf71e-9e92-4004-8c49-3d62f538044a")

print(c4493827.data)
```

```
You are a brilliant graduate student in molecular biology. Your work is reviewed by your world-renowned advisor, who is very demanding. You will need to give your very best effort in this project.
```
----------

```
<task>
Based on the experiment description, analyze the dataset and develop a mechanistic, causal hypothesis for the processes that led to the observed data. The proteins/genes in the dataset are hypothesized to be known to interact; use your knowledge of these proteins and their interactions to develop chains of events that connect experimental perturbations with molecular and phenotypic observations. The hypothesis should make specific predictions that could be experimentally validated. 
</task>

<biological_context>
The hypothesis should support the higher-level goal of developing drug therapies for the disease.
</biological_context>

<detailed_instructions>
1. Review the data to ensure that you understand the meaning of each observation. In your hypothesis, be sure that you correctly use the data and do not hallucinate any observations.

2. Review your knowledge of the functions of these proteins and the known interactions between them.

3. Based on the data and your knowledge, construct the hypothesis that you think best meets these criteria:
- Plausible
- Non-trivial
- Supports the higher-level goal of drug development
- Novel
- Actionable: is cost-effective in both time and money

4. Your lab has limited resources in both reagents, equipment, and your time. Remember, your time is precious. You must use it well if you are to get your doctorate. If you do not think that there is any hypothesis that is worth following up with a validation experiment, say so.

5. Output your hypothesis as follows:

## Knowledge Graph:
concise knowledge graph of as a list of the causal relationships between proteins, complexes, events, disease states, etc. "Therefore:" indicates hypothesized relationships. For example:

(A binds B) inactivates (B)
(B) performs (phosphorylation C)
(phosphorylation C) increases (active C)
Therefore: (A) decreases (active C)


## Hypothesis:
short descriptive paragraph. 

</detailed_instructions>

<experiment_description>
{experiment_description}
</experiment_description>

<dataset>
{data}
</data>
```