## Creating Datasets from the Dengue Hierarchy and interactome.
- The goal is to generate datasets in various formats based on the assemblies in a hierarchical model.
- This involves:
    - filtering the assemblies on assembly names, min size, and max size
    - filtering the data by columns and by row values
    - changing column names
    - generating/adding experiment, assembly, and content descriptions
    - cleaning the data, such as non-numeric values
    - limiting the precision of numeric values
    - optionaly saving the datasets to the database
    - optionally adding interaction data
    - optionally adding information from other sources, such as genecards
- The Dengue data is on the STRING+diffusion based interactome
- We can also get the data directly from the dengue_with_uniprot.csv
- Laura Martin-Sancho is most interested in the assemblies listed in the interesting_dengue_communities.xlsx spreadsheet

In [1]:
import sys
import os

# Add the parent directory of the current script to the Python path
cwd = os.getcwd()
dirname = os.path.dirname(cwd)
print(cwd)
print(dirname)
sys.path.append(dirname)

print(sys.path)

from models.analysis_plan import AnalysisPlan
from services.analysisrunner import AnalysisRunner
from models.review_plan import ReviewPlan
from services.reviewrunner import ReviewRunner
from app.sqlite_database import SqliteDatabase
from app.config import load_database_config

# Load the db connection details
# db_type, uri, user, password = load_database_config(path='~/ae_config/test_config.ini')
# self.db = Database(uri, db_type, user, password)

_, database_uri, _, _ = load_database_config()
db = SqliteDatabase(database_uri)

/Users/idekeradmin/Dropbox/GitHub/agent_evaluation/notebooks
/Users/idekeradmin/Dropbox/GitHub/agent_evaluation
['/Users/idekeradmin/Dropbox/GitHub/agent_evaluation/notebooks', '/opt/anaconda3/lib/python311.zip', '/opt/anaconda3/lib/python3.11', '/opt/anaconda3/lib/python3.11/lib-dynload', '', '/Users/idekeradmin/.local/lib/python3.11/site-packages', '/opt/anaconda3/lib/python3.11/site-packages', '/opt/anaconda3/lib/python3.11/site-packages/aeosa', '/Users/idekeradmin/Dropbox/GitHub/agent_evaluation']


In [2]:
# this is brief so that it will help keep the context small for fast operation.
brief_dengue_dataset_description = """
The dataset includes measurements of “protein abundance” and “mRNA expression”
changes after infection of human cells with Dengue virus as compared to mock 
infected controls. The dataset also includes an siRNA screen average Z-score 
as well as information about viral interacting proteins for each given human gene.

The data has 2 time points: 24 hours after infection or 48 hours after infection.

Here is a detailed description of the columns in the dataset:

A- Official human gene symbol (HGNC)
B- Protein abundance 24 hours after infection
C- Protein abundance 48 hours after infection
D- siRNA screen average Z-score
E- mRNA expression 24 after infection
F- mRNA expression 48 hurs after infection

Please note:

“Protein abundance” and “mRNA expression” measurements <0 reflect a "decrease" while measurements >0 indicate an "increase".

For the siRNA Z-scores in column D: the higher the score, the stronger the negative effect on viral replication caused by that gene's silencing. The Z-score can range from 1.1 (minor negative effect on viral replication) to 2.9 (strong negative effect on viral replication). If this datapoint is available for a certain gene, it means that siRNA experiments have already been performed and you should not include this approach in the "Actionablity" criteria described below.
Usually, if a gene has a Z-score value in column D, it will not have values in the other columns within the dataset.

"""

dengue_column_name_mapping ={}

In [3]:
import pandas as pd
cwd = os.getcwd()
dirname = os.path.dirname(cwd)
assembly_spreadsheet_filename = "interesting_dengue_communities.xlsx"
top_20_assembly_names_df = pd.read_excel(os.path.join(dirname, 
                                                      "files", 
                                                      assembly_spreadsheet_filename),
                                                      sheet_name=0)
top_10_assembly_names_df = pd.read_excel(os.path.join(dirname, 
                                                      "files", 
                                                      assembly_spreadsheet_filename),
                                                      sheet_name=1)
top_10_assembly_names_df

Unnamed: 0,Community
0,Chromatin remodelling and transcriptional regu...
1,DNA repair and maintenance in response to vira...
2,Endothelial barrier function and viral entry m...
3,Extracellular matrix organization and cell adh...
4,Kynurenine pathway modulation and immune respo...
5,Oxidative stress response and protein quality ...
6,Protein quality control and intracellular sign...
7,RAS pathway modulation and apoptosis regulation
8,RNA metabolism (and modification in viral infe...
9,Urea cycle and redox homeostasis


In [4]:
from models.hierarchy import Hierarchy
import json
import ndex2 
from ndex2.cx2 import RawCX2NetworkFactory

# Create NDEx2 python client
client = ndex2.client.Ndex2()

# Create CX2Network factory
factory = RawCX2NetworkFactory()

# Download BioGRID: Protein-Protein Interactions (SARS-CoV) from NDEx
# https://www.ndexbio.org/viewer/networks/669f30a3-cee6-11ea-aaef-0ac135e8bacf
# client_resp = client.get_network_as_cx2_stream('669f30a3-cee6-11ea-aaef-0ac135e8bacf')

# Dengue string interactome network c223d6db-b0e2-11ee-8a13-005056ae23aa
client_resp = client.get_network_as_cx2_stream('c223d6db-b0e2-11ee-8a13-005056ae23aa')

# Convert downloaded interactome network to CX2Network object
interactome = factory.get_cx2network(json.loads(client_resp.content))

# Dengue hierarchy
# https://www.ndexbio.org/viewer/networks/59bbb9f1-e029-11ee-9621-005056ae23aa
client_resp = client.get_network_as_cx2_stream('59bbb9f1-e029-11ee-9621-005056ae23aa')

# Convert downloaded interactome network to CX2Network object
hierarchy = factory.get_cx2network(json.loads(client_resp.content))

# Display information about the hierarchy network and output 1st 100 characters of CX2
print('Name: ' + hierarchy.get_name())
print('Number of nodes: ' + str(len(hierarchy.get_nodes())))
print('Number of nodes: ' + str(len(hierarchy.get_edges())))

# Display information about the interactome network 
print('Name: ' + interactome.get_name())
print('Number of nodes: ' + str(len(interactome.get_nodes())))
print('Number of nodes: ' + str(len(interactome.get_edges())))
      




Name: Dengue model - hidef string 12.0 0.7 (GPT-4 annotated) - L2R
Number of nodes: 203
Number of nodes: 249
Name: dengue string 12.0 0.7
Number of nodes: 1375
Number of nodes: 2792


In [5]:
# print(json.dumps(list(hierarchy.get_nodes().values())[0], indent=2))
hierearchy_object = Hierarchy(hierarchy)
print(top_10_assembly_names_df.shape)
print(top_10_assembly_names_df["Community"][1])

hierearchy_object.get_assemblies(filter={"names": [top_10_assembly_names_df["Community"][0]]})

(10, 1)
DNA repair and maintenance in response to viral infection


[]

## Hierarchy Class

In [6]:
hierarchy.add_network_attribute("experiment_description", "in this experiment, cells were infected with dengue virus and 'omics and phenotypic screening data  was measured", "string")

In [7]:
interactome.get_attribute_declarations()

{'nodes': {'has_rnaSeq_48hr': {'d': 'integer'},
  'has_rnaSeq_24hr': {'d': 'integer'},
  'Condition_48hpi': {'d': 'string'},
  'type': {'v': 'protein', 'd': 'string'},
  'GeneSymbol_48hpi': {'d': 'string'},
  'represents': {'a': 'r', 'd': 'string'},
  'has_protein_24hr': {'d': 'integer'},
  'has_protein_48hr': {'d': 'integer'},
  'UniprotID': {'d': 'string'},
  'siRNA_GeneSymbol': {'d': 'string'},
  'alias': {'d': 'list_of_string'},
  'PPI_GeneSymbol': {'d': 'string'},
  'dengue_MiST_list': {'d': 'string'},
  'GeneSymbol': {'d': 'string'},
  'Condition': {'d': 'string'},
  'viral_interaction': {'d': 'integer'},
  'DV3_24h-Mock_24h': {'d': 'double'},
  'log2FC_48hpi': {'d': 'double'},
  'has_siRNA': {'d': 'integer'},
  'dengue_protein_list': {'d': 'string'},
  'siRNA_Screen_Average_Zscore': {'d': 'double'},
  'name': {'a': 'n', 'd': 'string'},
  'GeneID': {'d': 'integer'},
  'log2FC': {'d': 'double'},
  'DV3_48h-Mock_48h': {'d': 'double'}},
 'networkAttributes': {'disease': {'d': 'strin

In [9]:
dengue_hierarchy = Hierarchy(hierarchy, interactome)
print(dengue_hierarchy.get_experiment_description())
assemblies = dengue_hierarchy.add_data_to_assemblies(
    member_attributes=["name", "GeneSymbol", "GeneID", "log2FC_48hpi", "pvalue_48hpi", "log2FC_72hpi", "pvalue_72hpi"],
    filter={"max_size": 4})
assemblies[0]


in this experiment, cells were infected with dengue virus and 'omics and phenotypic screening data  was measured


In [None]:
datasets[1].name


[{'name': 'PEMT', 'log2FC_48hpi': -2.09090804358616, 'GeneID': 10400},
 {'name': 'CHKA', 'GeneSymbol': 'CHKA', 'GeneID': 1119},
 {'name': 'PTDSS2', 'GeneSymbol': 'PTDSS2', 'GeneID': 81490},
 {'name': 'PLA1A', 'log2FC_48hpi': 4.07675395617054, 'GeneID': 51365}]