## Creating Datasets from the Dengue Hierarchy and interactome.
- The goal is to generate datasets in various formats based on the assemblies in a hierarchical model.
- This involves:
    - filtering the assemblies on assembly names, min size, and max size
    - filtering the data by columns and by row values
    - changing column names
    - generating/adding experiment, assembly, and content descriptions
    - cleaning the data, such as non-numeric values
    - limiting the precision of numeric values
    - optionaly saving the datasets to the database
    - optionally adding interaction data
    - optionally adding information from other sources, such as genecards
- The Dengue data is on the STRING+diffusion based interactome
- We can also get the data directly from the dengue_with_uniprot.csv
- Laura Martin-Sancho is most interested in the assemblies listed in the interesting_dengue_communities.xlsx spreadsheet

In [1]:
import sys
import os

# Add the parent directory of the current script to the Python path
cwd = os.getcwd()
dirname = os.path.dirname(cwd)
print(cwd)
print(dirname)
sys.path.append(dirname)

print(sys.path)

from models.analysis_plan import AnalysisPlan
from services.analysisrunner import AnalysisRunner
from models.review_plan import ReviewPlan
from services.reviewrunner import ReviewRunner
from app.sqlite_database import SqliteDatabase
from app.config import load_database_uri

# Load the db connection details
# db_type, uri, user, password = load_database_config(path='~/ae_config/test_config.ini')
# self.db = Database(uri, db_type, user, password)

database_uri = load_database_uri()
db = SqliteDatabase(database_uri)

/Users/idekeradmin/Dropbox/GitHub/agent_evaluation/notebooks
/Users/idekeradmin/Dropbox/GitHub/agent_evaluation
['/opt/anaconda3/envs/ae2/lib/python311.zip', '/opt/anaconda3/envs/ae2/lib/python3.11', '/opt/anaconda3/envs/ae2/lib/python3.11/lib-dynload', '', '/Users/idekeradmin/.local/lib/python3.11/site-packages', '/opt/anaconda3/envs/ae2/lib/python3.11/site-packages', '/Users/idekeradmin/Dropbox/GitHub/agent_evaluation']


  from .autonotebook import tqdm as notebook_tqdm


In [2]:


dengue_column_name_mapping ={}

## Get the assemblies of interest

In [3]:
import pandas as pd
cwd = os.getcwd()
dirname = os.path.dirname(cwd)
assembly_spreadsheet_filename = "interesting_dengue_communities.xlsx"
top_20_assembly_names_df = pd.read_excel(os.path.join(dirname, 
                                                      "data", 
                                                      assembly_spreadsheet_filename),
                                                      sheet_name=0)
top_10_assembly_names_df = pd.read_excel(os.path.join(dirname, 
                                                      "data", 
                                                      assembly_spreadsheet_filename),
                                                      sheet_name=1)

## Get the model and the interactome from NDEx in CX2

In [4]:
from models.hierarchy import Hierarchy
import json
import ndex2 
from ndex2.cx2 import RawCX2NetworkFactory

# Create NDEx2 python client
client = ndex2.client.Ndex2()

# Create CX2Network factory
factory = RawCX2NetworkFactory()

# Download BioGRID: Protein-Protein Interactions (SARS-CoV) from NDEx
# https://www.ndexbio.org/viewer/networks/669f30a3-cee6-11ea-aaef-0ac135e8bacf
# client_resp = client.get_network_as_cx2_stream('669f30a3-cee6-11ea-aaef-0ac135e8bacf')

# Dengue string interactome network c223d6db-b0e2-11ee-8a13-005056ae23aa
client_resp = client.get_network_as_cx2_stream('c223d6db-b0e2-11ee-8a13-005056ae23aa')

# Convert downloaded interactome network to CX2Network object
interactome = factory.get_cx2network(json.loads(client_resp.content))

# Dengue hierarchy
# https://www.ndexbio.org/viewer/networks/59bbb9f1-e029-11ee-9621-005056ae23aa
client_resp = client.get_network_as_cx2_stream('59bbb9f1-e029-11ee-9621-005056ae23aa')

# Convert downloaded interactome network to CX2Network object
hierarchy = factory.get_cx2network(json.loads(client_resp.content))

# Display information about the hierarchy network and output 1st 100 characters of CX2
print('Name: ' + hierarchy.get_name())
print('Number of nodes: ' + str(len(hierarchy.get_nodes())))
print('Number of nodes: ' + str(len(hierarchy.get_edges())))

# Display information about the interactome network 
print('Name: ' + interactome.get_name())
print('Number of nodes: ' + str(len(interactome.get_nodes())))
print('Number of nodes: ' + str(len(interactome.get_edges())))

# this is brief so that it will help keep the context small for fast operation.
brief_dengue_dataset_description = """

This data integrates four datasets and is intended to identify factors that negatively support the dengue virus and are supported by other orthogonal datasets.

The study created the following novel datasets. Primary human dendritic cells, were infected with dengue virus (serotype 3), and were subjected to: 
(1) siRNA screening to identify human host factors that act to restrict viral replication, 
(2) Proteomics (Protein Abundance) to look at human proteins that change in abundance following infection. This was done at 24h and 48h post-infection.
Jeff Johnson, Krogan lab
(3) RNAseq was used to examine cellular mRNAs that are differently expressed following infection. This was done 24h and 48h post-infection. 
Stephen Wolinski, NWU

In this analysis, we are also incorporating Priya Shah’s published dengue protein-protein interaction (PPI) dataset (Shah et al., Cell 2018). 

The dataset includes the following columns: 

"binds": Dengue virus proteins bound to the human protein
"knockdown_inhbits": 1 = siRNA knockdown inhibits dengue virus infection, 2 = no effect

rna and protein expression changes following dengue virus infection:
"protein_logFC_24h"
"protein_log2FC_48h"
"rna_logFC_24h"
"rna_log2FC_48h"

Missing values for measurements of a given gene/protein indicate that change was below the significance threshold chosen for that modality.
"""
      
hierarchy.add_network_attribute("experiment_description", brief_dengue_dataset_description)


Name: Dengue model - hidef string 12.0 0.7 (GPT-4 annotated) - L2R
Number of nodes: 203
Number of nodes: 249
Name: dengue string 12.0 0.7
Number of nodes: 1375
Number of nodes: 2792


## Preview the data columns

In [5]:
# csv_path = os.path.join(dirname, "data", "dengue_with_uniprot.csv")
# data = pd.read_csv(csv_path)

excel_path = os.path.join(dirname, "data", "dengue_with_uniprot_full.xlsx")
data = pd.read_excel(excel_path)

# data.columns.to_list()
column_dict = {}
for column in data.columns:
    column_dict[column] = column
column_dict

{'GeneID': 'GeneID',
 'UniprotID': 'UniprotID',
 'GeneSymbol': 'GeneSymbol',
 'siRNA_GeneSymbol': 'siRNA_GeneSymbol',
 'DV3_24h-Mock_24h': 'DV3_24h-Mock_24h',
 'DV3_48h-Mock_48h': 'DV3_48h-Mock_48h',
 'siRNA_Screen_Average_Zscore': 'siRNA_Screen_Average_Zscore',
 'log2FC': 'log2FC',
 'Condition': 'Condition',
 'GeneSymbol_48hpi': 'GeneSymbol_48hpi',
 'log2FC_48hpi': 'log2FC_48hpi',
 'Condition_48hpi': 'Condition_48hpi',
 'dengue_protein_list': 'dengue_protein_list',
 'dengue_MiST_list': 'dengue_MiST_list',
 'PPI_GeneSymbol': 'PPI_GeneSymbol',
 'viral_interaction': 'viral_interaction',
 'has_siRNA': 'has_siRNA',
 'has_protein_24hr': 'has_protein_24hr',
 'has_protein_48hr': 'has_protein_48hr',
 'has_rnaSeq_24hr': 'has_rnaSeq_24hr',
 'has_rnaSeq_48hr': 'has_rnaSeq_48hr',
 'UniProtID': 'UniProtID',
 'HGNC': 'HGNC'}

## Make the Hierarchy object and annotate it with data
- Select the columns and optionally rename them
- Only annotate those selected by the optional filter, based on name and size range
- Optionally reduce the precision of floats


In [6]:
import os

dengue_hierarchy = Hierarchy(hierarchy, interactome)

# csv_path = os.path.join(dirname, "data", "dengue_with_uniprot.csv")
excel_path = os.path.join(dirname, "data", "dengue_with_uniprot_full.xlsx")

assembly_list = top_10_assembly_names_df["Community"].to_list()

# This is for testing with just one assembly
#assembly_list = ["RAS pathway modulation and apoptosis regulation"]

data_columns = columns={'GeneID': 'GeneID',
                        'DV3_24h-Mock_24h': 'protein_log2FC_24h',
                        'DV3_48h-Mock_48h': 'protein_log2FC_48h',
                        'log2FC': 'rna_log2FC_24h',
                        'log2FC_48hpi': 'rna_log2FC_48h',
                        'dengue_protein_list': 'binds',
                        'has_siRNA': 'knockdown_inhibits',
                        'HGNC': 'GeneSymbol'}

print(data_columns)
# add_data_from_file(self, file_path, key_column='name', columns=None, filter=None, sheet_name=0, delimiter=None):
dengue_assemblies = dengue_hierarchy.add_data_from_file(excel_path,
                                                 key_column="HGNC",
                                                 filter={"names": assembly_list},
                                                 columns=data_columns)

dengue_assemblies[0]

{'GeneID': 'GeneID', 'DV3_24h-Mock_24h': 'protein_log2FC_24h', 'DV3_48h-Mock_48h': 'protein_log2FC_48h', 'log2FC': 'rna_log2FC_24h', 'log2FC_48hpi': 'rna_log2FC_48h', 'dengue_protein_list': 'binds', 'has_siRNA': 'knockdown_inhibits', 'HGNC': 'GeneSymbol'}
mapped key column GeneSymbol is not in data row {'GeneSymbol': 'ZNF724', 'GeneID': 440519, 'rna_log2FC_48h': -2.46518191325064, 'knockdown_inhibits': 0}
mapped key column GeneSymbol is not in data row {'GeneSymbol': 'ZNF724', 'GeneID': 440519, 'rna_log2FC_48h': -2.46518191325064, 'knockdown_inhibits': 0}
mapped key column GeneSymbol is not in data row {'GeneSymbol': 'ZNF724', 'GeneID': 440519, 'rna_log2FC_48h': -2.46518191325064, 'knockdown_inhibits': 0}
mapped key column GeneSymbol is not in data row {'GeneSymbol': 'ZNF724', 'GeneID': 440519, 'rna_log2FC_48h': -2.46518191325064, 'knockdown_inhibits': 0}
mapped key column GeneSymbol is not in data row {'GeneSymbol': 'ZNF724', 'GeneID': 440519, 'rna_log2FC_48h': -2.46518191325064, 'kno

{'id': 5971062,
 'v': {'CD_Labeled': True,
  'CD_AnnotatedAlgorithm': 'Annotated by gProfiler [Docker: coleslawndex/cdgprofilergenestoterm:0.3.0] {{--organism=hsapiens, --maxpval=0.00001, --minoverlap=0.05, --maxgenelistsize=5000}} via CyCommunityDetection Cytoscape App (1.10.0-SNAPSHOT)',
  'name': 'C4493683',
  'CommunityDetectionTally::viral_interaction': 1,
  'CD_AnnotatedMembers_Pvalue': 7.71565013380798e-21,
  'CD_AnnotatedMembers_Size': 11,
  'CD_AnnotatedMembers_Overlap': 0.091,
  'CommunityDetectionTally::has_protein_24hr': 0,
  'CD_MemberList_LogSize': 5.459,
  'CommunityDetectionTally::has_protein_48hr': 2,
  'CommunityDetectionTally::has_siRNA': 19,
  'CD_MemberList': 'BMP2 CAV1 CLDN5 COL15A1 COL6A2 COL6A5 DCC DMP1 ESPNL F13A1 FLNA FSTL1 IGFBP7 ITGA11 ITGA2 ITGA9 ITGAD ITGB3 ITGB8 LAMA3 LAMC2 MSN MSX1 NBL1 NCF1 NEO1 OSR2 PF4V1 PHEX PLS1 PTPN13 RND3 SAT1 SELP SERPINH1 SMAD6 SNX17 SPARC STAB1 STOM SYNPO2 UNC5C VCAM1 VWF',
  'LLM Name': 'Extracellular Matrix Organization and C

In [7]:
# properties = dengue_assemblies[0].get("v")
# data = json.loads(properties.get("data"))
# for gene, properties in data.items():
#     print(f'{gene}  :  {properties}')

# thing = data_dict_to_csv(data, 
#                        columns={'GeneID': 'GeneID',
#                         'rna_log2FC-24h': 'rna_log2FC-24h',
#                         'rna_log2FC_48h': 'rna_log2FC_48h',
#                         'protein_log2FC_24h': 'protein_log2FC_24h',
#                         'protein_log2FC_48h': 'protein_log2FC_48h',
#                         'binds': 'binds',
#                         'knockdown_inhibits': 'knockdown_inhibits',
#                         'name': 'GeneSymbol'}, 
#                         decimal_places=None)

# thing

## Generate Datasets for a list of assemblies
- optionally selecting columns
- get a list of their ids to use elsewhere

In [8]:
from models.hierarchy import dataset_from_assembly

dataset_ids = []

dataset_columns = {'GeneID': 'GeneID',
                        'rna_log2FC-24h': 'rna_log2FC-24h',
                        'rna_log2FC_48h': 'rna_log2FC_48h',
                        'protein_log2FC_24h': 'protein_log2FC_24h',
                        'protein_log2FC_48h': 'protein_log2FC_48h',
                        'binds': 'binds',
                        'knockdown_inhibits': 'knockdown_inhibits',
                        'name': 'GeneSymbol'}

for assembly in dengue_assemblies:
    dataset = dataset_from_assembly(db, assembly, 
                                    type="csv",
                                    columns=dataset_columns,
                                    experiment_description=brief_dengue_dataset_description)
    print(dataset.name)
    dataset_ids.append(dataset.object_id)
 

Extracellular Matrix Organization and Cell Adhesion in Response to Viral Infection
Extracellular Matrix Organization and Cell Adhesion in Response to Viral Infection
Urea cycle and redox homeostasis
DNA Repair and Maintenance in Response to Viral Infection
RAS Pathway Modulation and Apoptosis Regulation
Oxidative Stress Response and Protein Quality Control in Neurodegeneration
RNA Metabolism and Viral Defense Mechanism
Kynurenine Pathway Modulation and Immune Response Regulation
RNA Metabolism and Modification in Viral Infection
Endothelial Barrier Function and Viral Entry Modulation
