# Exploratory Data Analysis: Networks Dataframe (from Omnipath database)

[//]: # (------------------------------------------    DO NOT MODIFY THIS    ------------------------------------------)
<style type="text/css">
.tg  {border-collapse:collapse;
      border-spacing:0;
     }
.tg td{border-color:black;
       border-style:solid;
       border-width:1px;
       font-family:Arial, sans-serif;
       font-size:14px;
       overflow:hidden;
       padding:10px 5px;
       word-break:normal;
      }
.tg th{border-color:black;
       border-style:solid;
       border-width:1px;
       font-family:Arial, sans-serif;
       font-size:14px;
       font-weight:normal;
       overflow:hidden;
       padding:10px 5px;
       word-break:normal;
      }
.tg .tg-fymr{border-color:inherit;
             font-weight:bold;
             text-align:left;
             vertical-align:top
            }
.tg .tg-0pky{border-color:inherit;
             text-align:left;
             vertical-align:top
            }
[//]: # (--------------------------------------------------------------------------------------------------------------)

[//]: # (-------------------------------------    FILL THIS OUT WITH YOUR DATA    -------------------------------------)
</style>
<table class="tg">
    <tbody>
      <tr>
        <td class="tg-fymr" style="font-weight: bold">Title:</td>
        <td class="tg-0pky">Exploratory Data Analysis: Networks Dataframe (from Omnipath database)</td>
      </tr>
      <tr>
        <td class="tg-fymr" style="font-weight: bold">Authors:</td>
        <td class="tg-0pky">
            <a href="https://github.com/ecarrenolozano" target="_blank" rel="noopener noreferrer">Edwin Carreño</a>
        </td>
      </tr>
      <tr>
        <td class="tg-fymr" style="font-weight: bold">Affiliations:</td>
        <td class="tg-0pky">
            <a href="https://www.ssc.uni-heidelberg.de/en" target="_blank" rel="noopener noreferrer">Scientific Software Center</a>,
            <a href="https://saezlab.org/" target="_blank" rel="noopener noreferrer">Saez-Rodriguez Group</a>
        </td>
      </tr>
      <tr>
        <td class="tg-fymr" style="font-weight: bold">Date Created:</td>
        <td class="tg-0pky">19.03.2025</td>
      </tr>
      <tr>
        <td class="tg-fymr" style="font-weight: bold">Description:</td>
        <td class="tg-0pky">Extraction of metadata for building database tables </td>
      </tr>
    </tbody>
</table>

[//]: # (--------------------------------------------------------------------------------------------------------------)

## Overview

This notebook should help to understand the information contained in the "Networks" dataset from the Omnipath database.

## Setup (if required)

If your code require to install dependencies before your main code, please add the commands to install the dependencies.

### Pandas installation

In [5]:
%pip install pandas -q

Note: you may need to restart the kernel to use updated packages.


## Importing Libraries

Recommendations:

- Respect the order of the imports, they are indicated by the numbers *1, 2, 3*.
- One import per line is recommended, with this we can track easily any modified line when we use git.
- Absolute imports are recommended (see *3. Local application/library specific imports* below), they improve readability and give better error messages.
- You should put a blank line between each group of imports.

In [11]:
# 1. Standard library imports
import os

# 2. Related third party imports
import numpy as np
import pandas as pd

# 3. Local application/library specific imports
# import <mypackage>.<MyClass>         # this is an example
# from <mypackage> import <MyClass>    # this is another example 

## Introduction

TO DO


## Section 1. Load "Networks" dataset

### Section 1.1. Setting dataset path

In [45]:
dataset_path_networks = os.path.join("../data/omnipath_networks/omnipath_webservice_interactions__latest.tsv.gz")
# dataset_path_networks = os.path.join("../data_testing/subset_networks_1000.tsv")

In [31]:
print("This file exist? {}".format(os.path.exists(dataset_path_networks)))

This file exist? True


### Section 1.2. Load dataset as Pandas DataFrame

#### Configuring Pandas view

In [12]:
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)  

#### Load data into Pandas Dataframe (without predefined data types)

By default the option *keep_default_na* is True, it means that Pandas will interpret empty values or null values as NaN values.

In [46]:
networks_df = pd.read_table(dataset_path_networks, sep="\t", keep_default_na=True)

  networks_df = pd.read_table(dataset_path_networks, sep="\t", keep_default_na=True)


In [33]:
networks_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 36 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   source                 1000 non-null   object 
 1   target                 1000 non-null   object 
 2   source_genesymbol      1000 non-null   object 
 3   target_genesymbol      1000 non-null   object 
 4   is_directed            1000 non-null   int64  
 5   is_stimulation         1000 non-null   int64  
 6   is_inhibition          1000 non-null   int64  
 7   consensus_direction    1000 non-null   int64  
 8   consensus_stimulation  1000 non-null   int64  
 9   consensus_inhibition   1000 non-null   int64  
 10  sources                1000 non-null   object 
 11  references             987 non-null    object 
 12  omnipath               1000 non-null   bool   
 13  kinaseextra            1000 non-null   bool   
 14  ligrecextra            1000 non-null   bool   
 15  pathw

## Section 2. Metadata

We are interested in having a table with the following information:


- Column name.
- Data Type.
- A certain colum could contain null values.
- Number of unique  values.

That could be done using the next cell:

### Section 2.1. Overview

### Section 2.2. Unique values per column

In [31]:
metadata = pd.DataFrame({
    'Column Name': networks_df.columns,
    'Data Type': networks_df.dtypes.values,
    'Nullable': networks_df.isnull().any().values,
    'Unique Values': [networks_df[col].nunique() for col in networks_df.columns]
})

metadata

Unnamed: 0,Column Name,Data Type,Nullable,Unique Values
0,source,object,False,27944
1,target,object,False,47976
2,source_genesymbol,object,False,22117
3,target_genesymbol,object,False,36930
4,is_directed,int64,False,2
5,is_stimulation,int64,False,2
6,is_inhibition,int64,False,2
7,consensus_direction,int64,False,2
8,consensus_stimulation,int64,False,2
9,consensus_inhibition,int64,False,2


To know all the unique values in a certain column, just type the column's name in the variable *field*:

In [32]:
field = "dorothea_curated"

print("List of unique values in field: {}\n\t{}".format(field, networks_df[field].unique()))

List of unique values in field: dorothea_curated
	[nan True False 'True' 'False' '1']


## Section 3. Free Exploratory Analysis

In this section you can explore the data as you want, it means you can filter, select columns, counting values, etc. Feel free to explore as much you want.

In [36]:
value_counts_per_columnb = {col: networks_df[col].value_counts() for col in networks_df.columns}

In [37]:
value_counts_per_column['type']

type
transcriptional                866068
post_translational             331753
post_transcriptional            11118
mirna_transcriptional            4961
small_molecule_protein           3819
lncrna_post_transcriptional       181
Name: count, dtype: int64

In [22]:
def flatten(xss):
    return [x for xs in xss for x in xs]

out = pd.Series(flatten(networks_df.sources.str.split(";").tolist()))

In [39]:
networks_df.sources.value_counts()["Wang"]

np.int64(10)

### Counting Null values

In [None]:
networks_df.references[networks_df['references'].isnull()]

### Counting Null 

In [None]:
num_nulls_in_dorothea_curated = networks_df['references'].isnull().sum()
num_nulls_in_dorothea_curated

### Counting True values

In [None]:
num_True_in_dorothea_curated = (networks_df.dorothea_curated==True).sum()
num_True_in_dorothea_curated

### Counting "True" values

In [None]:
num_true_in_dorothea_curated = (networks_df.dorothea_curated=="True").sum()
num_true_in_dorothea_curated

### Counting 1 values

In [None]:
num_one_in_dorothea_curated = (networks_df.dorothea_curated=="1").sum()
num_one_in_dorothea_curated

### Counting False values

In [None]:
num_False_in_dorothea_curated = (networks_df.dorothea_curated==False).sum()
num_False_in_dorothea_curated

### Counting "False" values

In [None]:
num_false_in_dorothea_curated = (networks_df.dorothea_curated=="False").sum()
num_false_in_dorothea_curated

### Filtering

In [None]:
filtered = networks_df[(networks_df["source"]=="Q16254")
                         & (networks_df["target"]=="O43683")]
#filtered[["source", "target", "is_stimulation", "omnipath"]]

filtered

In [None]:
omnipath_df = networks_df[(networks_df["omnipath"])==True]
omnipath_df.info()

In [None]:
omnipath_df[(omnipath_df["source"])==(omnipath_df["target"])]

### Finding the neighbors of a certain protein

In [None]:
neighbors = networks_df[(networks_df['source_genesymbol']=='NF1') | (networks_df['target_genesymbol']=='NF1')]
neighbors

Unnamed: 0,source,target,source_genesymbol,target_genesymbol,is_directed,is_stimulation,is_inhibition,consensus_direction,consensus_stimulation,consensus_inhibition,...,dorothea_coexp,dorothea_level,type,curation_effort,extra_attrs,evidences,ncbi_tax_id_source,entity_type_source,ncbi_tax_id_target,entity_type_target
2843,P21359,P01112,NF1,HRAS,1,1,1,1,0,1,...,,,post_translational,7,"{""SPIKE_effect"":""1"",""SPIKE_mechanism"":""Other"",...","{""id_a"":""P21359"",""id_b"":""P01112"",""positive"":[{...",9606,protein,9606,protein
5161,P17947,P21359,SPI1,NF1,1,0,0,0,0,0,...,,,post_translational,2,"{""SPIKE_effect"":""3"",""SPIKE_mechanism"":""Transcr...","{""id_a"":""P17947"",""id_b"":""P21359"",""positive"":[]...",9606,protein,9606,protein
11368,P21359,P01111,NF1,NRAS,1,0,1,1,0,1,...,,,post_translational,2,"{""CA1_effect"":""_"",""CA1_type"":""Binding""}","{""id_a"":""P21359"",""id_b"":""P01111"",""positive"":[]...",9606,protein,9606,protein
71568,Q9UBF6,P21359,RNF7,NF1,1,0,1,1,0,1,...,,,post_translational,1,"{""SIGNOR_mechanism"":[""ubiquitination""]}","{""id_a"":""Q9UBF6"",""id_b"":""P21359"",""positive"":[]...",9606,protein,9606,protein
81529,P17252,P21359,PRKCA,NF1,1,0,0,0,0,0,...,,,post_translational,2,"{""PhosphoSite_evidence"":[""AB"",""WB"",""MA""]}","{""id_a"":""P17252"",""id_b"":""P21359"",""positive"":[]...",9606,protein,9606,protein
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1215792,10288191,P21359,BINIMETINIB,NF1,1,0,0,0,0,0,...,,,small_molecule_protein,0,{},"{""id_a"":""10288191"",""id_b"":""P21359"",""positive"":...",-1,small_molecule,9606,protein
1216019,16222096,P21359,COTELLIC,NF1,1,0,0,0,0,0,...,,,small_molecule_protein,0,{},"{""id_a"":""16222096"",""id_b"":""P21359"",""positive"":...",-1,small_molecule,9606,protein
1216108,44462760,P21359,DABRAFENIB,NF1,1,0,0,0,0,0,...,,,small_molecule_protein,0,{},"{""id_a"":""44462760"",""id_b"":""P21359"",""positive"":...",-1,small_molecule,9606,protein
1217393,11707110,P21359,TRAMETINIB,NF1,1,0,0,0,0,0,...,,,small_molecule_protein,0,{},"{""id_a"":""11707110"",""id_b"":""P21359"",""positive"":...",-1,small_molecule,9606,protein


In [52]:
neighbors.type.value_counts()

type
transcriptional           52
post_translational        27
post_transcriptional       6
small_molecule_protein     5
mirna_transcriptional      2
Name: count, dtype: int64

In [53]:
neighbors.sources.value_counts()

sources
Wang                                                                                                                                                              16
CollecTRI;ExTRI_CollecTRI                                                                                                                                         16
DoRothEA;PAZAR_DoRothEA                                                                                                                                            9
miRTarBase                                                                                                                                                         6
CancerDrugsDB                                                                                                                                                      5
CollecTRI;DoRothEA;TRRUST_CollecTRI;TRRUST_DoRothEA                                                                                                                2
Co

In [56]:
trametinib_interactions = networks_df[(networks_df['source_genesymbol']=='TRAMETINIB')]
trametinib_interactions.sources.value_counts()

sources
CancerDrugsDB           30
CancerDrugsDB;SIGNOR     2
Name: count, dtype: int64

In [None]:
file
small_molecule_protein_interactions = networks_df[(networks_df['type']=='small_molecule_protein')]
small_molecule_protein_interactions.source_genesymbol.value_counts()

source_genesymbol
TOLAK                        63
BORTEZOMIB                   63
OTREXUP                      61
GEMCITABINE                  54
KYPROLIS                     50
                             ..
ALOGLIPTIN                    1
68848                         1
115111                        1
NARATRIPTAN HYDROCHLORIDE     1
5312830                       1
Name: count, Length: 946, dtype: int64

In [59]:
small_molecule_protein_interactions.sources.value_counts()

sources
CancerDrugsDB                                     1906
SIGNOR                                            1405
Cellinker;Guide2Pharma_Cellinker                   308
CancerDrugsDB;SIGNOR                               194
Cellinker;Guide2Pharma_Cellinker;SIGNOR              5
CancerDrugsDB;Cellinker;Guide2Pharma_Cellinker       1
Name: count, dtype: int64

In [61]:
small_molecule_protein_interactions_filename = "../data/omnipath_networks/omnipath_webservice_interactions__small_molecule_interactions.tsv.gz"
small_molecule_protein_interactions[small_molecule_protein_interactions['sources']!="CancerDrugsDB"]
small_molecule_protein_interactions.to_csv(small_molecule_protein_interactions_filename, sep="\t", compression='infer')

## Section 4. Load the dataset with predefined data types.

In [36]:
# Data types for interactions
dtype = {'source': 'string',
         'target': 'string',
         'source_genesymbol': 'string',
         'target_genesymbol': 'string',
         'is_directed': 'boolean',
         'is_stimulation': 'boolean',
         'is_inhibition': 'boolean',
         'consensus_direction': 'boolean',
         'consensus_stimulation': 'boolean',
         'consensus_inhibition': 'boolean',
         'sources': 'string',
         'references': 'string',
         'omnipath': 'boolean',
         'kinaseextra': 'boolean',
         'ligrecextra': 'boolean',
         'pathwayextra': 'boolean',
         'mirnatarget': 'boolean',
         'dorothea': 'boolean',
         'collectri': 'boolean',
         'tf_target': 'boolean',
         'lncrna_mrna': 'boolean',
         'tf_mirna': 'boolean',
         'small_molecule': 'boolean',
         'dorothea_curated': 'boolean',
         'dorothea_chipseq': 'boolean',
         'dorothea_tfbs': 'boolean',
         'dorothea_coexp': 'boolean',
         'dorothea_level': 'string',
         'type': 'string',
         'curation_effort': 'Int64',
         'extra_attrs': 'string',
         'evidences': 'string',
         'ncbi_tax_id_source': 'Int64',
         'entity_type_source': 'string',
         'ncbi_tax_id_target': 'Int64',
         'entity_type_target': 'string'
}

In [37]:
networks_df = pd.read_table(dataset_path_networks, dtype=dtype)

In [38]:
networks_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1217900 entries, 0 to 1217899
Data columns (total 36 columns):
 #   Column                 Non-Null Count    Dtype  
---  ------                 --------------    -----  
 0   source                 1217900 non-null  string 
 1   target                 1217900 non-null  string 
 2   source_genesymbol      1217900 non-null  string 
 3   target_genesymbol      1217900 non-null  string 
 4   is_directed            1217900 non-null  boolean
 5   is_stimulation         1217900 non-null  boolean
 6   is_inhibition          1217900 non-null  boolean
 7   consensus_direction    1217900 non-null  boolean
 8   consensus_stimulation  1217900 non-null  boolean
 9   consensus_inhibition   1217900 non-null  boolean
 10  sources                1217900 non-null  string 
 11  references             413020 non-null   string 
 12  omnipath               1217900 non-null  boolean
 13  kinaseextra            1217900 non-null  boolean
 14  ligrecextra       

Note that by specifying the datatypes the size of the dataset in memory has been reduced **21.83%**.