## Instructions for Using the Script and Data

__Data Location__: /data/sjw6257/xDTD_database/zip_files/tables (`.tar.gz` file of data)

__Description__: 
This script is to conduct schema analysis for each model's Path Result (mechanism of action). The data for each model is stored in KG databases (SQLite). There are three models: ExplainableDTD_v1.3_KG2.8.0.1, ExplainableDTD_v1.0_KG2.8.3, ExplainableDTD_v1.0_KG2.8.6. Extract `PATH_RESULT_TABLE` in CSV file format from the databases in order to conduct the following analysis.

Please refer to the code below as example for the schema analysis:

In [None]:
# Set working directory
# Copy and transfer the the compressed data file (tar.gz) to the working directory before starting

import os
os.chdir('/home/grads/sjw6257/xDTD/xDTD_analysis')

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats

### Extract Path and Path Score

In [None]:
df_2801 = pd.read_csv('v1.3_KG2.8.0.1_PathResult_table.csv')
df_2801 = df_2801[['path','path_score']]
#df_2801

In [None]:
df_283 = pd.read_csv('v1.0_KG2.8.3_refresh_PathResult_table.csv')
df_283 = df_283[['path','path_score']]
#df_283

In [None]:
df_286 = pd.read_csv('v1.0_KG2.8.6_PathResult_table.csv')
df_286 = df_286[['path','path_score']]
#df_286

### Model v2.8.0.1 versus Model v2.8.3 Comparison

In [None]:
# Looking at pathways that are present in BOTH KG2.8.0.1 & KG2.8.3
df_2801_intr_283 = pd.merge(df_2801,df_283, how='inner', on=['path'],suffixes=('_2801','_283')) 
df_2801_intr_283

In [None]:
# How much of the intersect match in respect to each database

percent_match_1 = (len(df_2801_intr_283[['path']]) / len(df_2801[['path']])) * 100
percent_match_2 = (len(df_2801_intr_283[['path']])/ len(df_283[['path']])) *100
print(f"\nPercentage of IDs in df_2801 that match: {percent_match_1:.2f}%")
print(f"\nPercentage of IDs in df_283 that match: {percent_match_2:.2f}%")

### Model v2.8.0.1 versus Model v2.8.3 Comparison

In [None]:
# Looking at pathways that are present in BOTH KG2.8.0.1 & KG2.8.6
df_2801_intr_286 = pd.merge(df_2801,df_286, how='inner', on=['path'],suffixes=('_2801','_286'))
df_2801_intr_286

In [None]:
# How much of the intersect match in respect to each dataframes

percent_match_1 = (len(df_2801_intr_286[['path']]) / len(df_2801[['path']])) * 100
percent_match_2 = (len(df_2801_intr_286[['path']])/ len(df_286[['path']])) *100
print(f"\nPercentage of IDs in df_2801 that match: {percent_match_1:.2f}%")
print(f"\nPercentage of IDs in df_286 that match: {percent_match_2:.2f}%")

### Model v2.8.6 versus Model v2.8.3 Comparison

In [None]:
# Looking at pathways that are present in BOTH KG2.8.6 & KG2.8.3
df_283_intr_286 = pd.merge(df_283,df_286, how='inner', on=['path'],suffixes=('_283','_286'))
df_283_intr_286

In [None]:
# How much of the intersect match in respect to each dataframes

percent_match_1 = (len(df_283_intr_286[['path']]) / len(df_283[['path']])) * 100
percent_match_2 = (len(df_283_intr_286[['path']])/ len(df_286[['path']])) *100
print(f"\nPercentage of IDs in df_283 that match: {percent_match_1:.2f}%")
print(f"\nPercentage of IDs in df_286 that match: {percent_match_2:.2f}%")

### Drug-Disease Pairs present in ALL three models

In [None]:
# Looking at pathways that are present in ALL THREE KG2.8.0.1, KG2.8.3_refresh, and KG2.8.6
df = df_2801.merge(df_283, on=['path']).merge(df_286, on=['path'], suffixes=('_2801','_286','_283'))
df.columns = ['path','path_score_2801', 'path_score_283', 'path_score_286'] # rename column

df_all = df[['path']]
df_all

In [None]:
# How much of the intersect match in respect to each dataframes
dataframes = {'df_2801': df_2801, 'df_283': df_283, 'df_286': df_286}
for name, df in dataframes.items():
    percent_match = (len(df_all) / len(df['path'])) * 100
    print(f"\nPercentage of matching IDs in {name}: {percent_match:.2f}%")


### Venn Diagram

In [None]:
## Install matplot venn diagram package
#!pip install matplotlib-venn 

In [None]:
import matplotlib.pyplot as plt
from matplotlib_venn import venn3

# Count length of each DB
A, B, C = len(df_2801), len(df_283), len(df_286)
AB, AC, BC, ABC = len(df_2801_intr_283), len(df_2801_intr_286), len(df_283_intr_286), len(df_all)

# Create the Venn diagram 
plt.figure(figsize=(8, 8))
venn_diagram = venn3(subsets=(A, B, AB, C, AC, BC, ABC), set_labels=('KG2.8.0.1', 'KG2.8.3', 'KG2.8.6'))

# Venn Diagram region lables
labels = {'100': A - AB - AC + ABC, '010': B - AB - BC + ABC, '001': C - AC - BC + ABC,
          '110': AB - ABC, '101': AC - ABC, '011': BC - ABC, '111': ABC}

for label, count in labels.items():
    venn_diagram.get_label_by_id(label).set_text(count)

plt.title("Comparison by Path Result")
plt.show()

**Note:**
The current venn diagramp package for Matplot cannot create proper diagram for certain "fringe" cases e.g. sets that are inside each other without showing the 0 values.
We recommend drawing the three way venn diagram mannually or use of other illustrator tool to generate the figures.