# Graphistasis

![Graphistasis](assets/Logo.png)

Using a Graph Neural Network to mine gene-gene (epistatic) interactions for neurologic disease in the BioSNAP DGMiner dataset.

### Set up .venv
```bash
brew install uv
uv init
uv venv --python 3.11
source .venv/bin/activate

uv add torch torch_geometric pandas numpy matplotlib seaborn scikit-learn tqdm pip ipykernel biopython pymedtermino

# pip install pyg-lib
```

### Import libraries

In [1]:
from scripts.utils import *
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from tqdm import tqdm
import json
import matplotlib.pyplot as plt
import networkx as nx
import numpy as np
import pandas as pd
import requests
import seaborn as sns
import torch
import torch_geometric

### Run Data Set Up
```bash
bash scripts/unpack.sh
```

### Filter the dataset

In [2]:
# All genes
all = filter_data()

Reading in DGMiner TSV file as DataFrame object...
Filtering out missing values...
Cleaning disease names...
Dropping diseases that don't start with 'D'...


In [3]:
diseases = pd.unique(all['# Disease(MESH)']).tolist()

In [4]:
# # Map the disease names from their MESH IDs
# mesh_to_disease = mesh_to_name(diseases)
# with open('data/mesh_to_disease.json', 'w') as f:
#     json.dump(mesh_to_disease, f)

In [5]:
# Load the data
with open('data/mesh_to_disease.json', 'r') as f:
    mesh_to_disease = json.load(f)

### Create the Epistatic Interaction Dataset

In [6]:
# Uniprot to Gene Names
all_uniprot = pd.unique(all['Gene'])
with open('data/all_uniprot.txt', 'w') as f:
    for gene in all_uniprot:
        f.write(f"{gene}\n")

Upload the text file to this [website](https://www.uniprot.org/id-mapping). It will convert the Uniprot IDs to Gene names.

In [7]:
# Mapping dictionary
uniprot_to_gene, gene_to_uniprot, from_to = create_mappings(file_path='data/gene_mapping.tsv')

In [8]:
access_key = "97121fd52d13c24ff0ac6d80d8f0266e"
all_genes = pd.unique(from_to.To)
gene_list = all_genes.tolist()

Retrieve Epistatic Interaction data from BioGrid

In [9]:
# # Fetching epistatic interactions
# gene_interactions = fetch_epistatic_interactions(gene_list, access_key)
# with open('data/gene_interactions.json', 'w') as f:
#     json.dump(gene_interactions, f)

In [10]:
# Loading the epistatic interactions
with open('data/gene_interactions.json', 'r') as f:
    interactions = json.load(f)

### Clean up the dataset

In [11]:
all_dg = clean_and_map_data(all, uniprot_to_gene, mesh_to_disease, 'data/all_DG_clean.tsv')

### Generate the Epistatic Interaction tsv file

In [None]:
# generate_epistatic_interactions_tsv(interactions, gene_list, "data/epistatic_interactions.tsv")

Processing Interactions: 100%|██████████| 17745/17745 [00:00<00:00, 28070.29it/s]


Epistatic interactions TSV file saved to data/epistatic_interactions.tsv


Unnamed: 0,Gene1,Gene2
0,PPIP5K2,VAC14
1,PPIP5K2,WDR74
2,PPIP5K2,WDR6
3,PPIP5K2,ZC3H15
4,PPIP5K2,ZBTB8B
...,...,...
3964689,DAZ3,CFTR
3964690,HLA-DQA2,LHFPL5
3964691,HLA-DQA2,HLA-DQA1
3964692,HLA-DQA2,LHFPL5


### Generate the Graph

In [None]:
epistatic_graph = generate_graph_from_tsv("data/epistatic_interactions.tsv")
plot_graph(epistatic_graph, "figures/epistatic_interactions.png")

In [None]:
dg_graph = generate_graph_from_tsv("data/all_DG_clean.tsv")
plot_graph(dg_graph, "figures/disease_gene_interactions.png")

In [None]:
dg_and_epistatic = generate_graph_from_tsvs(["data/all_DG_clean.tsv", "data/epistatic_interactions.tsv"])
plot_graph(dg_and_epistatic, "figures/disease_gene_interactions_and_epistatic.png")