# Theoretical Computer Sciences Project

#### Sergio Peignier and Théotime Grohens

\section{1 - Graph Therory}

In [None]:
import pandas as pd
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt

\subsection{1.1 Introduction}

In this project we will apply graph algorithms to study the gene regulatory network (GRN)
of \textit{Saccharomyces cerevisiae}.
This species of yeast, it is a small single-cell eukaryote, with a short generation time, and
two possible forms: an haploid one and a diploid one. Moreover, this organism can be easily
cultured, and it has an important economic impact since it is extensively used for instance,
in winemaking, baking, and brewing. Due to these characteristics, Saccharomyces cerevisiae
is studied as an important model organism.

In this work we will study the gene regulatory network of \textit{Saccharomyces cerevisiae },
using graph theory algorithms. The files that are provided for this project have been used
in [MCK+12] , as gold-standards to assess gene regulatory network inference algorithms, and
they are the result of biological experiments based on ChIP binding data [MWG + 06], and
1systematic transcription factor deletions [HKI07]. Hereafter we describe each dataset in
details:

\begin{itemize}
\item GRN edges S cerevisiae.txt: contains the edges of the S. cervisiae regulatory network
(from transcription factors to target genes). The intended meaning is that if there is
an edge between transcription factor X and the target gene A, then X regulates the
transcription of A;

\item net4 transcription factors.tsv: Is a file containing in a single column the identifiers of the transcription factors of S. cervisiae that were studied;

\item net4 gene ids.tsv: The two previous files, use specific identifiers to denote genes, and this file contains the gene name associated to each gene identifier;

\item go slim mapping.tab.txt: Only columns 0 and 5 will be used in this work. Column 0 contains the gene name, and column 5 contains its Gene Ontology (GO) annotation
(http://www.geneontology.org/). Notice that two different rows may give for the
same gene different Gene Ontology annotations. 
\end{itemize}

\subsection{1.2 Exercices}

\textbf{Exercise 1 : } Exploration and characterization of the gene regulatory network}

1) Load the dataset and create a NetworkX graph instance.

On importe les datasets avec paandas : 

In [None]:
GRN_edges_SC = pd.read_csv("./datas/GRN_edges_S_cerevisiae.txt", sep = ',',  header=0)
net4_transcription_factors = pd.read_csv("./datas/net4_transcription_factors.tsv", sep = '\n',  header=0) 
net4_gene_ids = pd.read_csv("./datas/net4_gene_ids.tsv", sep = '\t', header=0) 
go_slim_mappingtab = pd.read_csv("./datas/go_slim_mapping.tab.txt", sep = '\t', header=None) 

On vérifie que tout a bien été importé 

In [None]:
GRN_edges_SC = GRN_edges_SC.iloc[:,1:]

In [None]:
# on peut transformer le df en array (au cas où si besoin)
GRN_edges_SC_np = GRN_edges_SC.to_numpy()

In [None]:
net4_gene_ids.head()

In [None]:
net4_transcription_factors.head()

In [None]:
go_slim_mappingtab.head()

In [None]:
GRN_edges_SC.head()

In [None]:
G = nx.from_pandas_edgelist(GRN_edges_SC, "transcription_factor", "target_gene")

In [None]:
#Premiere impression pour le graphe 
plt.figure(figsize=(15,8))
nx.draw(G, node_size = 12)

In [None]:
#On transforme G en graphe dirigé (pas super utile et prend bcp de temps à charger)
#G = nx.DiGraph(G)

In [None]:
#On rend le graphe plus lisible 
plt.figure(figsize=(18,10))
pos = nx.spring_layout(G, k = 0.6,) #return the relative positions of the nodes,k = optimal distance between nodes
nx.draw(G, node_size = 18, 
        pos = pos, 
        width = 0.4, 
        node_color = 'cyan',
        edge_color = 'DarkSlateGray')

#### Graphe biparti :

In [None]:
from networkx.algorithms import bipartite

In [None]:
g = nx.Graph()

In [None]:
g.add_nodes_from(GRN_edges_SC['transcription_factor'], bipartite = 'transcription_factor')

In [None]:
g.add_nodes_from(GRN_edges_SC['target_gene'], bipartite = 'target_gene')

In [None]:
g.add_edges_from(zip(GRN_edges_SC['transcription_factor'], GRN_edges_SC['target_gene']))

In [None]:
#tf = transcription_factor
tf_nodes = [ n for n in g.nodes() if g.nodes[n]
            ['bipartite'] == 'transcription_factor']

In [None]:
# gn = gene = target_gene
gn_nodes = [ n for n in g.nodes() if g.nodes[n]
           ['bipartite'] == 'target_gene']

In [None]:
pos_ = nx.bipartite_layout(g, tf_nodes, scale = 1)

Sur le graphe si-dessous, on réprésente la bipartie du graphe avec :
    \begin{itemize}
        \item À gauche les noeuds représentant les facteurs de transcription;
        \item À droite les noeuds représentant les gène cible pour une régulation.
    \end{itemize}
Globalement (sur une vue d'ensemble) on remarque tout de suite que certain facteurs agissent sur un plus grand nombre de gène cibles que d'autres.

In [None]:
plt.figure(figsize=(30,20))
nx.draw(g, pos = pos_, node_size = 14,
       node_color = 'forestgreen',
       edge_color = 'darkblue',
       width = 0.1)

\subsection{K-shell Decomposition}

On peut tout d'abord essayer de faire un k-shell avec une fonction \textit{'k_shell'} directement implémentée dans la bibliothèque networkx.algorithms.core, pour différents $k$.

In [None]:
from networkx.algorithms.core import k_shell

In [None]:
nx.draw(k_shell(G, k=2), node_size = 10)

In [None]:
nx.draw(k_shell(G, k=4), node_size = 10)

In [None]:
nx.draw(k_shell(G, k=6), node_size = 10)

In [None]:
nx.draw(k_shell(G, k=7), node_size = 10)

Avec un $k > 7$ il ne reste plus aucun noeud. 

Maintenant on essaye d'implementer la fonction k_shell par nous même :

In [None]:
def my_kshell (G, k) :
    degrees = []
    for n in G.nodes():
        degrees.append([n, G.degree(n)])
    for i in range (k):
        for d in degrees:
            if d[1] <= i:
                G.remove_node(d[0])
    return G

In [None]:
new_G = my_kshell(G, 3)

In [None]:
nx.draw(new_G, node_size = 10)