# Theoretical Computer Sciences Project

#### Sergio Peignier and Théotime Grohens

\section{1 - Graph Therory}

In [None]:
import random
import pandas as pd
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
import seaborn as sns

\subsection{1.1 Introduction}

In this project we will apply graph algorithms to study the gene regulatory network (GRN)
of \textit{Saccharomyces cerevisiae}.
This species of yeast, it is a small single-cell eukaryote, with a short generation time, and
two possible forms: an haploid one and a diploid one. Moreover, this organism can be easily
cultured, and it has an important economic impact since it is extensively used for instance,
in winemaking, baking, and brewing. Due to these characteristics, Saccharomyces cerevisiae
is studied as an important model organism.

In this work we will study the \textbf{gene regulatory network} of \textit{Saccharomyces cerevisiae }, using graph theory algorithms. The files that are provided for this project have been used in [MCK+12] , as gold-standards to assess gene regulatory network inference algorithms, and they are the result of biological experiments based on ChIP binding data [MWG + 06], and systematic transcription factor deletions [HKI07]. Hereafter we describe each dataset in details:

\begin{itemize}
\item GRN_edges_S_cerevisiae.txt : contains the edges of the \textit{ S. cervisiae} regulatory network (from transcription factors to target genes). The intended meaning is that if there is an edge between transcription factor X and the target gene A, then X regulates the transcription of A ;

\item net4_transcription_factors.tsv : Is a file containing in a single column the identifiers of the transcription factors of \textit{S. cervisiae} that were studied ;

\item net4_gene_ids.tsv : The two previous files, use specific identifiers to denote genes, and this file contains the gene name associated to each gene identifier ;

\item go_slim_mapping.tab.txt : Only columns 0 and 5 will be used in this work. Column 0 contains the gene name, and column 5 contains its Gene Ontology (GO) annotation (http://www.geneontology.org/). Notice that two different rows may give for the same gene different Gene Ontology annotations. 
\end{itemize}

\subsection{1.2 Exercices}

### \textbf{Exercise 1 : } Exploration and characterization of the gene regulatory network

#### 1) Load the dataset and create a NetworkX graph instance.

On importe les datasets avec pandas : 

In [None]:
GRN_edges_SC = pd.read_csv("./datas/GRN_edges_S_cerevisiae.txt", sep = ',',  header=0)
net4_transcription_factors = pd.read_csv("./datas/net4_transcription_factors.tsv", sep = '\n',  header=0) 
net4_gene_ids = pd.read_csv("./datas/net4_gene_ids.tsv", sep = '\t', header=0) 
go_slim_mappingtab = pd.read_csv("./datas/go_slim_mapping.tab.txt", sep = '\t', header=None) 

On vérifie que tout a bien été importé 

In [None]:
GRN_edges_SC = GRN_edges_SC.iloc[:,1:]

In [None]:
# on peut transformer le df en array (au cas où si besoin)
GRN_edges_SC_np = GRN_edges_SC.to_numpy()

In [None]:
net4_gene_ids.head()

In [None]:
net4_transcription_factors.head()

In [None]:
go_slim_mappingtab.head()

In [None]:
GRN_edges_SC.head()

Le dataset a bien été importé, on créé donc un graphe $G = <V,E>$ dont l'ensemble des noeuds noté $V$ contient les facteurs de transcription et les gènes cibles et l'ensemble des arettes noté $E$ représente les régulations des gènes par les facteurs de transcription.

In [None]:
G = nx.from_pandas_edgelist(GRN_edges_SC, "transcription_factor", "target_gene")

#### 2) Plot the gene regulatory network,  the plot should be readable,  understandable,  andinformative.  Which information did you decide to convey in your plot?  Why?

On représente le graphe G tel quel.

In [None]:
#Premiere impression pour le graphe 
plt.figure(figsize=(15,8))
nx.draw(G, node_size = 12)

In [None]:
#On transforme G en graphe dirigé (pas super utile et prend bcp de temps à charger)
#G = nx.DiGraph(G)

In [None]:
#On rend le graphe plus lisible 
plt.figure(figsize=(18,10))
pos = nx.spring_layout(G, k = 0.6,) #return the relative positions of the nodes,k = optimal distance between nodes
nx.draw(G, node_size = 18, 
        pos = pos, 
        width = 0.4, 
        node_color = 'cyan',
        edge_color = 'DarkSlateGray')

Le graphe n'est pas lisible tel quel et ne donne aucune information sur la nature des données.\\

L'ensemble $V$ peut être séparé en deux sous-ensembles :
\begin{itemize}
    \item $X$ : ensemble des facteurs de transcription ;
    \item $A$ : ensemble des gènes.
\end{itemize}
Il serait donc plus isdjocospjdkpl de représenter les deux sous-ensembles $X$ et $A$ de façon distincte. 

##### Graphe bipartite :

In [None]:
from networkx.algorithms import bipartite

In [None]:
g = nx.Graph()

In [None]:
g.add_nodes_from(GRN_edges_SC['transcription_factor'], bipartite = 'transcription_factor')

In [None]:
g.add_nodes_from(GRN_edges_SC['target_gene'], bipartite = 'target_gene')

In [None]:
g.add_edges_from(zip(GRN_edges_SC['transcription_factor'], GRN_edges_SC['target_gene']))

In [None]:
#tf = transcription_factor
tf_nodes = [ n for n in g.nodes() if g.nodes[n]
            ['bipartite'] == 'transcription_factor']

In [None]:
# gn = gene = target_gene
gn_nodes = [ n for n in g.nodes() if g.nodes[n]
           ['bipartite'] == 'target_gene']

In [None]:
pos_ = nx.bipartite_layout(g, tf_nodes, scale = 1)

Sur le graphe si-dessous, on réprésente la bipartie du graphe avec :
    \begin{itemize}
        \item À gauche les noeuds représentant les facteurs de transcription;
        \item À droite les noeuds représentant les gène cible pour une régulation.
    \end{itemize}
Globalement (sur une vue d'ensemble) on remarque tout de suite que certain facteurs agissent sur un plus grand nombre de gène cibles que d'autres.

In [None]:
plt.figure(figsize=(30,20))
nx.draw(g, pos = pos_, node_size = 14,
       node_color = 'forestgreen',
       edge_color = 'darkblue',
       width = 0.1)

#### 3) Describe the network by computing pertinent local and global metrics,  explain your choices, represent the results graphically if necessary, and interpret the results.

Clustering coeff : en fait ça sert à R je crois vu kon a un bipartire

In [None]:
tf_clust = nx.clustering(G, tf_nodes)
gn_clust = nx.clustering(G, gn_nodes)
clustering = [[k for k in tf_clust.values()], [k for k in gn_clust.values()]]

In [None]:
sns.set_style("whitegrid")

plt.figure(figsize=(18,10))
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel("Clustering Coefficient",fontsize=20)
plt.ylabel("Nodes",fontsize=20)
plt.title("Histogram of clustering coefficients", fontsize=20)

x1 = clustering[1]
x2 = clustering[0]
plt.hist([x1, x2], color = ['coral', 'mediumorchid'], 
         edgecolor = 'black', label = ["gene's nodes", "TGF's nodes"])

plt.legend()
plt.show()

In [None]:
sns.set_style("whitegrid")

plt.figure(figsize=(18,10))
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel("Clustering Coefficient",fontsize=20)
plt.ylabel("Nodes",fontsize=20)
plt.title("Histogram of clustering coefficients", fontsize=20)

x1 = clustering[1]
plt.hist(x1, color = ['coral'], 
         edgecolor = 'black', label = ["gene's nodes"])

plt.legend()
plt.show() 

In [None]:
 sns.set_style("whitegrid")

plt.figure(figsize=(18,10))
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel("Clustering Coefficient",fontsize=20)
plt.ylabel("Nodes",fontsize=20)
plt.title("Histogram of clustering coefficients", fontsize=20)

x2 = clustering[0]
plt.hist(x2, color = ['blue'], 
         edgecolor = 'black', label = ["TGF's nodes"])

plt.legend()
plt.show()

In [None]:
plt.hist(clustering[1], alpha=0.6)
plt.hist(clustering[0], alpha=0.6)
plt.show()

In [None]:
print(np.mean(clustering[0]))
print(np.mean(clustering[1]))

In [None]:
nx.average_clustering(G)

In [None]:
def overlap (G):
    out = []
    for couple in nx.utils.pairwise(G.nodes()):
        n = 0
        for nei in G[couple[0]]:
            if nei in G[couple[1]]:
                n+=1
        if n == 0:
            O = 0
        elif  G.degree(couple[0]) <= 1 or G.degree(couple[1]) <= 1:
            O = 0
        elif (G.degree(couple[0])-1+G.degree(couple[1])-1-n) == 0:
            print(G.degree(couple[0]),G.degree(couple[1]),n)
            O=100
        else :
            O =  n/(G.degree(couple[0])-1+G.degree(couple[1])-1-n)
        out.append((couple, O))
    return out

In [None]:
overlap(G)

In [None]:
nx.closeness_centrality(G) # inverse de la dist moyenne entre n et les autre noeuds

In [None]:
nx.betweenness_centrality(G) # fraction of shortest paths that passes through n

In [None]:
nx.density(G)

In [None]:
RC = nx.rich_club_coefficient(G)

In [None]:
lists = sorted(RC.items()) # sorted by key, return a list of tuples

x, y = zip(*lists) # unpack a list of pairs into two tuples

plt.plot(x, y)
plt.xlabel("Degree k",fontsize=10)
plt.ylabel("Rich Club Coefficient",fontsize=10)
plt.show()

In [None]:
def plot_degree_dist(G):
    degrees = [G.degree(n) for n in G.nodes()]
    plt.hist(degrees)
    plt.show()

In [None]:
degree_freq = nx.degree_histogram(G)
degrees = range(len(degree_freq))
plt.figure(figsize=(12, 8)) 
plt.loglog(degrees, degree_freq) 
plt.xlabel('Degree')
plt.ylabel('Frequency')

#### 4) Implement and apply the k-shell decomposition algorithm.

On peut tout d'abord essayer de faire un k-shell avec une fonction \textit{'k_shell'} directement implémentée dans la bibliothèque networkx.algorithms.core, pour différents $k$.

In [None]:
from networkx.algorithms.core import k_shell

In [None]:
nx.draw(nx.k_shell(G, k=2), node_size = 10)

In [None]:
nx.draw(k_shell(G, k=4), node_size = 10)

In [None]:
nx.draw(k_shell(G, k=6), node_size = 10)

In [None]:
nx.draw(k_shell(G, k=7), node_size = 10)

Avec un $k > 7$ il ne reste plus aucun noeud. 

Maintenant on essaye d'implementer la fonction k_shell par nous même :

In [None]:
def my_kshell (G, k) :
    degrees = []
    GG = nx.Graph()
    GG.add_nodes_from(g)
    GG.add_edges_from(g.edges)
    
    ik = 1
    for ik in range(k):
        done = False
        while not done :
            rm = []
            for n in GG.nodes():
                if GG.degree(n)<=ik:
                    rm.append(n)
            for m in rm:
                GG.remove_node(m)
            done = True
            for n in GG.nodes():
                if GG.degree(n)<=ik:
                    done = False
                    break
            
    return GG

In [None]:
new_G = my_kshell(G, 7)

In [None]:
nx.draw(new_G, node_size = 10)

#### 5) For at least 4 of the metrics that you have used:  what is the time complexity of thealgorithm that calculates it (explain)?

### \textbf{Exercise 2 : } Community detection

#### 1) You can choose between the Girvan Newman method and the Louvain algorithm tofind communities in the graph. Describe both algorithms, and their time complexities (explain).

#### 2) Which algorithm did you choose, why ?

#### 3) For the the Girvan Newman method, the user should select one of the output partitions, explain the criterion that could be used to make this choice, and its complexity.

#### 4) Study the GO composition of each community.  To do this you can produce a countingmatrix $M$,  such  that $M_{i,j}$ is  the  number  of  genes  from  community $j$ that  have  GO annotation $i^1$.

#### 5) Is there a relationship between graph communities and particular cell functions ?

\section{Around the Traveling Salesman Problem}
    \subsection{Introduction}
#### Exact solution

In a complete graph, every node is adjacent to every other node. Therefore, if we take all the nodes in a complete graph in any order, there will be a path through those nodes in that order. Then, if we join either end of that path, it will give us a Hamilton path. 
\newline

Let's define an Hamiltonian path $(n_1, n_2,...,n_n)$ in a complete graph of size $n$. There are $n$ possible starting nodes $n_1$. Then, as one of the $n$ nodes has already been visited, only $n-2$ options left for $n_2$ (it can't visite itself), $n-3$ for $n_3$ and so on until there is only one node left unvisited : $n_n$. Thus, there are $n!$ distinct Hamiltonian paths in a complete graph of size $n$.
\newline

In fact, not all these path possibilities are different. On any such cycle, there are:
\begin{itemize}
    \item $n$ differents nodes where you can start the path ;
    \item The path is reversible, you can travel it through $2$ directions.
\end{itemize}

So any one of these $n!$ possible paths is in a set of $2n$ cycles which all contain the same set of edges.
So finaly get $\frac{n!}{2n} = \frac{(n-1)!}{2}$ distinct Hamilton paths



In [None]:
# Our complet graph (just a test for folowing algorithms)

Gc = nx.Graph()
Gc = nx.complete_graph(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
list(Gc.nodes())

In [None]:
nx.draw(Gc)

In [None]:
def hamilton(G, pt):
    #pt : starting point
    F = [(G,[list(G.nodes())[pt]])]
    n = G.number_of_nodes()
    while F:
        graph,path = F.pop()
        temp = []
        neighbors = (node for node in graph.neighbors(path[-1]) if node != path[-1]) #exclude self loops
        for neighbor in neighbors:
            cppath = path + [] 
            cppath.append(neighbor)
            cpgraph = nx.Graph(graph)
            cpgraph.remove_node(path[-1])
            temp.append((cpgraph, cppath))
        for g,p in temp:
            if len(p) == n:
                return (p)
            else:
                F.append((g,p))

In [None]:
hamilton(Gc, 1)

In [None]:
# returns the adjacency list of a complete graph of size n
def complete_weight_graph(n): 
    G = {v : [(u, {'weight':0}) for u in range(n)] for v }
    for v in range(n):
        G[v]=[]
        for u in range(n):
            if v in G:
                random.randrange(10)+1

    return G

In [None]:
G = complete_weight_graph(4)
G

In [None]:
def Hamiltonian_path(G, i):
    N = len(G)
    paths = [[{'dist':0}, i]]
    for level in range(N-1):
        cppaths = paths + [] # deep copy of paths
        for p in cppaths: 
            for n in list(G[p[len(p)-1]]): # n : last node of the path p
                if n[0] not in p:
                    newp = p + [n[0]] 
                    newdist = p[0]['dist']+n[1]['weight']
                    newp[0]= {'dist' : newdist}
                    paths.append(newp)
            if len(p)<(N+1):
                paths.remove(p)
                
    return paths

In [None]:
Hamiltonian_path(G, 0)

In [None]:
t = (5,2)
u = t + (5,)


In [None]:
paths = {(1,5):0}

In [None]:

paths = {(1,):0}
p = (1,)
newp = p + (2,)
paths[newp] = 0
paths