# Layouts tutorial 

The purpose of this turtorial is to illuminate the utility in graph layouts and offer insight into how they are produced. 



## The Pipeline

In esscence we want to go from a graph represented in matrix form into a good estimate of what it would look like with nodes and edges. 

This is primarily accomplished through first taking the given Graph matrix then embedding it. There are many differnt ways to embed. We will first use Node2Vec as that is supported with the currnet code. Other options include ASE and LSE. 

Once we have this embedded matrix which is nxd we must then down-project to two dimensions. This can be accomplished with either UMAP or TNSE which are different types of nonlinear manifold learning algorithms. 

We now have a matrix that is nx2. The next step would be to use sometimes to apply nooverlap(https://github.com/microsoft/graspologic/tree/dev/graspologic/layouts/nooverlap) to make sure that nodes do not appear on top of each other thus obfuscating the graph layout. 

This in effect creates a layout. One can optionally cluster and color the nodes using leiden. 

## About the callable functions and what they do and where they fit into above framework
``layout_tsne`` 

-- Automatic graph layout generation by creating a generalized node2vec embedding, then using t-SNE for dimensionality reduction to 2d space.

Show an example 

``layout_umap``

-- Automatic graph layout generation by creating a generalized node2vec embedding, then using UMAP for dimensionality reduction to 2d space.

-- Show an example  (one below?)

``categorical_colors``

-- Generates a node -> color mapping based on the partitions provided. The partitions are ordered by population descending, and a series of perceptually balanced, complementary colors are chosen in sequence.

``sequential_colors``

-- Generates a node -> color mapping where a color is chosen for the value as it maps the value range into the sequential color space.


``show_graph``




#### what is missing

Need to do an example for each of the main function calls - further explanation as to how it works..

Use some play data... will talk to Dwayne about getting some other data to be more intersting later


# Using TSNE
 By default, this function automatically attempts to prune each graph to a maximum
    of 10,000,000 edges by removing the lowest weight edges. This pruning is approximate
    and will leave your graph with at most ``max_edges``, but is not guaranteed to be
    precisely ``max_edges``.
    In addition to pruning edges by weight, this function also only operates over the
    largest connected component in the graph.
    After dimensionality reduction, sizes are generated for each node based upon
    their degree centrality, and these sizes and positions are further refined by an
    overlap removal phase. Lastly, a global partitioning algorithm
    (`graspologic.partition.leiden`) is executed for the largest connected
    component and the partition ID is included with each node position.
    
``layout_tsne`` handles a lot of the pipeline stated about all at once. In later iterations this will be broken down into more callable functions to allow for more access to intermediate realizations

## the paramters

graph : :class:`networkx.Graph` Create any graph object

for other intersting graphs look here https://networkx.org/documentation/stable/reference/generators.html

In [None]:
direc = r'C:\Users\dfran\Downloads\download.tsv.iceland.tar.bz2'    
import tarfile
import os

# full_dir = os.path.join(direc, 'fig' + str(i) + '.png')
tar = tarfile.open(direc, "r:bz2")  
direc = r'D:\Hopkins\Hopkins_senior\Neurodata\ndd_prac\ndd_stuff\sprint3\data'
tar.extractall(direc) 
tar.close()

In [None]:
import networkx as nx
Data = open('large-graph.txt', "r")
next(Data, None)  # skip the first line in the input file
Graphtype = nx.Graph()
g = nx.parse_edgelist(Data, delimiter=',', create_using=Graphtype,
                      nodetype=str, data=(('weight', float),))


``perplexity : int``
The perplexity is related to the number of nearest neighbors that is used in 
other manifold learning algorithms. 
Larger datasets usually require a larger perplexity. 
Perplexity is a sensitive paramter likely choice between 4 and 100; large datasets will need larger perplexity. 

Manifold learning helps with nonlinear projection to lower dimension

``max_edges : int`` default is 10000000 

``n_iter : int``
        Maximum number of iterations for the optimization. We have found in practice
        that larger graphs require more iterations. We hope to eventually have more
        guidance on the number of iterations based on the size of the graph and the
        density of the edge connections.
        
``random_seed : int`` set for reproducible results 

In [None]:
# You will notice  that we have to relabel the nodes here to be the string form of the int
# this is because ``leiden`` requires that all the nodes be strings 
di = {}
for i in range(len(list(g.nodes))):
    di[list(g.nodes)[i]] = str(list(g.nodes)[i])
g = nx.relabel_nodes(g, di)

The return type of this is a tuple of ``nx.Graph, List[NodePosition]``

In [None]:
from graspologic.layouts import layout_tsne
import time
ti = time.time()
tupl = layout_tsne(g, perplexity = 3, n_iter = 250, random_seed = 23)
print(time.time() - ti)

## coloring of nodes

Should you not want to color the nodes using any of the functionality in ``graspologic.layout``

In [None]:
import seaborn as sns
nodes = list(g.nodes)
colors = sns.color_palette(n_colors = g.number_of_nodes())
node_to_color = dict(zip(nodes, colors))

You can also use ``categorical_colors`` or ``sequential_colors`` but that will be discussed later

## use show graph 

In [None]:
from graspologic.layouts import show_graph
show_graph(tupl[0], tupl[1], node_to_color)

You will notice that we are using just the basic defaults for everything. In the following examples more clear examples of how to change the graph will be demonstrated. 

# Using UMAP
Automatic graph layout generation by creating a generalized node2vec embedding,
    then using UMAP for dimensionality reduction to 2d space.
    By default, this function automatically attempts to prune each graph to a maximum
    of 10,000,000 edges by removing the lowest weight edges. This pruning is approximate
    and will leave your graph with at most ``max_edges``, but is not guaranteed to be
    precisely ``max_edges``.
    In addition to pruning edges by weight, this function also only operates over the
    largest connected component in the graph.
    After dimensionality reduction, sizes are generated for each node based upon
    their degree centrality, and these sizes and positions are further refined by an
    overlap removal phase. Lastly, a global partitioning algorithm
    (:func:`graspologic.partition.leiden`) is executed for the largest connected
    component and the partition ID is included with each node position.

## The parameterss

``graph`` : :class:`networkx.Graph`


``min_dist : float``
    The effective minimum distance between embedded points. Default is ``0.75``.
    Smaller values will result in a more clustered/clumped embedding where nearby
    points on the manifold are drawn closer together, while larger values will
    result on a more even dispersal of points. The value should be set relative to
    the ``spread`` value, which determines the scale at which embedded points will
    be spread out.
    
   
``n_neighbors : int``
    The size of local neighborhood (in terms of number of neighboring sample points)
    used for manifold approximation. Default is ``25``. Larger values result in
    more global views of the manifold, while smaller values result in more local
    data being preserved.
    

``max_edges : int``
    The maximum number of edges to use when generating the embedding.  Default is
    ``10000000``. The edges with the lowest weights will be pruned until at most
    ``max_edges`` exist. Warning: this pruning is approximate and more edges than
    are necessary may be pruned. Running in 32 bit enviornment you will most
    likely need to reduce this number or you will out of memory.
    
  
``random_seed : int``
    Seed to be used for reproducible results. Default is None and will produce
    random results.


In [None]:
import networkx as nx
from graspologic.layouts import layout_umap
t1 = time.time()
tupl = layout_umap(g)
print("UMAP takes" + str(time.time() - t1) + " on " + str(g.number_of_nodes()) + " number of nodes")

You will notice that due to the large graph there will be a 7 min wait to produce the projection. The ``n_iter`` needs to be at least 250, but ``perplexity`` can be changed to be larger with the larger graph

In [None]:
from graspologic.layouts import show_graph
show_graph(tupl[0], tupl[1], node_to_color)

# Graph coloring 

## ``categorical_colors``
The inputs to this coloring function is a dict that maps the node to a parition list. 

Here we are basing the graph color based  upon the estimates community. We can treat each community as a partition and color as such 

In [None]:
import numpy as np
from graspologic.layouts import categorical_colors
nodes = list(g.nodes)
parts = [node.community for node in tupl[1]]
cat_cols = categorical_colors(dict(zip(nodes, parts)))
show_graph(tupl[0], tupl[1], cat_cols)

You will notice this has almost identical coloring as before and that is because the community is based off a GMM. 

In [None]:
import numpy as np
from graspologic.layouts import categorical_colors
import numpy as np
from sklearn.mixture import GaussianMixture

max_comps = np.max( [node.community for node in tupl[1]])
X = np.array([[node.x, node.y] for node in tupl[1]])
labels = GaussianMixture(n_components=max_comps, random_state=0).fit_predict(X)
nodes = list(g.nodes)
cat_cols = categorical_colors(dict(zip(nodes, labels)))
show_graph(tupl[0], tupl[1], cat_cols)

## ``sequential_colors``

Similar to above the aforementioned coloring function this one still takes in a dictionary. Howver the color of this is soley based on whether the given floating point value associated with the node key is within certain bounds 

In [None]:
import numpy as np
from graspologic.layouts import sequential_colors
nodes = list(g.nodes)
parts = [node.x for node in tupl[1]]
seq_cols = sequential_colors(dict(zip(nodes, parts)))
show_graph(tupl[0], tupl[1], seq_cols)

# Modifications to the ``show_graph`` parameters

For the graphs above we have notr really found it necessary to change the ``show_graph`` default parameters mainly because ``layouts`` handles these large graphs will larger edge weights well. If we had a smaller graph or one with edge weights that are smaller it would be necessary to edit these parameters. Let us walk through an example of this. 

In [None]:
import seaborn as sns

g1 = nx.complete_graph(100)

di = {}
for i in range(len(list(g1.nodes))):
    di[list(g1.nodes)[i]] = str(list(g1.nodes)[i])
g1 = nx.relabel_nodes(g1, di)

nodes = list(g1.nodes)
colors = sns.color_palette(n_colors = g1.number_of_nodes())
node_to_color = dict(zip(nodes, colors))
tupl = layout_umap(g1)
show_graph(tupl[0], tupl[1], node_to_color)

The edges are not as clear as before and the ``alpha`` of the nodes also begins to look to translucent due to the ``nooverlap``. So we shall change the parameters like below. 

In [None]:
show_graph(tupl[0], tupl[1], node_to_color, vertex_alpha = 1, edge_line_width = 1, edge_alpha = 1,)

These paramters are the ones that will be most frequently changed in order to make the graph look more appropriate