# Complex IO

`topologic` contains extensive `io` and `projections` packages for loading data in many ways.  

Some sources of data are multigraphs, and you may need to make some hard decisions on how you want to handle converting a multigraph into a undirected or directed simple graph.

On edge duplication, do you want to:
- Sum the weights?
- Average them?
- Take the latest?
- Exclude edges based on some other attribute criteria?

You can always pre-process your data to answer this question for you, and write everything you need from scratch.  This is a valid strategy and if you feel most comfortable with it, you can either create your own `networkx` Graph objects or use the simple `topologic.io.from_file` function to create a graph for you.

`topologic` also contains a number of utility functions and a general opinionated paradigm for operating over input files when building a graph which may help you quickly transform all of your various source files into the exact graph you wish to analyze with the rest of `topologic`'s capabilities.  This notebook will show how to use this.  The main reason to use this is if you expect to do similar sorts of projections, possibly with minor configuration differences, across many different source files.  Building a corpus of convenient projections could save you a lot of time in the future, especially in an enterprise environment. 

# Data
The data we are using is located in `test_data/` colocated at the same directory as this notebook.  It is a directed multigraph from <a href="https://snap.stanford.edu/data/">Stanford's Large Network Dataset Collection</a>.  In specific, we are going to use the <a href="https://snap.stanford.edu/data/soc-RedditHyperlinks.html">Social Network: Reddit Hyperlink Network</a>. This dataset was generated for the paper:

- S. Kumar, W.L. Hamilton, J. Leskovec, D. Jurafsky. 
<a href="https://cs.stanford.edu/~srijan/pubs/conflict-paper-www18.pdf">Community Interaction and Conflict on the 
Web.</a> World Wide Web Conference, 2018.

The file has been cut down to size by removing the `POST_PROPERTIES` column of the original data file, primarily to speed up cloning the repository.  To do this, we executed 

```bash
cut -f 1,2,3,4 soc-redditHyperlinks-body.tsv > test_data/smaller-redditHyperlinks-body.tsv
```

The **only** reason this file was modified using any cli utilities is to reduce size within our git repository for faster cloning and to remove the `git-lfs` requirement. `topologic` would be quite happy to process a file of this size in normal circumstances.

The tab-separated file has the format of:
```
SOURCE_SUBREDDIT    TARGET_SUBREDDIT    POST_ID TIMESTAMP   POST_LABEL
```

# Scenario

We want to load this graph, filtering any record out before a given timestamp, then aggregate the implied weight (1) with any other existing `source` to `destination` link that exists between the two vertices by using the 
`topologic.io` and `topologic.projection` packages.

We will then make some edge weight cuts using the `topologic.statistics` package.

Finally we will show nominal usage of the `topologic.embedding.node2vec_embedding` function.

# Disclaimers

As we are trying to show the **capability** of the library, we are not necessarily going to make the best decisions with regards to cut dates, or making weight based edge cuts, and we shouldn't expect to glean any useful information from our node2vec embedding.

# Initial Setup and Projection Function Creation
First we're going to import the libraries we're going to use, including `topologic`, and set some of the constants we 
expect to use later.

Then we're going to create our projection function.  This projection function is simply going to be responsible
for processing a single row of data from our csv parser, and optionally modify the `networkx` graph depending on
some business rules we've put in place.

the `topologic.io.from_dataset` function expects a function of the signature 
`Callable[[nx.Graph], Callable[[List[str]], None]]` to be provided

This is the definition of a function that returns a function that returns a function.

The first function lets us specify some configuration properties that we can use later. Configuration properties like
changing the date we want to use when we build a graph, or what row index (0 based) that we expect a given column to 
contain data for. We will call this after we define it with our actual parameters.

The first inner function will be called by the `topologic.io.from_dataset` function. It will pass in the networkx
Graph object we will be using.

The second inner function will be called on a per csv-record basis, and will be responsible for actually applying
our business rules and deciding whether to update the graph or not.

In [1]:
import networkx as nx
import topologic as tc
from typing import Callable, List

reddit_hyperlinks_path = "test_data/smaller-redditHyperlinks-body.tsv"

# data file contains edges from Jan 2014 - April 2017 - we'll only take a year's worth
# timestamps are in `YYYY-MM-DD HH:mm:SS` format (e.g. `2013-12-31 16:39:58`)
# this means we can do a reasonably fast string comparison to determine whether the edge should be taken into
# consideration or not

date_cutoff = "2016-05-01"



def sum_after_date(
    keep_after_date: str,
    source_index: int,
    target_index: int,
    date_index: int
) -> Callable[[nx.Graph], Callable[[List[str]], None]]:
    def _csv_parser_setup(
        graph: nx.Graph
    ) -> Callable[[List[str]], None]:
        def _process_row(row: List[str]):
            # this processes the current row
            # in our case right now, we expect to be able to  drop any record before May 1st of 2016
            # and sum up any weights if they currently exist in the graph
            source = row[source_index]
            target = row[target_index]
            date = row[date_index]
            if date >= keep_after_date:
                original_weight = graph[source][target]["weight"] if source in graph and target in graph[source] else 0
                weight = original_weight + 1
                graph.add_edge(source, target, weight=weight)                    
            return
        return _process_row
    return _csv_parser_setup

# topologic.io.CsvDataset Setup

Now that we've defined our projection function, now we need to define our CsvDataset.

As we mentioned earlier, we are using a tab-separated file.  It also uses standard Unix `\n` line terminators. We use
the built-in `csv` Python module to parse our files, and this information is useful
for defining the csv `Dialect` type that we will be using.  We also have a header included and will want to ignore that.

In [2]:
digraph = nx.DiGraph()
projection_function = sum_after_date(date_cutoff, 0, 1, 3)

with open(reddit_hyperlinks_path, "r") as data_input:
    dataset = tc.io.CsvDataset(
        source_iterator=data_input,
        has_headers=True,
        dialect="excel-tab"
    )
    
    digraph = tc.io.from_dataset(
        csv_dataset=dataset,
        projection_function_generator=projection_function,
        graph=digraph
    )

In [3]:
def print_graph(graph: nx.Graph):
    print(f"Number of Graph Vertices: {len(graph)}")
    print(f"Number of Graph Edges: {len(graph.edges())}")
    print(f"Maximum Edge Weight: {max(weight for _, _, weight in graph.edges(data='weight'))}")
    print(f"Minimum Edge Weight: {min(weight for _, _, weight in graph.edges(data='weight'))}")

In [4]:
print_graph(digraph)

Number of Graph Vertices: 19696
Number of Graph Edges: 52903
Maximum Edge Weight: 184
Minimum Edge Weight: 1


# Pruning Graphs by Edge Weight
As we can see from the above print statements, we have a graph of 19696 nodes and 52903 edges.  The actual edge count
from the file is instead `286561` edges (`wc -l test_data/soc-redditHyperlinks-body.tsv` minus 1 (the header)). 

We also know that the initial edge list did not contain a weight - we just counted the number of source to target 
relationships within our time window and used that to be our weight.

Now we want to explore some of the tools available to make graph cuts.  In this specific case, we're going to explore
making graph cuts via the edge weight parameter (`topologic` also supports degree centrality and betweenness centrality
using an almost identical API). 


In [5]:
# histogram of weights
tc.statistics.histogram_edge_weight(digraph)

DefinedHistogram(histogram=array([52539,   242,    78,    22,    10,     4,     3,     1,     2,
           2]), bin_edges=array([  1. ,  19.3,  37.6,  55.9,  74.2,  92.5, 110.8, 129.1, 147.4,
       165.7, 184. ]))

In [6]:
#let's cut around weight <= 92.5
cut_graph = tc.statistics.cut_edges_by_weight(
    digraph,
    92.5,
    tc.statistics.MakeCuts.LARGER_THAN_EXCLUSIVE,
    prune_isolates=False
)

print_graph(cut_graph)


Number of Graph Vertices: 19696
Number of Graph Edges: 52891
Maximum Edge Weight: 92
Minimum Edge Weight: 1


It my be surprising to you that we have the precise same number of vertices as we did prior.  This is because we did not
tell our edge cut to ALSO prune our isolated vertices.  To address this, set `prune_isolates=True` and run it again.

In [7]:
cut_graph = tc.statistics.cut_edges_by_weight(
    digraph,
    92.5,
    tc.statistics.MakeCuts.LARGER_THAN_EXCLUSIVE,
    prune_isolates=True
)

print_graph(cut_graph)

Number of Graph Vertices: 19694
Number of Graph Edges: 52891
Maximum Edge Weight: 92
Minimum Edge Weight: 1


We've now pruned a pair of unused nodes!

Now let's extract the largest connected component from our graph; multiple connected components skew our resulting
embeddings more than we would like, so we'll work over discrete connected components. Note: It may make sense to create 
embeddings for each connected component if the sizes are roughly equivalent.

In [8]:
lcc = tc.largest_connected_component(cut_graph)

print_graph(lcc)

embedding_container = tc.embedding.node2vec_embedding(lcc)


Number of Graph Vertices: 18601
Number of Graph Edges: 52243
Maximum Edge Weight: 92
Minimum Edge Weight: 1
