# Project 1, Milestone 2

## 1. Install Neo4j server

Consult the Workflow section for details.

## 2. Configure Neo4j server

Consult the Workflow section for details.

## 3. Create new database

Consult the Workflow section for details.

## 4. Create the graph

In this step, we will convert the adjacency matrix we created in the previous milestone and combine it with our original dataset to create a pair of files of the form shown below. These will serve as input files to the `neo4j-admin import` command to load this data into a Neo4j graph.

**nodes.csv**

```
doc_id:ID,title,category,:LABEL
0704.0302,Spline Single-Index Prediction Model,stat.TH,Article
0704.0326,On generalized entropy measures and pathways,stat.TH,Article
...
```

**edges.csv**
```
:START_ID,:END_ID,:TYPE
0704.0302,1802.02649,SIMILAR_TO
0704.0517,0704.0744,SIMILAR_TO
...
```

Note that the first line in each file describes metadata that will be used to populate node and edge properties.

For example, the first line in **nodes.csv** tells us that each node will have a `docId` attribute that uniquely identifies it, hence can be used as an `:ID`, a `title` and `category` attributes, both of which are of type `str`. The type of node is identified by the `:LABEL` attribute, here we say that our nodes would be called `Article`. The subsequent lines describe the corresponding values for each node in our graph.

Similarly, the first line of **edges.csv** tells us that the subsequent lines are composed of a pair of nodes indicated by the values of their `docId:ID` property, the `:START_ID` and `:END_ID` respectively, and the type of edge connecting these two `Article` nodes is called `SIMILAR_TO`.

In [None]:
import csv
import numpy as np
import os

from scipy.sparse import load_npz

In [None]:
DATA_DIR = "../../data/project-1"

PATTERN = "av"

# our inputs from previous milestones
# input to milestone 1
ABSTRACTS_FILE = os.path.join(DATA_DIR, "stat-abstracts.tsv")
# outputs of milestone 1
DOCIDS_FILE = os.path.join(DATA_DIR, "stat-{:s}-docids.txt".format(PATTERN))
INPUT_MATRIX_FILE = os.path.join(DATA_DIR, "{:s}-adjmatrix.npz".format(PATTERN))

# our deliverables from milestone 2
NODE_FILE = os.path.join(DATA_DIR, "{:s}-node.csv".format(PATTERN))
RELS_FILE = os.path.join(DATA_DIR, "{:s}-edges.csv".format(PATTERN))

### Deserialize mapping of row/col id to docID

Our adjacency matrix we generated in the previous milestone is available to us as a numpy serialized file. On deserializing, we will get a square matrix of size (50426, 50426), where 50426 is the number of articles provided to us in our initial dataset.

Since we need to think in terms of a graph of actual documents now, we would like to relate the row / column ids in the adjacency matrix to actual articles and their document IDs. Therefore we need to produce a mapping of row / column ids to their corresponding article IDs.

Read the file indicated by `DOCIDS_FILE` and extract from it a dictionary mappng the row / column ID to the article `docID`. Call this dictionary `id_to_docid`.

In [None]:
id_to_docid = {}

## your code goes here
## Hint: Open the DOCIDS_FILE, loop through it, extracting docID and id values
## and populate a dictionary

## end of your code goes here
len(id_to_docid)

### Create mapping of docID to (title, category)

In order to create our **nodes.csv** file, we need the ability to look up the document title and category from the input file (indicated by `ABSTRACTS_FILE`) provided to us.

The `ABSTRACTS_FILE` contains the `docID`, `title`, `categories` and `abstract_text` for each article. To keep things simple, we will set the article category to the first `stat` category that the authors have marked it up with. Write a function that will take a concatenated set of categories separated by the semi-colon (";") character, and select the first category that starts with the string "stat.". Assign this category to the variable `first_stat_cat`.

In [None]:
def get_first_stat_category(categories_str):
    first_stat_cat = None
    ### your code goes here

    ### end of your code goes here
    return first_stat_cat


# test your function, you should get stat.SE
get_first_stat_category("cs.ML;stat.SE;stat.TH")

Next, we will parse the `ABSTRACTS_FILE` to extract `docID`, `title` and `categories` fields. We will extract the first `stat.*` category from the categories.

In [None]:
docid_to_title_cat = {}

num_docs = 0
with open(ABSTRACTS_FILE, "r") as fabs:
    for line in fabs:
        if num_docs % 10000 == 0:
            print("{:d} doc IDs read".format(num_docs))
        ## your code here
        ## Hint: Get docId, title and categories, then extract the first stat. category

        ## end of your code here
        docid_to_title_cat[doc_id] = (title, stat_cat)
        num_docs += 1
        
print("{:d} doc IDs read, COMPLETE".format(num_docs))
len(docid_to_title_cat)

### Create the nodes CSV file

Using the `id_to_docid` and `docid_to_title_cat` dictionaries you just created, write out the nodes CSV file **nodes.csv** indicated by `NODE_FILE` above.

As a reminder, here is what the output should look like.

**nodes.csv**

```
doc_id:ID,title,category,:LABEL
0704.0302,Spline Single-Index Prediction Model,stat.TH,Article
0704.0326,On generalized entropy measures and pathways,stat.TH,Article
...
```

Make sure to handle commas within the `title` field, either by quoting the title or replacing the comma character in the title with some other character.

In [None]:
num_docs = 0
fnode = open(NODE_FILE, "w")
## your code goes here
## write out the header as shown in the example

## end of your code goes here
for i in range(len(id_to_docid)):
    if num_docs % 10000 == 0:
        print("{:d} nodes written".format(num_docs))
    ## your code goes here
    ## write out each line as shown in the example

    ## end of your code goes here
    num_docs += 1
    
print("{:d} nodes written, COMPLETE\n".format(num_docs))
fnode.close()

### Check your work

The `check_csv` function below checks that your CSV file contains the correct number of columns and rows, otherwise it will error out and give you the line number in the file where there was a problem. We expect our `node.csv` file to have 4 columns per line and 50426 lines, since we are extracting the attributes `docID`, `title`, `category` and `type` for each of 50426 articles in our dataset.

**If your node.csv file looks good, you should see NO OUTPUT from the `check_csv` function.**

In [None]:
def check_csv(file_path, expected_num_cols, expected_num_rows):
    with open(file_path, "r") as fcsv:
        reader = csv.reader(fcsv)
        for i, row in enumerate(reader):
            if len(row) != expected_num_cols:
                print("Invalid CSV file: {:d} columns in line {:d}, expected {:d}"
                      .format(len(row), i+1, expected_num_cols))
                break
        if i != expected_num_rows:
            print("Invalid CSV file: {:d} rows found, expected {:d}".format(i, expected_num_rows))
            

check_csv(NODE_FILE, 4, 50426)

### Deserialize the Adjacency Matrix

Deserialize the Adjacency Matrix you created from the last milestone (indicated by `INPUT_MATRIX_FILE`). Verify that the shape of the matrix is `(50426, 50426)`.

If you have saved the adjacency matrix as a dense matrix (i.e., using `np.save()`) in the last milestone, then you should use `np.load()` to deserialize. If you saved it as a sparse COO matrix (our recommended approach), then you should use `load_npz()` instead. Deserializing from a sparse representation is significantly faster than deserializing from a dense representation.

In [None]:
## your code goes here

## end of your code goes here

### Create the edges.csv file

Use the deserialized matrix `A` and our `id_to_docid` dictionary to create the **edges.csv** file indicated by `RELS_FILE`. As a reminder, here is what the output should look like.

```
:START_ID,:END_ID,:TYPE
0704.0302,1802.02649,SIMILAR_TO
0704.0517,0704.0744,SIMILAR_TO
...
```

Remember that graph edges in this situation is based on document similarity, and is hence symmetric, i.e. if articles $doc_i$ and $doc_j$ are similar according to the adjacency matrix, then `A[i, j] == A[j, i]`. While Neo4j supports directed edges only, the Cypher query language can ignore the edge direction when querying the graph. Therefore, when creating the graph, we will specify only a single edge between any given $doc_i$, $doc_j$ pair.

Also remember that typically adjacency matrices that are designed based on similarity have the highest values along the diagonal (i.e. a document is most similar with itself), so in the general case you should correct for that as well. 

A good way to do this is to look only at non-zero entries in `A` where `i < j`, i.e., 

```
if i < j and A[i, j] > 0:
    # add an edge between i and j
```

**NOTE: This operation will take some time to complete!**

In [None]:
num_docs, num_edges = 0, 0
frels = open(RELS_FILE, "w")
frels.write(":START_ID,:END_ID,:TYPE\n")
for i in range(A.shape[0]):
    if i % 10000 == 0:
        print("{:d} docs read".format(num_docs))
    ## your code goes here
    ## Find the start and end docIDs from the indexes i and j
    ## Remember that our graph is based on document similarity and hence symmetric
    ## Also remember to skip diagonal elements (see if condition above)

    ## end of your code goes here
            num_edges += 1
    num_docs += 1

print("{:d} docs read, COMPLETE".format(num_docs))
print("--")
print("{:d} edges written, COMPLETE".format(num_edges))
frels.close()

### Check your work

**If your edges.csv file looks good, you should see no output from the call to `check_csv` below.**

In [None]:
check_csv(RELS_FILE, 3, 25603984)

### Compute graph sparsity

As a fun exercise, compute the graph sparsity. That is compute the ratio of `num_edges` and the maximum number of possible edges for a fully connected graph.

The maximum number of edges possible would be `num_nodes` * `(num_nodes - 1)`. This will give us two edges between any two nodes i and j. However, since the `SIMILAR_TO` relationship is symmetric, we should consider only one edge.

Your answer rounded to 2 decimal places should be `0.02`.

In [None]:
## your code here

## end of your code here

### Load data into neo4j

While there are multiple ways to load data into a Neo4j database, by far the fastest is the `neo4j-admin import` command.

**NOTE: This requires the Neo4j server to be shut down.**

First shut down the neo4j server, then run the command to import the data into Neo4j, then restart the server.

```
$NEO4J_HOME/bin/neo4j stop
$NEO4J_HOME/bin/neo4j-admin import --database=av-graph --nodes=av-nodes.csv --relationships=av-edges.csv
```

We are using Neo4j community edition, which allows only a single graph database accessible at a time, so we need to make the default database `av-graph` in the configuration. In $NEO4J_HOME/conf/neo4j.conf, set the value of `dbms.default_database` to `av-graph`.

```
# $NEO4J_HOME/conf/neo4j.conf
...
dbms.default_database=av-graph
...
```

Then restart the Neo4j server.

```
$NEO4J_HOME/bin/neo4j start|console
```

## 5. Run Basic Exploratory Commands

You should be able to verify that the data was successfully loaded. You can use the Neo4j browser interface at http://localhost:7474 to count the number of nodes and edges using the following Cypher commands.

```
MATCH (n) RETURN COUNT(n) as num_nodes
```

```
MATCH ()-[r]->() RETURN COUNT(r) as num_edges
```

Alternatively, you can also try using the `py2neo` interface to do this programatically from Python.

In [None]:
from py2neo import Graph

graph = Graph("bolt://localhost:7687", auth=("neo4j", "admin"))

In [None]:
num_nodes_g = graph.run("MATCH (n) RETURN COUNT(n) AS num_nodes")
num_nodes_g = list(num_nodes_g.data())
num_nodes_g[0]["num_nodes"]

Use py2neo to find the number of edges in the graph, using the Cypher query shown earlier to find the number of edges in the graph.

In [None]:
## your code here
num_edges_g = graph.run("MATCH ()-[r]->() RETURN COUNT(r) AS num_edges")
num_edges_g = list(num_edges_g.data())
num_edges_g[0]["num_edges"]
## end of your code here