In [1]:
import sys, os
sys.path.append(os.path.abspath('package'))
from package import KnowledgeGraphBuilder
import pandas as pd
import pickle

  from .autonotebook import tqdm as notebook_tqdm


Note: to be able to use all crisp methods, you need to install some additional packages:  {'graph_tool', 'wurlitzer'}
Note: to be able to use all crisp methods, you need to install some additional packages:  {'wurlitzer'}


# 1. Build knowledge graph

Initialize the `KnowledgeGraphBuilder` object. A github token must be given as a parameter.

In [2]:
with open ('./_/git_token.txt', 'r') as f:
    git_token = f.read().strip()

with open ('./_/hf_token.txt', 'r') as f:
    hf_token = f.read().strip()

kgb = KnowledgeGraphBuilder(git_token, hf_token)

Build the knowledge graph with the `KnowledgeGraphBuilder.build_knowledge_graph()` method. Parameters:
- *repo_name*: Name of the repository. Must match the format "owner/repo_name", as it is used for github API calls.
- *graph_type* (optional): Type of subgraph to build from the functions. Can be "CFG" (Control Flow Graph) or "AST" (Abstract Syntax Tree). Default is "CFG".
- *num_of_PRs* (optional): Number of pull requests to retrieve in detail. Defaults to 0 (all).
- *create_embedding* (optional): Whether to create embeddings for the nodes. Defaults to False.
- *repo_path_modifier* (optional): Path modifier for the repository for cases when only a subfolder is meant to be parsed.
- *URI* (optional): URI for the Neo4J data saving.
- *user* (optional): Username for the Neo4J data saving.
- *password* (optional): Password for the Neo4J data saving.

In [3]:
repograph = kgb.build_knowledge_graph(
    repo_path='./repos/mongodb-erlang/',
    project_language='erlang',
    num_of_PRs=0,
    num_of_issues=0,
    semantic_clustering=False,
    repo_path_modifier='src/api/'
)

Building CG...
Creating subgraphs for each function...


115it [00:00, 359.24it/s]


Subgraphs created.
Filtering graph nodes...
Graph nodes filtered. Creating hierarchical edges...
Hierarchical graph building successful.
Clustering skipped (semantic_clustering=False).
Issues scraped.
PRs scraped.
Issue to PR edges created.
Artifacts scraped.
Failed to fetch contributors: 'NoneType' object has no attribute 'get_contributors'


In [4]:
repograph.keys()

dict_keys(['function_nodes', 'function_edges', 'subgraph_nodes', 'subgraph_edges', 'subgraph_function_edges', 'function_subgraph_edges', 'import_nodes', 'class_nodes', 'class_function_edges', 'class_class_edges', 'file_nodes', 'file_edges', 'file_function_edges', 'file_class_edges', 'file_import_edges', 'config_nodes', 'file_config_edges', 'import_function_edges', 'pr_nodes', 'pr_function_edges', 'issue_nodes', 'issue_pr_edges', 'artifacts', 'cluster_nodes', 'cluster_function_edges', 'functionversion_nodes', 'functionversion_edges', 'functionversion_function_edges', 'developer_nodes', 'developer_function_edges', 'question_nodes', 'question_cluster_edges'])

In [5]:
repograph['developer_nodes']

Unnamed: 0,ID,dev_name,dev_email,dev_full


Scrape issues and PRs for a repo

In [None]:
with open('./_/graph_v4.pkl', 'rb') as f:
    repograph = pickle.load(f)

issues_prs = kgb.scrape_issue_pr_data(
    repo_name='scikit-learn/scikit-learn',
    cg_nodes=repograph['function_nodes'],
    num_of_issues=100,
    num_of_PRs=1000,
)

In [4]:
with open('issues_prs.pkl', 'wb') as f:
    pickle.dump(issues_prs, f)

# 2. Visualize graph

Create a HTML visualizaiton of the graph with the `visualize_graph` function. NOTE: for large graphs, it is advised to only plot a fraction of the nodes, othervise the visualization might not render properly. Parameters:
- *repograph*: The dictionary containing the created repository graph.
- *show_subgraph_nodes* (optional): Whether to plot the subgraph (CFG or AST) nodes. Defaults to *False*.
- *save_path* (optional): The file path to save the visualization. Defaults to "./graph.html".

In [8]:
kgb.visualize_graph(repograph, show_subgraph_nodes=False)

Graph visualization saved to ./graph.html


<networkx.classes.graph.Graph at 0x247c906d590>

# 3. Save the graph

Saving the graph in different formats.

### 3.1 Save it as a dictionary

Saving and loading the resulting graph dictionary as a pickle.

In [21]:
with open('graph_v10_nosubgraph.pkl', 'wb') as f:
    pickle.dump(repograph, f)

In [None]:
with open('graph_TEST.pkl', 'rb') as f:
    repograph = pickle.load(f)

In [None]:
repograph.keys()

### 3.2 Saving it to Neo4j database

The result can be saved to a Neo4j database by calling the `store_knowledge_graph_in_neo4j` method. Parameters:
- *URI*: URI for the Neo4J data saving.
- *user*: Username for the Neo4J data saving.
- *password*: Password for the Neo4J data saving.
- *knowledge_graph*: The knowledge graph to save.

If the *URI*, *username* and *password* parameters are provided at the `build_knowledge_graph` method, this function will automatically be called and the graph will be saved to neo4j.

In [4]:
kgb.store_knowledge_graph_in_neo4j(
    uri="neo4j://127.0.0.1:7687",
    user="neo4j",
    password="password",
    knowledge_graph=repograph
)

Loading nodes to neo4j: 100%|██████████| 33/33 [00:32<00:00,  1.03it/s]
Loading edges to neo4j: 100%|██████████| 33/33 [03:02<00:00,  5.54s/it]
