In [2]:
import sys, os
sys.path.append(os.path.abspath('package'))
from package import KnowledgeGraphBuilder
import pandas as pd
import pickle

  from .autonotebook import tqdm as notebook_tqdm


Note: to be able to use all crisp methods, you need to install some additional packages:  {'wurlitzer', 'graph_tool'}
Note: to be able to use all crisp methods, you need to install some additional packages:  {'wurlitzer'}


# 1. Build knowledge graph

Initialize the `KnowledgeGraphBuilder` object. A github token must be given as a parameter.

In [3]:
with open ('./_/git_token.txt', 'r') as f:
    git_token = f.read().strip()

with open ('./_/hf_token.txt', 'r') as f:
    hf_token = f.read().strip()

kgb = KnowledgeGraphBuilder(git_token, hf_token)

Build the knowledge graph with the `KnowledgeGraphBuilder.build_knowledge_graph()` method. Parameters:
- *repo_name*: Name of the repository. Must match the format "owner/repo_name", as it is used for github API calls.
- *graph_type* (optional): Type of subgraph to build from the functions. Can be "CFG" (Control Flow Graph) or "AST" (Abstract Syntax Tree). Default is "CFG".
- *num_of_PRs* (optional): Number of pull requests to retrieve in detail. Defaults to 0 (all).
- *create_embedding* (optional): Whether to create embeddings for the nodes. Defaults to False.
- *repo_path_modifier* (optional): Path modifier for the repository for cases when only a subfolder is meant to be parsed.
- *URI* (optional): URI for the Neo4J data saving.
- *user* (optional): Username for the Neo4J data saving.
- *password* (optional): Password for the Neo4J data saving.

In [4]:
repograph = kgb.build_knowledge_graph(
    repo_name='scikit-learn/scikit-learn',
    num_of_PRs=300,
    num_of_issues=3000,
    scrape_comments=False,
    semantic_clustering=True,
    create_embedding=False,
    repo_functions_only=True,
    repo_path_modifier='sklearn/'
)

Repo already exists here: ./repos\scikit-learn
Building CG...
Creating subgraphs for each function...


10537it [51:01,  3.44it/s]


Subgraphs created.
Filtering graph nodes...
Graph nodes filtered. Creating hierarchical edges...
Hierarchical graph building successful.


Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00, 15.15it/s]
Device set to use cuda:0


Number of clusters: 20 with silhouette score: 0.01682088667840439
[{'generated_text': 'These function names belong to one cluster:\nclone; clone parametrized; classifier mixin score; regressor mixin score; bicluster mixin biclusters; bicluster mixin get indices; bicluster mixin get shape; bicluster mixin get submatrix; density mixin score; is classifier; is regressor; is clusterer; is outlier detector; decorator; wrapper; calibrated classifier cv   init; calibrated classifier cv predict proba; calibrated classifier cv predict; calibrated classifier cv get metadata routing; calibrated classifier   init; calibrated classifier predict proba; sigmoid calibration; loss grad; convert to logits; sigmoid calibration predict; temperature scaling log loss; temperature scaling predict; calibration curve; calibration display   init; calibration display plot; calibration display from predictions; raccoon face or skip; global dtype; fetch fixture; wrapped; pytest collection modifyitems; pyplot; pyte

Request GET /users/bellet failed with 403: Forbidden
Setting next backoff to 215.597688s


In [5]:
repograph.keys()

dict_keys(['function_nodes', 'function_edges', 'subgraph_nodes', 'subgraph_edges', 'subgraph_function_edges', 'function_subgraph_edges', 'import_nodes', 'class_nodes', 'class_function_edges', 'class_class_edges', 'file_nodes', 'file_edges', 'file_function_edges', 'file_class_edges', 'file_import_edges', 'config_nodes', 'file_config_edges', 'import_function_edges', 'pr_nodes', 'pr_function_edges', 'issue_nodes', 'issue_pr_edges', 'artifacts', 'actions', 'cluster_nodes', 'cluster_function_edges', 'functionversion_nodes', 'functionversion_edges', 'functionversion_function_edges', 'developer_nodes', 'developer_function_edges', 'question_nodes', 'question_cluster_edges'])

In [17]:
repograph['developer_nodes']

Unnamed: 0,ID,dev_name,dev_email,dev_full
0,1,ogrisel,olivier.grisel@ensta.org,Olivier Grisel
1,2,amueller,t3kcit+githubspam@gmail.com,Andreas Mueller
2,3,larsmans,,Lars
3,4,agramfort,alexandre.gramfort@m4x.org,Alexandre Gramfort
4,5,glouppe,g.louppe@gmail.com,Gilles Louppe
...,...,...,...,...
406,407,mrbeann,jiachengliu@cuhk.edu.hk,mrbean
407,408,JPFrancoia,jeanpatrick.francoia@gmail.com,JPFrancoia
408,409,jaglima,jesselima@protonmail.com,Jesse Lima
409,410,fishcorn,,Josephine Moeller


Scrape issues and PRs for a repo

In [None]:
with open('./_/graph_v4.pkl', 'rb') as f:
    repograph = pickle.load(f)

issues_prs = kgb.scrape_issue_pr_data(
    repo_name='scikit-learn/scikit-learn',
    cg_nodes=repograph['function_nodes'],
    num_of_issues=100,
    num_of_PRs=1000,
)

In [4]:
with open('issues_prs.pkl', 'wb') as f:
    pickle.dump(issues_prs, f)

# 2. Visualize graph

Create a HTML visualizaiton of the graph with the `visualize_graph` function. NOTE: for large graphs, it is advised to only plot a fraction of the nodes, othervise the visualization might not render properly. Parameters:
- *repograph*: The dictionary containing the created repository graph.
- *show_subgraph_nodes* (optional): Whether to plot the subgraph (CFG or AST) nodes. Defaults to *False*.
- *save_path* (optional): The file path to save the visualization. Defaults to "./graph.html".

In [None]:
kgb.visualize_graph(repograph, show_subgraph_nodes=False)

# 3. Save the graph

Saving the graph in different formats.

### 3.1 Save it as a dictionary

Saving and loading the resulting graph dictionary as a pickle.

In [21]:
with open('graph_v10_nosubgraph.pkl', 'wb') as f:
    pickle.dump(repograph, f)

In [None]:
with open('graph_TEST.pkl', 'rb') as f:
    repograph = pickle.load(f)

In [None]:
repograph.keys()

### 3.2 Saving it to Neo4j database

The result can be saved to a Neo4j database by calling the `store_knowledge_graph_in_neo4j` method. Parameters:
- *URI*: URI for the Neo4J data saving.
- *user*: Username for the Neo4J data saving.
- *password*: Password for the Neo4J data saving.
- *knowledge_graph*: The knowledge graph to save.

If the *URI*, *username* and *password* parameters are provided at the `build_knowledge_graph` method, this function will automatically be called and the graph will be saved to neo4j.

In [None]:
kgb.store_knowledge_graph_in_neo4j(
    URI="neo4j://127.0.0.1:7687",
    user="neo4j",
    password="password",
    knowledge_graph=repograph
)