<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Discover-Thoth's-Graph-Structure" data-toc-modified-id="Discover-Thoth's-Graph-Structure-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Discover Thoth's Graph Structure</a></span><ul class="toc-item"><li><span><a href="#Connect-to-JanusGraph-Instance" data-toc-modified-id="Connect-to-JanusGraph-Instance-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Connect to JanusGraph Instance</a></span></li><li><span><a href="#Vertex-and-Edge-Labels" data-toc-modified-id="Vertex-and-Edge-Labels-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Vertex and Edge Labels</a></span></li><li><span><a href="#Vertex-and-Edge-Instances" data-toc-modified-id="Vertex-and-Edge-Instances-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Vertex and Edge Instances</a></span></li></ul></li><li><span><a href="#Discover-the-packages-inside-Thoth" data-toc-modified-id="Discover-the-packages-inside-Thoth-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Discover the packages inside Thoth</a></span></li><li><span><a href="#Select-one-package-and-discover-more-about-it" data-toc-modified-id="Select-one-package-and-discover-more-about-it-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Select one package and discover more about it</a></span></li></ul></div>

# Discover Thoth's Graph Structure

This notebook is addressed to users and developers of Thoth that want to discover the content of Thoth's Graph Database.

First it's important to look at schema model of Thoth Graph Database.

![Thoth Graph Database](https://raw.githubusercontent.com/thoth-station/storages/master/docs/schema.png)

## Connect to JanusGraph Instance

In order to discover what is inside the Graph Database, we need to connect to JanusGraph instance

In [1]:
from thoth.storages.graph import GraphDatabase
from thoth.lab import GraphQueryResult as gqr

graph_db = GraphDatabase.create('janusgraph.test.thoth-station.ninja', port=8182)
graph_db.connect()
g = graph_db.g   # We will use raw Gremlin traversal in examples.

We need to import objects that will be used in the notebook

In [2]:
import pandas as pd
from pprint import pprint
from thoth.solver import pip_compile
from thoth.storages.graph.models import ALL_MODELS
from gremlin_python.process.graph_traversal import has
from gremlin_python.process.traversal import Operator
from gremlin_python.process.traversal import Pop
from gremlin_python.process.traversal import not_
from gremlin_python.process.traversal import P
from gremlin_python.process.graph_traversal import identity
from gremlin_python.process.graph_traversal import path
from gremlin_python.process.graph_traversal import outE
from gremlin_python.process.graph_traversal import out
from gremlin_python.process.graph_traversal import inE
from gremlin_python.process.graph_traversal import inV
from gremlin_python.process.graph_traversal import select
from gremlin_python.process.graph_traversal import values
from gremlin_python.process.graph_traversal import fold
from gremlin_python.process.graph_traversal import constant
from gremlin_python.process.graph_traversal import project

## Vertex and Edge Labels

List all the available vertex labels in the graph database

In [3]:
# Create list of vertices 
vertex_labels = []
for element in ALL_MODELS:

    if element.__type__ == "vertex":
        vertex_labels.append(element.__label__)
        
# Create the pandas DataFrame 
df = pd.DataFrame(vertex_labels, columns = ['Vertex'])
df

Unnamed: 0,Vertex
0,cve
1,inspection_software_stack
2,software_stack_observation
3,adviser_software_stack
4,rpm_requirement
5,buildtime_environment
6,python_artifact
7,python_package_index
8,deb_package_version
9,user_software_stack


List all the available edge labels in the graph database

In [4]:
# Create list of edges
edge_labels = []
for element in ALL_MODELS:

    if element.__type__ == "edge":
        edge_labels.append(element.__label__)
        
# Create the pandas DataFrame 
df = pd.DataFrame(edge_labels, columns = ['Edge'])
df

Unnamed: 0,Edge
0,runs_in
1,has_artifact
2,observed
3,deb_pre_depends
4,requires
5,builds_on
6,solved
7,depends_on
8,runs_on
9,has_version


## Vertex and Edge Instances

Let's have an idea of the size of Thoth's Graph

In [5]:
print(f"Number of vertex instances in the graph database: {gqr(g.V().count().next()).result:d}")
print(f"Number of edge instances in the graph database: {gqr(g.E().count().next()).result:d}")

Number of vertex instances in the graph database: 39865
Number of edge instances in the graph database: 637271


Let's see which vertex label has more instances

In [32]:
vertices_counts = graph_db.


Number of vertex instances present in the graph database (sum): 39864
Number of vertex instances present in the graph database: 39865


In [33]:
# Show the number of instances for each vertex label
df = pd.DataFrame(list_vertices_counts, columns = ['Vertex', 'N. Instances']).sort_values(by='N. Instances',ascending=False)
df

Unnamed: 0,Vertex,N. Instances
0,python_package_version,22015
8,python_artifact,15407
2,package,946
3,cve,796
9,rpm_requirement,513
1,rpm_package_version,174
10,python_package_index,7
16,ecosystem_solver,4
6,buildtime_environment,1
13,runtime_environment,1


Let's see which edge label has more instances

In [9]:
# List of edge labels
edge_labels = [element.__label__ for element in ALL_MODELS if element.__type__ == "edge"]

# Dict of edge labels and counts 
edges_number = gqr(g.E().has("__type__", "edge").groupCount().by("__label__").next()).result

list_edges_counts = []

for edge in edge_labels:
    
    if edge in edges_number.keys():
        
        list_edges_counts.append([edge, edges_number[edge]])
        
    else:
        list_edges_counts.append([edge, 0])
        

print(f"\nNumber of edge instances present in the graph database (sum): {sum(edge_c[1] for edge_c in list_edges_counts)}")
print(f"Number of edge instances present in the graph database: {gqr(graph_db.g.E().count().next()).result:d}")


Number of edge instances present in the graph database (sum): 637271
Number of edge instances present in the graph database: 637271


In [12]:
# Show the number of instances for each vertex label
df = pd.DataFrame(list_edges_counts, columns = ['Edge', 'N. Instances']).sort_values(by='N. Instances',ascending=False)
df

Unnamed: 0,Edge,N. Instances
6,solved,405001
7,depends_on,156006
15,has_vulnerability,34810
9,has_version,22283
1,has_artifact,15557
4,requires,3350
11,is_part_of,264
0,runs_in,0
2,observed,0
3,deb_pre_depends,0


Looking at the relations between vertices which are currently instanciated

In [96]:
vertex_edge_vertex = gqr(g.V()
                         .outE()
                         .otherV()
                         .groupCount()
                         .by(path()
                         .by("__label__"))
                         .next()).result

vertex_edge_vertex_list = []
for triple_string, counts in vertex_edge_vertex.items():
    triple = triple_string.lstrip("[").rstrip("]").replace(" ","").split(",")
    vertex_edge_vertex_list.append([str(triple[0]), str(triple[1]), str(triple[2]), counts])
vertex_edge_vertex_list.sort(key=lambda x: x[0])

In [97]:
df = pd.DataFrame(vertex_edge_vertex_list, columns = ['VERTEX','EDGE','VERTEX', 'COUNTS'])
df

Unnamed: 0,VERTEX,EDGE,VERTEX.1,COUNTS
0,ecosystem_solver,solved,python_package_version,405001
1,package,has_version,rpm_package_version,267
2,package,has_version,python_package_version,22016
3,python_package_version,depends_on,python_package_version,156006
4,python_package_version,has_vulnerability,cve,34810
5,python_package_version,is_part_of,buildtime_environment,1
6,python_package_version,has_artifact,python_artifact,15557
7,rpm_package_version,requires,rpm_requirement,3350
8,rpm_package_version,is_part_of,buildtime_environment,174
9,rpm_package_version,is_part_of,runtime_environment,89


# Discover the packages inside Thoth

Check the allowed sources for the packages inside Thoth database

In [6]:
known_thoth_url = graph_db.get_python_package_index_urls()
df = pd.DataFrame(known_thoth_url, columns = ['URL'])
pd.set_option('max_colwidth', 800)
df

Unnamed: 0,URL
0,https://pypi.org/simple
1,https://tensorflow.pypi.thoth-station.ninja/index/rhel7.5/jemalloc/simple
2,https://tensorflow.pypi.thoth-station.ninja/index/fedora26/jemalloc/simple
3,https://tensorflow.pypi.thoth-station.ninja/index/fedora28/jemalloc/simple
4,https://tensorflow.pypi.thoth-station.ninja/index/fedora27/jemalloc/simple
5,https://tensorflow.pypi.thoth-station.ninja/index/centos7/jemalloc/simple
6,https://tensorflow.pypi.thoth-station.ninja/index/rhel7.5/cuda9.2+jemalloc/simple


Check how many packages for each url index

In [3]:
n_packages_per_index = gqr(g.V()
                         .has("index_url")
                         .groupCount()
                         .by("index_url")
                         .next()).result

n_packages_per_index_list = [[url_index, n_packages] for url_index, n_packages in n_packages_per_index.items()]
df = pd.DataFrame(n_packages_per_index_list, columns = ['URL', 'N. of Packages'])
pd.set_option('max_colwidth', 800)
df

Unnamed: 0,URL,N. of Packages
0,https://tensorflow.pypi.thoth-station.ninja/index/rhel7.5/jemalloc/simple,601
1,https://pypi.org/simple,18398
2,https://tensorflow.pypi.thoth-station.ninja/index/fedora28/jemalloc/simple,605
3,https://tensorflow.pypi.thoth-station.ninja/index/rhel7.5/cuda9.2+jemalloc/simple,601
4,https://tensorflow.pypi.thoth-station.ninja/index/fedora27/jemalloc/simple,604
5,https://tensorflow.pypi.thoth-station.ninja/index/centos7/jemalloc/simple,605
6,https://tensorflow.pypi.thoth-station.ninja/index/fedora26/jemalloc/simple,600


We can check which packages for a certain index

In [4]:
graph_db.get_python_packages_for_index("https://tensorflow.pypi.thoth-station.ninja/index/rhel7.5/jemalloc/simple")

{'aiohttp',
 'astropy',
 'awscli',
 'beaker',
 'bigchaindb-driver',
 'bise-theme',
 'block-io',
 'bodhi',
 'cloudinary',
 'collective-tablepage',
 'collective-xmpp-chat',
 'cryptography',
 'deis',
 'django',
 'django-anymail',
 'django-autocomplete-light',
 'django-awl',
 'django-ca',
 'django-fiber',
 'django-jet',
 'django-safedelete',
 'django-select2',
 'django-session-security',
 'django-social-auth',
 'djangorestframework-simplejwt',
 'djblets',
 'drf-tracking',
 'dulwich',
 'fedmsg',
 'flask-micropub',
 'foolscap',
 'ftw-dashboard-portlets-postit',
 'futoin-cid',
 'homeassistant',
 'indico',
 'ipython',
 'jinja2',
 'ldap3',
 'luigi',
 'lxml',
 'mako',
 'markdown2',
 'mercurial',
 'mistune',
 'mitmproxy',
 'mockup',
 'mollie-api-python',
 'morepath',
 'mysql-connector-python',
 'newrelic',
 'notable',
 'oauth2',
 'oauthlib',
 'oci',
 'onegov-form',
 'paste',
 'pastescript',
 'peewee',
 'phileo',
 'pip',
 'plone-app-content',
 'plone-app-contenttypes',
 'plone-app-discussion',
 'p

In [5]:
# Extract all packages
all_packages = gqr(
    g.V()
    .has('__label__', 'package')
    .order().by('package_name')
    .project('package').by('package_name')
    .toList()
).result

Let's take a closer look at which python packages are inside Thoth 

In [9]:
# Extract Packages for selected letter

# Select a letter
letter = 'g'

packages_list = [package['package'] for package in all_packages if package['package'][0] == letter]    
print(f"The number of packages for letter {letter} is: {len(packages_list)}\n")

The number of packages for letter g is: 16



In [10]:
# Visualize packages for selected letter
df = pd.DataFrame(packages_list, columns = [letter])
df

Unnamed: 0,g
0,gandi-cli
1,gast
2,genshi
3,gevent
4,geventhttpclient
5,girder
6,gitlab-languages
7,gns3-gui
8,go-http
9,goblin


For the packages considered, let's see how many versions are available

In [None]:
%%time
# Count all versions (Python and RPM) for the packages considered for the selected letter
package_versions_results = []

for package in packages_list:
    
    n_python_package_versions = gqr(g.V()
                              .has("__label__", "package")
                              .has("package_name", package)
                              .outE()
                              .has("__label__","has_version")
                              .inV()
                              .has('__label__', 'python_package_version')
                              .count()
                              .next()
                             ).result
    
    n_rpm_package_versions = gqr(g.V()
                          .has("__label__", "package")
                          .has("package_name", package)
                          .outE()
                          .has("__label__","has_version")
                          .inV()
                          .has('__label__', 'rpm_package_version')
                          .count()
                          .next()
                         ).result
    
    package_versions_results.append([package,
                                 n_python_package_versions,
                                 n_rpm_package_versions,
                                 n_python_package_versions +  n_rpm_package_versions])
    

In [18]:
# Visualize packages for selected letter
df = pd.DataFrame(package_versions_results, columns = ['package_name', 'n_python_package_version',
                                                      'n_rpm_package_version', 'total_package_versions'])
df

Unnamed: 0,package_name,n_python_package_version,n_rpm_package_version,total_package_versions
0,gandi-cli,3,0,3
1,gast,5,0,5
2,genshi,15,0,15
3,gevent,18,0,18
4,geventhttpclient,2,0,2
5,girder,49,0,49
6,gitlab-languages,12,0,12
7,gns3-gui,51,0,51
8,go-http,1,0,1
9,google-appengine,1,0,1


# Select one package and discover more about it

In [14]:
# Select the package
name_of_package = 'tensorflow'

We retrieve all the versions available in Thoth's graph from any ecosystem for the selected package

In [4]:
%%time

gqr(
    g.V().
    has('package_name', name_of_package)
    .outE().has('__label__', 'has_version')
    .inV()
    .order().by('package_version')
    .project('package', 'version', 'ecosystem','index_url')
    .by('package_name').by('package_version').by('ecosystem').by('index_url')
    .toList()
).to_dataframe()

CPU times: user 18.4 ms, sys: 1.85 ms, total: 20.2 ms
Wall time: 4.75 s


Unnamed: 0,ecosystem,index_url,package,version
0,pypi,https://pypi.org/simple,tensorflow,0.12.0
1,pypi,https://pypi.org/simple,tensorflow,0.12.0rc0
2,pypi,https://pypi.org/simple,tensorflow,0.12.0rc1
3,pypi,https://pypi.org/simple,tensorflow,0.12.1
4,pypi,https://pypi.org/simple,tensorflow,1.0.0
5,pypi,https://pypi.org/simple,tensorflow,1.0.1
6,pypi,https://pypi.org/simple,tensorflow,1.1.0
7,pypi,https://pypi.org/simple,tensorflow,1.1.0rc0
8,pypi,https://pypi.org/simple,tensorflow,1.1.0rc1
9,pypi,https://pypi.org/simple,tensorflow,1.1.0rc2


Get all the direct dependencies for the selected package regardless the version

In [5]:
gqr(
    g.V()
    .has('__label__', 'python_package_version')
    .has('package_name', name_of_package)
    .outE().has('__label__', 'depends_on')
    .inV()
    .dedup()
    .group().by('package_name').by('package_version')
    .toList()
).to_dataframe()

Unnamed: 0,absl-py,astor,gast,google-pasta,grpcio,keras-applications,keras-preprocessing,numpy,protobuf,setuptools,six,tb-nightly,tensorboard,tensorflow-estimator,termcolor,tf-estimator-nightly,wheel
0,"[0.1.13, 0.2.2, 0.1.7, 0.5.0, 0.1.8, 0.1.9, 0....","[0.6, 0.7.0, 0.7.1, 0.6.2, 0.6.1]","[0.2.1.post1, 0.2.1.post0, 0.2.1, 0.2.0, 0.2.2]","[0.1.2, 0.1.3, 0.1.4]","[1.10.0rc1, 1.10.0, 1.11.0rc1, 1.11.1, 1.13.0,...","[1.0.6, 1.0.7, 1.0.5]","[1.0.5, 1.0.9, 1.0.6, 1.0.8, 1.0.3, 1.0.4]","[1.14.1, 1.14.3, 1.15.4, 1.16.0, 1.16.2, 1.15....","[3.7.0rc3, 3.7.0, 3.6.1, 3.7.0rc2, 3.6.0, 3.5....","[20.6.6, 8.2, 18.3.1, 18.0.1, 18.5, 18.7.1, 18...","[1.10.0, 1.11.0, 1.12.0]","[1.5.0a20171209, 1.5.0a20171210, 1.5.0a2017121...","[1.12.0, 1.0.0a2, 1.6.0rc0, 1.10.0, 1.0.0a5, 1...","[1.10.12, 1.13.0rc0, 1.13.0, 1.10.7, 1.10.8, 1...",[1.1.0],"[1.12.0.dev20181124, 1.12.0.dev20181204, 1.12....","[0.26.0, 0.32.3, 0.33.1, 0.30.0a0, 0.32.2, 0.3..."


Identify if any version of the package considered has solved_error = TRUE and it is unsolvable.

In [13]:
# Extract all packages with solver error TRUE and unsolvable
unsolvable_pypy_package = graph_db.retrieve_unsolvable_pypi_packages()

In [21]:
if unsolvable_pypy_package.get(name_of_package):
    print(unsolvable_pypy_package.get([name_of_package]))
else:
    print(f'No version of package "{name_of_package}" which has solved_error=TRUE')

No version of package "tensorflow" which has solved_error=TRUE
