![Egeria Logo](https://raw.githubusercontent.com/odpi/egeria/main/assets/img/ODPi_Egeria_Logo_color.png)

### Performance Suite Results
# Crux Plugin Repository Connector

## Introduction

Following are detailed results for the Crux Connector's performance at various scales. In each set of results, the Crux Connector was tested under the following conditions (full details are available in the each results directory's `deployment` file):

- Running through a Kubernetes pod with a single OMAG Platform running both the Performance Test Suite server and the Crux Plugin Repository
- Running a co-located Bitnami Kafka (and Zookeeper) pod on the same Kubernetes node running the OMAG Platform
- Resources allocated are a minimum of 2 cores to a maximum of 4 cores, and a minimum of 8GB memory to a maximum of 16GB memory

### Versions

Component | Version | Notes
---|---|---
Egeria | 2.10 | OMAG Platform, CTS, PTS
Crux Plugin Repository Connector | 2.10 |
Crux | 21.05-1.17.0-beta | Embedded in Crux Plugin Repository Connector
RocksDB | 6.15.2 | Used for transaction log, document store and index store for Crux
Kafka | 2.3.1 | Used for cohort event bus
Lucene | 8.8.2 | Used for text indexing in Crux

## Setup

### Results locations

Locations for the results (see subdirectories in the same location where this notebook resides to review the raw results themselves):

In [None]:
results = [
    "pts-05-02",
    "pts-10-05",
    "janus-05-02"
]

### Analysis and visualization methods

The following defines methods necessary to parse, process and visualize the results, and must be run prior to the subsequent cells.

In [None]:
import os
import json
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display

def validateProfileResultsLocation(location):
    profile_details_location = location + os.path.sep + "profile-details"
    print("Validating profile-details location:", profile_details_location)
    if os.path.isdir(profile_details_location):
        print(" ... directory exists.")
    else:
        print(" ... ERROR: could not find this directory. Is the location specified correct?")

# Define the profile ordering
profile_order=[
    'Entity creation', 'Entity search', 'Relationship creation', 'Relationship search',
    'Entity classification', 'Classification search', 'Entity update', 'Relationship update',
    'Classification update', 'Entity undo', 'Relationship undo', 'Entity retrieval', 'Entity history retrieval',
    'Relationship retrieval', 'Relationship history retrieval', 'Entity history search', 'Relationship history search',
    'Graph queries', 'Graph history queries', 'Entity re-home', 'Relationship re-home', 'Entity declassify',
    'Entity re-type', 'Relationship re-type', 'Entity re-identify', 'Relationship re-identify',
    'Relationship delete', 'Entity delete', 'Entity restore', 'Relationship restore', 'Relationship purge',
    'Entity purge'
]

# Given a profileResult.requirementResults object, parse all of its positiveTestEvidence
# for discovered properties
def parseProperties(df, repositoryName, requirementResults):
    if (requirementResults is not None and 'positiveTestEvidence' in requirementResults):
        print("Parsing properties for:", requirementResults['name'], "(" + repositoryName + ")")
        data_array = []
        for evidence in requirementResults['positiveTestEvidence']:
            if ('propertyName' in evidence and 'propertyValue' in evidence):
                data = {
                    'repo': repositoryName,
                    'property_name': evidence['propertyName'],
                    'property_value': evidence['propertyValue']
                }
                data_array.append(data)
        df = df.append(pd.read_json(json.dumps(data_array), orient='records'), ignore_index=True)
    return df

# Given a profileResult.requirementResults object, parse all of its positiveTestEvidence
# and group the results by methodName
def parseEvidence(df, repositoryName, requirementResults):
    if (requirementResults is not None and 'positiveTestEvidence' in requirementResults):
        print("Parsing evidence for:", requirementResults['name'], "(" + repositoryName + ")")
        data_array = []
        for evidence in requirementResults['positiveTestEvidence']:
            if ('methodName' in evidence and 'elapsedTime' in evidence):
                data = {
                    'repo': repositoryName,
                    'method_name': evidence['methodName'],
                    'elapsed_time': evidence['elapsedTime'],
                    'profile_name': requirementResults['name'],
                    'test_case_id': evidence['testCaseId'],
                    'assertion_id': evidence['assertionId']
                }
                data_array.append(data)
        df = df.append(pd.read_json(json.dumps(data_array), orient='records'), ignore_index=True)
    return df

# Given a profile detail JSON file, retrieve all of its profileResult.requirementResults[] objects
def parseRequirementResults(profileFile):
    with open(profileFile) as f:
        profile = json.load(f)
    # This first case covers files retrieved via API
    if ('profileResult' in profile and 'requirementResults' in profile['profileResult']):
        return profile['profileResult']['requirementResults']
    # This second case covers files created by the CLI client
    elif ('requirementResults' in profile):
        return profile['requirementResults']
    else:
        return None

def getEnvironmentProfile(profileLocation):
    detailsLocation = profileLocation + os.path.sep + "profile-details"
    return detailsLocation + os.path.sep + "Environment.json"

def parseEnvironmentDetailsIntoDF(df, profileFile, qualifier):
    profileResults = parseRequirementResults(profileFile)
    if profileResults is not None:
        for result in profileResults:
            df = parseProperties(df, qualifier, result)
    return df

# Retrieve a listing of all of the profile detail JSON files
def getAllProfiles(profileLocation):
    detailsLocation = profileLocation + os.path.sep + "profile-details"
    _, _, filenames = next(os.walk(detailsLocation))
    full_filenames = []
    for filename in filenames:
        full_filenames.append(detailsLocation + os.path.sep + filename)
    return full_filenames

# Parse all of the provided profile file's details into the provided dataframe
def parseProfileDetailsIntoDF(df, profileFile, qualifier):
    profileResults = parseRequirementResults(profileFile)
    if profileResults is not None:
        for result in profileResults:
            df = parseEvidence(df, qualifier, result)
    return df

def plotMethod(df, methodName, remove_outliers=False, by_repo=False, by_assertion=False):
    dfX = df[df['method_name'] == methodName]
    if not dfX.empty:
        if remove_outliers:
            dfX = dfX[dfX['elapsed_time'].between(dfX['elapsed_time'].quantile(.00), dfX['elapsed_time'].quantile(.99))]
        sns.set(font_scale=1.2)
        sns.set_style("whitegrid")
        fix, axs = plt.subplots(ncols=1, nrows=1, figsize=(18,9))
        if by_repo:
            # Display the repos within the method in alphabetical order for consistency
            repos = dfX['repo'].unique()
            figure = sns.histplot(ax=axs, data=dfX, x="elapsed_time", hue="repo",
                                  hue_order=sorted(repos), kde=True, discrete=False)
        if by_assertion:
            # Display the assertions within the method in alphabetical order for consistency
            assertions = dfX['assertion_id'].unique()
            figure = sns.histplot(ax=axs, data=dfX, x="elapsed_time", hue="assertion_id",
                                  hue_order=sorted(assertions), kde=True, discrete=False)
        else:
            figure = sns.histplot(ax=axs, data=dfX, x="elapsed_time",
                                  kde=True, discrete=False)
        figure.set(xlabel="Elapsed time (ms)")
        figure.set_title(methodName)
        display(fix)
        plt.close(fix)

def plotProfile(df, profileName, remove_outliers=False):
    dfX = df[df['profile_name'] == profileName]
    # Only attempt to plot if there is anything left in the dataframe
    if not dfX.empty:
        if remove_outliers:
            # If we have been asked to remove outliers, drop anything outside the 2nd and 98th percentiles
            dfX = dfX[dfX['elapsed_time'].between(dfX['elapsed_time'].quantile(.02), dfX['elapsed_time'].quantile(.98))]
        sns.set(font_scale=1.2)
        sns.set_style("whitegrid")
        fix, axs = plt.subplots(ncols=1, nrows=1, figsize=(18,9))
        # Display the methods within the profile in alphabetical order for consistency
        methods = dfX['method_name'].unique()
        figure = sns.histplot(ax=axs, data=dfX, x="elapsed_time", hue="method_name",
                              hue_order=sorted(methods), kde=True, discrete=False)
        figure.set(xlabel="Elapsed time (ms)")
        figure.set_title(profileName)
        figure.get_legend().set(title='Method')
        display(fix)
        plt.close(fix)

def slowestRunning(df, num=10, methodName=None):
    pd.set_option('display.max_colwidth', None)
    pd.set_option('display.max_rows', None)
    if methodName:
        df = df[df['method_name'] == methodName]
    display(df.sort_values(by=['elapsed_time'], ascending=False).groupby('method_name').head(num))

def compareProfiles(df, profileName, left, right, remove_outliers=False):
    dfX = df[df['profile_name'] == profileName]
    # Only attempt to plot if there is anything left in the dataframe
    if not dfX.empty:
        if remove_outliers:
            # If we have been asked to remove outliers, drop anything outside the 2nd and 98th percentiles
            dfX = dfX[dfX['elapsed_time'].between(dfX['elapsed_time'].quantile(.02), dfX['elapsed_time'].quantile(.98))]
        sns.set(font_scale=1.2)
        sns.set_style("whitegrid")
        fix, axs = plt.subplots(ncols=1, nrows=1, figsize=(18,9))
        # Display the methods within the profile in alphabetical order for consistency
        methods = dfX['method_name'].unique()
        figure = sns.violinplot(x="method_name", y="elapsed_time", ax=axs, hue="repo",
                                hue_order=[left, right], split=True, scale='count',
                                inner='quartile', cut=0, data=dfX)
        # If there are more than 4 methods in the profile, rotate them so they are still readable
        if (len(methods) > 4):
            figure.set_xticklabels(figure.get_xticklabels(), rotation=10)
        figure.set(xlabel="Method name", ylabel="Elapsed time (ms)")
        figure.set_title(profileName + ' comparison')
        figure.get_legend().set(title='Test')
        display(fix)
        plt.close(fix)

# The results

## instancesPerType=5, maxSearchResults=2

In [None]:
results0 = results[0]

validateProfileResultsLocation(results0)
files = getAllProfiles(results0)

df1 = pd.DataFrame({'repo': [], 'method_name': [], 'elapsed_time': [], 'profile_name': [], 'test_case_id': [], 'assertion_id': []})
dfEnv = None

for profile_file in files:
    df1 = parseProfileDetailsIntoDF(df1, profile_file, results0)

### Environment details

In [None]:
results0_env = getEnvironmentProfile(results0)
env0 = pd.DataFrame({'repo': [], 'property_name': [], 'property_value': []})
env0 = parseEnvironmentDetailsIntoDF(env0, results0_env, results0)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)
display(env0)

### Full response time profiles

The following plots the response times of each method within each profile in full, including any extreme / outlying values. (As this is rendering 30+ detailed visualizations it may take a little time to complete.)

From these visualizations, we can quickly see the range of response times for a given method and where there values are more typical (high peaks) than not (long tails). This allows us to quickly assess two important areas:

1. Any methods that appear to consistently run for a longer time than we may want or expect.
1. Any particular combination of parameters that may cause a method that in most cases runs quickly to in certain cases run particularly slowly.

In [None]:
for profile in profile_order:
    plotProfile(df1, profile)

#### Analysis

From the plots above we can see that most methods have very high peaks towards the left of the graph: indicating that the vast majority of the executions of that method have response times in that range. However, there are a number of cases where various methods run for much longer than this usual response time (even up to several seconds).

To see whether these are rare outliers, we may want to re-plot the profiles again: this time ignoring the slowest 1% of the values in the response times. Stated differently, this will show the response time of 99% of the method calls: if there is a consistently-slow combination of parameters, we will expect it to show up as part of this 99% cut-off point in these plots.

### "Typical" response time profiles

The following plots the response times of each method within each profile focusing only on the typical values -- specifically removing any outliers within the top and bottom 2% of the response times. From these visualizations, we can quickly see the "typical" response times for a given method, keeping in mind that we are ignoring the outlying extreme values here.

In [None]:
for profile in profile_order:
    plotProfile(df1, profile, remove_outliers=True)

Without the outliers, we can more clearly see the typical distribution of each method's response times: and that in most cases (99% of the methods' executions) the response times are sub-second (in most cases even less than 250ms).

We can also see that there are however a few exceptions to this -- the various graph queries all have very long tails that suggest there are a number of examples of very long-running methods. In addition, various write operations also have long tails that appear to occur relatively infrequently but nonetheless extend to around 1 second within the 99% range.

We can start by looking at the top-10 slowest response times for each of these individual methods:

In [None]:
slowest = ['updateEntityProperties', 'getEntityNeighborhood', 'getLinkingEntities', 'getRelatedEntities', 'getRelationshipsForEntity']
for slow in slowest:
    slowestRunning(df1, num=10, methodName=slow)

We can see that each of these top-10 slowest results for these various methods are similar, and the result of the method running against a different set of parameters (for example, against different types of instances). This would suggest that these response times were not simply a one-off or pseudo-random occurrence that could have been caused by something like a garbage collection pause, but that there is more likely to be some fundamental underlying reason for this particular performance. To find out more, we need to delve back into the repository connector itself with deeper profiling of these particular combinations of parameters for each method to see if there is some further optimization that can be done.

## Comparing results

Up to this point, we have done some analysis of the performance of a single set of volume parameters. However, we may also be interested in comparing and contrasting these results with additional volume parameters to investigate the scalability of the connector as the volume of metadata within the repository grows.

In [None]:
results1 = results[1]

validateProfileResultsLocation(results1)
files = getAllProfiles(results1)

for profile_file in files:
    df1 = parseProfileDetailsIntoDF(df1, profile_file, results1)

### instancesPerType=10, maxSearchResults=5 details

In [None]:
results1_env = getEnvironmentProfile(results1)
env1 = pd.DataFrame({'repo': [], 'property_name': [], 'property_value': []})
env1 = parseEnvironmentDetailsIntoDF(env1, results1_env, results1)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)
display(env1)

### ipt=5, msr=2 compared to ipt=10, msr=5

In [None]:
for profile in profile_order:
    compareProfiles(df1, profile, results0, results1, remove_outliers=True)

#### Analysis

For the most part, the performance for each method is comparable -- even though we have doubled the number of instances involved (from 4470 to 8940) and the number of results returned by each page of a search (from 2 to 5).

The notable exceptions are the various search methods and the graph queries, in particular `getRelatedEntities` and `getLinkingEntities` which we can see have a significant additional peak. This may be understandable, given the additional number of instances is likely to equate to a significant increase in the number of relationships and linked entities that these methods will retrieve in the higher volume environment (since these methods do not page results, but retrieve all relationships and entities involved).

### Other repositories

We may also want to do some comparative analysis between repositories. The following looks at results from the JanusGraph repository at the same volume parameters to compare and contrast the relative performance of the two repositories.

In [None]:
results2 = results[2]

validateProfileResultsLocation(results2)
files = getAllProfiles(results2)

for profile_file in files:
    df1 = parseProfileDetailsIntoDF(df1, profile_file, results2)

In [None]:
results2_env = getEnvironmentProfile(results2)
env2 = pd.DataFrame({'repo': [], 'property_name': [], 'property_value': []})
env2 = parseEnvironmentDetailsIntoDF(env2, results2_env, results1)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)
display(env2)

In [None]:
for profile in profile_order:
    compareProfiles(df1, profile, results0, results2, remove_outliers=True)

#### Analysis

Here we can see that in _almost all_ cases, based on their default configurations only, the Crux repository connector is faster than the JanusGraph repository connector:

- The Crux connector appears to be significantly faster (~3-4x) with write operations (create, update, delete, purge, restore, re-identify, etc)
- The Crux connector also appears to be significantly faster for most search operations
- For some operations (i.e. the graph queries) we did not even run them under JanusGraph due to each per-type test not completing after more than 3 hours (vs. Crux's few seconds for the same tests, at the same volume).

Only the retrieval methods are roughly equivalent between the two repositories. Of course, there may be further optimisations possible with either or both repositories to further improve their performance for certain aspects: this is only comparing the default configuration of each.

In [None]:
plotMethod(df1, "findEntities", by_repo=True)

In [None]:
slowestRunning(df1[df1['repo'] == 'janus-05-02'], num=10, methodName='findEntities')

Interestingly we can see that some predicted suspects like `Referenceable` and `OpenMetadataRoot` are particularly slow-performing; however, these are not alone given `UserAccessDirectory`, `VerificationPoint`, and `UserProfileManager` each also demonstrate response times that exceed 5 seconds (and are closely followed by a number of others that come close to 5 seconds).

Instead of the metadata type being the distinguishing factor, it appears it is the search parameters that are most important:

- For `Referenceable` and `OpenMetadataRoot` the slow-running examples come from the `repository-entity-retrieval-performance` set of tests: these run `findEntities` with only a type GUID as a filter.
- All of the other slowest-running examples come from the `repository-entity-classification-performance` set of tests: where `findEntities` is called with a classification criteria to retrieve a limited number of results.

It would therefore appear that the JanusGraph repository connector's ability to search based on classification and to search based only on a very abstract supertype is significantly slower than the Crux repository connector's ability to do the same searches.