# Overview

This notebook pulls down data off the ROS 2 build farm and does some preprocessing work to make it much more easy to work with. The build farm runs a variety of tests and for each test generates a single csv file that are zipped together. This notebook takes all of those notebooks, which are really rough, and merges them together into a handful of larger CSV files that are a lot easier to work with using pandas. 

There are three types of test data that come off the build farm:

1. `overhead_node` These files evaluate a single spinning ROS node in terms of cpu / memory consumption and a few other metrics. 
2. `overhead_tests` These tests examine interop between different RMW vendors where for two different vendors one acts as a publishing node and the other acts as a subscriber node. These tests profile the performance for this network confiration
3. `two_process_perf` These tests create a publisher and subscriber that use the same RMW vendor. The nodes send messages from publisher to subscriber and the whole assembly is instrumented to collect system performance and networking performance data. 

In [1]:
import numpy as np
import pandas as pd
import matplotlib
import glob as glob
# Blessed build for evaluation is August 31st 
# https://build.ros2.org/job/Rci__nightly-performance_ubuntu_focal_amd64/387/
# https://build.ros2.org/job/Rci__nightly-performance_ubuntu_focal_amd64/387/artifact/ws/test_results/buildfarm_perf_tests/*.csv/*zip*/buildfarm_perf_tests.zip
# The next block will pull down the zip file and extract it to the correct location 

In [2]:
! wget http://build.ros2.org/job/Rci__nightly-performance_ubuntu_focal_amd64/387/artifact/ws/test_results/buildfarm_perf_tests/*.csv/*zip*/buildfarm_perf_tests.zip
! rm -rf ./data/build_farm/
! mkdir ./data/build_farm/
! mkdir ./data/build_farm/
! mkdir ./data/build_farm/raw/
! mv buildfarm_perf_tests.zip ./data/build_farm/raw/
! unzip ./data/build_farm/raw/buildfarm_perf_tests.zip -d ./data/build_farm/raw/ 

--2021-10-12 11:55:32--  http://build.ros2.org/job/Rci__nightly-performance_ubuntu_focal_amd64/387/artifact/ws/test_results/buildfarm_perf_tests/*.csv/*zip*/buildfarm_perf_tests.zip
Resolving build.ros2.org (build.ros2.org)... 13.52.151.147
Connecting to build.ros2.org (build.ros2.org)|13.52.151.147|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://build.ros2.org/job/Rci__nightly-performance_ubuntu_focal_amd64/387/artifact/ws/test_results/buildfarm_perf_tests/*.csv/*zip*/buildfarm_perf_tests.zip [following]
--2021-10-12 11:55:32--  https://build.ros2.org/job/Rci__nightly-performance_ubuntu_focal_amd64/387/artifact/ws/test_results/buildfarm_perf_tests/*.csv/*zip*/buildfarm_perf_tests.zip
Connecting to build.ros2.org (build.ros2.org)|13.52.151.147|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘buildfarm_perf_tests.zip’

buildfarm_perf_test     [ <=>                ]  95.59

In [3]:
# First let's try to figure out blocks of data
# I.e. what are the "sets" of files we can process.
out = glob.glob("./data/build_farm/raw/*.csv")

print("Total Files: {0}".format(len(out)))    
perf_files = [f for f in out if "performance" in f]
print("Performance Files: {0}".format(len(perf_files)))
overhead_files = [f for f in out if "overhead" in f]
print("Overhead Files: {0}".format(len(overhead_files)))
two_files = [f for f in out if "two_process" in f]
print("Two Process Files: {0}".format(len(two_files)))
sync_files = [f for f in out if "_sync" in f]
print("Sync Files: {0}".format(len(sync_files)))
async_files = [f for f in out if "async" in f]
print("Async Files: {0}".format(len(async_files)))
pub_files = [f for f in out if "_pub" in f]
print("pub Files: {0}".format(len(pub_files)))
sub_files = [f for f in out if "_sub" in f]
print("sub Files: {0}".format(len(sub_files)))
node_files = [f for f in out if "node" in f]
print("node Files: {0}".format(len(node_files)))


Total Files: 270
Performance Files: 207
Overhead Files: 63
Two Process Files: 99
Sync Files: 155
Async Files: 115
pub Files: 28
sub Files: 28
node Files: 7


In [4]:
perf_cols = ['mean virtual memory (Mb)',
             'median virtual memory (Mb)',
             'virtual memory (Mb)',
             'mean cpu_usage (%)',
             'median cpu_usage (%)',
             'cpu_usage (%)',
             'mean physical memory (Mb)',
             'median physical memory (Mb)',
             'physical memory (Mb)',
             'mean resident anonymous memory (Mb)',
             'median resident anonymous memory (Mb)',
             'resident anonymous memory (Mb)']

In [5]:
# Take all of the "overhead" files and try to merge them into a single table. 
for p in node_files:
    print(p)

df = pd.read_csv(node_files[0])
df.columns = perf_cols
for p in node_files[1:]:
    temp = pd.read_csv(p)
    temp.columns = perf_cols
    df = df.append(temp)
# parse the filenames and add that data. 
df["config"] = ["_".join(n.strip('./data/build_farm/raw/overhead_node_test_results_rmw_').strip('.csv').split('_')[1:]) for n in node_files]
df["vendor"] = [n.strip('./data/build_farm/raw/overhead_node_test_results_rmw_').split('_')[0] for n in node_files]
df = df[df.columns[::-1]]
df["file_name"] = node_files
df.to_csv("./data/build_farm/node_perf.csv")
print(len(df))
print(df["file_name"])

./data/build_farm/raw/overhead_node_test_results_rmw_cyclonedds_cpp_sync.csv
./data/build_farm/raw/overhead_node_test_results_rmw_connextdds_async.csv
./data/build_farm/raw/overhead_node_test_results_rmw_fastrtps_dynamic_cpp_async.csv
./data/build_farm/raw/overhead_node_test_results_rmw_connextdds_sync.csv
./data/build_farm/raw/overhead_node_test_results_rmw_fastrtps_cpp_sync.csv
./data/build_farm/raw/overhead_node_test_results_rmw_fastrtps_dynamic_cpp_sync.csv
./data/build_farm/raw/overhead_node_test_results_rmw_fastrtps_cpp_async.csv
7
0    ./data/build_farm/raw/overhead_node_test_resul...
0    ./data/build_farm/raw/overhead_node_test_resul...
0    ./data/build_farm/raw/overhead_node_test_resul...
0    ./data/build_farm/raw/overhead_node_test_resul...
0    ./data/build_farm/raw/overhead_node_test_resul...
0    ./data/build_farm/raw/overhead_node_test_resul...
0    ./data/build_farm/raw/overhead_node_test_resul...
Name: file_name, dtype: object


In [6]:
def fname_to_data(fname, head="./data/build_farm/raw/overhead_test_results_rmw_",tail="_ROS2_pub.csv"):
    """
    Munge a file name into metadata. Pull out the first and seond RMW 
    along with the "flavor" information
    """
    fname = fname.replace(head,"").replace(tail,"")
    parts = fname.split("_rmw_")
    first = parts[0].split("_")
    second = parts[1].split("_")
    # format is rmw _ <name> _ <config> _ rwm _ <name2> _ <config2>
    ret_val = {}
    ret_val["first_rmw"] = first[0]
    ret_val["second_rmw"] = second[0]
    ret_val["first_flavor"] = " ".join(first[1:])
    ret_val["second_flavor"] = " ".join(second[1:])
    return(ret_val)

fname_to_data("./data/build_farm/raw/overhead_test_results_rmw_fastrtps_cpp_async_rmw_connext_cpp_ROS2_pub.csv")

{'first_flavor': 'cpp async',
 'first_rmw': 'fastrtps',
 'second_flavor': 'cpp',
 'second_rmw': 'connext'}

In [7]:
pub_sub_cols = ['mean virtual memory (Mb)',
                'median virtual memory (Mb)',
                'virtual memory (Mb)',
                'mean cpu_usage (%)',
                'median cpu_usage (%)',
                'cpu_usage (%)',
                'mean physical memory (Mb)',
                'median physical memory (Mb)',
                'physical memory (Mb)',
                'mean resident anonymous memory (Mb)',
                'median resident anonymous memory (Mb)',
                'resident anonymous memory (Mb)',
                'mean latency_mean (ms)',
                'median latency_mean (ms)',
                'Top 5% latency (ms)',
                'max ru_maxrss',
                'mean received',
                'mean sent',
                'sum lost',
                'mean system_cpu_usage (%)',
                'mean system virtual memory (Mb)']

In [8]:
# Pull out data for the pub files and repeat for sub files. 
pub_df = pd.read_csv(pub_files[0])
print(pub_files[0])

print("DF Cols {0} vs known cols {1}".format(len(pub_df.columns),len(pub_sub_cols)))    
# squish all the files into one table
pub_df.columns = pub_sub_cols
for p in pub_files[1:]:
    temp = pd.read_csv(p)
    temp.columns = pub_sub_cols
    pub_df = pub_df.append(temp)
# parse the file names into data and add them back to table. 
flavors = [fname_to_data(flavor) for flavor in pub_files]
pub_df["from_rmw"]= [flavor["first_rmw"] for flavor in flavors]
pub_df["from_rmw_flavor"]= [flavor["first_flavor"] for flavor in flavors]
pub_df["to_rmw"]= [flavor["second_rmw"] for flavor in flavors]
pub_df["to_rmw_flavor"]= [flavor["second_flavor"] for flavor in flavors]
pub_df["file_name"] = pub_files
pub_df = pub_df[pub_df.columns[::-1]]
pub_df.to_csv("./data/build_farm/pub_perf.csv")
pub_df.head()

./data/build_farm/raw/overhead_test_results_rmw_fastrtps_cpp_async_rmw_connextdds_ROS2_pub.csv
DF Cols 21 vs known cols 21


Unnamed: 0,file_name,to_rmw_flavor,to_rmw,from_rmw_flavor,from_rmw,mean system virtual memory (Mb),mean system_cpu_usage (%),sum lost,mean sent,mean received,...,mean resident anonymous memory (Mb),physical memory (Mb),median physical memory (Mb),mean physical memory (Mb),cpu_usage (%),median cpu_usage (%),mean cpu_usage (%),virtual memory (Mb),median virtual memory (Mb),mean virtual memory (Mb)
0,./data/build_farm/raw/overhead_test_results_rm...,,connextdds,cpp async,fastrtps,1752.971,27.926067,0.0,4.36,4.36,...,8.869302,35.6797,35.6797,35.477227,1.890617,1.70223,1.764062,162.95,162.95,160.131387
0,./data/build_farm/raw/overhead_test_results_rm...,,connextdds,cpp sync,fastrtps,1752.594194,28.7346,0.0,4.4,4.52,...,8.811556,35.5469,35.5469,35.246245,2.51683,2.29258,2.170923,182.985,182.985,177.656413
0,./data/build_farm/raw/overhead_test_results_rm...,cpp,fastrtps,dynamic cpp sync,fastrtps,1802.063667,28.35133,0.0,4.44,4.48,...,8.969271,36.1484,36.1484,35.877047,2.345141,2.120775,2.243746,199.147,199.147,193.981293
0,./data/build_farm/raw/overhead_test_results_rm...,cpp,fastrtps,async,connextdds,1752.387742,28.567597,0.0,4.346154,4.307692,...,12.465678,50.5859,50.5859,49.862616,1.719645,1.62,1.570665,239.521,239.521,232.23129
0,./data/build_farm/raw/overhead_test_results_rm...,,connextdds,async,connextdds,1761.191667,28.412267,0.0,4.4,4.4,...,12.820502,52.3594,52.3594,51.282183,2.095654,1.891895,1.847275,240.067,240.067,232.513857


In [9]:
# Now repeat for subscribersub_perf.head()
sub_df = pd.read_csv(sub_files[0])
print(sub_files[0])

print("DF Cols {0} vs known cols {1}".format(len(sub_df.columns),len(pub_sub_cols)))    

sub_df.columns = pub_sub_cols
for p in sub_files[1:]:
    temp = pd.read_csv(p)
    temp.columns = pub_sub_cols
    sub_df = sub_df.append(temp)
    
flavors = [fname_to_data(flavor,tail="_ROS2_sub.csv") for flavor in sub_files]
sub_df["from_rmw"]= [flavor["first_rmw"] for flavor in flavors]
sub_df["from_rmw_flavor"]= [flavor["first_flavor"] for flavor in flavors]
sub_df["to_rmw"]= [flavor["second_rmw"] for flavor in flavors]
sub_df["to_rmw_flavor"]= [flavor["second_flavor"] for flavor in flavors]
sub_df["file_name"] = sub_files
sub_df = sub_df[sub_df.columns[::-1]]
sub_df.to_csv("./data/build_farm/sub_perf.csv")
sub_df.head()

./data/build_farm/raw/overhead_test_results_rmw_cyclonedds_cpp_sync_rmw_cyclonedds_cpp_ROS2_sub.csv
DF Cols 21 vs known cols 21


Unnamed: 0,file_name,to_rmw_flavor,to_rmw,from_rmw_flavor,from_rmw,mean system virtual memory (Mb),mean system_cpu_usage (%),sum lost,mean sent,mean received,...,mean resident anonymous memory (Mb),physical memory (Mb),median physical memory (Mb),mean physical memory (Mb),cpu_usage (%),median cpu_usage (%),mean cpu_usage (%),virtual memory (Mb),median virtual memory (Mb),mean virtual memory (Mb)
0,./data/build_farm/raw/overhead_test_results_rm...,cpp,cyclonedds,cpp sync,cyclonedds,1734.742667,28.326873,0.0,0.0,4.44,...,7.143878,29.8477,29.8477,28.575543,26.48983,25.7825,23.988257,144.67,144.419,139.968091
0,./data/build_farm/raw/overhead_test_results_rm...,cpp,fastrtps,cpp sync,fastrtps,1743.916774,28.59229,0.0,0.0,4.48,...,8.645447,35.5703,35.5703,34.581767,27.06145,26.8221,25.597341,199.147,199.147,192.509635
0,./data/build_farm/raw/overhead_test_results_rm...,dynamic cpp,fastrtps,dynamic cpp sync,fastrtps,1801.767667,28.299863,0.0,0.0,4.44,...,8.678513,35.7695,35.7695,34.714032,26.918485,26.71685,25.387021,199.161,199.161,192.301825
0,./data/build_farm/raw/overhead_test_results_rm...,dynamic cpp,fastrtps,cpp sync,fastrtps,1744.119,28.040883,0.0,0.0,4.36,...,8.860903,36.418,36.418,35.44365,26.61738,26.1652,24.66297,199.164,199.164,192.304985
0,./data/build_farm/raw/overhead_test_results_rm...,dynamic cpp,fastrtps,cpp async,fastrtps,1743.983,27.956407,0.0,0.0,4.4,...,8.68037,35.8281,35.8281,34.72146,26.91762,26.76425,25.574776,179.156,179.156,172.963658


In [10]:
# now aggregate the performance results, there are two types two process and and "results"
two_process_perf = [p for p in perf_files if "two_process" in p]
result_perf_file = [p for p in perf_files if "two_process" not in p]
print("{0} two process files and {1} results files. {2} total files.".format(len(two_process_perf),len(result_perf_file),len(perf_files)))

# From: https://github.com/ahcorde/buildfarm_perf_tests/blob/master/test/test_performance.py.in#L48
perf_col_names = [
    'mean latency_mean (ms)',
    'median latency_mean (ms)',
    '95th Percentile Latency',
    'max ru_maxrss',
    'mean received',
    'mean sent',
    'sum lost',
    'mean cpu_usage (%)',
    '95th Percentile CPU',
    'median cpu_usage (%)',
    'mean data_received (Mb)',
    'median data_received (Mb)',
    '95th Percentile Data Received (Mb)']


99 two process files and 108 results files. 207 total files.


In [11]:
def fname_to_rmw_and_data(fname):
    """
    Parse and return file names of the format
    performnace_test_resuts_<optional rmw>_<rmw_name>_<rmw_flavor>_<datatype>.csv
    E.g. 
    ./data/performance_test_results_rmw_fastrtps_dynamic_cpp_async_Array32k.csv
    ./data/performance_test_results_FastRTPS_sync_Array2m.csv
    ./data/performance_test_results_CycloneDDS_sync_Array1k.csv
    """
    fname = fname.replace("./data/build_farm/raw/performance_test_two_process_results_rmw_","")
    fname = fname.replace("./data/build_farm/raw/performance_test_two_process_results_","")
    fname = fname.replace("./data/build_farm/raw/performance_test_results_","")
    
    fname = fname.replace(".csv","")
    parts = fname.split("_");
    ret_val = {}
    ret_val["type"] = parts[-1] # last entry is type, easy
    if(parts[0] == "rmw"):
        parts = parts[1:] # drop the first value if it is RMW
    ret_val["vendor"] = parts[0].lower() # both upper and lower is present
    ret_val["flavor"] = "_".join(parts[1:-1])
    return ret_val 

In [12]:
perf_df = pd.read_csv(result_perf_file[0])
print(result_perf_file[0])

print("DF Cols {0} vs known cols {1}".format(len(perf_df.columns),len(perf_col_names)))

perf_df.columns = perf_col_names

# smush main csv files together
for p in result_perf_file[1:]:
    temp = pd.read_csv(p)
    temp.columns = perf_col_names
    perf_df = perf_df.append(temp)
# parse file names 
fname_data = [fname_to_rmw_and_data(p) for p in result_perf_file]
perf_df["vendor"] = [p["vendor"] for p in fname_data]
perf_df["flavor"] = [p["flavor"] for p in fname_data]
perf_df["data_type"] = [p["type"] for p in fname_data]
perf_df["file_name"] = result_perf_file
perf_df = perf_df[perf_df.columns[::-1]]
perf_df.to_csv("./data/build_farm/perf_network_results.csv")
perf_df.head()


./data/build_farm/raw/performance_test_results_rmw_cyclonedds_cpp_sync_Array4m.csv
DF Cols 13 vs known cols 13


Unnamed: 0,file_name,data_type,flavor,vendor,95th Percentile Data Received (Mb),median data_received (Mb),mean data_received (Mb),median cpu_usage (%),95th Percentile CPU,mean cpu_usage (%),sum lost,mean sent,mean received,max ru_maxrss,95th Percentile Latency,median latency_mean (ms),mean latency_mean (ms)
0,./data/build_farm/raw/performance_test_results...,Array4m,cpp_sync,cyclonedds,4000.376185,4000.017879,4000.003327,11.24,12.49,11.434444,0.0,999.518519,999.518519,88816.0,0.46748,0.4256,0.429648
0,./data/build_farm/raw/performance_test_results...,Array4m,sync,fastrtps,4003.188868,3999.255228,4000.088347,9.997,10.425,9.915037,0.0,999.222222,999.222222,94996.0,0.38561,0.3685,0.368093
0,./data/build_farm/raw/performance_test_results...,PointCloud512k,sync,cyclonedds,500.677456,500.175248,500.252393,2.0,2.25,2.046185,0.0,999.148148,999.185185,87364.0,0.062795,0.05561,0.05731
0,./data/build_farm/raw/performance_test_results...,Array32k,async,connextdds,31.266631,31.265222,31.265211,3.247,3.6713,3.256148,0.0,999.481481,999.481481,90120.0,0.091345,0.08145,0.079563
0,./data/build_farm/raw/performance_test_results...,Array2m,cpp_sync,fastrtps,2000.186412,2000.024097,2000.013389,5.245,5.495,5.235704,0.0,999.518519,999.518519,93600.0,0.18651,0.1843,0.184367


In [13]:
twop_df = pd.read_csv(two_process_perf[0])
print(two_process_perf[0])

print("DF Cols {0} vs known cols {1}".format(len(twop_df.columns),len(perf_col_names)))

twop_df.columns = perf_col_names

# smush main csv files together
for p in two_process_perf[1:]:
    temp = pd.read_csv(p)
    temp.columns = perf_col_names
    twop_df = twop_df.append(temp)
# parse file names 
fname_data = [fname_to_rmw_and_data(p) for p in two_process_perf]
twop_df["vendor"] = [p["vendor"] for p in fname_data]
twop_df["flavor"] = [p["flavor"] for p in fname_data]
twop_df["data_type"] = [p["type"] for p in fname_data]
twop_df["file_name"] = two_process_perf
twop_df = twop_df[twop_df.columns[::-1]]
twop_df.to_csv("./data/build_farm/two_process_perf_network_results.csv")
twop_df.head()


./data/build_farm/raw/performance_test_two_process_results_rmw_connextdds_sync_Array60k.csv
DF Cols 13 vs known cols 13


Unnamed: 0,file_name,data_type,flavor,vendor,95th Percentile Data Received (Mb),median data_received (Mb),mean data_received (Mb),median cpu_usage (%),95th Percentile CPU,mean cpu_usage (%),sum lost,mean sent,mean received,max ru_maxrss,95th Percentile Latency,median latency_mean (ms),mean latency_mean (ms)
0,./data/build_farm/raw/performance_test_two_pro...,Array60k,sync,connextdds,58.610945,58.608705,58.556918,2.747,3.172,2.670556,0.0,0.0,998.518519,90296.0,0.12944,0.1173,0.118704
0,./data/build_farm/raw/performance_test_two_pro...,Array2m,sync,connextdds,0.0,0.0,0.073959,6.99,8.7305,7.10537,17856.0,0.0,0.0,582404.0,0.0,0.0,1.025926
0,./data/build_farm/raw/performance_test_two_pro...,Array8m,async,fastrtps,605.502727,567.914353,567.004951,10.75,11.175,10.823333,25064.0,0.0,69.888889,87456.0,21.292,20.47,20.364074
0,./data/build_farm/raw/performance_test_two_pro...,Array4k,dynamic_cpp_sync,fastrtps,3.921659,3.921494,3.917588,1.748,1.998,1.729481,0.0,0.0,998.407407,93872.0,0.062124,0.05297,0.052825
0,./data/build_farm/raw/performance_test_two_pro...,Array16k,cpp_async,fastrtps,15.640858,15.640251,15.624621,1.998,2.498,2.015741,0.0,0.0,998.518519,93404.0,0.11121,0.09198,0.093627


In [14]:
total = len(pub_files)+len(sub_files)+len(node_files)+len(two_process_perf)+len(result_perf_file)
print("processed {0} of {1}".format(total,len(glob.glob("./data/build_farm/raw/*.csv"))))

processed 270 of 270
