# Process the SAP Sales & Distribution Benchmark
In this notebook, we attempt to correlate SPECintbase2006 with an arbitrary "SAPS per Core" that we obtain from the SAP Sales & Distribution 2-tier benchmark.

This is because there is a much larger volume of SPECint2006 data, as compared to SAP SD2 data. If we accept the assumption that SAP SD2 is a reasonable approximation of an enterprise workload, and if we can find a good correlation between SPECint2006 and SAP SD2, then we can size enterprise workloads directly using SPECint2006 data, whch is quite abundant.

In [None]:
import pandas as pd
import numpy as np

In [None]:
!wget -O export-sd.csv https://www.sap.com/dmc/exp/2018-benchmark-directory/assets/export-sd.csv

In [None]:
dfsaps = pd.read_csv("export-sd.csv", sep=";", error_bad_lines=False, header=0, encoding = "ISO8859-1")

dfsaps['Server Name'] = dfsaps['Server Name'].str.upper() 
dfsaps['CPU Architecture'] = dfsaps['CPU Architecture'].str.upper() 
dfsaps['CPU Speed'] = dfsaps['CPU Speed'].str.upper() 
dfsaps['Technology Partner'] = dfsaps['Technology Partner'].str.upper() 

# we should only get 2-tier results
dfsaps = dfsaps[ dfsaps['Configuration'] == '2-tier' ]

dfsaps = dfsaps.drop(dfsaps.columns[[8, 12, 13, 14, 22, 24]], axis=1)

# unfortunately some of the benchmarks have incorrect core counts
# e.g. 2005021 has a null value for Cores in the CSV (but the long description shows 32 processors)
# so we just drop any entries where Cores is not defined

dfsaps.dropna(subset=['Cores'], inplace=True)


# also, there are further errors in the CSV e.g. 2013010 reports 64 cores, but the details show 8 x 
# calculate a naive "SAPS per Core" value
dfsaps['SAPS per Core'] = dfsaps['saps'] / dfsaps['Cores']

dfsaps.head()

In [None]:
dfsaps.to_csv("saps.csv", index=False)
dfsaps.sort_values(['SAPS per Core'], ascending=[False])

## Attempt to Derive Correlation Between SAPS and SPEC
SAPS is a whole-system, complex benchmark, which is more relevant for enterprise workloads. However, the number of available SAPS benchmarks is low. If we can find a strong correlation between SAPS and SPEC, then we can use SPEC as a proxy for estimating performance of different processor architectures on SAPS-like, enterprise workloads.

To get a good match between SAPS and SPEC, we will only use SAPS with "INTEL XEON" in the CPU Architecture description. This is 296 entries which is a little less than half of the total entries.

In [None]:
dfintel = dfsaps[ (dfsaps['CPU Architecture'].str.contains(r'^INTEL XEON'))].sort_values(['SAPS per Core'], ascending=[False])

dfintel.head(10)

Load the SPEC ratings, and extract only those for INTEL XEON where a clock speed is specified in SYSTEM NAME

In [None]:
dfspec = pd.read_csv("spec.csv")

# only get the SPEC results for INTEL XEON which have a clock speed (so we can match it to the SAPS dataframe)
dfspecintel = dfspec[ (dfspec['SYSTEM NAME'].str.contains('\(INTEL XEON'))]
dfspecintel = dfspecintel[ (dfspecintel['SYSTEM NAME'].str.contains('GHZ'))]

dfspecintel.head(10)

## Correlating Technology Partner
These are the top Technology Partner submissions in the SAPS Intel Xeon submissions

In [None]:
dfintel.groupby("Technology Partner")["Certification Number"].count().head(10).sort_values(ascending=False)

And these are the top TEST SPONSOR submissions in SPEC. We have to "fix" these so that a join is possible.

In [None]:
dfspecintel.groupby("TEST SPONSOR")["RESULTS BASE 2006"].count().head(10).sort_values(ascending=False)

In [None]:
dfspecintel = dfspecintel.replace(to_replace='DELL INC.', value='DELL', regex=False)
dfspecintel = dfspecintel.replace(to_replace='DELL INC', value='DELL', regex=False)
dfspecintel = dfspecintel.replace(to_replace='BULL SAS', value='BULL', regex=False)

In [None]:
dfspecintel.groupby("TEST SPONSOR")["RESULTS BASE 2006"].count().head(10).sort_values(ascending=False)

Iterate over the "Intel Xeon" SAPS dataframe and filter the Intel SPEC dataframe by Technology Partner / TEST SPONSOR, Cores / PROCESSOR ENABLED CORES, Server Name / SYSTEM NAME (substring), CPU Speed / SYSTEM NAME (substring).

In [None]:
import re

pd.set_option('display.max_colwidth', None)
pd.set_option('mode.chained_assignment', None)

c = 0
d = 0
e = 0

# 3 or more digits.. assume this is the Xeon model number
r1 = re.compile('\s+([a-zA-Z0-9_-]*\d{3,})')

# assume this is clock speed
r2 = re.compile('(\d+\.\d+)\s?GHZ')

newdf = pd.DataFrame()
newdf2 = pd.DataFrame()

while (c < len(dfintel.index)):
    
    row = dfintel.iloc[c]
    tech_partner = row['Technology Partner']
    server_name =  row['Server Name']
    cpu_arch = row['CPU Architecture']
    cpu_speed = row['CPU Speed']
    cores = row['Cores']
    certnum = row['Certification Number']
    certdate = row['Certification Date']
    saps = row['saps']
    saps_per_core = row['SAPS per Core']


    # for CPU architecture we just want to extract the 4-digit Xeon model number
    m = r1.search( cpu_arch )
    if m:
        model = m.group(1)

    # for CPU speed, we want to get rid of the GHZ bit
    m = r2.match( cpu_speed )
    if m:
        clock_speed = m.group(1)

    if ((len(model) > 0) & (len(clock_speed))) > 0:
        # just match Xeon model number, cores, clock speed
        res = dfspecintel[ (
#            (dfspecintel['TEST SPONSOR'] == tech_partner) &
            (dfspecintel['PROCESSOR ENABLED CORES'] == cores) &
            (dfspecintel["SYSTEM NAME"].str.contains(clock_speed)) &
#            (dfspecintel["SYSTEM NAME"].str.contains(server_name)) &
            (dfspecintel["SYSTEM NAME"].str.contains(model)) 
        )]
        
        # stricter match including manufacturer and server name
        res2 =  dfspecintel[ (
            (dfspecintel['TEST SPONSOR'] == tech_partner) &
            (dfspecintel['PROCESSOR ENABLED CORES'] == cores) &
            (dfspecintel["SYSTEM NAME"].str.contains(clock_speed)) &
            (dfspecintel["SYSTEM NAME"].str.contains(server_name)) &
            (dfspecintel["SYSTEM NAME"].str.contains(model)) 
        )]

        if len(res.index) > 0:
            res["Technology Partner"] = tech_partner
            res["Server Name"] = server_name
            res["CPU Architecture"] = cpu_arch
            res["CPU Speed"] = cpu_speed
            res["Cores"] = cores
            res['Certification Number'] = certnum
            res['Certification Date'] = certdate
            res['SAPS'] = saps
            res['SAPS per Core'] = saps_per_core
            
            newdf = newdf.append(res)
            d = d + 1

        if len(res2.index) > 0:
            res2["Technology Partner"] = tech_partner
            res2["Server Name"] = server_name
            res2["CPU Architecture"] = cpu_arch
            res2["CPU Speed"] = cpu_speed
            res2["Cores"] = cores
            res2['Certification Number'] = certnum
            res2['Certification Date'] = certdate
            res2['SAPS'] = saps
            res2['SAPS per Core'] = saps_per_core
            
            newdf2 = newdf2.append(res2)
            e = e + 1

    c = c + 1

print('%d, %d of %d matched' % (d, e, c))

When we only match the Xeon model number, core count, and clock speed, we get a good number of matches. This allows for a larger data set.

In [None]:
newdf.sort_values(['Certification Number'], ascending=True).head(10)

In [None]:
newdf.to_csv("correlated.csv", index=False)

The problem is that the entries with matching/correlated data have a minimum SAPS per core of 924, which is not a very old machine. Each submission in the SAPS benchmark results in multiple matches in the SPECintbase2006 benchmark. 

Correlation of SAPS per core and SPECintbase2006 is not very strong. However, we are able to obtain a reasonable best-fit line.

In [None]:
newdf['SAPS per Core'].corr(newdf['RESULTS BASE 2006'])

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

d2 = newdf[['RESULTS BASE 2006', 'SAPS per Core', ]].copy()

X = d2.iloc[:, 0].values.reshape(-1, 1)
Y = d2.iloc[:, 1].values.reshape(-1, 1)
linear_regressor = LinearRegression()
linear_regressor.fit(X, Y)
Y_pred = linear_regressor.predict(X)

plt.scatter(X, Y)
plt.plot(X, Y_pred, color='red')
plt.show()

print('Coefficient/Intercept: %f %f\n' % ( linear_regressor.coef_, linear_regressor.intercept_) )

Let's try it with the much smaller exact data set. Notice that the correlation is better.

In [None]:
newdf2['SAPS per Core'].corr(newdf2['RESULTS BASE 2006'])

In [None]:
d2 = newdf2[['RESULTS BASE 2006', 'SAPS per Core', ]].copy()

X = d2.iloc[:, 0].values.reshape(-1, 1)
Y = d2.iloc[:, 1].values.reshape(-1, 1)
linear_regressor = LinearRegression()
linear_regressor.fit(X, Y)
Y_pred = linear_regressor.predict(X)

plt.scatter(X, Y)
plt.plot(X, Y_pred, color='red')
plt.show()

print('Coefficient/Intercept: %f %f\n' % ( linear_regressor.coef_, linear_regressor.intercept_) )

### Polynomial fitting for potentially better correlation

We can try different degrees to find a better fit; too high a degree will result in overfitting. A degree of 3 looks empirically sufficient to model the old/slow machines properly.

In [None]:
d2 = newdf[['RESULTS BASE 2006', 'SAPS per Core', ]].copy()

X = d2.iloc[:, 0].values.reshape(-1, 1)
Y = d2.iloc[:, 1].values.reshape(-1, 1)

X_seq = np.linspace(X.min(),X.max(),300).reshape(-1,1)

# change this as an experiment
degree=3

polyreg=make_pipeline(PolynomialFeatures(degree),LinearRegression())
polyreg.fit(X,Y)

plt.figure()
plt.scatter(X,Y)
plt.plot(X_seq,polyreg.predict(X_seq),color="black")
plt.title("Polynomial regression with degree "+str(degree))
plt.show()

It is clear from the above polynomial fit that SAPS will be generally over-estimated at the low end (where SPECintbase2006 is 20 or less). This will have the tendency of inflating the SAPS rating of old/slow boxes.

In [None]:
# y = ax^3 + bx^2 + cx + d
c = np.polyfit(d2['RESULTS BASE 2006'],d2['SAPS per Core'], degree)
print(c)

# formula for SAPS from SPECintbase2006
def spec2saps(spec: float) -> float:
    saps = c[0]*spec**3 + c[1]*spec**2 + c[2]*spec + c[3]

    # the lowest possible SAPS per core on the official benchmark is 145 (Sun T2000)
    if (saps < 145):
        saps = 145
    return (saps)

In [None]:
# Sun M3000 (actual SAPS/core = 1032)
print (spec2saps(13.58))

## Conclusion

There is a strong correlation between SAPS (per core) and SPECintbase2006, even when only doing a simple linear regression. With the larger data set, the best-fit line is defined approximately by:

**SAPS per Core = (20.57 * SPECintbase2006) + 1061**

The above formula **will not work** for really old systems with very low SPECintbase2006. If we posit a pathological system with **ZERO** SPECintbase2006, the formula would still predict a SAPS per core of 1061.

Example: Sun M3000 with 4 cores reports 4130 SAPS and 1032 SAPS/core. Average SPECintbase2006 for this system is 13.58. The above formula predicts a SAPS per core (based on SPECintbase2006) of 1340 using the linear regression. This is much higher than the actual SAPS per core of 1032. Meanwhile, if we use a polynomial fit with degree 3, the predicted SAPS is 890 which is less than actual.

