# Emulator: Support Vector Machine (scikit-learn)

This notebook shows the emulation of the two point correlation functions using Suport Vector Machine (SVM).

SVM is a supervised learning algorithm that can be employed in classification and regression tasks. In particular, in this notebook uses SVM for regression analysis purposes. 

The Support Vector Regression (SVR) uses the same principles as the SVM for classification, with only a few minor differences. A margin is defined not to separate between two classes but instead is used to define the region of intereset ($\epsilon$).

![../images/svr.png](../images/svr.png)

The main idea is: to minimize error, individualizing the hyperplane which maximizes the margin, keeping in mind that part of the error is tolerated.  

Usually, SVM is better in small sample than the other methods.

for more information: https://www.saedsayad.com/support_vector_machine_reg.htm

#### Index<a name="index"></a>
1. [Import packages](#imports)
2. [Load data](#loadData)
    1. [Load train data](#loadTrainData)
    2. [Load test data](#loadTestData)
3. [Visualize dataset](#visualizeData)
    1. [Data structure](#dataStructure)
    2. [Plot datasets](#plotData)
4. [Emulator method](#emulator)
    1. [Scale data](#scaleData)
    2. [Train emulator](#trainEmu)
    3. [Predict on test data](#predEmu)
    4. [Plot results](#plotEmu)
    5. [Improving the emulator](#improveEmu)

## 1. Import packages<a name="imports"></a>

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pickle

from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler

## 2. Load data<a name="loadData"></a>

Read the training data from a `.npy` file:

### 2.1. Load train data<a name="loadTrainData"></a>

In [None]:
import pandas as pd

df = pd.read_pickle('../data/cosmology_train.pickle')

The cosmology dataset contains cosmological parameters (Omega_m, sigma8, Omega_b) as inputs, and the correlation function as output. The correlation function is measured at 10 separation values $r$.

### 2.2. Load test data<a name="loadTestData"></a>

In [None]:
df_test = pd.read_pickle('../data/cosmology_test.pickle')

## 3. Visualizing the dataset<a name="visualizeData"></a>

### 3.1 Data Structure<a name="dataStructure"></a>

The cosmology dataset contains cosmological parameters (Omega_m, sigma8, Omega_b) as inputs, and the correlation function as output. The correlation function is measured at 10 separation values  𝑟 .

In [None]:
df_in = df['input_data']
df_in

In [None]:
df_out = df['output_data']
df_out

In [None]:
rvals = df['extra_input']['r_vals']
rvals

In [None]:
ys_train = df_out[[r'$\xi(r_0)$', r'$\xi(r_1)$', r'$\xi(r_2)$', r'$\xi(r_3)$',
       r'$\xi(r_4)$', r'$\xi(r_5)$', r'$\xi(r_6)$', r'$\xi(r_7)$', r'$\xi(r_8)$',
       r'$\xi(r_9)$']].to_numpy()

xs_train = df_in[[r'$\Omega_m$', r'$\sigma_8$', r'$\Omega_b$']].to_numpy()

In [None]:
print('x shape:',xs_train.shape)
print('y shape:',ys_train.shape)

### 3.2 Plot datasets<a name="plotData"></a>

In [None]:
plt.figure(figsize=(8,6))
ys_train_plot = ys_train.copy()
np.random.shuffle(ys_train_plot) # shuffle so that color order isn't weird
plt.plot(rvals, ys_train_plot.T, alpha=0.8)

plt.xlabel('$r$',fontsize=18)
plt.ylabel(r'$\xi(r)$',fontsize=18)

Let's do the same for our test set:

In [None]:
df_test_in = df_test['input_data']
df_test_out = df_test['output_data']

ys_test = df_test_out[[r'$\xi(r_0)$', r'$\xi(r_1)$', r'$\xi(r_2)$', r'$\xi(r_3)$',
       r'$\xi(r_4)$', r'$\xi(r_5)$', r'$\xi(r_6)$', r'$\xi(r_7)$', r'$\xi(r_8)$',
       r'$\xi(r_9)$']].to_numpy()

xs_test = df_test_in[[r'$\Omega_m$', r'$\sigma_8$', r'$\Omega_b$']].to_numpy()

In [None]:
n_test = xs_test.shape[0]
n_values= ys_test.shape[1]
n_params= xs_test.shape[1]
print("Number of datapoints:", n_test)
print("Number of input parameters:", n_params)
print("Number of output values:", n_values)

In [None]:
plt.figure(figsize=(8,6))
plt.plot(rvals, ys_test.T, alpha=0.8)
plt.xlabel('$r$')
plt.ylabel(r'$\xi(r)$')

## 4. Emulator method<a name="emulator"></a>

SVM Method

based on this example: https://scikit-learn.org/stable/auto_examples/svm/plot_svm_regression.html#sphx-glr-auto-examples-svm-plot-svm-regression-py

In [None]:
import numpy as np
from sklearn.svm import SVR
import matplotlib.pyplot as plt

# #############################################################################
# sample data
X = xs_train
y = ys_train

Xt= xs_test

# #############################################################################
# Fit regression model
svr_rbf = SVR(kernel='rbf', C=100, gamma=0.1, epsilon=.01, degree=10)
svr_lin = SVR(kernel='linear', C=100, gamma='auto', epsilon=.1)
svr_poly = SVR(kernel='poly', C=100, gamma='auto', degree=12, epsilon=.1,
               coef0=1)
##############################################################################

In [None]:
svr_rbf

In [None]:
print(X.shape)
print(y.shape)

### 4.1 Scale data<a name="scaleData"></a>

In [None]:
scaler = StandardScaler()
scaler.fit(xs_train)

In [None]:
xs_train = scaler.transform(xs_train)
xs_test = scaler.transform(xs_test)

y_mean = np.mean(ys_train, axis=0)
ys_train = ys_train/y_mean
ys_test = ys_test/y_mean

### 4.2 Train emulator<a name="trainEmu"></a>

In [None]:
def do_regresssion(kwargs):
    regrs = np.empty(n_values, dtype=object)
    scores= np.empty(n_values, dtype=object)
    for j in range(n_values):
        ys_train_r = ys_train[:,j]
        ys_test_r = ys_test[:,j]
        regr = SVR(**kwargs).fit(xs_train, ys_train_r)
        score = regr.score(xs_test, ys_test_r)
        print(f"Value {j} score:", score)
        regrs[j] = regr
        scores[j] = score
    print()
    return regrs, scores

In [None]:
kwargs = {'kernel':'rbf', 'epsilon':5e-4, 'C':11, 'gamma':0.09,'tol':1e-6}
r,s = do_regresssion(kwargs)

In [None]:
regrs = r

### 4.3 Predict on test data<a name="predEmu"></a>

In [None]:
ys_predict = np.zeros((n_test, n_values))
for j in range(n_values):  
    ys_predict_r = regrs[j].predict(xs_test)
    ys_predict[:,j] = ys_predict_r

In [None]:
n_plot = int(0.2*n_test)
idxs = np.random.choice(np.arange(n_test), n_plot)
color_idx = np.linspace(0, 1, n_plot)
colors = np.array([plt.cm.rainbow(c) for c in color_idx])

In [None]:
ys_train = ys_train*y_mean
ys_test = ys_test*y_mean
ys_predict = ys_predict*y_mean

### 4.4 Plot results<a name="plotEmu"></a>

In [None]:
plt.figure(figsize=(8,6))
for i in range(n_plot):
    ys_test_plot = ys_test[idxs,:][i]
    ys_predict_plot = ys_predict[idxs][i]
    if i==0:
        label_test = 'truth'
        label_predict = 'emu_prediction'
    else:
        label_test = None
        label_predict = None
    plt.plot(rvals[:n_values], ys_test_plot, alpha=0.8, label=label_test, marker='o', markerfacecolor='None', ls='None', color=colors[i])
    plt.plot(rvals[:n_values], ys_predict_plot, alpha=0.8, label=label_predict, color=colors[i])
plt.xlabel('$r$')
plt.ylabel(r'$\xi(r)$')
plt.title('SVM')
plt.legend()

In [None]:
plt.figure(figsize=(8,6))
for i in range(n_plot):
    ys_test_plot = ys_test[idxs,:][i]
    ys_predict_plot = ys_predict[idxs][i]
    frac_err = (ys_predict_plot-ys_test_plot)/ys_test_plot
    plt.plot(rvals, frac_err, alpha=0.8, color=colors[i])
plt.axhline(0.0, color='k')
plt.xlabel('$r$')
plt.ylabel(r'fractional error')

### 4.5 Improving the emulator<a name="improveEmu"></a>

Let's do a grid search to find the best hyperparameters.

In [None]:
ys_train = ys_train/y_mean
ys_test = ys_test/y_mean

#### Which kernel is best?

In [None]:
from sklearn.metrics import make_scorer
from sklearn.metrics import r2_score
from sklearn.model_selection import GridSearchCV

scoring = make_scorer(r2_score)
param_grid = [{'kernel': ['linear'], 'gamma': [7e-2,9e-2,11e-2],
               'C': [ 1, 7, 10], 'epsilon':[1e-3,1e-4]},
             {'kernel': ['poly'], 'gamma': [7e-2,9e-2,11e-2],
               'C': [ 1, 7, 10], 'epsilon':[1e-3,1e-4]},
             {'kernel': ['rbf'], 'gamma': [7e-2,9e-2,11e-2],
               'C': [ 1, 7, 10], 'epsilon':[1e-3,1e-4]}]

# param_grid = [{'kernel': ['rbf'], 'gamma': [7e-2,9e-2,11e-2],
#                'C': [ 1, 3, 7, 10]}]

g_cv = GridSearchCV(SVR(), param_grid, scoring=scoring, refit=True, cv=10)
g_cv.fit(xs_train, ys_train[:,7])
score = r2_score(ys_test[:,7], g_cv.predict(xs_test))

print("Best parameters set found on development set:")
print()
print('%.5f'%score,':',g_cv.best_params_)
print()
print()
print("Grid scores on development set:")
print()
means = g_cv.cv_results_['mean_test_score']
stds = g_cv.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, g_cv.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r"
          % (mean, std * 2, params))
print()

#### Which kernel is best?
Answer: rbf, then linear and poly

#### What is the best setup for all bins?

In [None]:
from sklearn.metrics import make_scorer
from sklearn.metrics import r2_score
from sklearn.model_selection import GridSearchCV

scoring = make_scorer(r2_score)
param_grid = [{'kernel': ['rbf'], 'gamma': [5e-2,7e-2,9e-2],
               'C': [ 7, 10, 13, 20], 'epsilon':[5e-4,1e-4,5-4]}]

print("Best parameters set found on development set:")
for ix in range(n_values):
    g_cv = GridSearchCV(SVR(), param_grid, scoring=scoring, refit=True, cv=10)
    g_cv.fit(xs_train, ys_train[:,ix])
    score = r2_score(ys_test[:,ix], g_cv.predict(xs_test))
    print()
    print('radii bin %i'%ix)
    print('%.4f'%score,':',g_cv.best_params_)
    print()

It's best to use a mean value of the parameters: epsilon=0.05;
gamma = 0.09;
C = 7