## SVC from SAS® Viya® on Higgs Data   

### About the [Higgs data set](https://archive.ics.uci.edu/dataset/280/higgs)
The original data was generated through Monte Carlo simulations. The initial 21 features (columns 2-22) represent kinematic properties obtained from the particle detectors in the accelerator. The final seven features are transformations of the first 21 features; these higher-level attributes were created by physicists to differentiate between the two classes; 1 for signal, 0 for background. This example utilizes a 0.1% sample of the original data.  

### Import necessary libraries and get the data:

In [None]:
import os
import pandas as pd
import numpy as np
from sklearn.metrics import classification_report, confusion_matrix

### load dataset

In [None]:
workspace=f'{os.path.abspath("")}/../data/'
higgs_df = pd.read_csv(workspace+'higgs.csv')

In [None]:
# View dimension of the dataset
higgs_df.shape

In [None]:
print(higgs_df.info())

The higgs_df DataFrame has 10,984 rows and 29 columns. All columns except label have a data type of float64, which indicates they are all interval-level variables. Additionally, there are no missing values in the data.

### Split the data into predictor and response dataframes

In [None]:
X_df = higgs_df.drop(['label'], axis=1)
y = higgs_df['label']

### Calculate target level percentages

In [None]:
percentages = y.value_counts() / len(y) * 100
print(percentages)

### Let's view the distribution of the data

In [None]:
numeric_X_df = X_df.select_dtypes(exclude=['object'])
numeric_X_df.describe().T

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

fig, axs = plt.subplots(ncols=7, nrows=4, figsize=(20, 10))
index = 0
axs = axs.flatten()
for k,v in numeric_X_df.items():
    sns.boxplot(y=k, data=numeric_X_df, ax=axs[index])
    index += 1
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=5.0)

The plots above indicate there are some columns have outliers. Let's determine the percentage of outliers in each column.

In [None]:
for k, v in numeric_X_df.items():
    q1 = v.quantile(0.25)
    q3 = v.quantile(0.75)
    irq = q3 - q1
    v_col = v[(v <= q1 - 1.5 * irq) | (v >= q3 + 1.5 * irq)]
    perc = np.shape(v_col)[0] * 100.0 / np.shape(higgs_df)[0]
    print("Column %s outliers = %.2f%%" % (k, perc))

The columns **m_jj** and **m_lv** have a higher percentage of outliers.

### Remove outliers from the m_jj and m_lv columns

In [None]:
higgsWithoutOutliers = higgs_df[np.abs(higgs_df["m_jj"]-higgs_df["m_jj"].mean())<=(3*higgs_df["m_jj"].std())] 
higgsWithoutOutliers = higgsWithoutOutliers[np.abs(higgsWithoutOutliers["m_lv"]-higgsWithoutOutliers["m_lv"].mean())<=(3*higgsWithoutOutliers["m_lv"].std())] 

print ("Shape of the dataframe before ouliers removed: ",higgs_df.shape)
print ("Shape of the dataframe after ouliers removed: ",higgsWithoutOutliers.shape)

### Examine the correlation between the variables using a correlation matrix

In [None]:
X_df = higgsWithoutOutliers.drop(['label'], axis=1)
y = higgsWithoutOutliers['label']

plt.figure(figsize=(20, 10))
sns.heatmap(X_df.corr().abs(),  annot=False, cmap="coolwarm")
plt.show()

Correlation matrix heatmap shows most columns are uncorrelated, except for 'm_wbb' and 'm_wwbb', which have a correlation greater than 0.8. 

### Split the data into training and test sets 

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X_df,y,test_size=0.25,random_state=3)

### Train the SAS Viya machine learning Support Vector Classifier with kernal='poly' and method='ipoint'

By default, SVC automatically scales the features to [0,1] range. For details about using the `SVC` class, see the [SVC documentation](https://documentation.sas.com/?cdcId=workbenchcdc&cdcVersion=default&docsetId=explore&docsetTarget=p1udx0532v47xfn1l3ix3scjh8uj.htm)

In [None]:
from sasviya.ml.svm import SVC 

model = SVC(kernel='poly', method='ipoint')

# train the model on train set
model.fit(X_train, y_train)
  
# print prediction results
y_pred = model.predict(X_test)

In [None]:
model.get_params()

In [None]:
# Compute train accuracy and format it as a percentage with 2 decimal places
print("Train Accuracy = {:.2%}".format(model.score(X_train, y_train)))

In [None]:
# Compute test accuracy and format it as a percentage with 2 decimal places
print("Test Accuracy = {:.2%}".format(model.score(X_test, y_test)))

In [None]:
print(classification_report(y_test, y_pred))

We have achieved an approximate prediction accuracy of 65%.

In [None]:
confusion_matrix = confusion_matrix(y_test,y_pred)
class_names = [0,1]
fig,ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks,class_names)
plt.yticks(tick_marks,class_names)
sns.heatmap(pd.DataFrame(confusion_matrix), annot = True, cmap = 'Pastel1_r', fmt = 'g')
ax.xaxis.set_label_position('top')
plt.tight_layout()
plt.title('Confusion Matrix for sasviya SVC')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.show()