### This is a simple notebook to train a Support Vector Machine to discriminate between two types of collisional events. This week, we'll deal with data preprocessing, try out a simple Linear SVM, and diagnose its performance.

It accompanies Chapter 4 of the book.

Data for this exercise were kindly provided by [Sascha Caron](https://www.nikhef.nl/~scaron/).

Copyright: Viviana Acquaviva (2023)

Modifications by Julieta Gruszko (2025)

License: [BSD-3-clause](https://opensource.org/license/bsd-3-clause/)




### Group names

In [None]:
import numpy as np
import itertools
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import rc
from sklearn.svm import SVC, LinearSVC # New algorithm!
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_predict, cross_validate
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn import metrics

In [None]:
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_colwidth', 100)
rc('text', usetex=False)

### The first part of this notebook walks through the manipulation I did to get the data in the format we need and to select a random sample to keep computation times more manageable.

This csv data set is a bit tricky to deal with: it has two different delimeters, ';' and ','. 

The semicolons separate the first 5 columns, which apply to the full event, and are also used to separate the lists of info about each of the products detected. So using the semicolons as a separator would give you:
$$\texttt{numID; processID; weight; MET; METphi; P1\_info; P2\_info; P3\_info; ...}$$

Commas are used to separate info about each of the products, so for each of the "Pn_info" columns separated by semicolons, there's a comma-separated list of the label and 4-momentum values:
$$\texttt{Pn\_info = Pn\_label, P\_E, Pn\_pt, Pn\_eta, Pn\_phi}$$

First, we'll make an array for the column labels, allowing for up to 19 products per event.

In [None]:
# make an array of column labels
names = np.array(['numID', 'processID', 'weight', 'MET', 'METphi'])
names = np.append(names, [['P'+str(i)+'_type', 'P'+str(i)+'_E', 'P'+str(i)+'_pt', 'P'+str(i)+'_eta', 'P'+str(i)+'_phi'] for i in range(19)])
print(names)


Pandas can use python's regular expressions parsing engine to accept a more complex delimeter expression, so we can still read in the csv data file in one line.

In [None]:
df = pd.read_csv('../Data/TrainingValidationData.csv', delimiter=';|,', engine = 'python', names=names)
print(df.columns)

In [None]:
df.head()

We can use the $\texttt{describe}$ function to look at the numerical columns. The ATLAS detector calorimeter accepts tracks with $ -4.9 < \eta < 4.9$ and any value of azimuthal angle ($-\pi < \phi < \pi$). You can see nearly the full range of allowed values for these columns present in the P0 track. 

If you scroll all the way over to the final columns, you can see that we have some empty ones. We'll drop those below.

In [None]:
df.describe()

In [None]:
X = df.drop(['numID', 'processID', 'weight'], axis = 1) #drop indices, labels, and weights (used to weigh statistics in simulation studies of spectra)
X = X.drop(['P18_type', 'P18_E', 'P18_pt', 'P18_eta', 'P18_phi'], axis = 1) #drop empty columns
X.describe()

In [None]:
len(df.columns)

In [None]:
X.head() 

### Saving subset of the data to a new file

We'll select 5000 random instances to reduce the size of the data set, reset the indices of both the features and labels, and store both to csv files. That will let us load the same data more easily if we want to work with it again (e.g. in next week's studio).


First, we'll generate a random list of event indices to select 5000 events. We'll set the seed so everyone is using the same subset.

In [None]:
np.random.seed(10)

sel = np.random.choice(df.shape[0], 5000)

Now, we'll select those rows from the feature data frame and save the subset to a file.

In [None]:
features = X.iloc[sel,:]

print(features.shape)
print(features.columns)

# reset index
features.reset_index(drop=True, inplace=True)
features.head()

# Export the feature data to a file
features.to_csv('../Data/ParticleID_features.csv', columns=features.columns, index_label= 'ID')

#### Now, the labels

In [None]:
#Select the labels
y = df.processID[sel].values # values makes it an array
print(y)
#Export labels to file
np.savetxt('../Data/ParticleID_labels.txt', y, fmt = '%s')

## Now we're ready to start!

Read in features and labels.

In [None]:
features = pd.read_csv('../Data/ParticleID_features.csv', index_col='ID')

In [None]:
features.head()

In [None]:
features.shape

In [None]:
y = np.genfromtxt('../Data/ParticleID_labels.txt', dtype = str)

In [None]:
y

#### We need to turn categorical (string-type) labels into an array, e.g. 0/1.

sk-learn has a nice preprocessing tool we can use for this. 

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder() #turns categorical into 1 ... N

In [None]:
y

In [None]:
y = le.fit_transform(y)

In [None]:
y 

Our transformer used 1 for the first instance, but we actually wanted 4top to be the positive label, so we'll flip the labels:

In [None]:
target = np.abs(y - 1)

In [None]:
target # Happier now.

#### Let's take a look at these features, using the "describe" property.

In [None]:
features.describe() #Note that this automatically excludes non-numerical type columns

### Important:

Looking at the "count" row, we can see that the whole data set has 5,000 rows, but some columns are present only for a fraction of them. This is because of the variable number of products in each collision.

#### Option 1: Only consider the missing energy features and the 4-momentum values of the first four products, so we have limited imputing/manipulation problems.

In our data set, the products are ordered by their energy, so choosing the first 4 should give us a lot of the relevant information.

For the moment, we'll work just with the numerical features. Next week, we'll see some options for incorporating the track types as additional features.

We have a trade-off between keeping more features, but having a more severe missing data/imputing problem, or keeping fewer features, but dealing with a simpler imputing problem. We are choosing the latter.

In [None]:
features_lim = features[['MET', 'METphi', 'P0_E', 'P0_pt', 'P0_eta', 'P0_phi', 'P1_E', 'P1_pt', 'P1_eta', 'P1_phi', 'P2_E', 'P2_pt', 'P2_eta', 'P2_phi', 'P3_E', 'P3_pt', 'P3_eta', 'P3_phi']]

In [None]:
features_lim.head()

In [None]:
features_lim.describe() #This automatically excludes non-numerical type columns, and missing values/NaNs are not counted.

There are still some feature columns with different length! This means there might be NaN values. Let's replace them with 0 for the moment. 

In [None]:
features_lim = features_lim.fillna(0) #Fill with 0 everywhere there is a NaN

Note: this is the simplest but worst possible choice - imputing a constant value skews the model :D One step up would be to input the mean or median for each column. However, because only a limited number of instances have missing data, the choice of imputing strategy doesn't matter too much.

#### Let's see what "describe" says now.

In [None]:
features_lim.describe()

Yay - we now have consistent sizes, so we can use these as feature arrays, BUT be mindful of possible negative impacts of our imputing strategies.

### Let's move onto a quick exploration of labels and benchmarking.

In [None]:
np.sum(target)/len(target) #distribution 

84\% in the negative label, 16\% in the positive label. 

This means that a classifier that puts everything in the negative class will have 84\% accuracy.

How about a random classifier that just assigns a random value according to class distribution?

In [None]:
#Numerical solution

acc=0
for i in range(1000):
    x = np.random.choice(target,5000)
    acc += metrics.accuracy_score(target,x)
print(acc/1000)

#Analytic solution 

print(0.8378*(0.8378) + 0.1622*0.1622)

### Question: 
Describe what the "numerical solution" code is doing to get an estimated accuracy for a random classifier. You can give a description or pseudocode, your choice. 

In conclusion, a "random" classifier would have 73% accuracy; a "lazy" classifier that predicts the most frequent class would have 83% accuracy. These are useful in order to set the expectation for what "a good result" is and what constitutes a significant improvement.

### Let's start with a linear model; model = SVC()

Define a cross-validation strategy; establish benchmark for a linear model.

In [None]:
bmodel = LinearSVC(dual = False) #Prefer dual=False when n_samples > n_features. If not, will not converge!!

In [None]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=101) 

In [None]:
l_benchmark_lim = cross_validate(bmodel, features_lim, target, cv = cv, scoring = 'accuracy', return_train_score=True)

In [None]:
l_benchmark_lim

In [None]:
np.round(l_benchmark_lim['test_score'].mean(),3), np.round(l_benchmark_lim['test_score'].std(), 3)

### Question:
Evaluate the performance of the inital Linear SVM you trained. Does it out-perform the "lazy" or random classifiers described above?

We can also check the predicted labels and the confusion matrix. Cross\_val\_predict will compile labels predicted when each object was in the test fold.

In [None]:
ypred_bench_lim = cross_val_predict(bmodel, features_lim, target, cv = cv)
metrics.confusion_matrix(target,ypred_bench_lim) 

### Question:
How many 4-top events are missed by this classifier? How many non-4-top events are misidentified as 4-top events?

### Question: is there perhaps something that we should have done before building the SVM model?

### How about scaling?

Implementation notes: Technically, standardizing/normalizing data using the entire learning set introduces leakage between train and test set (the test set "knows" about the mean and standard deviation of the entire data set). Usually this is not a dramatic effect, but the correct procedure is to derive the scaler within each CV fold (i.e. after separating in train and test), only on the train set, and apply the same transformation to the test set. The model then becomes a pipeline.

In [None]:
from sklearn.pipeline import make_pipeline #This allows one to build different steps together

In [None]:
piped_model = make_pipeline(StandardScaler(), LinearSVC(dual = False)) #make a pipeline with standard scaler and linear SVM

piped_model.get_params() #return the parameters of the full pipeline. You should see some associeted with the the scaler, and some with the SVM algorithm

Now that we have a pipeline, when we use cross-validation, then for each fold, it will run all the steps of the pipeline. In this case, that means that for each fold, it:
- applies standard scaling to the train data
- trains the linear SVM using this scaled train set, returning the train score
- scales the test data with standard scaling
- makes predictions for the test data, returning the accuracy score

In [None]:
benchmark_lim_piped = cross_validate(piped_model, features_lim, target, cv = cv, scoring = 'accuracy', return_train_score=True)

In [None]:
benchmark_lim_piped

In [None]:
#get the mean and standard deviation of the test score
np.round(benchmark_lim_piped['test_score'].mean(),3), np.round(benchmark_lim_piped['test_score'].std(), 3)

In [None]:
#get the mean and standard deviation of the train score
np.round(benchmark_lim_piped['train_score'].mean(),3), np.round(benchmark_lim_piped['train_score'].std(), 3)

This is a significant improvement (woo-ooh!), and the comparison between test and train scores tells us already something about the problem that we have. We can formalize this by looking at the learning curves, which tell us both about gap between train/test scores, AND whether we need more data.

### Learning curves 

In [None]:
from sklearn.model_selection import learning_curve

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=5,
                        n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 5), scoring = 'accuracy', scale = False):
    """
    Generate a simple plot of the test and training learning curve.

    Parameters
    ----------
    estimator : object type that implements the "fit" and "predict" methods
        An object of that type which is cloned for each validation.

    title : string
        Title for the chart.

    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.

    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.

    ylim : tuple, shape (ymin, ymax), optional
        Defines minimum and maximum yvalues plotted.

    cv : int, cross-validation generator or an iterable, optional
        Determines the cross-validation splitting strategy.
        Possible inputs for cv are:
          - None, to use the default 3-fold cross-validation,
          - integer, to specify the number of folds.
          - :term:`CV splitter`,
          - An iterable yielding (train, test) splits as arrays of indices.

        For integer/None inputs, if ``y`` is binary or multiclass,
        :class:`StratifiedKFold` used. If the estimator is not a classifier
        or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.

        Refer :ref:`User Guide <cross_validation>` for the various
        cross-validators that can be used here.

    n_jobs : int or None, optional (default=None)
        Number of jobs to run in parallel.
        ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
        ``-1`` means using all processors. See :term:`Glossary <n_jobs>`
        for more details.

    train_sizes : array-like, shape (n_ticks,), dtype float or int
        Relative or absolute numbers of training examples that will be used to
        generate the learning curve. If the dtype is float, it is regarded as a
        fraction of the maximum size of the training set (that is determined
        by the selected validation method), i.e. it has to be within (0, 1].
        Otherwise it is interpreted as absolute sizes of the training sets.
        Note that for classification the number of samples usually have to
        be big enough to contain at least one sample from each class.
        (default: np.linspace(0.1, 1.0, 5))
    """
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("# of training examples",fontsize = 14)
 
    plt.ylabel("Accuracy score",fontsize = 14)
    
    if (scale == True):
        scaler = sklearn.preprocessing.StandardScaler()
        X = scaler.fit_transform(X)
    
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes, scoring = scoring)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
#    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="b")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="b",
             label="Training score from CV")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Test score from CV")

    plt.legend(loc="best",fontsize = 12)
    return plt

In [None]:
plot_learning_curve(piped_model, 'Generalized Learning Curves, linear SVC model, no reg', features_lim, target, train_sizes = np.array([0.05,0.1,0.2,0.5,1.0]), cv = KFold(n_splits=5, shuffle=True));

### Linear SVM Conclusion Questions

1. How does the classifier performance compare to "lazy" and random classifiers?
2. Briefly explain why scaling the data improved the SVM performance. You may find it helpful to think back to the different behavior we observed with respect to scaling effects in the Decision Tree and kNN algorithms. 
3. Does the model suffer from high variance, high bias, or both? How do you know?
4. Would you expect that having more data would improve the performance of the classifier? How do you know?
5. Given your diagnosis from questions 3 and 4, what are some things we should try to improve this classifier's performance? 



### Acknowledgement statement:

### When you're finished, submit this studio to Gradescope. 