### This is a follow-up to Studio 4, using SVMs to classify 4-top ATLAS events. 
In this notebook, you'll use nested cross-validation to optimize hyperparameters for SVM models, comparing to the benchmark linear model that you trained in Studio 4. You'll also add newly-engineered features to the model and test the resulting performance. 

It accompanies Chapter 4 of the book.

Data for this exercise were kindly provided by [Sascha Caron](https://www.nikhef.nl/~scaron/).

Copyright: Viviana Acquaviva (2023)
Modifications by Julieta Gruszko (2025)

License: [BSD-3-clause](https://opensource.org/license/bsd-3-clause/)




### Group Names:

In [None]:
import numpy as np
import itertools
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import rc
from sklearn.svm import SVC, LinearSVC # New algorithm!
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_predict, cross_validate, cross_val_score
from sklearn.model_selection import KFold, StratifiedKFold, GridSearchCV
from sklearn import metrics

In [None]:
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_colwidth', 100)
rc('text', usetex=False)

## We'll begin by opening the feature and label files you prepared last week.

Read in features and labels.

In [None]:
features = pd.read_csv('../Data/ParticleID_features.csv', index_col='ID')

In [None]:
features.head()

In [None]:
features.shape

In [None]:
y = np.genfromtxt('../Data/ParticleID_labels.txt', dtype = str)

In [None]:
y

#### As we did last week, we'll turn categorical (string-type) labels into an array, e.g. 0/1.

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder() #turns categorical into 1 ... N
y = le.fit_transform(y)
target = np.abs(y - 1) #flip the labels, so 4-top is 1 and t/t-bar is 0


In [None]:
y

In [None]:
features.describe() #Note that this automatically excludes non-numerical type columns

### Imputing Missing Data
As in Studio 4, we'll keep just the first 4 products and fill any reamining missing values with 0's. You'll test other imputation strategies on the homework. 


In [None]:
features_lim = features[['MET', 'METphi', 'P0_E', 'P0_pt', 'P0_eta', 'P0_phi', 'P1_E', 'P1_pt', 'P1_eta', 'P1_phi', 'P2_E', 'P2_pt', 'P2_eta', 'P2_phi', 'P3_E', 'P3_pt', 'P3_eta', 'P3_phi']]
features_lim = features_lim.fillna(0) #Fill with 0 everywhere there is a NaN
features_lim.describe()

### Let's first reproduce our benchmark linear model (with scaling) from Studio 4 so we have it for comparison; model = LinearSVC().

In [None]:
from sklearn.pipeline import make_pipeline #This allows one to build different steps together
piped_model = make_pipeline(StandardScaler(), LinearSVC(dual=False)) #make a pipeline with standard scaler and linear SVM
cv = StratifiedKFold(n_splits = 5, shuffle=True, random_state=101)# make a 5-fold stratified cross-validation, setting shuffle to "True" and random state to 101

benchmark_lim_piped = cross_validate(piped_model, features_lim, target, cv = cv, scoring = 'accuracy', return_train_score=True)

In [None]:
benchmark_lim_piped

In [None]:
np.round(benchmark_lim_piped['test_score'].mean(),3), np.round(benchmark_lim_piped['test_score'].std(), 3)

In [None]:
np.round(benchmark_lim_piped['train_score'].mean(),3), np.round(benchmark_lim_piped['train_score'].std(), 3)

### Parameter optimization 

When we optimize parameters with a grid search, we choose the parameters that give the best test scores. This is different from what would happen with new data - to do this fairly, at no point of the training procedure we are allowed to look at the test labels. Therefore, we would need to do <b> nested cross validation </b> to avoid leakage between the parameter optimization and the cross validation procedure and properly evaluate the generalization error.

If you don't do nested cross-validation, you'll unintentionally bias your generalization error estimate, as you'll be choosing the model you use based on the same test data you're using to evaluate the performance on new data. 

Since we're doing a lot of new things here, we'll first run an optimization <b> without </b> nested cross-validation. This is the approach you'd use to pick a set of optimal hyperparameters, which you could use to make predictions on new data.

Then, we'll run the same optimization <b> with </b> nested cross-validation, which is the approach used to get the correct generalization error. Then we'll compare the results from the two approaches. 

### Setup for both steps:

First we'll set up the model and hyperparameters to scan over, which are common to both approaches. We'll also set up 1 cross-validation, which will eventually be the outer one.

In [None]:
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=101) #1 layer of cross-validation

piped_model = make_pipeline(StandardScaler(), SVC()) #now using the general SVC so I can change the kernel

piped_model.get_params() #this shows how we can access parameters both for the scaler and the classifier


Then we define a dictionary of parameter values that we'll run the optimization over.

You'll notice that we're not using a linear kernel. That's because the RBF (Gaussian) kernel with a very large $\gamma$ is equivalent to a linear kernel.

In [None]:
parameters = {'svc__kernel':['poly', 'rbf'], \
              'svc__gamma':[0.00001,'scale', 0.01, 0.1], 'svc__C':[0.1, 1.0, 10.0, 100.0], \
              'svc__degree': [2, 4, 8]}

### A few questions:
- Briefly describe what each hyperparameter in the dictionary does. 
- How many SVC's will be trained if we do 5-fold cross-validation for each combination of hyperparameters?
- How many of the SVC's trained correspond to models that are actually distinct? Hint: there are degeneracies! E.g. does the "degree" parameter change anything if you're using the rbf kernel?

### Now we'll run the optimization with just 1 layer of cross-validation.
Note that this might take a while (~1 min on my laptop); the early estimates output by this cell may be misleading because more complex models (in particular high gamma) take longer. To speed things up, we're running 4 jobs in parallel.

Notice that the $\texttt{GridSearchCV()}$ function constructs our $\texttt{model}$ object, but doesn't actually train any models! The training happens when we call $\texttt{fit}$, as usual.

Once you run this cell, the $\texttt{model}$ object will have attributes $\texttt{best\_score\_}$, $\texttt{best\_params\_}$ and $\texttt{best\_estimator\_}$, which give us access to the optimal estimator (printed out), as well as $\texttt{cv\_results\_}$ that can be used to visualize the performance of all models.

In [None]:
#optimizing SVC: THIS IS NOT YET NESTED CV

model = GridSearchCV(piped_model, parameters, cv = outer_cv, \
                     verbose = 2, n_jobs = 4, return_train_score=True)

In [None]:
model.fit(features_lim,target)

In [None]:
print('Best params, best score:', "{:.4f}".format(model.best_score_), \
      model.best_params_)

Which model performed the best? Give its relevant hyperparameters. What was its accuracy score?

#### We can visualize the models in a data frame, and rank them according to their test scores.

I like to look at the mean and std of the test scores, the mean of the train scores (so I can evaluate if they differ and the significance of the result), and also fitting time (we may pick a faster model instead of the best model if the scores are comparable)!

In [None]:
scores_lim = pd.DataFrame(model.cv_results_)

scores_lim.columns

In [None]:
scores_lim[['params','mean_test_score','std_test_score','mean_train_score', \
            'mean_fit_time']].sort_values(by = 'mean_test_score', ascending = False)

#### We can also isolate one type of kernel to look at it more closely.

In [None]:
scores_lim[scores_lim['param_svc__kernel'] == 'poly'][['params','mean_test_score','std_test_score',\
                        'mean_train_score','mean_fit_time']].sort_values(by = 'mean_test_score', ascending = False)

### A few questions:
- Why do some of the models have identical scores (e.g., the top 3 models)?
- What hyperparameter values are common to all the best-performing models? 
- What hyperparameters do not strongly affect the accuracy? Ignore degenerate models in this discussion. 

### Next, we'll run nested cross-validation to get the generalization error.

First, we need to make one more layer of cross-validation. We'll do 4 splits for the inner layer.

In [None]:
inner_cv = StratifiedKFold(n_splits=4, shuffle=True, random_state=10)


This time, the cross-validation applied to the hyperparameter optimization step is the $\texttt{inner\_cv}$.

Now, we'll use $\texttt{cross\_val\_score}$ with $\texttt{outer\_cv}$ (instead of $\texttt{fit}$!) to fit the model and get its score. 

This may be a bit confusing: remember that the $\texttt{GridSearchCV()}$ is just a constructor, it doesn't run any model fits! 


I found this explanation, from Arpit Omprakash on StackOverflow (https://stackoverflow.com/a/78544053) extremeley helpful:

"In the first line of code here, we are instantiating the GridSearchCV object using the inner_cv cross validator (but not fitting it). 

In the second line, we are doing a lot of things. First, using cross_val_score and outer_cv we break the initial data into different splits, let's call it x_tr_0, x_ts_0, x_tr_1, x_ts_1, x_tr_2, x_ts_2, x_tr_3, x_ts_3, x_tr_4, x_ts_4, (since there are five splits, and in each split, we have training and testing data). The training data is passed on to the GridSearchCV method in each fold. So, the inner_cv cross validator works on the training data splits from the outer_cv cross validator. So, in the GridSearchCV method, we are basically breaking down x_tr_0 into 4 splits: x_tr_0_0, x_tr_0_1, x_tr_0_2, x_tr_0_3. In these "inner" splits we are doing the hyperparameter tuning.

Once the optimal hyperparameters are calculated, we use these in the "outer" split for calculating the model performance. In this case, the model evaluation is done on the outer_cv split test data (which is unseen by the hyperparameter tuning "inner" split). This ensures that the performance values we are getting are more generalizable and there is no data leakage."

Because I'm using $\texttt{cross\_val\_score}$, all I'm returning is the scores of each of the 5 outer folds. If you want to return the models as well, you can do so with the $\texttt{cross\_validate}$ function.


Because we're running 5 times as many fits as before, this will take a bit longer! It ran in 1.5 minutes on my laptop.

In [None]:
nested_model = GridSearchCV(piped_model, parameters, cv = inner_cv, verbose = 2, n_jobs = 4, return_train_score=True)
nested_score = cross_val_score(nested_model, X=features_lim, y=target, cv= outer_cv, n_jobs=4)

In [None]:
nested_score #an array of all the test scores for the outer CV, using the optimal hyperparameters from the inner CV step

In [None]:
print(np.round(nested_score.mean(), 3), np.round(nested_score.std(), 3))

What is the average accuracy and generalization error of the model? Is the accuracy you found for the optimal model (without using nested cross-validation) compatible with this result?

Results for a model with optimized hyperparameters are reported as: optimal model accuracy $\pm$ generalization error

Report your results for the optimized SVC model:

### Diagnosis 

- Compare the performance of the best-performing model found in the optimization to the benchmark scaled Linear SVC model. Does it perform measurably better?

- Our diagnosis of the scaled Linear SVC model was that it had high bias. Has this problem been corrected by making the model more complex using different hyperparameters?



The problem here is high bias, which is not that surprising given that we are using only a subset of features.

We can try two things: making up new features which might help, based on what we know about the problem, and using an imputing strategy to include information about the discarded features. Here we'll focus on adding new engineered features, and you'll try the imputing approach on Homework 2.

### Next step: Feature Engineering

First, we'll define some new variables.
We'll go back to the full list of features (not the abbreviated list we tested above) to develop our new engineered features.

In [None]:
features = features.fillna(0) #takes care of nan
features = features.replace('', 0) #takes care of empty string values
features.head()

#### Let's start by looking at what kind of particles we have as a product of the collision.

In [None]:
# make a 2D numpy array of all the values of the particle type columns, storing the values as strings
ptypes = np.array([features['P'+str(i)+'_type'].values for i in range(0,18)])

print(ptypes)

print(np.shape(ptypes)) #note that the shape might be the transpose of what you expect! There are 18 rows (one for each particle), 5000 columns (one for each instance)

print(ptypes[0, 0:5]) #e.g. the type of particle 0 in instances 0 - 5

In [None]:
#list the unique values of the particle type columns
np.unique(ptypes.astype('str'))

#### Here are the proposed new features (justification can be found in Chapter 4).
    
    1. The total number of particles produced
    2. The total number of b jets
    3. The total number of jets
    4. The total number of leptons (electrons, positron, mu+, mu-)

In [None]:
#count number of non-zero types for each instance

ntot = np.array([(np.sum(np.array([ptypes[i][j] != 0 for i in range(ptypes.shape[0])]))) for j in range(features.shape[0])])
ntot

In [None]:
#define new column in my data frame
features['Total_products'] = ntot

In [None]:
#count number of b jets 
nbtot = np.array([np.sum(np.array([ptypes[i][j] == 'b' for i in range(ptypes.shape[0])])) for j in range(features.shape[0])])
#define new column in my data frame
features['Total_b'] = nbtot

In [None]:
#You get the idea, let's count all types (jets, photons g, e-, e+, mu-, mu+)
njtot = np.array([np.sum(np.array([ptypes[i][j] == 'j' for i in range(ptypes.shape[0])])) for j in range(features.shape[0])])
ngtot = np.array([np.sum(np.array([ptypes[i][j] == 'g' for i in range(ptypes.shape[0])])) for j in range(features.shape[0])])

# count each of the lepton types separately, then sum to get total leptons
n_el_tot = np.array([np.sum(np.array([ptypes[i][j] == 'e-' for i in range(ptypes.shape[0])])) for j in range(features.shape[0])])
n_pos_tot = np.array([np.sum(np.array([ptypes[i][j] == 'e+' for i in range(ptypes.shape[0])])) for j in range(features.shape[0])])
n_muneg_tot = np.array([np.sum(np.array([ptypes[i][j] == 'm-' for i in range(ptypes.shape[0])])) for j in range(features.shape[0])])
n_mupos_tot = np.array([np.sum(np.array([ptypes[i][j] == 'm+' for i in range(ptypes.shape[0])])) for j in range(features.shape[0])])
n_lepton_tot = n_el_tot + n_pos_tot + n_muneg_tot + n_mupos_tot

And here we define the other new features:

In [None]:
features['Total_j'] = njtot
features['Total_g'] = ngtot
features['Total_leptons'] = n_lepton_tot

In [None]:
features.head() #scroll to the final columns to see your new features

### Feature engineering 1: impact of engineered variables

We'll add these 4 new features to our original set to see if that improves our classifier.

In [None]:
features_lim_2 = features[['MET', 'METphi', 'P0_E', 'P0_pt', 'P0_eta', 'P0_phi', 
                           'P1_E', 'P1_pt', 'P1_eta', 'P1_phi', 
                           'P2_E', 'P2_pt', 'P2_eta', 'P2_phi', 
                           'P3_E', 'P3_pt', 'P3_eta', 'P3_phi',
                           'Total_products', 'Total_b' ,'Total_j','Total_g','Total_leptons']]

First, we'll try our benchmark model (Standard Scaler and Linear SVC), using 5-fold cross-validation

In [None]:
piped_model #remember our benchmark model?

In [None]:
benchmark_lim2 = cross_validate(piped_model, features_lim_2, target, cv = cv, scoring = 'accuracy', return_train_score=True)

In [None]:
benchmark_lim2

In [None]:
np.round(benchmark_lim2['test_score'].mean(),3), np.round(benchmark_lim2['test_score'].std(), 3)

In [None]:
np.round(benchmark_lim2['train_score'].mean(),3), np.round(benchmark_lim2['train_score'].std(), 3)

What is the average accuracy and generalization error of the model? Note: we haven't done any parameter optimization here, so we don't need nested cross-validation to get the error!

Compare the performance of this enhanced-feature Linear SVC to your optimized model. Which change had the larger impact on the bias: hyperparameter optimization or feature engineering?

### This is a very significant improvement, which cuts our error rate in half!

In my experience, this knowledge-informed feature engineering is often very successful, more than hyperparameter optimization. Machine learning methods are often tooted for their ability to learn relevant representations, but non-deep-learning methods are less capable to do so, and providing informative features is very helpful.

We can optimize this model as well, just as we did before. For the moment we'll skip this for the sake of time, but you'll get lots of practice with this on HW 2.

### Techniques for Feature Engineering: One-Hot Encoding

Another feature engineering attempt we could potentially do is use the type of product in the i-th location as a feature. To do this, we need to somehow turn the particle types into numerical features, since that's all SVM's know how to handle.

We could do it with label encoding, as we did earlier in this notebook, but such strategy introduces a notion of distance metric (labels that are mapped to 0 and 1 are interpreted to be closer to each other than labels that are mapped into 0 and 7). 

As an alternative, we can introduce as many new columns as possible values for each categorical variable we are re-mapping, and we just use a 0/1 to indicate that the particle is of that type. This is known as "one-hot encoding," since only one of the categorical variable columns we add can ever be "hot" at a time (that is to say, a 1, instead of a 0). 

This is achieved with the wonderfully-named "get_dummies" function:

In [None]:
features_add = pd.get_dummies(data=features, columns=['P'+str(i)+'_type' for i in range(0,18)])

In [None]:
features_add.columns[77:90] #A subset of the new features

Notice a couple of things:
- Particle 0 only comes in type "b" or "j" -- this probably has to do with how the ATLAS collaboration decides to trigger on events. Presumably they're only storing events that have at least 1 jet in them (whether or not it's a b-jet). 
-  Empty particle tracks are now listed as "type 0" in the encoding scheme. This may be acceptable, or it could be something you decide to clean up on a later test, if you notice problems classifying events with missing tracks. Developing an ML method is often an iterative process! You don't need to have everything perfectly correct the first time through; try it and see what happens before investing a lot of time into perfecting things. 

In [None]:
features_add.shape

features_add

### Feature engineering 2: add other variables (type of product) for the first four particles.

In [None]:
features_lim_3 = features_add[['MET', 'METphi', 'P0_E', 'P0_pt', 'P0_eta', 'P0_phi', 
                           'P1_E', 'P1_pt', 'P1_eta', 'P1_phi', 
                           'P2_E', 'P2_pt', 'P2_eta', 'P2_phi', 
                           'P3_E', 'P3_pt', 'P3_eta', 'P3_phi',
                           'Total_products', 'Total_b' ,'Total_j','Total_g','Total_leptons',
                           'P0_type_b', 'P0_type_j',
                            'P1_type_0', 'P1_type_b', 'P1_type_e+', 'P1_type_e-', 'P1_type_g', 'P1_type_j', 'P1_type_m+', 'P1_type_m-', 
                            'P2_type_0', 'P2_type_b', 'P2_type_e+', 'P2_type_e-', 'P2_type_g', 'P2_type_j', 'P2_type_m+', 'P2_type_m-', 
                            'P3_type_0', 'P3_type_b', 'P3_type_e+', 'P3_type_e-', 'P3_type_g', 'P3_type_j', 'P3_type_m+', 'P3_type_m-']]

In [None]:
features_lim_3.head()

Let's train the benchmark model one more time, using the new features.

In [None]:
benchmark_lim3 = cross_validate(piped_model, features_lim_3, target, cv = cv, scoring = 'accuracy', return_train_score=True)

In [None]:
benchmark_lim3

In [None]:
np.round(benchmark_lim3['test_score'].mean(),3), np.round(benchmark_lim3['test_score'].std(), 3)

In [None]:
np.round(benchmark_lim3['train_score'].mean(),3), np.round(benchmark_lim3['train_score'].std(), 3)

What is the average accuracy and generalization error of the model? Note: we haven't done any parameter optimization here, so we don't need nested cross-validation to get the error!

Compare the performance of this enhanced-feature Linear SVC to the first feature-engineered model (the one with the numbers of products as added features). Do you see any improvement?

#### Next, we would normally optimize the model. Again, we'll skip this for the sake of time in this Studio.

### Finally, we can try with all the features.

In [None]:
features_add.shape

In [None]:
benchmark_all = cross_validate(piped_model, features_add, target, cv = cv, scoring = 'accuracy', return_train_score=True)

In [None]:
benchmark_all

In [None]:
np.round(benchmark_all['test_score'].mean(),3), np.round(benchmark_all['test_score'].std(), 3)

In [None]:
np.round(benchmark_all['train_score'].mean(),3), np.round(benchmark_all['train_score'].std(), 3)

What is the average accuracy and generalization error of the model? Note: we haven't done any parameter optimization here, so we don't need nested cross-validation to get the error!'

Compare the bias and variance of this model (using all the features) to the other two models using engineered features. What happened to the variance as the number of features increased?

We could run the optimization, but as you might have anticipated, it won't help much, and it is very time consuming.

### Take-home message: feature engineering often works best if we use subject matter knowledge, and buulding more features is not necessarily better.

### Acknowledgement Statement:

### You're done! Upload your work to Gradescope.