# Machine learning with artificial neural networks

In this project, an application of machine learning models, using a simulated dataset, will be implemented.\
The dataset was taken from [openml](https://www.openml.org/search?type=data&sort=runs&id=4532&status=active). It is a tabular dataset consisting of 28 features that are (functions of) kinematic properties.
Besides, it contains a binary target, denoting if a process is tied to a higgs boson (signal) or another process (background).
For the sake of a faster runtime and independence of external sources, a sample dataset was created. \
An artificial neural network will be used in order to determine if a process is signal or background based on the given features.

The entire notebook should run in about 1 min. (with the sample dataset)

## Part 1: Obtaining the data
- Download of the [Higgs dataset](https://www.openml.org/search?type=data&sort=runs&id=4532&status=active) or load a smaller sample dataset.
- The data is saved as a a pandas DataFrame.

## Part2: Cleaning the data
- Removal of rows with missing values, infinite values and duplicate rows

## Part 3: Data Summary and Statistics
- Presentation of some basic information of the data.
- Plot of the distributions of `lepton_pT`, `lepton_eta`, and `lepton_phi` as histograms for the signal and background events.

## Part 4: Pre-Process Features
- Re-seperation of features and target and preprocessing of the features.

## Part 5: Split Dataset
- 70% will be used for training, 15% for testing and 15% for validation.

## Part 6: Simple MLP Model with Keras
- Definition and compiling of a simple simple Multi-Layer Perceptron (MLP) model with [Keras](https://keras.io/).
- Training of the model with the given data.
- Plot of the training and validation loss over the training epochs as well as the ROC-curve.
- Evaluation of the model on the test set.

## Part 7: Feature Importance
- Training of the MLP while dropping features.
- Ranking of the first 10 features based on importance. (Not all features to keep the runtime short)

## Part 8: Brute Force Single Feature Attack on MLP
- For 5 correctly classified samples each feature will be varied over its valid range (min/max values from the training set).
- Misclassifications after changing a single feature will be considered.

---
## Part 1

In [None]:
import openml  # to download the full dataset

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# to prepare data
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# for the ROC-curve
from sklearn.metrics import roc_curve, auc

# to construct the neural network
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

%matplotlib inline
plt.rcParams.update({'font.size': 14})  # larger text in plots

In [None]:
full_dataset = False


if full_dataset == True:
    # download the full Higgs dataset from openml (98 050 events)
    higgs_dataset = openml.datasets.get_dataset(4532)

    # seperate the data into features/target and combine into a pandas DataFrame
    X_higgs, y_higgs, _, _ = higgs_dataset.get_data(target=higgs_dataset.default_target_attribute)
    df = pd.DataFrame(X_higgs)  # feature data
    df['target'] = y_higgs  # add target-data

else:
    # load the provided sample dataset (only 10 000 events of the full dataset)
    df = pd.read_csv("data/higgs_sample.csv")


# extract column names for parts 7 and 8
features = df.columns.to_numpy()

# print some basic information
print(f'Shape: \n{df.shape}\n')
print(f'Head: \n{df.head()}\n')
print(f'Columns: \n{features}')

## Part 2

In [None]:
# pre-cleaning shape
print(f'Original shape: \n{df.shape}\n')

# remove NaN-entries and infinities
df = df[np.isfinite(df).all(axis=1)]

# remove identical rows
df.drop_duplicates(inplace=True)

# post-cleaning shape
print(f'Cleaned shape: \n{df.shape}')

## Part 3

In [None]:
# print relevant information
print(f'Names and data types: \n{df.dtypes}\n')
print(f'Basic statistics: \n{df.describe()}\n')
print(df['target'].value_counts())


# plot histograms

nbins = 100  # bin-number

fig, ax = plt.subplots(3, 1, figsize=(10,8))
fig.tight_layout(pad=4.0)

# data
ax[0].hist(df['lepton_pT'][df['target'] == 1], bins=nbins, label='signal', alpha=.7)
ax[0].hist(df['lepton_pT'][df['target'] == 0], bins=nbins, label='background', alpha=.7)

ax[1].hist(df['lepton_eta'][df['target'] == 1], bins=nbins, label='signal', alpha=.7)
ax[1].hist(df['lepton_eta'][df['target'] == 0], bins=nbins, label='background', alpha=.7)

ax[2].hist(df['lepton_phi'][df['target'] == 1], bins=nbins, label='signal', alpha=.7)
ax[2].hist(df['lepton_phi'][df['target'] == 0], bins=nbins, label='background', alpha=.7)

# appearance
ax[0].set_title(r'$\mathtt{pT}$-distribution')
ax[0].set_xlabel(r'$\mathtt{lepton\_pT}$')
ax[0].set_ylabel('Counts')
ax[0].grid(True, 'both')
ax[0].legend(loc='upper right')
        
ax[1].set_title(r'$\mathtt{eta}$-distribution')
ax[1].set_xlabel(r'$\mathtt{lepton\_eta}$')
ax[1].set_ylabel('Counts')
ax[1].grid(True, 'both')
ax[1].legend(loc='upper right')

ax[2].set_title(r'$\mathtt{phi}$-distribution')
ax[2].set_xlabel(r'$\mathtt{lepton\_phi}$')
ax[2].set_ylabel('Counts')
ax[2].grid(True, 'both')
ax[2].legend(loc='lower right')

plt.show()

## Part 4

In [None]:
# reseparate DataFrame and ensure integer classification
X_higgs = df.iloc[:, :-1]
y_higgs = df.iloc[:, -1].astype(int)

# mean and std before preprocessing (using ".values" to print only numerical data from pandas series)
print(f'Averages before preprocessing: \n{np.mean(X_higgs, axis=0).values}\n')
print(f'Standard deviations before preprocessing: \n{np.std(X_higgs, axis=0).values}\n')

# preprocessing
X_higgs = StandardScaler().fit_transform(X_higgs)

# mean and std after preprocessing
print(f'Averages after preprocessing: \n{np.mean(X_higgs, axis=0)}\n')
print(f'Standard deviations after preprocessing: \n{np.std(X_higgs, axis=0)}')

The averages before preprocessing fluctuate over almost four orders of magnitude which is reduced to about one order by preprocessing. Meanwhile the standard deviations fluctuate over about one order of magnitude at the beginning and are all equal in the end.

## Part 5

In [None]:
# split data using train_test_split (must be done twice since data is only split into 2 parts)
# split into training and non-training
X_train, X_test, y_train, y_test = train_test_split(X_higgs, y_higgs,
                                                    test_size=0.3, random_state=42)
# split into test and validation
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test,
                                                test_size=0.5, random_state=42)

print(f"Training features' shape: {X_train.shape}")
print(f"Training targets' shape: {y_train.shape}")

print(f"Test features' shape: {X_test.shape}")
print(f"Test targets' shape: {y_test.shape}")

print(f"Validation features' shape: {X_val.shape}")
print(f"Validation targets' shape: {y_val.shape}")

## Part 6

In [None]:
# define the model
model = Sequential([
    Dense(64, input_dim=28, activation='relu'),
    Dense(32, activation='relu'),
    Dense(16, activation='relu'),
    Dense(1, activation='sigmoid')
])

# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# model summary
print(model.summary)

# training the model
history = model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=25,
    batch_size=512
)

# plot losses over epochs
plt.figure(figsize=(8,5))

plt.plot(history.history['loss'], label='Training loss')
plt.plot(history.history['val_loss'], label='Validation loss')

plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Neural Network Training vs. Validation Loss')
plt.legend()
plt.grid(True)
plt.show()

# evaluate with test set
test_loss, test_acc = model.evaluate(X_test, y_test)
print(f'\n Test accuracy: {test_acc}')


# consider ROC curve

# Predict probabilities (as flat array)
y_pred_nn = model.predict(X_val).ravel()

# Compute ROC curve and AUC
fpr_nn, tpr_nn, _ = roc_curve(y_val, y_pred_nn)
roc_auc_nn = auc(fpr_nn, tpr_nn)

# Plot ROC curve
plt.figure(figsize=(8, 6))

plt.plot(fpr_nn, tpr_nn, color='blue', label=f'Neural Network (AUC = {roc_auc_nn:.2f})')
plt.plot([0, 1], [0, 1], 'k--')  # Diagonal line

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve (Neural Network Classifier)')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()

## Part 7

In [None]:
# define a simpler model to reduce runtime
model_light = Sequential([
    Dense(32, input_dim=27, activation='relu'),
    Dense(16, activation='relu'),
    Dense(1, activation='sigmoid')
])

# compile the model
model_light.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# create an array to be filled with accuracies
dropped_acc = np.zeros(10)

# loop through features
for i in range(10): 
    # training the model, where np.delete() is used to drop certain features
    history = model_light.fit(
        np.delete(X_train, i, axis=1), y_train,
        validation_data=(np.delete(X_val, i, axis=1), y_val),
        epochs=15, 
        batch_size=512,
        verbose=0
    )
    
    # extract accuracies
    _, dropped_acc[i] = model_light.evaluate(np.delete(X_test, i, axis=1), y_test)
    
# sort features corresponding to increasing accuracies, i.e. decreasing importance
print('\n The first 10 features and corresponding accuracies in decreasing importance:')
print(pd.Series(dropped_acc, features[9::-1]).sort_values())  # print as pd.Series for readability

## Part 8

In [None]:
# calculate probability for a sample to be signal based on the model
prob = model.predict(X_test, verbose=0).flatten()

# get indices where the difference between probability and target is < 0.5
args_correct = np.where(np.abs(prob - y_test) < .5 )[0]

# select 5 correctly classified samples
args_selected = np.random.choice(args_correct, 5)

# extract and print valid ranges
features_min = np.min(X_train, axis=0)
features_max = np.max(X_train, axis=0)

print('Features and the corresponding valid ranges:')
print(pd.DataFrame({'feature': features[:-1],
                    'min': features_min,
                    'max': features_max}).to_string(index=False))

# create an array of possible feature values
steps = 10
possible_feature_vals = np.linspace(features_min, features_max, steps)  # shape: (steps, 28)

# define dictionary to be filled later
misclass_details = {'Sample no.': [], 'Feature': [],
                    'Old value': [], 'New value': [],
                    'Prediction': [], 'True label': []}

# loop over samples
for arg in args_selected:
    sample = X_test[arg]
    stop = False  # will be used to stop changing features after misclassification
    
    # change each feature of each sample
    for feature_ind in range(28):
        for step_number in range(steps):
            
            # copy the sample to implement changes
            test_sample = sample.copy()
            test_sample[feature_ind] = possible_feature_vals[step_number, feature_ind]
            
            new_prob = model.predict(test_sample[np.newaxis, :], verbose=0)  # mind dimensions
            
            # check for misclassification
            if np.abs(new_prob - y_test.iloc[arg]) > .5: 
                misclass_details['Sample no.'].append(arg)
                misclass_details['Feature'].append(features[feature_ind])
                misclass_details['Old value'].append(sample[feature_ind])
                misclass_details['New value'].append(test_sample[feature_ind])
                misclass_details['Prediction'].append(new_prob[0][0])
                misclass_details['True label'].append(y_test.iloc[arg])
                
                stop = True
                break   # break step_number loop
                
        if stop:
            break   # break feature_ind loop
            
# print results
print(f"\n \n{len(misclass_details['Sample no.'])} out of 5 samples were misclassified:\n")
print(pd.DataFrame(misclass_details).to_string(index=False))