# Nueronal Network Autoencoder Method for Outlier Detection
**Author:** Juan A. Monleón de la Lluvia  
**Date:** 29-08-2023  

## Description
This Jupyter Notebook demonstrates a comprehensive approach for identifying outliers in data sets resulting from proton-induced experiments. Utilizing a neural network autoencoder model, the notebook provides an end-to-end guide, covering data preprocessing, model building, evaluation, and anomaly detection. The focus is to offer a replicable, step-by-step methodology for efficient outlier analysis in scientific datasets.

In [None]:
from EXFOR_ProtonReactions_UtilityFunctions import *
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
import numpy as np
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module="sklearn")
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 12)

## Data Import and Cleaning

In [None]:
path = r'D:\OneDrive\ETSII\MASTER\TFM\Scripts\exfortables\EXFOR_ProtonReactions_Classified_Group_1.csv'
df = pd.read_csv(path)
df = clean_dataframe(df)
df

In [None]:
# Save the IDs and drop them from the dataframe
x4_id = df['X4_ID']
df = df.drop(columns=['X4_ID'])

## Building the Autoencoder

In [None]:
# Data Splitting (Train/Test)
X_train, X_test = train_test_split(df, test_size=0.2, random_state=42)

# Data Scaling
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
# Model Architecture
input_dim = X_train.shape[1]
encoding_dim = int(input_dim / 2)  # por simplicidad, pero puedes ajustarlo según necesites

input_layer = Input(shape=(input_dim,))
encoder = Dense(encoding_dim, activation='relu')(input_layer)
decoder = Dense(input_dim, activation='sigmoid')(encoder)

autoencoder = Model(inputs=input_layer, outputs=decoder)

In [None]:
# Model Compilation
autoencoder.compile(optimizer=Adam(learning_rate=0.001), loss='mean_squared_error')

In [None]:
# Model Training
history = autoencoder.fit(X_train, X_train, epochs=50, batch_size=256, validation_data=(X_test, X_test))

## Model Evaluation

In [None]:
# Loss Plot
plt.plot(history.history['loss'], label='Loss of training')
plt.plot(history.history['val_loss'], label='Loss of validation')
plt.legend()
plt.yscale('log')
plt.show()

In [None]:
# Preformance Metrics
predictions = autoencoder.predict(X_test)
mse = np.mean(np.power(X_test - predictions, 2), axis=1)
print(f'MSE: {np.mean(mse)}')

mean_value = np.mean(X_train, axis=0)
mse_naive = np.mean((X_test - mean_value) ** 2)
print(f"MSE of naive model: {mse_naive}")

std_value = np.std(X_train, axis=0)
print(f"Standard deviation: {std_value}")

## Outlier Detection

In [None]:
# Preparing the DataFrame for Outlier Detection
df = pd.read_csv(path)
df = clean_dataframe(df)
X4_ID_column = df['X4_ID'].copy()
df_without_X4_ID = df.drop('X4_ID', axis=1)

In [None]:
# Scaling and Predictions
df_scaled_complete = scaler.transform(df_without_X4_ID)
predictions = autoencoder.predict(df_scaled_complete)

In [None]:
# Detecting Outliers
reconstruction_error = np.mean(np.power(df_scaled_complete - predictions, 2), axis=1)
threshold = np.percentile(reconstruction_error, 99.0)
anomalies_col2 = reconstruction_error > threshold

In [None]:
# Post-processing
df_original_values = pd.DataFrame(scaler.inverse_transform(df_scaled_complete), columns=df_without_X4_ID.columns, index=df.index)
df_original_values['X4_ID'] = X4_ID_column
df_original_values['Outliers'] = anomalies_col2

In [None]:
# Extracting the Outliers
outliers_df = df_original_values[df_original_values['Outliers'] == True].drop('Outliers', axis=1)
print('Porcentaje de outliers: {:.2f}%'.format(len(outliers_df)/len(df)*100))
outliers_df

## Visual Representation and Verification of Outliers

For the visual representations, the whole data set need to be loaded into memory. This is done by using the `read_experiments_from_binary` function, but also could be done by using the `read_experiments_from_txt` function, both available in the `EXFOR_ProtonReactions_UtilityFunctions.py` file.

In [None]:
experiments = read_experiments_from_binary('EXFOR_ProtonReactions_Database.bin')

In [None]:
plot_outliers(outliers_df, experiments, ylog=True)