# Case Study 6 - Neural Networks

__Team Members:__ Amber Clark, Andrew Leppla, Jorge Olmos, Paritosh Rai

# Content
* [Business Understanding](#business-understanding)
    - [Scope](#scope)
    - [Introduction](#introduction)
    - [Methods](#methods)
    - [Results](#results)
* [Data Evaluation](#data-evaluation)
    - [Loading Data](#loading-data) 
    - [Data Summary](#data-summary)
    - [Missing Values](#missing-values)
    - [Feature Removal](#feature-removal)
    - [Exploratory Data Analysis (EDA)](#eda)
* [Model Preparations](#model-preparations)
    - [Sampling & Scaling Data](#sampling-scaling-data)
    - [Proposed Method](#proposed-metrics)
    - [Evaluation Metrics](#evaluation-metrics)
    - [Feature Selection](#feature-selection)
* [Model Building & Evaluations](#model-building)
    - [Sampling Methodology](#sampling-methodology)
    - [Model](#model)
    - [Performance Analysis](#performance-analysis)
* [Model Interpretability & Explainability](#model-explanation)
    - [Examining Feature Importance](#examining-feature-importance)
* [Conclusion](#conclusion)
    - [Final Model Proposal](#final-model-proposal)
    - [Future Considerations, Model Enhancements and Alternative Modeling Approaches](#model-enhancements)

# Business Understanding & Executive Summary <a id='business-understanding'/>

## Objective:

The objective of this case study is to predict the detection of a new subatomic particle with high accuracy from a dataset with 7 million records.  

## Introduction:
No information regarding the data in the case study was provided; the only stipulation given was to classify a binary variable representing "the existence of a particle" using a neural network. In terms of data detection of the binary classifier, 1 represents detection, and 0 represents non-detection. The client has advised that this is a massive amount of data best modeled with Neural Networks, and a high level of accuracy is critical.

### Artificial Neural Networks

Neural networks are based on brain biology and stimulate the brain's function. Based on neuroscience, neurons are connected by axons to other neurons. This concept is applied in an Artificial Neural Network (ANN). An ANN comprises groups of "neurons" called layers. These layers are connected in a network to take inputs from the dataset, fit model weights to the inputs, and eventually produce outputs that can be used to classify a target variable. The layers between the inputs and the target outputs in a neural network are called hidden layers.   

<img src="https://raw.githubusercontent.com/olmosjorge28/QTW-SPRING-2022/main/ds7333_case_study_6/Neural_Network_fig.png" width=300 height=300 />

Physiologically, neurons work by firing signals only when a certain signal "threshold" is reached. This behavior is mimicked by ANNs. Any signal input below the threshold will not result in an output from the neuron, while any signal at or above the threshold will result in a constant output. Various activation functions are used to mathematically approximate how a neuron works. Activation functions are equations that determine the output of a neural network model. 

Some of the common activation functions are discussed below:


<img src="https://raw.githubusercontent.com/olmosjorge28/QTW-SPRING-2022/main/ds7333_case_study_6/Activation%20Function.png" width=300 height=300 />


Each neuron represents a regression in the neural network and calculates an output. A neural network is an ensemble of many regressors that will take the outputs of previous regressors as inputs. This results in a large ensemble of regression models.


## Modeling:

### Training and Test Split
70/30 split ---> batch and epochs?


### Key Metrics
talk about accuracy/auc and why that's our metric 


### Results



### Feature Importance
Add linear feature importances 


## Conclusion



## Future Considerations



# Data Evaluation <a id='data-evaluation'>
    

Summarize the data being used in the case using appropriate mediums (charts, graphs, tables); address questions such as: Are there missing values? Which variables are needed (which ones are not)? What assumptions or conclusions are you drawing that need to be relayed to your audience?

## Loading Data <a id='loading-data'>

In [1]:
# standard libraries
import pandas as pd
import numpy as np
import os
from IPython.display import Image
#from IPython.display import clear_output
#import sklearn
import time
#import re

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from tabulate import tabulate

# data pre-processing
from sklearn.impute._base import _BaseImputer
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection._split import BaseShuffleSplit
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# prediction models
from sklearn.linear_model import LogisticRegression
import tensorflow as tf
from tensorflow.keras.callbacks import TensorBoard
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import fbeta_score

# import warnings filter
'''import warnings
warnings.filterwarnings('ignore')
from warnings import simplefilter 
simplefilter(action='ignore', category=FutureWarning)'''



In [3]:
from os import listdir
from os.path import isfile, join

class FilePathManager:
    def __init__(self, local_dir: str):
        self.local_dir = local_dir
    
    def retrieve_full_path(self):
        return os.getcwd()+'/'+self.local_dir

In [4]:
class Loader:
    df = pd.DataFrame()
    
    #@abstractmethod
    def load_data(self, file_name):
        pass
    
    #@abstractmethod
    def get_df(self):
        pass
    
    def size(self):
        return len(self.df)

In [9]:
from typing import Callable
 
class CSVLoader(Loader):
    def __init__(self, file_path_manager: FilePathManager):
        self.file_path_manager = file_path_manager
        
    def load_data(self, _prepare_data: Callable[[pd.DataFrame], pd.DataFrame] = None):
        self.df = pd.read_csv(self.file_path_manager.retrieve_full_path())
        if _prepare_data:
            self.df = _prepare_data(self.df)
    
    def get_df(self):
        return self.df;
    
    def size(self):
        return len(self.df)  

In [7]:
def clean_data(df):
    df['# label'] = df['# label'].astype(int)
    return df

In [10]:
loader = CSVLoader(FilePathManager('data/all_train.csv'))
loader.load_data(clean_data)
df = loader.get_df()

## Data Summary <a id='data-summary'>
    
### Data Exploration and Manipulation:
    
The provided data, although not described in detail, is a large dataset consisting of 28 features and a binary class. The column mass is the only named feature; all others are arbitrarily numbered, and all features are numeric. There are seven million observations. There are no known missing values in the data with the caveat that it is unknown whether zeros could constitute missing data.
The only manipulation required for preparing this data for use in a neural network model is to change the target class object type to Boolean to save a small amount of space and to normalize the range of the features, which was performed after the data was split into test/train data set. In addition, the target classes are very well balanced in the dataset.


## Missing Values <a id='missing-values'>
There are no missing Values -- elaborate on this point later 


## Exploratory Data Analysis (EDA) <a id='eda'>



In [None]:
df.head()

### Scaling and Skew

In [None]:
feature_summary = df.iloc[:, 1:29].describe().T
feature_summary

### Highly Skewed Features

In [None]:
right_skew = feature_summary.loc[feature_summary['max'] > feature_summary['mean'] + feature_summary['std']*4]
left_skew = feature_summary.loc[feature_summary['min'] < feature_summary['mean'] - feature_summary['std']*4]
skew = pd.concat([right_skew.T, left_skew.T], axis=0, join='outer')
skew.head(8)

#### Skewed, no major outliers

In [None]:
fig, axes = plt.subplots(2, 6, figsize=(14, 5))
fig.suptitle('Skewed Features - Boxplots')
for i,j in zip(skew.columns, range(11)):
    sns.boxplot(ax = axes[int(j/6), j%6], x = df[i])

fig.tight_layout()

    
### Target Variable Class Distribution

    
<img src="https://raw.githubusercontent.com/olmosjorge28/QTW-SPRING-2022/main/ds7333_case_study_6/Target_Variable_Class_Distribution.png" width=300 height=300 />

    
Also, correlations between features f6, f10, f14, f18, and f26 were observed and also with the target variable. However, all the variables will be included in the model fitting exercise as there is no domain knowledge of the features to assess if some of them can be excluded from the analysis instead of the others.

    
        
<img src="https://raw.githubusercontent.com/olmosjorge28/QTW-SPRING-2022/main/ds7333_case_study_6/Correlation_heat_map.png" width=300 height=300 />
   

# Model Preparations <a id='model-preparations'/>

Which methods are you proposing to utilize to solve the problem?  Why is this method appropriate given the business objective? How will you determine if your approach is useful (or how will you differentiate which approach is more useful than another)?  More specifically, what evaluation metrics are most useful given that the problem is a classification one (ex., Accuracy, F1-score, Precision, Recall, AUC, etc.)?

## Sampling & Scaling Data <a id='sampling-scaling-data' />


Training and test sets were created from the data using the stratified method to maintain the ratio of the binary outcome.  This was done in an abundance of caution, because the classes are almost perfectly balanced. 30% of the data was withheld for the test set, and the defining features were normalized.


## Proposed Method <a id='proposed-metrics' />

The stakeholders wanted our team to focus on creating a model that would predict the existance of a new particle with high accuracy above all, and the model interpretability was not a priority. With this mind our team decided on using an Artificial Neural Network to achieve a high accuracy model. Our has input layer of 28 neurons for each of the feature with X hidden layers and a single neuron output layer with a sigmoid activation function and a BinaryCrossentropy loss since our target variable is binary. The hidden layers used a ReLu activation, which was chosen for its non-linear characteristics that helps estimating non-linear functions. The team decided to have X neurons and X neurons...for each hidden layer respectively. 

In experimentations our best results were achieved with a batch size of 1000. This gave a large sample size limit uncessary fluctations, this gave the model the right balance between variance and bias. Additionally, the batch sizes were small enough to compute in memory, but not so small it would increase processing time dramatically. The team ran 100 epochs however after 30 epochs there was no further improvment was observed. 

### Code

In [None]:
class BaseImputer:
    #@abstractmethod
    def fit(self, X, y=None):
        pass
    
    #@abstractmethod
    def transform(self, X):
        pass

class BaseModel:
    #@abstractmethod
    def fit(self, X, y, sample_weight=None):
        pass
    
    #@abstractmethod
    def predict(self, X):
        passb

In [None]:
class Modeling:
    _X_train_fitted = None
    _X_test_fitted = None
    _y_train = None
    _y_test = None
    _y_preds = None
    
    def __init__(self, data: pd.DataFrame, 
                 target_name: str, 
                 shuffle_splitter: BaseShuffleSplit, 
                 imputer: BaseImputer, 
                 model: BaseModel, 
                 scaler = None):
        self._data = data
        self._target_name = target_name
        self._shuffle_splitter = shuffle_splitter
        self._imputer = imputer
        self._model = model
        self._X, self._y = self._split_data()
        self._scaler = scaler
        
    @property
    def X(self):
        return self._X
    
    @property
    def y(self):
        return self._y

    @property
    def model(self):
        return self._model
    
    @model.setter
    def model(self, model):
        self._model = model
     
    @property
    def X_train(self):
        return self._X_train_fitted
    
    @property
    def X_test(self):
        return self._X_test_fitted
    
    @property
    def y_train(self):
        return self._y_train
    
    @property
    def y_test(self):
        return self._y_test
    
    @property
    def y_preds(self):
        return self._y_preds
    
    def _split_data(self):
        X = self._data.copy()
        return X.drop([self._target_name], axis=1) , X[self._target_name]
    
    def _shuffle_split(self):
        X = self.X
        y = self.y
        for train_index, test_index in self._shuffle_splitter.split(X,y):
            X_train, X_test = X.iloc[train_index], X.iloc[test_index]
            y_train, y_test = y[train_index], y[test_index]
        return X_train, X_test, y_train, y_test
    
    def _fit_imputer(self, train):
        if self._imputer is not None:
            self._imputer.fit(train)
    
    def _fit_scaler(self, train):
        if self._scaler is not None:
            self._scaler.fit(train)
    
    def _impute_data(self, X: pd.DataFrame):
        if self._imputer is not None:
            return pd.DataFrame(self._imputer.transform(X), columns = self.X.columns, index = X.index)
        return X
    
    def _scale_data(self, X: pd.DataFrame):
        if self._scaler is not None:
            X = pd.DataFrame(self._scaler.transform(X), columns = self._X.columns)
        return X
    
    def prepare(self):
        X_train, X_test, y_train, y_test = self._shuffle_split()   
        self._fit_imputer(X_train)
        X_train = self._impute_data(X_train)
        X_test = self._impute_data(X_test)
        self._fit_scaler(X_train)
        self._X_train_fitted = self._scale_data(X_train)
        self._X_test_fitted = self._scale_data(X_test)
        self._y_train = y_train
        self._y_test = y_test
        
    def prepare_and_train(self):
        self.prepare()
        return self.train()
        
    def train(self): #, epoch=None, batch=None
        self._model.fit(self.X_train, self.y_train) #, batch_size=batch, epochs=epoch
        self._y_preds = self._model.predict(self.X_train)
        
        return self.metrics(self.y_train, self.y_preds)
        
    def test(self):
        return self.metrics(self.y_test, self._model.predict(self.X_test))
       
        
    def metrics(self, y_true = None, y_pred = None):
        pass

In [None]:
class ClassificationModeling(Modeling):
    def __init__(self, 
                 data: pd.DataFrame, 
                 target_name: str, 
                 shuffle_splitter: BaseShuffleSplit, 
                 imputer: BaseImputer, 
                 model: BaseModel, 
                 scaler = None,
                 beta: int = 1,
                 classification: str = 'binary'):
        super().__init__(data, target_name, shuffle_splitter, imputer, model, scaler)
        self.beta = beta
        self.classification = classification
    
    def metrics(self, y_true = None, y_pred = None):
        if y_true is None and y_pred is None:
            y_true = self.y_train
            y_pred = self.y_preds
        return ({'matrix': confusion_matrix(y_true, y_pred), 
            'accuracy': accuracy_score(y_true, y_pred), 
            'precision': precision_score(y_true, y_pred, average=self.classification), 
            'recall': recall_score(y_true, y_pred, average=self.classification),
             'f1': f1_score(y_true, y_pred),
            'f{}'.format(self.beta) : fbeta_score(y_true, y_pred, average=self.classification, beta=self.beta) } )

In [None]:
class NNClassificationModeling(ClassificationModeling):
    def __init__(self, 
             data: pd.DataFrame, 
             target_name: str, 
             shuffle_splitter: BaseShuffleSplit, 
             imputer: BaseImputer, 
             model: BaseModel, 
             scaler = None,
             beta: int = 1,
             classification: str = 'binary', tb_callback = TensorBoard(log_dir="logs/", histogram_freq=1)):
        super().__init__(data, target_name, shuffle_splitter, imputer, model, scaler, beta, classification)
        self.tb_callback=tb_callback
        
        
    def train(self, epoch, batch):
        logDir = "logs/{epoch}-{batchsize}-{time}".format(epoch=epoch, batchsize=batch, time=time.time())
        self.tb_callback.log_dir = logDir
        self._model.fit(self.X_train, self.y_train, batch_size=batch, epochs=epoch, validation_data=(self.X_test, self.y_test), callbacks=[self.tb_callback])
        self._y_preds = self._model.predict(self.X_train)
        return self.metrics(self.y_train, self.y_preds)
    
    def metrics(self, y_true = None, y_pred = None):
        if y_true is None and y_pred is None:
            y_true = self.y_train
            y_pred = self.y_preds
            
        y_pred = pd.Series(y_pred.reshape((y_pred.shape[1], y_pred.shape[0]))[0], index=y_true.index)
        y_pred = pd.Series( (y_pred>0.5).astype(int), index=y_true.index)
        return super().metrics(y_true,y_pred)
    

## Evaluation Metrics <a id='evaluation-metrics' />

### Baseline Model

For our baseline model the team decided to run a logistic regression model. The model used a 70/30 stratified split, with L1 penalty and saga solver. The logistic model was chosen as this a simple, quick, and interprateable model. This gave the team a benchmark for accuracy to compare the proposed artificial neural network accuracy. 

In [None]:
baseline = ClassificationModeling(df,'# label',
                           StratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state=12343),
                           None,
                           LogisticRegression(penalty='l1', solver='saga', random_state=12343),
                           StandardScaler(), beta=2)

baseline.prepare()

In [None]:
baseline_results = pd.DataFrame()

for i in [0.0001, 0.0005, 0.001, .005, 1, 100]:
    baseline.model.C = i
    baseline_results = baseline_results.append({"C": i,
                                               "Train Accuracy": round( baseline.train()['accuracy'], 4),
                                               "Test Accuracy": round( baseline.test()['accuracy'], 4)},
                                              ignore_index=True)

In [None]:
sns.lineplot(data=baseline_results, x='C', y='Train Accuracy', color='blue')
sns.lineplot(data=baseline_results, x='C', y='Test Accuracy', color='red')
plt.title('Tuning C for Logistic Regression with L1')
plt.legend(['Train', 'Test'])
plt.xscale('log')
plt.axvline(0.001, color='black', ls='--')
plt.show()

#### Best Baseline Logistic Model

In [11]:
baseline.model.C = 0.001
baseline.train() #epoch=None, batch=None

NameError: name 'baseline' is not defined

In [None]:
baseline.test()

## Feature Selection <a id='feature-selection' />

All the features were used in the proposed neural network model. The team chose not use regularization since the training and test set evalution metric results aligned, which indicates that the neural network model was not overfitting. 


# Model Building & Evaluations <a id='model-building'/>

In this case, your primary task is to construct a neural network to detect the existence of new particles and will involve the following steps:

- Construct your neural network's architecture
- Fit your neural network to your training data
- Analyze your model's performance - referencing your chosen evaluation metric (including supplemental visuals and analysis where appropriate)


The team initially fit a Logistic Regression model to the data set to get a baseline accuracy rate for the prediction. Then, a neural network model was fit to assess the improvement in the accuracy rate.


## Sampling Methodology <a id='sampling-methodology'/>

## Modeling -- TODO: Talk about final model

### Final Model


In [None]:
NN = NNClassificationModeling(df,'# label',
                           StratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state=12343),
                           None,
                           None,
                           StandardScaler(), beta=2)

NN.prepare()

In [None]:
NN.model = tf.keras.Sequential() # model object

NN.model.add( tf.keras.layers.Input( shape=(NN.X_train.shape[1],) ) )
# specify data shape for first input layer
# columns (features) only, # rows specified by batch size later in fit() 

NN.model.add( tf.keras.layers.Dense(200, activation = 'relu') )
# add these layers sequentially with decreasing # neurons
NN.model.add( tf.keras.layers.Dense(50, activation = 'relu') )

NN.model.add( tf.keras.layers.Dense(1, activation = 'sigmoid') )
# Final layer, Regression Output
# For Classification, use activation = 'sigmoid' or 'softmax' for Final layer

NN.model.compile(optimizer='adam', loss='BinaryCrossentropy', metrics=['accuracy'])
# Have to compile model after specifying layers

NN.train(batch = 100000, epoch=40)

In [None]:
NN.test()

## Model's Performance Analysis <a id='performance-analysis'/>


Runtime, training, confusion matrices, accuracy and AUC

## Model Interpretability & Explainability <a id='model-explanation'>

### Final Model Proposal <a id='final-model-proposal'/>

### Examining Feature Importance <a id='examining-feature-importance'/>


#### Feature importance for baseline Model

!!! TODO: expand on this point


In [None]:
feat_coef = []
feat = zip(baseline.X_train.columns, baseline.model.coef_[0])
[feat_coef.append([i,j]) for i,j in feat]
feat_coef = pd.DataFrame(feat_coef, columns = ['feature','coef'])
top_feat_baseline = feat_coef.loc[abs(feat_coef['coef'])>0].sort_values(by='coef')

feat_plot = sns.barplot(data=top_feat_ baseline, x='feature', y='coef', palette = "ch:s=.25,rot=-.25")
plt.xticks(rotation=90)
plt.title('LR Feature Importance with L1')
plt.show()

# Conclusion <a id='conclusion'>

After all of your technical analysis and modeling; what are you proposing to your audience and why?  How should they view your results and what should they consider when moving forward?  Are there other approaches you'd recommend exploring?  This is where you "bring it all home" in language they understand.

### Future Considerations, Model Enhancements and Alternative Modeling Approaches <a id='model-enhancements'/>

## References