# Case Study 7

__Team Members:__ Amber Clark, Andrew Leppla, Jorge Olmos, Paritosh Rai

# Content
* [Business Understanding](#business-understanding)
    - [Introduction](#introduction)
    - [Methods](#methods)
    - [Results](#results)
* [Data Evaluation](#data-evaluation)
    - [Loading Data](#loading-data) 
    - [Data Summary](#data-summary)
    - [Missing Values](#missing-values)
    - [Exploratory Data Analysis (EDA)](#eda)
* [Model Preparations](#model-preparations)
    - [Sampling & Scaling Data](#sampling-scaling-data)
    - [Proposed Method](#proposed-metrics)
    - [Evaluation Metrics](#evaluation-metrics)
    - [Feature Selection](#feature-selection)
* [Model Building & Evaluations](#model-building)
    - [Performance Analysis](#performance-analysis)
* [Model Interpretability & Explainability](#model-explanation)
    - [Examining Feature Importance](#examining-feature-importance)
* [Conclusion](#conclusion)
    - [Final Model Proposal](#final-model-proposal)
    - [Future Considerations, Model Enhancements and Alternative Modeling Approaches](#model-enhancements)

# Business Understanding & Executive Summary <a id='business-understanding'/>

## Objective:




## Introduction:





## Modeling:

### Training and Test Split


### Key Metrics
       

### Model Building


### Results

   


## Conclusion



## Future Considerations


# Data Evaluation <a id='data-evaluation'>
    

Summarize the data being used in the case using appropriate mediums (charts, graphs, tables); address questions such as: Are there missing values? Which variables are needed (which ones are not)? What assumptions or conclusions are you drawing that need to be relayed to your audience?

## Loading Data <a id='loading-data'>

In [1]:
# standard libraries
import os
import pandas as pd
import numpy as np
#import re
import os
from IPython.display import Image
from abc import ABC, abstractmethod
import time
#import sklearn
#import time

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from tabulate import tabulate
from IPython.display import clear_output
import xgboost

# data pre-processing
from scipy.io import arff
#from sklearn.model_selection import train_test_split
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, KNNImputer, IterativeImputer
from sklearn.impute._base import _BaseImputer
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection._split import BaseShuffleSplit
from sklearn.datasets import load_digits
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier

# prediction models
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier
from sklearn.svm._base import BaseSVC 
from sklearn.model_selection import cross_val_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import fbeta_score
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression
import tensorflow as tf

# import warnings filter
import warnings
warnings.filterwarnings('ignore')
from warnings import simplefilter 
simplefilter(action='ignore', category=FutureWarning)

In [2]:
class FilePathManager:
    def __init__(self, local_dir: str):
        self.local_dir = local_dir
    
    def retrieve_full_path(self):
        return os.getcwd()+'/'+self.local_dir

In [3]:
class Loader:
    df = pd.DataFrame()
    
    def load_data(self, file_name):
        pass
    
    def get_df(self):
        pass
    
    def size(self):
        return len(self.df)

In [4]:
from typing import Callable
 
class CSVLoader(Loader):
    def __init__(self, file_path_manager: FilePathManager):
        self.file_path_manager = file_path_manager
        
    def load_data(self, _prepare_data: Callable[[pd.DataFrame], pd.DataFrame] = None):
        self.df = pd.read_csv(self.file_path_manager.retrieve_full_path())
        if _prepare_data:
            self.df = _prepare_data(self.df)
    
    def get_df(self):
        return self.df;
    
    def size(self):
        return len(self.df)  

In [5]:
def clean_data(df):
    df['y'] = df['y'].astype(int)
    df['x32'] = df['x32'].str.replace('%','').astype(float)
    df['x37'] = df['x37'].str.replace('$','').astype(float)
    return df

In [6]:
loader = CSVLoader(FilePathManager('final_project(5).csv'))
loader.load_data(clean_data)

## Data Summary <a id='data-summary'>
    
### Data Exploration and Manipulation:
    



## Missing Values <a id='missing-values'>



## Exploratory Data Analysis (EDA) <a id='eda'>



# Model Preparations <a id='model-preparations'/>

In [8]:
class BaseImputer:
    def fit(self, X, y=None):
        pass
    
    def transform(self, X):
        pass

class BaseModel:

    def fit(self, X, y, sample_weight=None):
        pass
    
    def predict(self, X):
        pass

In [9]:
class Modeling:
    _X_train_fitted = None
    _X_test_fitted = None
    _y_train = None
    _y_test = None
    _y_preds = None
    _y_preds_proba = None
    
    def __init__(self, data: pd.DataFrame, 
                 target_name: str, 
                 shuffle_splitter: BaseShuffleSplit, 
                 imputer: BaseImputer, 
                 model: BaseModel, scaler = None, encoder = None):
        self._data = data
        self._target_name = target_name
        self._shuffle_splitter = shuffle_splitter
        self._imputer = imputer
        self._model = model
        self._encoder = encoder
        self._X, self._y = self._split_data()
        self._scaler = scaler
        
    @property
    def X(self):
        return self._X
    
    @property
    def y(self):
        return self._y

    @property
    def model(self):
        return self._model
    
    @model.setter
    def model(self, model):
        self._model = model
     
    @property
    def X_train(self):
        return self._X_train_fitted
    
    @property
    def X_test(self):
        return self._X_test_fitted
    
    @property
    def y_train(self):
        return self._y_train
    
    @property
    def y_test(self):
        return self._y_test
    
    @property
    def y_preds(self):
        return self._y_preds
    
    def _split_data(self):
        X = self._data.copy()
        return X.drop([self._target_name], axis=1) , X[self._target_name]
    
    def _shuffle_split(self):
        X = self.X
        y = self.y
        for train_index, test_index in self._shuffle_splitter.split(X,y):
            X_train, X_test = X.iloc[train_index], X.iloc[test_index]
            y_train, y_test = y[train_index], y[test_index]
        return X_train, X_test, y_train, y_test
    
    def _fit_imputer(self, train):
        if self._imputer is not None:
            self._imputer.fit(train)
    
    def _fit_scaler(self, train, cont_vars = None):
        transform_cols = None
        if cont_vars is None:
            transform_cols = self.X.columns
        else:
            transform_cols = cont_vars
            
        if self._scaler is not None:
            self._scaler.fit(train[transform_cols])
    
    def _impute_data(self, X: pd.DataFrame):
        if self._imputer is not None:
            return pd.DataFrame(self._imputer.transform(X), columns = self.X.columns, index = X.index)
        return X
    
    def _scale_data(self, X: pd.DataFrame, cont_vars = None):
        transform_cols = None
        if cont_vars is None:
            transform_cols = X.columns
        else:
            transform_cols = cont_vars
        scaled_data = X[transform_cols]
        if self._scaler is not None:
            scaled_data = pd.DataFrame(self._scaler.transform(X[transform_cols]), columns = transform_cols)
        X[transform_cols] = scaled_data
        return X
    
    def _encode_data(self):
        df = self.X.copy()
        cont_vars = df.describe().columns
        cat_vars = set(df.columns) - set(cont_vars)
        for column in [*cat_vars]:
            df[column] = self._encoder.fit_transform(df[column].astype(str))
        self._X = df
        return cont_vars, cat_vars
        
    
    def prepare(self):
        cont_vars = None
        if self._encoder is not None: 
            cont_vars, _ = self._encode_data()
        X_train, X_test, y_train, y_test = self._shuffle_split()   
        self._fit_imputer(X_train)
        X_train = self._impute_data(X_train)
        X_test = self._impute_data(X_test)
        self._fit_scaler(X_train, cont_vars)
        self._X_train_fitted = self._scale_data(X_train, cont_vars)
        self._X_test_fitted = self._scale_data(X_test, cont_vars)
        self._y_train = y_train
        self._y_test = y_test
        
    def prepare_and_train(self):
        self.prepare()
        return self.train()
        
    def train(self):
        self._model.fit(self.X_train, self.y_train)
        self._y_preds = self._model.predict(self.X_train)
        self._y_preds_proba = self._model.predict_proba(self.X_train)
        
        return self.metrics(self.y_train, self.y_preds, self._y_preds_proba)
        
    def test(self):
        return self.metrics(self.y_test, self._model.predict(self.X_test), self._model.predict_proba(self.X_test))
       
    @abstractmethod
    def metrics(self, y_true = None, y_pred = None, y_preds_proba = None):
        pass

In [10]:
class ClassificationModeling(Modeling):
    def __init__(self, 
                 data: pd.DataFrame, 
                 target_name: str, 
                 shuffle_splitter: BaseShuffleSplit, 
                 imputer: BaseImputer, 
                 model: BaseModel, 
                 scaler = None,
                 encoder = None,
                 beta: int = 1, 
                 classification: str = 'binary'):
        super().__init__(data, target_name, shuffle_splitter, imputer, model, scaler, encoder)
        self.beta = beta
        self.classification = classification
        
    @abstractmethod
    def metrics(self, y_true = None, y_pred = None, y_preds_proba=None):
        pass

In [11]:
from typing import Type, TypeVar

class TuningClassificationModeling(ClassificationModeling):
    TClass = None
    all_models = [];
    
    def __init__(self, 
             data: pd.DataFrame, 
             target_name: str, 
             shuffle_splitter: BaseShuffleSplit, 
             imputer: BaseImputer, 
             model: BaseModel, 
             scaler = None,
             encoder = None,
             beta: int = 1, 
             classification: str = 'binary',
                 classification_type: str = 'logistic'):
        super().__init__(data, target_name, shuffle_splitter, imputer, model, scaler, encoder, beta, classification)
        if classification_type == 'logistic':
            TClass = TypeVar("TClass", bound=LogisticRegression)
        elif classification_type == 'xgb':
            TClass = TypeVar("TClass", bound=XGBClassifier)
        elif classification_type == 'neural':
            TClass = TypeVar("TClass", bound=NNModel)
            

    def parameter_tuning(self, params, class_to_instantiate: Type[TClass]):
        list_of_models = []
        combination = []
        params_base = {}
        output = []
        for key, value in params.items():
            if isinstance(value, list):
                combination.append((key,value))
            else:
                params_base[key]=value
        result = {}
        if len(combination) > 0:       
            result = TuningClassificationModeling.get_combinations(combination)
        print(params_base)
        for r in result:
            list_of_models.append(class_to_instantiate(**{**params_base, **r}))
            
        for a_model in list_of_models:
            self.model = a_model
            startTrain = time.time()
            train_metrics = self.train()
            endTrain = time.time()
            test_metrics = self.test()
            endTest = time.time()
            train_time = endTrain - startTrain
            test_time = endTest - endTrain
            output.append({'model': a_model, 'train_metrics': {**train_metrics,**{'elapsed_time':train_time}}, 'test_metrics': {**test_metrics,**{'elapsed_time':test_time}}})
        self.all_models = output
        return output
        
    def find_best_model(self, metric):
        max_accuracy = self.all_models[0]['test_metrics'][metric]
        location = 0
        for indx, output_metrics in enumerate(self.all_models):
            if max_accuracy < output_metrics['test_metrics'][metric]:
                max_accuracy = output_metrics['test_metrics'][metric]
                location = indx
            elif max_accuracy == output_metrics['test_metrics'][metric]:
                if output_metrics['test_metrics']['elapsed_time'] < self.all_models[location]['test_metrics']['elapsed_time']:
                    location = indx
                
        return self.all_models[location]
    
    @staticmethod
    def get_combinations(tuples):
        length = len(tuples)
        if length > 1:
            total_params = []
            tuple_copy = tuples.copy()
            a_tuple = tuple_copy.pop(0)
            params_list = TuningClassificationModeling.get_combinations(tuple_copy)
            for value in a_tuple[1]:
                for a_params in params_list:
                    temp = { a_tuple[0]: value}
                    total_params.append({**temp, **a_params})
            return total_params
        else:
            params_list = []
            a_tuple =  tuples[0]
            for value in a_tuple[1]:
                temp = {}
                temp[a_tuple[0]] = value
                params_list.append(temp)
            return params_list
            
    
    def metrics(self, y_true = None, y_pred = None, y_pred_proba=None):
        if y_true is None and y_pred is None:
            y_true = self.y_train
            y_pred = self.y_preds
        conf_matrix = confusion_matrix(y_true, y_pred)
        return  {
                'matrix': conf_matrix, 
                'auc': roc_auc_score(y_true, y_pred),
                'accuracy': round(accuracy_score(y_true, y_pred), 5), 
                'precision': precision_score(y_true, y_pred, average=self.classification), 
                'recall': recall_score(y_true, y_pred, average=self.classification),
                'f1': f1_score(y_true, y_pred),
                'cost': TuningClassificationModeling.cost_calc(conf_matrix),
                'y_pred': y_pred,
                'y_pred_proba': y_pred_proba
               }
    
    @staticmethod
    def cost_calc(conf_matrix):
        cost_matrix = np.array([[0,-100],[-25,0]])
        cost = np.sum(cost_matrix*conf_matrix)/np.sum(conf_matrix)
        return cost

In [12]:
class NNModel:
    model = None
    epoch = 50
    batch_size = 32
    loss = 'BinaryCrossentropy',
    metric = 'accuracy'
    optimizer = 'adam'
    
    def __init__(self,**inputs):
        self.model = tf.keras.Sequential()
        for arg, content in inputs.items():
            if arg.startswith('input'):
                self.model.add( tf.keras.layers.Input( shape=(content,) ) )
            if arg.startswith('layer'):
                self.model.add( tf.keras.layers.Dense(content['s'], activation = content['activation']) )
            if arg == 'epoch':
                self.epoch = content
            if arg == 'bs':
                self.batch_size = content
            if arg == 'optimizer':
                self.optimizer = content
            if arg == 'loss':
                self.loss = content
            if arg == 'metric':
                self.metric = content
        self.model.compile(optimizer=self.optimizer, loss=self.loss, metrics=[self.metric])
        print(self.model)
    
    def fit(self, X, y):
        self.model.fit(X, y, batch_size=self.batch_size, epochs=self.epoch)
    
    def predict(self, X):
        y_pred_proba = self.predict_proba(X)
        return pd.Series( (y_pred_proba>0.5).astype(int))
        
    
    def predict_proba(self, X):
        y_pred_proba = self.model.predict(X)
        return pd.Series(y_pred_proba.reshape((y_pred_proba.shape[1], y_pred_proba.shape[0]))[0])

In [13]:
def tune_cost_proba(train_proba, test_proba, y_train, y_test, conf_train, conf_test):
    cost_results = pd.DataFrame()
    thresh = 0
    for i in range(11):
        yhat_train = pd.Series(train_proba < thresh).astype(int)
        yhat_test = pd.Series(test_proba < thresh).astype(int)
        conf_train = confusion_matrix(y_train, yhat_train)
        conf_test = confusion_matrix(y_test, yhat_test)
        cost_results = cost_results.append({"Threshold": thresh,
                                        "Train Cost": -TuningClassificationModeling.cost_calc(conf_train),
                                        "Test Cost":  -TuningClassificationModeling.cost_calc(conf_test)},
                                        ignore_index=True)
        thresh = thresh + 0.05
    return cost_results

Which methods are you proposing to utilize to solve the problem?  Why is this method appropriate given the business objective? How will you determine if your approach is useful (or how will you differentiate which approach is more useful than another)?  More specifically, what evaluation metrics are most useful given that the problem is a binary-classification one (ex., Accuracy, F1-score, Precision, Recall, AUC, etc.)?

# Model Building & Evaluations <a id='model-building'/>

## Linear Model

In [14]:
linear_modeling = TuningClassificationModeling(loader.get_df(),'y',
                                           StratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state=12343),
                                           SimpleImputer(missing_values=np.nan, strategy='mean'), LogisticRegression, None, LabelEncoder(), beta=1)

In [15]:
linear_modeling.prepare()

In [17]:
linear_result = linear_modeling.parameter_tuning( { 
    'penalty':'l1',
    'random_state':1,
    'solver': 'liblinear',
    'C':  [0.001, 0.01, 1, 10],
 }, LogisticRegression)

{'penalty': 'l1', 'random_state': 1, 'solver': 'liblinear'}


In [18]:
linear_modeling.find_best_model('auc')

{'model': LogisticRegression(C=0.001, penalty='l1', random_state=1, solver='liblinear'),
 'train_metrics': {'matrix': array([[55175, 11887],
         [21306, 23632]]),
  'auc': 0.6743131085040095,
  'accuracy': 0.70363,
  'precision': 0.6653340465666263,
  'recall': 0.5258801014731408,
  'f1': 0.5874442248654561,
  'cost': -15.369196428571428,
  'y_pred': array([1, 1, 0, ..., 0, 0, 1]),
  'y_pred_proba': array([[0.1317954 , 0.8682046 ],
         [0.30796321, 0.69203679],
         [0.62385404, 0.37614596],
         ...,
         [0.83326334, 0.16673666],
         [0.64663224, 0.35336776],
         [0.31716649, 0.68283351]]),
  'elapsed_time': 5.558510065078735},
 'test_metrics': {'matrix': array([[23539,  5202],
         [ 9156, 10103]]),
  'auc': 0.6717950589503955,
  'accuracy': 0.70088,
  'precision': 0.6601110748121529,
  'recall': 0.5245859078872216,
  'f1': 0.5845966901978937,
  'cost': -15.60625,
  'y_pred': array([1, 0, 0, ..., 0, 1, 0]),
  'y_pred_proba': array([[0.33479881, 0.

In [None]:
train_proba = linear_modeling.find_best_model('auc')['train_metrics']['y_pred_proba']
test_proba = linear_modeling.find_best_model('auc')['test_metrics']['y_pred_proba']
conf_train = linear_modeling.find_best_model('auc')['train_metrics']['matrix']
conf_test = linear_modeling.find_best_model('auc')['test_metrics']['matrix']
   
cost_results = tune_cost_proba(train_proba[:,0], test_proba[:,0], linear_modeling.y_train, linear_modeling.y_test, conf_train, conf_test)

## XGB Model

In [19]:
xgb_classifier = TuningClassificationModeling(loader.get_df(),'y',
                                           StratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state=12343),
                                           None, XGBClassifier, None, LabelEncoder(), beta=1,classification_type = 'xgb' )

In [20]:
xgb_classifier.prepare()

In [None]:
xgb_results = xgb_classifier.parameter_tuning( { 
    'max_depth': [3,6,10],
    'learning_rate': [0.05, 0.1],
    'n_estimators': [100, 500, 1000],
    'colsample_bytree': [0.3, 0.7],
 }, XGBClassifier)

{}


In [None]:
xgb_classifier.find_best_model('auc')

In [None]:
train_proba = xgb_classifier.find_best_model('auc')['train_metrics']['y_pred_proba']
test_proba = xgb_classifier.find_best_model('auc')['test_metrics']['y_pred_proba']
conf_train = xgb_classifier.find_best_model('auc')['train_metrics']['matrix']
conf_test = xgb_classifier.find_best_model('auc')['test_metrics']['matrix']
   
cost_results = tune_cost_proba(train_proba[:,0], test_proba[:,0], xgb_classifier.y_train, xgb_classifier.y_test, conf_train, conf_test)

## NN Model

In [None]:
nn_modeling = TuningClassificationModeling(loader.get_df(),'y',
                                           StratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state=12343),
                                           SimpleImputer(missing_values=np.nan, strategy='mean'), NNModel, None, LabelEncoder(), beta=1,classification_type='neural' )

In [None]:
nn_modeling.prepare()

In [None]:
nn_model_tunning = nn_modeling.parameter_tuning( { 
        'input':50,
        'layer1':{'s':300, 'activation': 'relu'}, 
        'layer2':{'s':200, 'activation': 'relu'}, 
        'layer3':{'s':100, 'activation': 'relu'},
        'layer4':{'s':1, 'activation':'sigmoid'},
        'loss':'BinaryCrossentropy',
        'metric':'accuracy',
        'epoch':[10,30,100,
        'bs':[10,100,1000,10000], 
        'optimizer':'adam'
 }, NNModel)        

In [None]:
nn_modeling.find_best_model('auc')

In [None]:
train_proba = nn_modeling.find_best_model('auc')['train_metrics']['y_pred_proba']
test_proba = nn_modeling.find_best_model('auc')['test_metrics']['y_pred_proba']
conf_train = nn_modeling.find_best_model('auc')['train_metrics']['matrix']
conf_test = nn_modeling.find_best_model('auc')['test_metrics']['matrix']

cost_results = tune_cost_proba(1-train_proba, 1-test_proba, nn_modeling.y_train, nn_modeling.y_test, conf_train, conf_test)

In this case, your primary task is to train a model (or models) capable of generalizing on a binary-target that will minimize the monetary loss for your customer and will involve the following steps:

- Specify your sampling methodology
- Setup your model(s) - highlighting any important parameters
- Analyze the performance of your model(s) - referencing your chosen evaluation metric (including supplemental visuals and analysis where appropriate)

### Final Model


# Monetary Outcome

What is the expected monetary cost (or loss) associated with your model and how might you best translate this to your customer?  Remember, predicting class 1 incorrectly costs the customer $100  while incorrectly predicting class 0 costs the customer $25; or said another way, False Positives = -$100 and False Negatives = -$25

# Conclusion <a id='conclusion'>

After all of your technical analysis and modeling; what are you proposing to your audience and why?  How should they view your results and what should they consider when moving forward?  Are there other approaches you'd recommend exploring?  This is where you "bring it all home" in language they understand.

### Future Considerations, Model Enhancements and Alternative Modeling Approaches <a id='model-enhancements'/>

  

## References


## Appendix

