# Model Building
In this stage, you will build several machine learning models on the cleaned data set and attempt to train a model that performs better than baseline. Depending on your data set, this may mean different things.
## Imports

In [None]:
#Loading main package, setting working directory, and loading in data sets
library(tidyverse)
setwd("C:/Users/User/Documents/nick-kaggle-training")
data <- read_csv("data/clean_financial.csv") 
imputedm <- read_csv("data/imputedm_financial.csv")

## Functions
For your convenience, we have included a few pre-written functions, which you might find useful in your model building. They are by no means necessary, but feel free to use any or all of them.

### score_classification
score_classification takes the predicted results from a model and scores them on every classification metric ever. It also gives the confusion matrix.

Parameters:
- y_train: (1d array-like) The correct y values for the training data set
- y_train_pred: (1d array-like) The predicted y values from the training data set
- y_test: (1d array-like) The correct y values for the test data set
- y_test_pred: (1d array-like) The predicted y values from the test data set

This function uses [sklearn](https://scikit-learn.org/stable/modules/classes.html).metrics to calculate each score. The required functions are imported inside the function.

In [None]:
def score_classification(y_train, y_train_pred, y_test, y_test_pred):
    import numpy as np
    from sklearn.metrics import balanced_accuracy_score
    from sklearn.metrics import accuracy_score
    from sklearn.metrics import roc_auc_score
    from sklearn.metrics import average_precision_score
    from sklearn.metrics import brier_score_loss
    from sklearn.metrics import f1_score
    from sklearn.metrics import log_loss
    from sklearn.metrics import precision_score
    from sklearn.metrics import recall_score
    from sklearn.metrics import jaccard_score
    from sklearn.metrics import confusion_matrix
    
    scores = pd.DataFrame(data = np.array([[accuracy_score(y_train, y_train_pred), accuracy_score(y_test, y_test_pred)],
                                          [balanced_accuracy_score(y_train, y_train_pred), balanced_accuracy_score(y_test, y_test_pred)],
                                          [precision_score(y_train, y_train_pred), precision_score(y_test, y_test_pred)],
                                          [recall_score(y_train, y_train_pred), recall_score(y_test, y_test_pred)],
                                          [f1_score(y_train, y_train_pred), f1_score(y_test, y_test_pred)],
                                          [roc_auc_score(y_train, y_train_pred), roc_auc_score(y_test, y_test_pred)],
                                          [brier_score_loss(y_train, y_train_pred), brier_score_loss(y_test, y_test_pred)],
                                          [log_loss(y_train, y_train_pred), log_loss(y_test, y_test_pred)],
                                          [jaccard_score(y_train, y_train_pred), jaccard_score(y_test, y_test_pred)]]),
                          index = ['Accuracy', 
                                   'Balanced_Accuracy', 
                                   'Precision', 
                                   'Recall', 
                                   'f1',
                                   'ROC_AUC',
                                   'Brier_Loss',
                                   'Log_Loss',
                                   'Jaccard'], 
                          columns = ['Train', 'Test'])
    print(scores)
    print(confusion_matrix(y_test, y_test_pred))

### downsample
Takes a dataframe and the name (string) of its target column and [downsamples](https://developers.google.com/machine-learning/data-prep/construct/sampling-splitting/imbalanced-data) the majority class to equal the minority class.

Parameters:
- df: a Pandas DataFrame containing the data to be downsampled
- target: string. The name of the target variable.

This function uses the Python libraries [Pandas](https://pandas.pydata.org/docs/reference/index.html) (pd), which has been imported above, and [resample](https://scikit-learn.org/stable/modules/generated/sklearn.utils.resample.html) from the [sklearn](https://scikit-learn.org/stable/modules/classes.html) library, which is imported inside the function.

In [None]:
def downsample(df, target):
    from sklearn.utils import resample

    is_0 =  df[target]==0 
    is_1 =  df[target]==1

    if is_0.sum() > is_1.sum():
        df_majority = df[is_0]
        df_minority = df[is_1]
    else:
        df_majority = df[is_1]
        df_minority = df[is_0]

    df_majority_downsampled = resample(df_majority, 
                                       replace=False,   
                                       n_samples=df_minority.shape[0],    
                                       random_state=42)
    df_downsampled = pd.concat([df_majority_downsampled, df_minority])

    return df_downsampled

### scaled_model_search 
Takes a list of scalers and models, along with test-train split data, and runs a search over every possible combination of scaler and model. It prints out the best result. Currently the metric used is accuracy, but it would be simple enough to change depending on the situation.

Parameters:
- scalers: a list of initialized scaler functions (ex: scalers = [StandardScaler(), RobustScaler(), QuantileTransformer(random_state = 42)]
- models: a list of initialized model function (ex: models = [LogisticRegression(), ExtraTreesClassifier(random_state = 42), RandomForestClassifier(random_state = 42)]
- X_train: DataFrame containing the training data set without the target variable
- y_train: DataFrame containing the target variable for the training data.
- X_test: DataFrame containing the test data set without the target variable
- y_test: DataFrame containing the target variable for the test data.

This function uses the [sklearn](https://scikit-learn.org/stable/modules/classes.html) function [accuracy_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) as a metric to compare the models, and it has been imported inside the function. It also uses [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) from [sklearn](https://scikit-learn.org/stable/modules/classes.html), which has been imported inside the function.

In [None]:
def scaled_model_search(scalers, models, X_train, y_train, X_test, y_test):
    from sklearn.metrics import accuracy_score
    from sklearn.pipeline import Pipeline

    best_score = 0
    
    for scaler in scalers:
        for model in models:
            pipe = Pipeline(steps=[('scaler', scaler),
                              ('classifier', model)])
            pipe.fit(X_train, y_train)
            y_pred = pipe.predict(X_test)
            score = accuracy_score(y_test, y_pred)

            if score > best_score:
                best_score = score
                best_model = model
                best_scaler = scaler
    print("The best model is {}, scaled by {}, with a test (accuracy) score of {}.".format(best_model, best_scaler, best_score)) 

## Data
Read in the clean data set from your data_preparation notebook. It should be ready for some preliminary model-building by now, but you should consider your variables and decide if you want to use all of them to train a model. You should have a clear reason for excluding any variables. Also consider time-series data (if applicable to your set). If you have data from multiple years, should you train and test on each year individually? Train on one year and test on another?

## Data Splitting
Once you have an idea of how you plan to use the data, split your data into train and test groups or, if you prefer a more complicated approach, multiple folds. 

## Baseline Model
Before anything else, let's build a baseline model. This will serve as a "sanity check" for everything that comes after. Choose a simplistic model and, without any preprocessing or tuning, train a model on the training set. How well does it perform on the test set?

## Model Improvement
Now you can work on improving on the baseline. There's no linear approach to this process and the steps you take will depend on the data. Below are some steps that are commonly used in building robust models. You can use any, all, or only some of them, and you are encouraged to add your own steps for your specific data set.

As you go through this process, keep in mind all that you learned during the data understanding phase and consider the following questions:
- What sort of model should you train? (ie, classification, regression? Neural network?)
- Given the distribution of your data, the presence or absence of missing data, and various other factors, is there a particular model (or ensemble) that you think will work well? (ie, RandomForest, ExtraTrees, SVM...?)
- Depending on what sort of model you train and what your data look like, you may find different evaluation metrics useful. How can you certain that you have the most well-rounded view of how well your model is performing? What metric or metrics will best capture your model priorities (and what are your model priorities)?

### Scaling
Some models assume data have a normal distribution and performance will suffer when they do not. Most models will suffer if different variables have vastly differing scales. Do you need to scale your data? If so, how should you go about doing so?

### Feature Selection and Engineering
Are all of your variables necessary, or do you have a lot of them taking up time and computing power without assing much to model building? Can some variables be combined to make a better model? Are variables linearly related to your target variable, or would it be worthwhile to include some polynomial features? 

### Hyperparameter Tuning
Once you have a model that is performing decently well, you'll want to adjust the hyperparameters to improve performance.

### Additional Tuning, Processing, or Model-Improvement
What else can you do to improve your model from the baseline?

## Outcome
At the end of this notebook, you should have a model that is performing better than the baseline model. You should be able to explain what steps you took to train this model and why each one was chosen.