## Develop Classification Models

In [44]:
import pandas as pd
from scipy.stats import distributions
import mlflow 

from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import f1_score

pd.set_option("display.max_columns", None)
%config Completer.use_jedi = False

Helper Functions

In [30]:
class LogUniformInt:
    
    def __init__(self, min_val, max_val):
        self.min_val = min_val
        self.max_val = max_val 
        
    def rvs(self, random_state=None):
        """
        rvs method that is needed by RandomSearchCV
        """
        
        # call the loguniform distribution that is built into scipy
        lu = distributions.loguniform(self.min_val, self.max_val)
        
        # convert outut to integer
        rand_int = int(lu.rvs(random_state=random_state))
        
        return rand_int

Create our MLflow Database to Log Results

#### Read in and standardize data


We will apply a standard scaler.
This is just for computational efficiency (avoid any floating point error).

Decision tree based methods are relatively scale invariant, and don't really need standardization.
This makes intuitive sense; when splitting on a feature,
we are just picking that value that minimizing entropy. As long as,
rank order is preserved in the feature column, we should get the same result.

Relevant [Stack Exchange post](https://stats.stackexchange.com/questions/255765/does-random-forest-need-input-variables-to-be-scaled-or-centered).


In [37]:
# get training data
X_train_df = pd.read_csv("train_cleaned.csv")
y_train_df = X_train_df.pop("Credit_Score")
y_train = y_train_df.values

# get dev data
X_dev_df = pd.read_csv("dev_cleaned.csv")
y_dev_df = X_dev_df.pop("Credit_Score")
y_dev = y_dev_df.values

# Standardize our data
X_train = StandardScaler().fit_transform(X_train_df)
X_dev = StandardScaler().fit_transform(X_dev_df)

Train Model

According to the `RandomSearchCV` [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html), our param_distributions are lists that will be sampled uniformly or distributions with a `rvs` method for sampling (such as those from scipy.stats.distributions).


[Example project](https://jamesrledoux.com/code/randomized_parameter_search) using `RandomSearchCV`.

For values that can span mulitiple orders of magnitude, 
we will want to sample using a [loguniform distribution](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.loguniform.html#scipy.stats.loguniform).
[This blog post](https://towardsdatascience.com/why-is-the-log-uniform-distribution-useful-for-hyperparameter-tuning-63c8d331698) does a good job of explaining why to use the log normal.

In [None]:
rfc_model = RandomForestClassifier(n_estimators = 1000)

# ----- Parameter Distributions -----
param_dists = {"max_depth": distributions.randint(2, 20),
               "min_samples_split": LogUniformInt(min_val=2, max_val=50),
               "max_leaf_nodes": distributions.randint(2, 10),
               # normally distributed max_features, with mean .25 stddev 0.1, bounded between 0 and 1
               "max_features":  distributions.truncnorm(a=0, b=1, loc=0.25, scale=0.1)}


clf = RandomizedSearchCV(rfc_model,
                         param_dists,
                         n_iter=2,
                         cv=5, # for classification defaults to StratefiedKFold
                         random_state=99)


cv_model = clf.fit(X_train, y_train)


# get the best params
best_params = cv_model.best_estimator_.get_params()

__Generate predictions__: on our dev set, so that we can compare to other models

In [None]:
best_rfc_model = RandomForestClassifier(**best_params)
best_rfc_model.fit(X_train, y_train)

y_pred = best_rfc_model.predict(X_dev)
score = f1_score(y_true=y_dev,
                 y_pred=y_pred,
                 average="weighted")

__Log Results__