# Starting Off

Sometimes when trying to classify problems like fraud detection, the dataset will have a lot of non-fraud cases and realtively few fraud cases.  How could a class imbalance cause a problem with your model. 

*use the term bias in your answer*

# Random Forest Practicum with Class Imbalance

Agenda:
- Review class imbalance
- Review code for different ways to handle class imbalance
- Review code for Random Forest with gridsearch
- Practice both class imbalance and Random Forest on credit data.

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score, f1_score
from sklearn.model_selection import train_test_split

In [2]:
# Read in data and split data to be used in the models
titanic = pd.read_csv('https://raw.githubusercontent.com/learn-co-students/nyc-mhtn-ds-042219-lectures/master/Module_4/cleaned_titanic.csv', index_col='PassengerId')



In [3]:
# Create matrix of features
X = titanic.drop('Survived', axis = 1) # grabs everything else but 'Survived'

# Create target variable
y = titanic['Survived'] # y is the column we're trying to predict

# Create a list of the features being used in the 
feature_cols = X.columns

# Fitting a Random Forest Classifier

In [4]:
# Instantiate the classifier using 100 trees
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(random_state = 23, n_estimators=100)

In [5]:
#let's look at all the different default features
rfc

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=23, verbose=0, warm_start=False)

In [6]:
# setting up testing and training sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=23)

In [7]:
#fit the model to the training data
rfc.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=23, verbose=0, warm_start=False)

In [8]:
#use the fitted model to predict on the test data
rfc_pred = rfc.predict(X_test)



# checking accuracy on the test data
print('Test Accuracy score: ', accuracy_score(y_test, rfc_pred))


# checking accuracy on the test data
print('Test F1 score: ', f1_score(y_test, rfc_pred))

Test Accuracy score:  0.7802690582959642
Test F1 score:  0.6754966887417219


In [11]:
print(rfc.feature_importances_)

[0.08859929 0.21849423 0.05507378 0.04150652 0.27509539 0.01864985
 0.26823803 0.01218626 0.02215665]


### Let's use grid search to identify the best tunig parameters to use

In [9]:
from sklearn.model_selection import GridSearchCV

In [12]:
#create a dictionary of all the parameters you want to tune
param_grid = { 
    'n_estimators': [200,300,400,500],
    'max_features': [0.1,0.2,0.25],
    'max_depth': [1,2,3,4],
    'criterion': ['gini','entropy']
}

In [13]:
#create a grid search object and fit it to the data
rfc = RandomForestClassifier(random_state = 23)
CV_rfc = GridSearchCV(rfc, param_grid, cv=3,n_jobs=-1)
CV_rfc.fit(X_train,y_train)

GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
            oob_score=False, random_state=23, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'n_estimators': [200, 300, 400, 500], 'max_features': [0.1, 0.2, 0.25], 'max_depth': [1, 2, 3, 4], 'criterion': ['gini', 'entropy']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [18]:
### Identify the best params 
print(CV_rfc.best_params__)


#Identify the best score during fitting with cross-validation
print(CV_rfc.best_estimator_.feature_importances_)


AttributeError: 'GridSearchCV' object has no attribute 'best_params__'

In [19]:
#predict on the test set

rfc_pred = CV_rfc.best_estimator_.predict(X_test)

# checking accuracy
print('Test Accuracy score: ', accuracy_score(rfc_pred, y_test))


# checking accuracy
print('Test F1 score: ', f1_score(rfc_pred, y_test))

Test Accuracy score:  0.820627802690583
Test F1 score:  0.6923076923076923
