# Random forest classifier: diabetes prediction

Absolutely minimal MVP (minimum viable product) solution.

## 1. Data acquisition

In [None]:
# Handle imports up-front
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from scipy.stats import uniform, norm
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier

### 1.1. Load the data

In [None]:
# Load the data from the URL
data_df=pd.read_csv('https://raw.githubusercontent.com/4GeeksAcademy/decision-tree-project-tutorial/main/diabetes.csv')

### 1.2. Inspect the data

In [None]:
# Your code here....

### 1.3. Train-test split

In [None]:
# Separate features from labels
labels=data_df['Outcome']
features=data_df.drop('Outcome', axis=1)

# Split the data into training and testing features and labels
training_features, testing_features, training_labels, testing_labels=train_test_split(
    features,
    labels,
    test_size=0.25,
    random_state=315
)

### 1.4. Encoding

In [None]:
# Your code here....

## 2. EDA

### 2.1. Baseline model performance

In [None]:
# Define a reusable helper function for cross-validation here. We are going to
# be doing a lot of cross-validation, this allows us to reuse this code
# without having to copy-paste it over and over.

def cross_val(model, features: pd.DataFrame, labels: pd.Series) -> list[float]:
    '''Reusable helper function to run cross-validation on a model. Takes model,
    Pandas data frame of features and Pandas data series of labels. Returns 
    list of cross-validation fold accuracy scores as percents.'''

    # Define the cross-validation strategy
    cross_validation=StratifiedKFold(n_splits=7, shuffle=True, random_state=315)

    # Run the cross-validation, collecting the scores
    scores=cross_val_score(
        model,
        features,
        labels,
        cv=cross_validation,
        n_jobs=-1,
        scoring='accuracy'
    )

    # Print mean and standard deviation of the scores
    print(f'Cross-validation accuracy: {(scores.mean() * 100):.2f} +/- {(scores.std() * 100):.2f}%')

    # Return the scores
    return scores

In [None]:
# Instantiate a random forest classifier model
model=RandomForestClassifier(random_state=315)

# Run the cross-validation
scores=cross_val(model, training_features, training_labels)

### 2.2. Missing, and/or extreme values

In [None]:
# Your code here....

### 2.3. Feature selection

In [None]:
# Your code goes here...

## 3. Model training

In [None]:
# Your code goes here...

## 4. Model optimization
### 4.1. Hyperparameter optimization

In [None]:
import warnings
warnings.filterwarnings('ignore')

# Define the hyperparameter distributions to sample from
distributions={
    'n_estimators': list(range(2, 100))
}

# Instantiate a random forest classifier model
model=RandomForestClassifier(random_state=315)

# Define the cross-validation strategy
cross_validation=StratifiedKFold(n_splits=7, shuffle=True, random_state=315)

# Set-up the search
search=RandomizedSearchCV(
    model,
    distributions,
    scoring='accuracy',
    n_jobs=-1,
    cv=cross_validation,
    n_iter=100,
    random_state=315,
    return_train_score=True
)

# Run the grid search
results=search.fit(training_features, training_labels)

# Print the best parameter settings found at the end
print(f'Best hyperparameters: {results.best_params_}')

### 4.2. Cross-validation of optimized model

In [None]:
# Your code goes here...

### 4.3 Final model evaluation

In [None]:
# Your code goes here...

## 5. A note on interpretation & some statistics

As a professional data scientist, in addition to making the model as accurate as possible, we want to have a good estimate of how accurate it will be on un-seen test data. With our cross-validation results, we can take advantage of some simple statistics to talk about probabilities and confidence intervals. 

### 5.1. Confidence intervals on model performace

The first useful thing to look at is a confidence interval around the cross-validation performance:

In [None]:
lower_bound, upper_bound=norm.interval(0.95, loc=scores.mean(), scale=scores.std())
print(f'95% CI = {lower_bound*100:.1f}% - {upper_bound*100:.1f}% accuracy')

### 5.2. Likelihood of test results

We can also use SciPy's stats module to calculate the probability of the test set result we observed, given our cross-validation results. Doing so will give us a nice way to quantify how well we are estimating true out-of-sample performance. If our test score is likely, we are in good shape. If the test result is very unlikely then something is probably wrong.

For example, if our test result is 76.0% accuracy, did we do a good job estimating out-of-sample performance with cross-validation?

In [None]:
# Use your test set accuracy here
testing_percent_accuracy=76.0

# Convert the test accuracy to a z-score using the mean and standard 
# deviation from the cross-validation
z_score=((testing_percent_accuracy/100) - scores.mean()) / scores.std()

# Use the standard normal distribution's probability density function to
# get the probability of observing our testing accuracy score
probability=norm.pdf(z_score)

print(f'Probability: {probability*100:.1f}%')

For example - what would happen if our test result was only 68.0% accurate - just outside of our 95% confidence interval?

In [None]:
testing_percent_accuracy=68.0

z_score=((testing_percent_accuracy/100) - scores.mean()) / scores.std()
probability=norm.pdf(z_score)

print(f'Probability: {probability*100:.1f}%')