# Predicting Lung Cancer and Survey Analysis

## Summary/ Introduction

This notebook analyzes survey responses to identify behavioral and demographic factors linked to lung cancer. Using permutation-based feature importance, we find that only 5 out of 15 survey questions are needed to accurately predict the presence of lung cancer. We also demonstrate how to adjust the model to prioritize minimizing false negatives by using the F-beta score (with beta > 1), which emphasizes recall. This trade-off is critical in medical contexts, where failing to identify at-risk individuals can delay diagnosis and treatment. By surfacing high-risk patterns in survey responses, this approach offers a potential tool for early detection and clinical follow-up.

## Data Overview

The [Lung Cancer Dataset](https://www.kaggle.com/datasets/aagambshah/lung-cancer-dataset) contains **309 survey responses** capturing demographic and behavioral factors. The target variable is `LUNG_CANCER`, indicating whether a respondent has been diagnosed with lung cancer. Below is a description of each feature:

| Feature Name            | Description                                      |
| ----------------------- | ------------------------------------------------ |
| `GENDER`                | Gender of the respondent (Male/Female)           |
| `AGE`                   | Age of the respondent                            |
| `SMOKING`               | Smoking habit (Yes/No)                           |
| `YELLOW_FINGERS`        | Presence of yellowing fingers (Yes/No)           |
| `ANXIETY`               | Presence of anxiety (Yes/No)                     |
| `PEER_PRESSURE`         | Experience of peer pressure (Yes/No)             |
| `CHRONIC DISEASE`       | Existing chronic diseases (Yes/No)               |
| `FATIGUE`               | Presence of fatigue (Yes/No)                     |
| `ALLERGY`               | Allergic conditions (Yes/No)                     |
| `WHEEZING`              | Wheezing symptoms (Yes/No)                       |
| `ALCOHOL CONSUMING`     | Alcohol consumption habit (Yes/No)               |
| `COUGHING`              | Frequent coughing (Yes/No)                       |
| `SHORTNESS OF BREATH`   | Symptom of shortness of breath (Yes/No)          |
| `SWALLOWING DIFFICULTY` | Difficulty in swallowing (Yes/No)                |
| `CHEST PAIN`            | Presence of chest pain (Yes/No)                  |
| `LUNG_CANCER`           | Lung cancer diagnosis (target variable) (Yes/No) |


## Imports

In [None]:
# Core libraries
import os
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Kaggle dataset access
import kagglehub

# Scikit-learn: modeling and evaluation
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.inspection import permutation_importance

from sklearn.metrics import (
    accuracy_score,
    fbeta_score,
    make_scorer,
    precision_score,
    recall_score,
    confusion_matrix,
    ConfusionMatrixDisplay,
    precision_recall_curve,
    PrecisionRecallDisplay,
    roc_curve,
    RocCurveDisplay
)

## Data Cleaning & Preprocessing

In [None]:
# Download latest version
path = kagglehub.dataset_download("aagambshah/lung-cancer-dataset")

# Load dataset into DataFrame
df = pd.read_csv(os.path.join(path, 'survey lung cancer.csv'))

# Preview
df.head()

In [None]:
# Check for nulls
df.isna().sum()

In [None]:
# Check data types
df.dtypes

In [None]:
# View unique values
for col in df.select_dtypes(include=['object', 'int64']):
    print(f'{col}: {df[col].unique()}')

In [None]:
# Clean column names
df.columns = df.columns.str.strip().str.replace(' ', '_')

In [None]:
# Helper function
def one_hot_encode(array, positive_class):
    """
    Encodes a binary categorical array into 0s and 1s based on a specified positive class.

    Parameters:
        array (array-like): Input array containing binary values (e.g., 'Yes'/'No', 1/2).
        positive_class (str or int): The class to encode as 1. All other values become 0.

    Returns:
        np.ndarray: Array of 0s and 1s, where 1 represents the positive_class.
    """
    return np.where(array == positive_class, 1, 0)


In [None]:
# One-hot encode
df_encoded = df.copy()
df_encoded.GENDER = one_hot_encode(df.GENDER.values, 'M')
df_encoded.SMOKING = one_hot_encode(df.SMOKING.values, 2)
df_encoded.YELLOW_FINGERS = one_hot_encode(df.YELLOW_FINGERS.values, 2)
df_encoded.ANXIETY = one_hot_encode(df.ANXIETY.values, 2)
df_encoded.PEER_PRESSURE = one_hot_encode(df.PEER_PRESSURE.values, 2)
df_encoded.CHRONIC_DISEASE = one_hot_encode(df.CHRONIC_DISEASE.values, 2)
df_encoded.FATIGUE = one_hot_encode(df.FATIGUE.values, 2)
df_encoded.ALLERGY = one_hot_encode(df.ALLERGY.values, 2)
df_encoded.WHEEZING = one_hot_encode(df.WHEEZING.values, 2)
df_encoded.ALCOHOL_CONSUMING = one_hot_encode(df.ALCOHOL_CONSUMING.values, 2)
df_encoded.COUGHING = one_hot_encode(df.COUGHING.values, 2)
df_encoded.SHORTNESS_OF_BREATH = one_hot_encode(df.SHORTNESS_OF_BREATH.values, 2)
df_encoded.SWALLOWING_DIFFICULTY = one_hot_encode(df.SWALLOWING_DIFFICULTY.values, 2)
df_encoded.CHEST_PAIN = one_hot_encode(df.CHEST_PAIN.values, 2)
df_encoded.LUNG_CANCER = one_hot_encode(df.LUNG_CANCER.values, 'YES')

# Preview
df_encoded.head()

## Exploratory Data Analysis (EDA)

In [None]:
# Histograms of all features
ax = df_encoded.hist(figsize=(12, 10), edgecolor='black', grid=False)
plt.suptitle("Distribution of Encoded Features", fontsize=20)
plt.tight_layout(rect=[0, 0.03, 1, 0.97])

By looking at `LUNG_CANCER`, it becomes obvious that this is an imbalanced dataset with many more cases of lung cancer than those without. Sampling methods will need to be used to prevent the model from becoming biased towards the majority class.

In [None]:
# Check for collinearity
correlation_matrix = df_encoded.corr()

# Visualize
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', square=True)
plt.title('Feature Correlation Matrix')
plt.show()

## Model Building

### Datasets

In [None]:
# Separate features from target
X = df_encoded.drop('LUNG_CANCER', axis=1).values
y = df_encoded['LUNG_CANCER'].values
feature_names = df_encoded.drop('LUNG_CANCER', axis=1).columns

# Split X, y into train and test splits and stratify based on target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, shuffle=True, stratify=y)

# Split X_train, y_train into train and validation splits and stratify based on y_train (for feature importance scoring)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.33, random_state=42, shuffle=True, stratify=y_train)

In [None]:
# Helper function
def plot_cv_results(results1, results2, x1='param_max_depth', x2='min_samples_leaf'):
    # Create subplots
    fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(12, 6), constrained_layout=True, sharey=True)

    # Left plot
    ax1.plot(results1[x1].astype('int'), results1['mean_train_score'], label='Train')
    ax1.fill_between(results1[x1].astype('int'),
                     [m + s for m, s in zip(results1['mean_train_score'], results1['std_train_score'])],
                     [m - s for m, s in zip(results1['mean_train_score'], results1['std_train_score'])],
                     alpha=0.3)
    
    ax1.plot(results1[x1].astype('int'), results1['mean_test_score'], label='Val')
    ax1.fill_between(results1[x1].astype('int'),
                     [m + s for m, s in zip(results1['mean_test_score'], results1['std_test_score'])],
                     [m - s for m, s in zip(results1['mean_test_score'], results1['std_test_score'])],
                     alpha=0.3)
    
    ax1.legend()
    ax1.set_xlabel(x1)
    ax1.set_ylabel('F-beta Score')
    ax1.set_title('Maximum Tree Depth Approach')

    # Right plot
    ax2.plot(results2[x2].astype('int'), results2['mean_train_score'], label='Train')
    ax2.fill_between(results2[x2].astype('int'),
                     [m + s for m, s in zip(results2['mean_train_score'], results2['std_train_score'])],
                     [m - s for m, s in zip(results2['mean_train_score'], results2['std_train_score'])],
                     alpha=0.3)
    
    ax2.plot(results2[x2].astype('int'), results2['mean_test_score'], label='Val')
    ax2.fill_between(results2[x2].astype('int'),
                     [m + s for m, s in zip(results2['mean_test_score'], results2['std_test_score'])],
                     [m - s for m, s in zip(results2['mean_test_score'], results2['std_test_score'])],
                     alpha=0.3)
    
    ax2.legend()
    ax2.set_xlabel(x2)
    ax2.set_ylabel('F-beta Score')
    ax2.set_title('Minimum Samples Per Leaf Approach')
    
    fig.suptitle('Cross-Validation F-beta Scores')
    plt.show()

### Tree-Depth Approach vs Minimum Samples Per Leaf Approach

In [None]:
# Define parameter grid
params1 = {'max_depth': list(range(2, 11)),
           'random_state': [42]}
params2 = {'min_samples_leaf': list(range(1, 11)),
           'random_state': [42]}

# Scoring with F-beta score
fbeta_scorer = make_scorer(fbeta_score, beta=2)

# Create Grid Search object
grid1 = GridSearchCV(
    estimator=RandomForestClassifier(),
    param_grid=params1,
    scoring=fbeta_scorer,
    cv=5,
    return_train_score=True
)
grid2 = GridSearchCV(
    estimator=RandomForestClassifier(),
    param_grid=params2,
    scoring=fbeta_scorer,
    cv=5,
    return_train_score=True
)

# Train
grid1.fit(X_train, y_train)
grid2.fit(X_train, y_train)

# Plot results
cv_results1 = pd.DataFrame(grid1.cv_results_)[['param_max_depth', 'mean_train_score', 'std_train_score', 'mean_test_score', 'std_test_score']]
cv_results2 = pd.DataFrame(grid2.cv_results_)[['param_min_samples_leaf', 'mean_train_score', 'std_train_score', 'mean_test_score', 'std_test_score']]
plot_cv_results(cv_results1, cv_results2, x1='param_max_depth', x2='param_min_samples_leaf')

Setting the maximum tree depth leads to overfitting as seen by the continued increase in model performance on the train dataset, but decreased performance on the validation dataset. Setting the minimum samples per leaf leads to better generalization which can be seen by the lack of overfitting, and low variance across cross validation runs.

In [None]:
# Verify we agree with GridSearchCV's best parameters
grid2.best_params_

## Feature Importance (Permutation-Based)
Now that we have a first pass at a successful model, let's measure permutation-based feature importance to decide which survey questions positively impact predictive performance. The model was trained on the train dataset and will now be tested on the validation dataset.

In [None]:
# Compute permutation importance
result = permutation_importance(grid2, X_val, y_val, n_repeats=30, random_state=42, scoring=fbeta_scorer)

# Create a sorted Series for plotting
importances = pd.Series(result.importances_mean, index=feature_names)
importances_std = pd.Series(result.importances_std, index=feature_names)
sorted_importances = importances.sort_values(ascending=False)

# Plot with matching error bars
fig, ax = plt.subplots(figsize=(10, 6))
sorted_importances.plot.bar(yerr=importances_std[sorted_importances.index], ax=ax)
ax.set_title("Permutation Importance (Validation Set)")
ax.set_ylabel("Mean Accuracy Decrease")
fig.tight_layout()
plt.show()

`AGE` and `CHRONIC_DISEASE` both show that the model performance increases when their values are permuted, meaning they are useless. Let's remove them and repeat the process.

In [None]:
# Drop features (datsets do not leak since random_state kept constant)
X = df_encoded.drop(columns=['AGE', 'CHRONIC_DISEASE', 'LUNG_CANCER']).values
y = df_encoded['LUNG_CANCER'].values
feature_names = df_encoded.drop(columns=['AGE', 'CHRONIC_DISEASE', 'LUNG_CANCER']).columns

# Split X, y into train and test splits and stratify based on target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, shuffle=True, stratify=y)

# Split X_train, y_train into train and validation splits and stratify based on y_train (for feature importance scoring)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.33, random_state=42, shuffle=True, stratify=y_train)

# Train
grid1.fit(X_train, y_train)
grid2.fit(X_train, y_train)

# Plot results
cv_results1 = pd.DataFrame(grid1.cv_results_)[['param_max_depth', 'mean_train_score', 'std_train_score', 'mean_test_score', 'std_test_score']]
cv_results2 = pd.DataFrame(grid2.cv_results_)[['param_min_samples_leaf', 'mean_train_score', 'std_train_score', 'mean_test_score', 'std_test_score']]
plot_cv_results(cv_results1, cv_results2, x1='param_max_depth', x2='param_min_samples_leaf')

In [None]:
# Verify we agree with GridSearchCV's best parameters
grid2.best_params_

In [None]:
# Compute permutation importance
result = permutation_importance(grid2, X_val, y_val, n_repeats=30, random_state=42, scoring=fbeta_scorer)

# Create a sorted Series for plotting
importances = pd.Series(result.importances_mean, index=feature_names)
importances_std = pd.Series(result.importances_std, index=feature_names)
sorted_importances = importances.sort_values(ascending=False)

# Plot with matching error bars
fig, ax = plt.subplots(figsize=(10, 6))
sorted_importances.plot.bar(yerr=importances_std[sorted_importances.index], ax=ax)
ax.set_title("Permutation Importance (Validation Set)")
ax.set_ylabel("Mean Accuracy Decrease")
fig.tight_layout()
plt.show()

Now, `GENDER`, `SMOKING`, and `FATIGUE` show that the model score is not affected by their shuffled values. Therefore, we should remove these features as well and repeat the process.

In [None]:
# Drop features (datsets do not leak since random_state kept constant)
X = df_encoded.drop(columns=['GENDER', 'SMOKING', 'FATIGUE', 'AGE', 'CHRONIC_DISEASE', 'LUNG_CANCER']).values
y = df_encoded['LUNG_CANCER'].values
feature_names = df_encoded.drop(columns=['GENDER', 'SMOKING', 'FATIGUE', 'AGE', 'CHRONIC_DISEASE', 'LUNG_CANCER']).columns

# Split X, y into train and test splits and stratify based on target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, shuffle=True, stratify=y)

# Split X_train, y_train into train and validation splits and stratify based on y_train (for feature importance scoring)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.33, random_state=42, shuffle=True, stratify=y_train)

# Train
grid1.fit(X_train, y_train)
grid2.fit(X_train, y_train)

# Plot results
cv_results1 = pd.DataFrame(grid1.cv_results_)[['param_max_depth', 'mean_train_score', 'std_train_score', 'mean_test_score', 'std_test_score']]
cv_results2 = pd.DataFrame(grid2.cv_results_)[['param_min_samples_leaf', 'mean_train_score', 'std_train_score', 'mean_test_score', 'std_test_score']]
plot_cv_results(cv_results1, cv_results2, x1='param_max_depth', x2='param_min_samples_leaf')

In [None]:
# Verify we agree with GridSearchCV's best parameters
grid2.best_params_

In [None]:
# Compute permutation importance
result = permutation_importance(grid2, X_val, y_val, n_repeats=30, random_state=42, scoring=fbeta_scorer)

# Create a sorted Series for plotting
importances = pd.Series(result.importances_mean, index=feature_names)
importances_std = pd.Series(result.importances_std, index=feature_names)
sorted_importances = importances.sort_values(ascending=False)

# Plot with matching error bars
fig, ax = plt.subplots(figsize=(10, 6))
sorted_importances.plot.bar(yerr=importances_std[sorted_importances.index], ax=ax)
ax.set_title("Permutation Importance (Validation Set)")
ax.set_ylabel("Mean Accuracy Decrease")
fig.tight_layout()
plt.show()

Now, shuffling `ANXIETY` tends to make the model perform better. Therefore, this feature needs to be removed as well.

In [None]:
# Drop features (datsets do not leak since random_state kept constant)
X = df_encoded.drop(columns=['ANXIETY', 'GENDER', 'SMOKING', 'FATIGUE', 'AGE', 'CHRONIC_DISEASE', 'LUNG_CANCER']).values
y = df_encoded['LUNG_CANCER'].values
feature_names = df_encoded.drop(columns=['ANXIETY', 'GENDER', 'SMOKING', 'FATIGUE', 'AGE', 'CHRONIC_DISEASE', 'LUNG_CANCER']).columns

# Split X, y into train and test splits and stratify based on target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, shuffle=True, stratify=y)

# Split X_train, y_train into train and validation splits and stratify based on y_train (for feature importance scoring)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.33, random_state=42, shuffle=True, stratify=y_train)

# Train
grid1.fit(X_train, y_train)
grid2.fit(X_train, y_train)

# Plot results
cv_results1 = pd.DataFrame(grid1.cv_results_)[['param_max_depth', 'mean_train_score', 'std_train_score', 'mean_test_score', 'std_test_score']]
cv_results2 = pd.DataFrame(grid2.cv_results_)[['param_min_samples_leaf', 'mean_train_score', 'std_train_score', 'mean_test_score', 'std_test_score']]
plot_cv_results(cv_results1, cv_results2, x1='param_max_depth', x2='param_min_samples_leaf')

In [None]:
# Verify we agree with GridSearchCV's best parameters
grid2.best_params_

In [None]:
# Compute permutation importance
result = permutation_importance(grid2, X_val, y_val, n_repeats=30, random_state=42, scoring=fbeta_scorer)

# Create a sorted Series for plotting
importances = pd.Series(result.importances_mean, index=feature_names)
importances_std = pd.Series(result.importances_std, index=feature_names)
sorted_importances = importances.sort_values(ascending=False)

# Plot with matching error bars
fig, ax = plt.subplots(figsize=(10, 6))
sorted_importances.plot.bar(yerr=importances_std[sorted_importances.index], ax=ax)
ax.set_title("Permutation Importance (Validation Set)")
ax.set_ylabel("Mean Accuracy Decrease")
fig.tight_layout()
plt.show()

Once again, a feature that is not affecting the model prediction has surfaced. Removing `SHORTNESS_OF_BREATH` and repeating.

In [None]:
# Drop features (datsets do not leak since random_state kept constant)
X = df_encoded.drop(columns=['SHORTNESS_OF_BREATH', 'ANXIETY', 'GENDER', 'SMOKING', 'FATIGUE', 'AGE', 'CHRONIC_DISEASE', 'LUNG_CANCER']).values
y = df_encoded['LUNG_CANCER'].values
feature_names = df_encoded.drop(columns=['SHORTNESS_OF_BREATH', 'ANXIETY', 'GENDER', 'SMOKING', 'FATIGUE', 'AGE', 'CHRONIC_DISEASE', 'LUNG_CANCER']).columns

# Split X, y into train and test splits and stratify based on target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, shuffle=True, stratify=y)

# Split X_train, y_train into train and validation splits and stratify based on y_train (for feature importance scoring)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.33, random_state=42, shuffle=True, stratify=y_train)

# Train
grid1.fit(X_train, y_train)
grid2.fit(X_train, y_train)

# Plot results
cv_results1 = pd.DataFrame(grid1.cv_results_)[['param_max_depth', 'mean_train_score', 'std_train_score', 'mean_test_score', 'std_test_score']]
cv_results2 = pd.DataFrame(grid2.cv_results_)[['param_min_samples_leaf', 'mean_train_score', 'std_train_score', 'mean_test_score', 'std_test_score']]
plot_cv_results(cv_results1, cv_results2, x1='param_max_depth', x2='param_min_samples_leaf')

In [None]:
# Verify we agree with GridSearchCV's best parameters
grid2.best_params_

In [None]:
# Compute permutation importance
result = permutation_importance(grid2, X_val, y_val, n_repeats=30, random_state=42, scoring=fbeta_scorer)

# Create a sorted Series for plotting
importances = pd.Series(result.importances_mean, index=feature_names)
importances_std = pd.Series(result.importances_std, index=feature_names)
sorted_importances = importances.sort_values(ascending=False)

# Plot with matching error bars
fig, ax = plt.subplots(figsize=(10, 6))
sorted_importances.plot.bar(yerr=importances_std[sorted_importances.index], ax=ax)
ax.set_title("Permutation Importance (Validation Set)")
ax.set_ylabel("Mean Accuracy Decrease")
fig.tight_layout()
plt.show()

Great! We are now left with a model that benefits from each feature in its dataset. We can proceed to fine tuning the model.

## Fine Tuning: Minimizing False Negatives

The current model is chosen based on the hyperparameters that resulted in the best F-beta Score. However, the default threshold used to predict class that the RandomForestClassifier uses is 0.5. Below, we will compare the predictive performance of the trained model when we alter this threshold.

First things first, we need to retrain our model on the combination of the train and validation datasets. 

In [None]:
# Drop features (datsets do not leak since random_state kept constant)
X = df_encoded.drop(columns=['SHORTNESS_OF_BREATH', 'ANXIETY', 'GENDER', 'SMOKING', 'FATIGUE', 'AGE', 'CHRONIC_DISEASE', 'LUNG_CANCER']).values
y = df_encoded['LUNG_CANCER'].values
feature_names = df_encoded.drop(columns=['SHORTNESS_OF_BREATH', 'ANXIETY', 'GENDER', 'SMOKING', 'FATIGUE', 'AGE', 'CHRONIC_DISEASE', 'LUNG_CANCER']).columns

# Split X, y into train and test splits and stratify based on target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, shuffle=True, stratify=y)

# Create classifier with best hyperparameters
clf = RandomForestClassifier(**grid2.best_params_)

# Train
clf.fit(X_train, y_train)

In [None]:
# Get predicted positive class probabilities
y_score = clf.predict_proba(X_train)[:, 1]

# Determine Precision-Recall scores
prec, recall, thresholds = precision_recall_curve(y_train, y_score)
pr_display = PrecisionRecallDisplay(precision=prec, recall=recall)

# Find threshold that minimizes False Negatives
num_false_negatives = []
tn_list, fp_list, fn_list, tp_list = [], [], [], []
for threshold in thresholds: 
    y_pred_at_threshold = np.where(y_score >= threshold, 1, 0)
    # indices_of_true_negatives = np.where(y_train == 1)[0]
    # correct = y_pred_at_threshold[indices_of_true_negatives].sum()
    # num_false_negatives.append(len(indices_of_true_negatives) - correct)

    # Calculate true negatives, false positives, false negatives, true positives
    tn, fp, fn, tp = confusion_matrix(y_train, y_pred_at_threshold).ravel().tolist()
    tn_list.append(tn)
    fp_list.append(fp)
    fn_list.append(fn)
    tp_list.append(tp)

# # Threshold for best F-beta score
# beta = 2
# f_beta_score = ((1 + beta ** 2) * prec * recall) / ((beta ** 2 * prec) + recall) # as a function of precision and recall
# idx = f_beta_score.argmax()
# best_fb_threshold = thresholds[idx]
# print('Best Threshold=%f, F-Score=%.3f' % (best_fb_threshold, f_beta_score[idx]))

# # Confusion Matrix (default threshold)
# y_pred = clf.predict(X_train)
# cm = confusion_matrix(y_train, y_pred)
# cm_display = ConfusionMatrixDisplay(cm)

# # Confusion Matrix (threshold for best F-beta score)
# y_pred_fb = np.where(y_score >= best_fb_threshold, 1, 0)
# cm = confusion_matrix(y_train, y_pred_fb)
# cm_display_fb = ConfusionMatrixDisplay(cm)

# # Plot all together
# fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
# cm_display.plot(ax=ax1)
# ax1.set_title('Default Threshold=0.5')
# cm_display_fb.plot(ax=ax2)
# ax2.set_title('Best F-beta Score Threshold=%f' % best_fb_threshold)
# plt.show()

# fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
# pr_display.plot(ax=ax1)
# default_ind = np.argwhere(thresholds == 0.5)
# ax1.scatter(recall[default_ind], prec[default_ind], marker='o', c='r')
# ax1.legend(['PRC', 'Default Threshold=0.5'])
# pr_display.plot(ax=ax2)
# ax2.scatter(recall[idx], prec[idx], marker='o', c='r')
# ax2.legend(['PRC', 'Best F-beta Score Threshold=%f' % best_fb_threshold])
# plt.show()

In [None]:
plt.plot(thresholds, fn_list, label='False Negatives')
plt.plot(thresholds, fp_list, label='False Positives')
plt.xlabel('Threshold')
plt.ylabel('Count')
plt.legend()
plt.show()

## Results and Insights

## Conclusion and Next Steps


Thank you for checking out my notebook!