In [2]:
# packages
import pandas as pd
from mod02_build_bot_predictor import train_model

### Define a function to extract predictions from the model

In [3]:
def predict_bot(df, model=None):
    """
    Predict whether each account is a bot (1) or human (0).
    """
    if model is None:
        model = train_model()

    preds = model.predict(df)
    return pd.Series(preds, index=df.index)

### Define a function to evaluate model error

In [4]:
def confusion_matrix_and_metrics(y_true, y_pred):
    """
    Computes confusion matrix and common error rates for binary classification.

    Assumes labels:
      0 = negative class
      1 = positive class

    Returns:
      dict with:
        tn, fp, fn, tp
        misclassification_rate
        false_positive_rate
        false_negative_rate
    """
    tn = fp = fn = tp = 0

    for yt, yp in zip(y_true, y_pred):
        if yt == 0 and yp == 0:
            tn += 1
        elif yt == 0 and yp == 1:
            fp += 1
        elif yt == 1 and yp == 0:
            fn += 1
        elif yt == 1 and yp == 1:
            tp += 1
        else:
            raise ValueError("Labels must be 0 or 1")

    total = tn + fp + fn + tp

    misclassification_rate = (fp + fn) / total if total > 0 else 0.0
    false_positive_rate = fp / (fp + tn) if (fp + tn) > 0 else 0.0
    false_negative_rate = fn / (fn + tp) if (fn + tp) > 0 else 0.0

    return {
        "tp": tp,
        "tn": tn,
        "fp": fp,
        "fn": fn,
        "misclassification_rate": misclassification_rate,
        "false_positive_rate": false_positive_rate,
        "false_negative_rate": false_negative_rate,
    }


### Load the data

In [5]:
TRAIN_PATH = "mod02_data/train.csv"
train = pd.read_csv(TRAIN_PATH)

TEST_PATH = "mod02_data/test.csv"
test = pd.read_csv(TEST_PATH)

### Format the data by independent vs. dependent variables

In [6]:
X_train = train.drop(columns=["is_bot"])
y_train = train['is_bot']

X_test = test.drop(columns=["is_bot"])
y_test = test['is_bot']

### Build the model on training data

In [7]:
model = train_model(X_train, y_train)

### Get the model predictions on training and test data

In [8]:
y_pred_train = predict_bot(X_train, model)
y_pred_test = predict_bot(X_test, model)

### Check results on the training set (data used to build the model)

In [9]:
confusion_matrix_and_metrics(y_train, y_pred_train)

{'tp': 97,
 'tn': 2594,
 'fp': 43,
 'fn': 266,
 'misclassification_rate': 0.103,
 'false_positive_rate': 0.016306408797876374,
 'false_negative_rate': 0.7327823691460055}

### Check results on the test set (new data not yet seen by the model)

In [10]:
confusion_matrix_and_metrics(y_test, y_pred_test)

{'tp': 30,
 'tn': 865,
 'fp': 9,
 'fn': 96,
 'misclassification_rate': 0.105,
 'false_positive_rate': 0.010297482837528604,
 'false_negative_rate': 0.7619047619047619}

# Discussion Questions

### Based on the misclassification rate of your model, discuss your confidence in the ability to predict a bot. 

My misclassification rate is 0.105, so the model is correct about 89.5% of the time which is decent. However, the errors are not evenly spread so I don't have much confidence for different sets of data. The false negative rate is very high (0.762), meaning it misses a lot of actual bots (96 missed bots). So I’m more confident the model avoids falsely labeling humans (as seen by the low FP and low FPR), but I’m not very confident it reliably catches bots and wouldn't want to use it.

### What are potential ramifications of false positives from the model?

A false positive is when a real human gets labeled as a bot. Some potential ramifications could be user unsatisfaction (lockouts, failed logins, not being able to create an account, extra challenges), and lower customer trust and retention because legitimate users would feel unfairly blocked.

### What are potential ramifications of false negatives from the model?

A false negative is when a bot is labeled as human. Some potential ramifications could be that many bots could figure how to bypass these security measures and get in. This can lead to fraud/account takeover attempts, spam, and other abuse that would upset real users. With this model specificially having such a high false negative rate, the model would let many bots through and could make the platform infested with them.