# First Iteration - Model Prototypes

## Setup

In [1]:
#import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report, roc_auc_score, recall_score
from imblearn.over_sampling import SMOTE

In [2]:
df = pd.read_csv("../data/df_final.csv")

I will create prototypes for three models: Logistic Regression, Random Forest, and Naïve Bayes. This will allow me to quickly gain an understanding of their respective performance, and inform my final model selection.

Because my data is unbalanced, I will first use SMOTE to create additional synthetic data for the less frequent class (`is_child` == 1) to account for this.

In [3]:
#get features and target
X = df.drop("is_child", axis=1)
y = df["is_child"]

In [4]:
#split into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=100000, test_size=40000, random_state=42, stratify=y)

In [5]:
#oversample using smote
smote = SMOTE(random_state=42)
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)

#code adapted from Singh (2025)

In [6]:
#check class balance
pre_smote = y_train.value_counts(normalize=True)
post_smote = y_train_res.value_counts(normalize=True)
print(f"Minority class pre-SMOTE: {pre_smote[1]:.3f} | Post-SMOTE: {post_smote[1]:.3f}")

Minority class pre-SMOTE: 0.052 | Post-SMOTE: 0.500


## Logistic Regression Prototype

In [7]:
#instantiate model
lr = LogisticRegression(random_state=42)

#fit model
lr.fit(X_train_res, y_train_res)

#predict
y_pred_proba = lr.predict_proba(X_test)[:, 1]
y_pred = lr.predict(X_test)

#evaluate
print(f"ROC AUC: {roc_auc_score(y_test, y_pred_proba):.4f}")
print(classification_report(y_test, y_pred))

ROC AUC: 0.8472
              precision    recall  f1-score   support

           0       0.98      0.80      0.88     37900
           1       0.17      0.73      0.28      2100

    accuracy                           0.80     40000
   macro avg       0.58      0.77      0.58     40000
weighted avg       0.94      0.80      0.85     40000



## Random Forest Prototype

In [8]:
#instantiate
rf = RandomForestClassifier(random_state=42)

#fit
rf.fit(X_train_res, y_train_res)

#predict
y_pred_proba = rf.predict_proba(X_test)[:, 1]
y_pred = rf.predict(X_test)

#evaluate
print(f"ROC AUC: {roc_auc_score(y_test, y_pred_proba):.4f}")
print(classification_report(y_test, y_pred))

ROC AUC: 0.8381
              precision    recall  f1-score   support

           0       0.96      0.99      0.97     37900
           1       0.55      0.28      0.37      2100

    accuracy                           0.95     40000
   macro avg       0.76      0.63      0.67     40000
weighted avg       0.94      0.95      0.94     40000



## Naïve Bayes Prototype

In [9]:
#instantiate
nb = GaussianNB()

#fit
nb.fit(X_train_res, y_train_res)

#predict
y_pred_proba = nb.predict_proba(X_test)[:, 1]
y_pred = nb.predict(X_test)

#evaluate
print(f"ROC AUC: {roc_auc_score(y_test, y_pred_proba):.4f}")
print(classification_report(y_test, y_pred))    

ROC AUC: 0.7861
              precision    recall  f1-score   support

           0       0.98      0.63      0.77     37900
           1       0.10      0.77      0.18      2100

    accuracy                           0.64     40000
   macro avg       0.54      0.70      0.48     40000
weighted avg       0.93      0.64      0.74     40000

