# **Data Science Lecture 1 September 10 2025**

- random forest and trees are not deployed to production because they're too computationally expensive
- Averaging and boosting methods are better at precision rather than recall because they can be trained to be very precise when an outcome happens
- **Precision**: When the model says yes, how often is it right?
- **Recall**: Of all the things that are actually positive, how many did the model catch?
### **Bagging**
- Great way to reduce risk by averaging predictions
- Bagging can be used on all types of models not just trees
- In python there are no methods to control your sampling method, need to build functions with custom bagging models from scratch to better control sampling strategy; find source code from sklearn and add custom functions for sampling methods

- Extra Trees is less computationally expensive than Random Forest but has the same performance

### **Boosting**
- Adaboost works with weak learners
- Boosted trees focus on what the previous tree was not good at predicting
- Don't really need to address imbalance problems for boosted models because they'll build subsequent trees to address the imbalance issue. (in theory); observations are randomly sampled their could be imbalance because there is a small subset of the minority class to randomly sample compared to the majority class
- **Gradient Tree Boosting**
    - Gradient tree boosting builds full trees

### **Voting Based Models**
- Can assign weights for soft voting compared to hard voting which is just majority rules

### **Stacked Generalization**
- Not used in industry
- Just combining different types of models vertically and are computationally expensive

- Light GBM is the standard in industry (a cheaper computational XGboost); Random Forest is not used in industry

In [23]:
, import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.inspection import permutation_importance

from sklearn.metrics import (accuracy_score, roc_auc_score, f1_score, precision_score, 
                             recall_score, confusion_matrix, log_loss, RocCurveDisplay,
                             PrecisionRecallDisplay, DetCurveDisplay, ConfusionMatrixDisplay, brier_score_loss)

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (BaggingClassifier, AdaBoostClassifier, RandomForestClassifier, ExtraTreesClassifier, VotingClassifier)

In [24]:
df = pd.read_csv("data/train.csv")

Simple feature engineering

In [25]:
# converting 'Sex' and 'Embarked' to categorical variables
# defining a new columnn 'FamilySize'BaseExceptionGroup
# defining a new column 'IsAlone'
# Extracting 'Title' from 'Name'
df['Sex'] = df['Sex'].astype('category')
df['Embarked'] = df['Embarked'].astype('category')
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
df['IsAlone'] = (df['FamilySize']==1).astype(int)
df['Title'] = (
    df['Name'].str.extract(r",\s*([^.]*)\.", expand = False).
    str.strip()
    .replace({'Mlle':'Miss', 'Ms':'Miss', 'Mme':'Mrs', 'Lady':'Noble', 'Countess':'Noble',
              'Sir':'Noble', "Don":'Noble','Dona':'Noble', 'Jonkheer':'Noble','Dr':'Officer',
              'Col':'Officer','Major':'Officer','Capt':'Officer','Rev':'Officer'}).astype('category')
)

In [26]:
# setting target variable and inputs
y = df['Survived'].astype(int)
x = df[['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked','FamilySize','IsAlone','Title']]

In [27]:
# separating numerical and categorical columns
# separate num / cat columns
num_cols = ['Age','SibSp','Parch','Fare','FamilySize','Pclass']
cat_cols = ['Sex','Embarked','IsAlone','Title']

Creating processing pipelines

In [28]:
# imputing missing numerical values with median and scaling them
# imputing missing categorical values with the most frequent value and one-hot encoding them
# creating a preprocessing pipeline that can handle both numerical and categorical features
numeric = Pipeline([('imputer', SimpleImputer(strategy='median')),
                    ['scale', StandardScaler(with_mean=False)]])
categorical = Pipeline([('imputer',SimpleImputer(strategy='most_frequent')),
                        ('onehot', OneHotEncoder(handle_unknown='ignore'))])

prep = ColumnTransformer([('num',numeric, num_cols),('cat',categorical, cat_cols)], sparse_threshold=0.3) # transformations for the model

In [29]:
# creating train test split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42, stratify=y)

Bagging classifier documentation https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html 

In [30]:
# bagging classifier pipeline
bag = Pipeline([
    ('prep',prep),
    ('bag',BaggingClassifier(estimator = DecisionTreeClassifier(max_depth=3), n_estimators=200, 
                             max_samples=0.8, bootstrap=True, n_jobs=-1, random_state=42))
])

adaboost classifier documentation https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html

In [31]:
# adaboost classifier pipeline
# max_depth=2 is for weak learner
ada = Pipeline([
      ('prep',prep),
      ('clf', AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=2),
                                 n_estimators=300, learning_rate=0.5,random_state=42))
])

Building a random forest model https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [32]:
# random forest classifier pipeline
rf = Pipeline([
    ('prep',prep),
    ('clf', RandomForestClassifier(n_estimators=400, max_depth=None, random_state=42, n_jobs=-1))
])


Extra trees classifier https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html

In [33]:
# extra trees classifier pipeline
et = Pipeline([
    ('prep',prep),
    ('clf', ExtraTreesClassifier(n_estimators=400, max_depth=None, random_state=42, bootstrap=True, n_jobs=-1))
])

Voting based classifier 

In [34]:
# voting classifier
vote = Pipeline([
    ('prep',prep),
    ('clf', VotingClassifier(estimators=[
        ('bag',BaggingClassifier(estimator = DecisionTreeClassifier(max_depth=3), n_estimators=200, 
                             max_samples=0.8, bootstrap=True, n_jobs=-1, random_state=42)),
        ('ada',AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=2),
                                 n_estimators=300, learning_rate=0.5,random_state=42)),
        ('rf',RandomForestClassifier(n_estimators=400, max_depth=None, random_state=42, n_jobs=-1)),
        ('et',ExtraTreesClassifier(n_estimators=400, max_depth=None, random_state=42, bootstrap=True, n_jobs=-1))], voting='soft'))
])

In [35]:
# building an object that holds all the models
models = {'Bagging':bag, 'AdaBoost':ada, 'RandomForest':rf, 'ExtraTrees':et, 'Voting':vote}

stratified k fold cross validation https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html

In [36]:
# cross validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

scoring = {'accuracy':'accuracy', 'roc_auc':'roc_auc', 'f1':'f1', 'precision':'precision', 'recall':'recall', 'neg_log_loss':'neg_log_loss'}

In [37]:
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, cross_validate

In [38]:
cv_rows = []

for name, pipe in models.items():
  res = cross_validate(pipe, X_train, y_train, cv=cv, scoring=scoring, n_jobs=-1)
  cv_rows.append({
      'model':name,
      'acc_mean': res['test_accuracy'].mean(),
      'auc_mean': res['test_roc_auc'].mean(),
      'f1_mean': res['test_f1'].mean(),
      'prec_mean': res['test_precision'].mean(),
      'recall_mean': res['test_recall'].mean(),
      'logloss_mean': -res['test_neg_log_loss'].mean()
  })
  
cv_df = pd.DataFrame(cv_rows).sort_values("auc_mean", ascending=False)
cv_df



Unnamed: 0,model,acc_mean,auc_mean,f1_mean,prec_mean,recall_mean,logloss_mean
4,Voting,0.813277,0.878558,0.748843,0.775771,0.725253,0.429643
0,Bagging,0.816005,0.870244,0.746214,0.792714,0.706936,0.414278
2,RandomForest,0.797754,0.867506,0.733868,0.74012,0.728889,0.767524
3,ExtraTrees,0.799153,0.863415,0.736214,0.741041,0.73266,0.70286
1,AdaBoost,0.80196,0.845252,0.737383,0.75148,0.725387,0.647294


The closer the logloss_mean is to 1, the more the model is closer to chance. Would be the same as flipping a coin

score card

In [46]:
def per_observation_scoreboard(fitted, X_test, y_test, id_cols=None):
  df = pd.DataFrame(index=X_test.index)
  df['y_true'] = y_test.values

  prob_cols, pred_cols, correct_cols =[], [], []

  for name, model in fitted.items():
    proba = model.predict_proba(X_test)[:,1]
    pred = (proba >= 0.5).astype(int)
    df[f"{name}_proba"] = proba
    df[f"{name}_pred"] = pred
    df[f"{name}_correct"] = (pred == y_test.values).astype(int)

  preds = df[pred_cols].values
  votes = preds.sum(axis=1)
  majority = (votes > (len(pred_cols)/2)).astype(int)
  df['majority_pred'] = majority
  df['majority_correct'] = (df['majority_pred'] == df['y_true']).astype(int)
  
  df['num_models_correct'] = df[correct_cols].sum(axis=1)
  df['unanimous_correct'] = (df['num_models_correct'] == len(correct_cols)).astype(int)
  df['unanimous_wrong'] = (df['num_models_correct'] == 0).astype(int)

  if id_cols:
    df = pd.concat([X_test[id_cols].reset_index(drop=True), df.reset_index(drop=True)], axis=1)
  
  df = df.sort_values(['unanimous_wrong','unanimous_correct', 'num_models_correct'], ascending= [False, True, True])

  return df

In [47]:
test_rows, fitted = [], {}

for name, pipe in models.items():
  pipe.fit(X_train, y_train)
  fitted[name] = pipe
  proba = pipe.predict_proba(X_test)[:,1]
  pred = (proba >= 0.5).astype(int)

  test_rows.append({
      'model':name,
      'accuracy': accuracy_score(y_test, pred),
      'roc_auc': roc_auc_score(y_test, proba),
      'f1': f1_score(y_test, pred),
      'precision': precision_score(y_test, pred),
      'recall': recall_score(y_test, pred),
      'log_loss': log_loss(y_test, proba)
  })
  
test_df = pd.DataFrame(test_rows).sort_values("roc_auc", ascending=False)
test_df



Unnamed: 0,model,accuracy,roc_auc,f1,precision,recall,log_loss
0,Bagging,0.821229,0.857312,0.75,0.813559,0.695652,0.426546
4,Voting,0.826816,0.847167,0.763359,0.806452,0.724638,0.459856
3,ExtraTrees,0.821229,0.832279,0.761194,0.784615,0.73913,0.539121
2,RandomForest,0.815642,0.830435,0.755556,0.772727,0.73913,0.54338
1,AdaBoost,0.731844,0.74697,0.647059,0.656716,0.637681,0.668784


In [48]:
scoreboard = per_observation_scoreboard(fitted, X_test, y_test, id_cols=None)

In [49]:
scoreboard

Unnamed: 0,y_true,Bagging_proba,Bagging_pred,Bagging_correct,AdaBoost_proba,AdaBoost_pred,AdaBoost_correct,RandomForest_proba,RandomForest_pred,RandomForest_correct,...,ExtraTrees_pred,ExtraTrees_correct,Voting_proba,Voting_pred,Voting_correct,majority_pred,majority_correct,num_models_correct,unanimous_correct,unanimous_wrong
565,0,0.127689,0,1,0.492913,0,1,0.157500,0,1,...,0,1,0.210151,0,1,0,1,0.0,1,1
160,0,0.109291,0,1,0.476029,0,1,0.035000,0,1,...,0,1,0.159455,0,1,0,1,0.0,1,1
553,1,0.097014,0,0,0.483462,0,0,0.102500,0,0,...,0,0,0.192619,0,0,0,0,0.0,1,1
860,0,0.115554,0,1,0.480723,0,1,0.012500,0,1,...,0,1,0.157819,0,1,0,1,0.0,1,1
241,1,0.620355,1,1,0.504417,1,1,0.862500,1,1,...,1,1,0.722443,1,1,0,0,0.0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
880,1,0.909191,1,1,0.492372,0,0,0.790000,1,1,...,1,1,0.776641,1,1,0,0,0.0,1,1
91,0,0.100693,0,1,0.493480,0,1,0.002500,0,1,...,0,1,0.150262,0,1,0,1,0.0,1,1
883,0,0.100778,0,1,0.488720,0,1,0.002500,0,1,...,0,1,0.148676,0,1,0,1,0.0,1,1
473,1,0.877843,1,1,0.544272,1,1,0.937500,1,1,...,1,1,0.828654,1,1,0,0,0.0,1,1


In [51]:
df.iloc[553] # can see why the model had a hard time predicting this person because he is an anomaly because many of his characteristics are different from the rest of the people who survived

PassengerId                                  554
Survived                                       1
Pclass                                         3
Name           Leeni, Mr. Fahim ("Philip Zenni")
Sex                                         male
Age                                         22.0
SibSp                                          0
Parch                                          0
Ticket                                      2620
Fare                                       7.225
Cabin                                        NaN
Embarked                                       C
FamilySize                                     1
IsAlone                                        1
Title                                         Mr
Name: 553, dtype: object