**Can we approach this as a regression problem instead of classification?**

For any metric other than ROC-AUC, the answer would be NO, as we have two discreet classes to predict. But *roc_auc_score* cares only about the order of predictions, so it will work. Specifically, I will use Lasso regression because its L1 penalty squeezes out some features by setting their coefficients to zeros, thus performing feature selection. None of the features will be eliminated in this case, but we will get their relative importance at the end.

In [None]:
from datetime import datetime
import pandas as pd
import numpy as np
from sklearn.linear_model import LassoCV
from sklearn.model_selection import RepeatedKFold
from sklearn.metrics import mean_squared_error, roc_auc_score
import joblib
import matplotlib.pyplot as plt
from category_encoders.leave_one_out import LeaveOneOutEncoder
from sklearn.preprocessing import StandardScaler
from scipy import special


def timer(start_time=None):
    if not start_time:
        start_time = datetime.now()
        return start_time
    elif start_time:
        tmin, tsec = divmod((datetime.now() - start_time).total_seconds(), 60)
        print(" Time taken: %i minutes and %s seconds." % (tmin, round(tsec, 2)))


DATA_TRAIN_PATH = "/kaggle/input/playground-series-s3e2/train.csv"
DATA_TEST_PATH = "/kaggle/input/playground-series-s3e2/test.csv"


def load_data(path_train=DATA_TRAIN_PATH, path_test=DATA_TEST_PATH):
    train_loader = pd.read_csv(path_train)
    train = train_loader.drop(["stroke", "id"], axis=1)
    features = train.columns.tolist()
    print("\n Train dataset shape:", train.shape)
    train_labels = train_loader["stroke"].values
    train_ids = train_loader["id"].values

    test_loader = pd.read_csv(path_test)
    test = test_loader[features]
    print(" Whole test dataset shape:", test.shape)
    test_ids = test_loader["id"].values

    return train, train_labels, train_ids, features, test, test_ids


if __name__ == "__main__":

    # We will do repeated K-fold cross-validation
    folds = 10
    repeats = 10
    # For historic reasons I use one of these 3 seeds
    seeds = [6772, 6659, 7622]

    # Load data set and target values
    start_time = timer(None)
    print("\n# Reading and Processing Data")
    X_train, y, train_ids, features, X_test, test_ids = load_data()

    all_cols = features
    cols_cat = [
        "gender",
        "ever_married",
        "work_type",
        "Residence_type",
        "smoking_status",
    ]
    num_features = [col for col in all_cols if col not in cols_cat]

Like most generalized linear models, Lasso works only with numerical features and does not tolerate missing values. That means we have to toss out the categorical features, or convert them to numbers. We will do the latter. Didn't check whether features are normally distributed, but it doesn't hurt to standardize them.

In [None]:
print("\n Encoding categorical variables ...")
# sigma=0.05 injects a bit of noise so we don't overfit
ce = LeaveOneOutEncoder(cols=cols_cat, random_state=2022, sigma=0.05, verbose=1)
X_train = ce.fit_transform(X_train, y)
X_test = ce.transform(X_test)
print("\n Train Set Matrix Dimensions: %d x %d" % (X_train.shape[0], X_train.shape[1]))
print(" Test Set Matrix Dimensions: %d x %d" % (X_test.shape[0], X_test.shape[1]))

print(
    " Potential NaN or Inf values in train data:  ",
    np.isnan(X_train[features].values).any(),
    "  ",
    np.isinf(X_train[features].values).any(),
)
print(
    " Potential NaN or Inf values in test data:  ",
    np.isnan(X_test[features].values).any(),
    "  ",
    np.isinf(X_test[features].values).any(),
)
timer(start_time)

scaler = StandardScaler()
scaler.fit(X_train[features].values)
X_train[features] = scaler.transform(X_train[features].values)
joblib.dump(scaler, "StandardScaler_Lasso-01-v1.joblib")
# scaler = joblib.load('StandardScaler_Lasso-01-v1.joblib')
X_test[features] = scaler.transform(X_test[features].values)

All rigth, now we have all-numerical data. Time to set up cross-validation and let Lasso rip.

In [None]:
rkf_grid = list(
    RepeatedKFold(n_splits=folds, n_repeats=repeats, random_state=seeds[0]).split(
        X_train, y
    )
)
start_time = timer(None)
# Run Lasso cross-validation to determine coefficient alpha and the intercept
# Doing repeated CV to test many different folds and avoid overfitting
print("\n Running Lasso:")
model_llcv = LassoCV(
    precompute="auto",
    fit_intercept=True,
    normalize=False,
    max_iter=1000,
    verbose=False,
    eps=1e-04,
    cv=rkf_grid,
    n_alphas=1000,
    n_jobs=8,
)
model_llcv.fit(X_train, y)
joblib.dump(model_llcv, "Stroke_Lasso-01-v1.joblib")
#    model_llcv = joblib.load('Stroke_Lasso-01-v1.joblib')
print(" Best alpha value: %.10f" % model_llcv.alpha_)
print(" Intercept: %.10f" % model_llcv.intercept_)
print(" LassoCV score: %.10f" % model_llcv.score(X_train, y))


We have our model parameters. The value of alpha is critical, as we don't want to over- or under-fit the model. Large alpha values will make most feature coefficents be zero, and the model will be very simple and likely under-fitted. When alpha is 0, we have linear regression which means no regularization and stronger potential for over-fitting.

Let's do a quick test to see what kind of score we can expect.

In [None]:
RMSE_nocv = np.sqrt(mean_squared_error(y, model_llcv.predict(X_train)))
AUC_nocv = roc_auc_score(y, model_llcv.predict(X_train))
print("\n Non cross-validated LassoCV RMSE: %.6f" % RMSE_nocv)
print(" Non cross-validated AUC: %.6f" % AUC_nocv)


Now we run this on even larger number of different folds (10 folds repeated 10 times). The idea again is to avoid overfitting and guard against uneven data distributions within folds, as we will average every single point 10 times in out-of-fold fashion. We will have to keep track of how many times each point has been predicted and divide its sum by the number of predictions.

In [None]:
cv_sum = 0
cv_sum_auc = 0
pred = []
fpred = []
avreal = y
avpred = np.zeros(X_train.shape[0])
avpred_count = np.zeros(X_train.shape[0])
idpred = train_ids

train_time = timer(None)
repeats = 10
rkf_grid = list(
    RepeatedKFold(n_splits=folds, n_repeats=repeats, random_state=seeds[0]).split(
        X_train, y
    )
)

for i, (train_index, val_index) in enumerate(rkf_grid):
    fold_time = timer(None)
    print("\n Fold %02d" % (i + 1))
    Xtrain, Xval = X_train.loc[train_index], X_train.loc[val_index]
    ytrain, yval = y[train_index], y[val_index]

    scores_val = model_llcv.predict(Xval)
    corr_val = np.sqrt(mean_squared_error(yval, scores_val))
    corr_val_auc = roc_auc_score(yval, scores_val)
    print(" Fold %02d RMSE: %.6f" % ((i + 1), corr_val))
    print(" Fold %02d AUC: %.6f" % ((i + 1), corr_val_auc))
    y_pred = model_llcv.predict(X_test)
    timer(fold_time)

    avpred[val_index] += scores_val
    avpred_count[val_index] += 1
    if i > 0:
        fpred = pred + y_pred
    else:
        fpred = y_pred
    pred = fpred
    cv_sum = cv_sum + corr_val
    cv_sum_auc = cv_sum_auc + corr_val_auc

timer(train_time)

cv_score = cv_sum / (folds * repeats)
cv_score_auc = cv_sum_auc / (folds * repeats)
avpred = avpred / avpred_count
oof_corr = np.sqrt(mean_squared_error(avreal, avpred))
oof_corr_auc = roc_auc_score(avreal, avpred)
print("\n Average RMSE: %.6f" % cv_score)
print(" Out-of-fold RMSE: %.6f" % oof_corr)
print(" Average AUC: %.6f" % cv_score_auc)
print(" Out-of-fold AUC: %.6f" % oof_corr_auc)
score = str(round(oof_corr_auc, 6)).replace(".", "")
mpred = pred / (folds * repeats)

now = datetime.now()
# Not really necessary, but applying sigmoid function here so all predictions map to [0,1] range
oof_result = pd.DataFrame(avreal, columns=["stroke"])
oof_result["prediction"] = special.expit(avpred)
oof_result["id"] = idpred
oof_result = oof_result[["id", "stroke", "prediction"]]
sub_file = (
    "train_OOF_10_by_10x-Lasso_"
    + score
    + "_"
    + str(now.strftime("%Y-%m-%d-%H-%M"))
    + ".csv"
)
print("\n Writing out-of-fold train file::  %s" % sub_file)
oof_result.to_csv(sub_file, index=False, float_format="%.6f")

# Not really necessary, but applying sigmoid function here so all predictions map to [0,1] range
result = pd.DataFrame(special.expit(mpred), columns=["stroke"])
result["id"] = test_ids
result = result[["id", "stroke"]]
print("\n First 10 lines of your prediction:")
print(result.head(10))
sub_file = (
    "test_10_by_10x-Lasso_" + score + "_" + str(now.strftime("%Y-%m-%d-%H-%M")) + ".csv"
)
print("\n Writing submission file::  %s" % sub_file)
result.to_csv(sub_file, index=False, float_format="%.6f")
result.to_csv("submission.csv", index=False, float_format="%.6f")


We are done with predictions. Without any fancy setup or parameter tuning, this will score ~0.87 on public LB.

Let's see what features Lasso selected from the whole group. Features with low coefficients contribute less to the final prediction. The absolute value of each feature coefficient is roughly proportional to its importance.

In [None]:
coef = pd.DataFrame(model_llcv.coef_, columns=["LassoCV_score"])
coef["Feature"] = features
coef["Relative score"] = coef["LassoCV_score"] / coef["LassoCV_score"].sum()
coef = coef.sort_values("Relative score", ascending=False)
coef = coef[["Feature", "LassoCV_score", "Relative score"]]
coef.to_csv("feature_importance_LassoCV.csv", index=False, float_format="%.6f")
coef.plot(kind="barh", x="Feature", y="LassoCV_score", legend=False, figsize=(6, 12))
plt.title("Features Coefficients")
plt.xlabel("LassoCV score")
plt.savefig(
    "feature_importance_Lasso-01-v1.png",
    bbox_inches="tight",
    pad_inches=0.25,
    dpi=150,
)
plt.show()
