# An eye for feature engineering

## Baseline

In [71]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score, f1_score

In [72]:
dino = pd.read_csv("https://storage.googleapis.com/aiolympiadmy/maio_2025_eye_for_feature_engineering.csv", index_col=0)

In [73]:
X, y = dino[["feature1", "feature2"]], dino["class"]

In [74]:
def create_new_feature(X):
    return X["feature1"]

In [75]:
# Create the feature itself
X["feature3"] = create_new_feature(X)

In [76]:
logreg = LogisticRegression()

In [77]:
logreg.fit(X, y)
y_pred_logreg = logreg.predict(X)

In [78]:
print("Logreg precision / recall / f1_score",
    precision_score(y, y_pred_logreg, zero_division=0, pos_label=1, average="binary"),
    recall_score(y, y_pred_logreg, zero_division=0, pos_label=1, average="binary"),
    f1_score(y, y_pred_logreg, zero_division=0, pos_label=1, average="binary")
)

Logreg precision / recall / f1_score 0.0 0.0 0.0


## Your task

Above is a peculiar dataset passed through a logistic regression classifier. Notice that the baseline example provided above scores 0 for precision, recall and F1 score. (Google / ask ChatGPT and friends if you're learning of these terms for the first time!)

Do what you can to raise the F1 score as much as possible, subject to the following restrictions:

- You cannot edit the existing model prediction logic in the Your Submission section:
    - except for the cell containing `create_new_feature()` itself
    - except for the cell marked for you to import new libraries
- You can still add new code cells to this notebooks under the Scratchpad section below. Do all your exploration and testing here. However, code in Your Submission must not depend on code in your Scratchpad in any way. Only code from Your Submission will be run during evaluation.

This challenge will be graded via notebook submission only. Scoring as follows:

- Up to 10 pts for model performance, F1 score X 10. Partial credit may be granted for incomplete submissions at discretion. So show your work below!
- +3 pts if F1 score >= 0.5 and no neural networks are involved. Neural networks here are strictly defined as the use of learnable weights and biases
- +2 pts if F1 score >= 0.5 and the `%%timeit` cell reports runtime <= 10 milliseconds


### Scratchpad

In [69]:
# Write all your exploratory code here

# Import necessary libraries
import numpy as np # mathematical transformation
from sklearn.cluster import KMeans # automatically discover clusters

"""STEP 1: Find clusters using only class 1 points"""
# only extract the points where it is class 1
class_1_points = X[y == 1]  # only use feature1 and feature2 for clustering

# set number of clusters (k), originally used 1, increased until 3
optimal_k = 3
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)

# train kmeans clustering using only 'feature1' and 'feature2' from class 1 points
kmeans.fit(class_1_points[["feature1", "feature2"]])

# get discovered class 1 cluster centers
cluster_centers = kmeans.cluster_centers_
print("Discovered class 1 cluster centers:", cluster_centers)  # print found cluster locations

"""STEP 2: Feature engineering function"""
def create_new_feature(X):
  sigma = 15  # control the spread of influence (originally tried 20 and 10)
  cluster_weight = 2  # scaling factor to increase cluster importance

  # compute weighted Gaussian cluster scores for each point based on distance to the cluster centres
  cluster_score = sum(
      cluster_weight * np.exp(-((X["feature1"] - cx) ** 2 + (X["feature2"] - cy) ** 2) / (2 * sigma ** 2))
      for cx, cy in cluster_centers
  )

  # LATEST: add a binary indicator for points near discovered clusters
  # assigns strong weight if point is within 9 units of any cluster centre (originally tried 15 and 10)
  binary_flag = sum(
      ((X["feature1"] > cx - 9) & (X["feature1"] < cx + 9) &
        (X["feature2"] > cy - 9) & (X["feature2"] < cy + 9))
      for cx, cy in cluster_centers
  ).astype(int) * 3

  # interaction term to capture feature relationships
  interaction = (X["feature1"] * X["feature2"]) / 600

  return cluster_score + interaction + binary_flag

# DO NOT EDIT BELOW - scoring cell
X["feature3"] = create_new_feature(X)

logreg = LogisticRegression()
logreg.fit(X, y)
y_pred_logreg = logreg.predict(X)

print("Logreg precision / recall",
    precision_score(y, y_pred_logreg, zero_division=0, pos_label=1, average="binary"),
    recall_score(y, y_pred_logreg, zero_division=0, pos_label=1, average="binary"),
    f1_score(y, y_pred_logreg, zero_division=0, pos_label=1, average="binary")
)

Discovered class 1 cluster centers: [[51.77156364 82.28438182]
 [80.156      33.23205   ]
 [30.25643333 61.53846667]]
Logreg precision / recall 1.0 1.0 1.0


## Your submission

In [79]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score, f1_score

In [80]:
# EDIT ME - import additional libraries here
# e.g. import this
import numpy as np
from sklearn.cluster import KMeans

In [81]:
# EDIT ME
"""STEP 1: Find clusters using only class 1 points"""
# only extract the points where it is class 1
class_1_points = X[y == 1]  # only use feature1 and feature2 for clustering

# set number of clusters (k), originally used 1, increased until 3
optimal_k = 3
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)

# train kmeans clustering using only 'feature1' and 'feature2' from class 1 points
kmeans.fit(class_1_points[["feature1", "feature2"]])

# get discovered class 1 cluster centers
cluster_centers = kmeans.cluster_centers_
print("Discovered class 1 cluster centers:", cluster_centers)  # print found cluster locations

"""STEP 2: Feature engineering function"""
def create_new_feature(X):
  sigma = 15  # control the spread of influence (originally tried 20 and 10)
  cluster_weight = 2  # scaling factor to increase cluster importance

  # compute weighted Gaussian cluster scores for each point based on distance to the cluster centres
  cluster_score = sum(
      cluster_weight * np.exp(-((X["feature1"] - cx) ** 2 + (X["feature2"] - cy) ** 2) / (2 * sigma ** 2))
      for cx, cy in cluster_centers
  )

  # add a binary indicator for points near discovered clusters
  # assigns strong weight if point is within 9 units of any cluster centre (originally tried 15 and 10)
  binary_flag = sum(
      ((X["feature1"] > cx - 9) & (X["feature1"] < cx + 9) &
        (X["feature2"] > cy - 9) & (X["feature2"] < cy + 9))
      for cx, cy in cluster_centers
  ).astype(int) * 3

  # interaction term to capture feature relationships
  interaction = (X["feature1"] * X["feature2"]) / 600

  return cluster_score + interaction + binary_flag


Discovered class 1 cluster centers: [[51.77156364 82.28438182]
 [80.156      33.23205   ]
 [30.25643333 61.53846667]]


In [88]:
%%timeit -n 10
# DO NOT EDIT - timing cell
create_new_feature(X)

4.34 ms ± 471 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [83]:
# DO NOT EDIT - scoring cell
X["feature3"] = create_new_feature(X)

logreg = LogisticRegression()
logreg.fit(X, y)
y_pred_logreg = logreg.predict(X)

print("Logreg precision / recall",
    precision_score(y, y_pred_logreg, zero_division=0, pos_label=1, average="binary"),
    recall_score(y, y_pred_logreg, zero_division=0, pos_label=1, average="binary"),
    f1_score(y, y_pred_logreg, zero_division=0, pos_label=1, average="binary")
)

Logreg precision / recall 1.0 1.0 1.0
