In [19]:
import pandas as pd
import numpy as np
import sklearn.preprocessing

Since simple models such as decision trees don't seem to be very effective alone, I designed a Stacked Generalization architecture that leverages the complementary strengths of three distinct machine learning paradigms. These base models are a Gradient Boosted Decision Tree (CatBoost), a Kernel Ridge Regressor (KRR), and a Multilayer Perceptron (like a small neural net). This is because in theory, CatBoost excels at identifying the sharp, non-linear cutoffs inherent in periodic table trends, while KRR utilizes radial basis functions to efficiently interpolate properties between chemically similar alloys, and the MLP captures complex, high-order feature interactions that simpler models might miss.

To integrate these experts without overfitting, I used an Out-of-Fold (OOF) prediction strategy. This ensures that the inputs to the second-layer (a lassoo regressor) are unbiased, representing how each model performs on unseen data. Furthermore, the entire pipeline operates on log-transformed targets to enforce non-negative physical constraints and stabilize  experimental noise. 

Step 1: Data preprocessing

Because band gap energy is non-negative and errors are larger for wide-bandgap insulators than for metals, the model may skew towards predicting low band gaps to keep error low. To fix this, band gap (y) is transformed to y' = ln(y+1). We use +1 so that metals (y=0) map to ln(1) = 0, preserving the logic that 0 is the floor.

In [20]:
df = pd.read_csv("team-a-cleaned.csv")
df["bandgap_transformed"] = np.log(df["gap expt"] + 1)
df.head(5)
df_num = df[df.columns[1:]]


In [21]:
#Global data cleaning:
#If the stddev of a column = 0, it offers no help so we remove it
for col in df_num.columns:
    if np.std(df[col]) == 0:
        df.drop(col)

for col in df.columns: #Check for nan values
    if pd.isna(col):
        print(col)


In [22]:
corr = df_num.corr(method="pearson")

# keep only upper triangle (avoids duplicates + self-corr on diagonal)
upper = corr.where(np.triu(np.ones(corr.shape), k=1).astype(bool))

# find strong correlations
strong = (
    upper.stack()
         .loc[lambda s: (s >= 0.95)]
         .sort_values(ascending=False)
)

# print results
for (row, col), r in strong.items():
    print(f"{row} vs {col}: r={r:.4f}")

maximum NfUnfilled vs range NfUnfilled: r=0.9993
maximum Number vs maximum AtomicWeight: r=0.9993
maximum GSmagmom vs range GSmagmom: r=0.9989
mean Number vs mean AtomicWeight: r=0.9988
maximum GSbandgap vs range GSbandgap: r=0.9988
mode Number vs mode AtomicWeight: r=0.9987
minimum Number vs minimum AtomicWeight: r=0.9979
maximum NsUnfilled vs range NsUnfilled: r=0.9969
avg_dev Number vs avg_dev AtomicWeight: r=0.9965
range Number vs range AtomicWeight: r=0.9960
maximum NdUnfilled vs range NdUnfilled: r=0.9913
maximum NfValence vs range NfValence: r=0.9790
mean Number vs mean Row: r=0.9633
mode Number vs mode Row: r=0.9616
maximum GSvolume_pa vs range GSvolume_pa: r=0.9580
gap expt vs bandgap_transformed: r=0.9570
minimum Number vs minimum Row: r=0.9543
range NfUnfilled vs avg_dev NfUnfilled: r=0.9519
mean AtomicWeight vs mean Row: r=0.9514
maximum NfUnfilled vs avg_dev NfUnfilled: r=0.9513


These values are all far too correlated, and might skew the feature importance. By having multiple columns that show effectively the same thing, the features are unbalanced. To fix this, we compute which column is more correlated to the target and discard the other one. 

The way I did this is to find groups of features that are strongly correlated with each other. From each group, keep only ONE feature (the one that is most correlated with the target). This avoids multicollinearity without arbitrarily deleting columns

In [23]:
cols = df_num.columns
features = [c for c in cols if c != "bandgap_transformed"] 

parent = {c: c for c in features}
rank = {c: 0 for c in features}

def find(a):
    while parent[a] != a:
        parent[a] = parent[parent[a]]
        a = parent[a]
    return a

def union(a, b):
    ra, rb = find(a), find(b)
    if ra == rb:
        return
    if rank[ra] < rank[rb]:
        parent[ra] = rb
    elif rank[ra] > rank[rb]:
        parent[rb] = ra
    else:
        parent[rb] = ra
        rank[ra] += 1

# add an edge for each highly-correlated pair in `strong`
for (a, b), r in strong.items():
    # ignore anything involving the target or non-existent cols
    if a not in parent or b not in parent:
        continue
    union(a, b)

# group columns into connected components (clusters)
clusters = {}
for c in features:
    root = find(c)
    clusters.setdefault(root, []).append(c)

# ---------- 2) For each cluster, keep the feature most correlated with the target ----------
# abs Pearson correlation vs target (strength only)
target_corr = df_num[features].corrwith(df_num["bandgap_transformed"], method="pearson")

keep = set()
drop = set()

for group in clusters.values():
    if len(group) == 1:
        keep.add(group[0])
        continue

    best = target_corr.loc[group].idxmax()  # highest |corr(feature, target)|
    keep.add(best)
    drop.update([c for c in group if c != best])

# ---------- 3) Apply to df_num ----------
df_num_reduced = df_num.drop(columns=sorted(drop), errors="ignore")

print(f"Clusters found: {sum(len(v) > 1 for v in clusters.values())}")
print(f"Kept {len(keep)} features; Dropped {len(drop)} features")
print("Dropped columns:", sorted(drop))

Clusters found: 13
Kept 116 features; Dropped 17 features
Dropped columns: ['avg_dev AtomicWeight', 'avg_dev NfUnfilled', 'maximum GSmagmom', 'maximum NdUnfilled', 'maximum NfUnfilled', 'maximum NfValence', 'maximum NsUnfilled', 'maximum Number', 'mean Number', 'mean Row', 'minimum AtomicWeight', 'minimum Row', 'mode Number', 'mode Row', 'range AtomicWeight', 'range GSbandgap', 'range GSvolume_pa']


In [24]:
df_num_reduced = df_num_reduced.drop('gap expt',axis =1) 
# I removed "gap expt" since it can't be used as a feature
df_num_reduced.head(5)

Unnamed: 0,minimum Number,range Number,avg_dev Number,minimum MendeleevNumber,maximum MendeleevNumber,range MendeleevNumber,mean MendeleevNumber,avg_dev MendeleevNumber,mode MendeleevNumber,maximum AtomicWeight,...,mean GSmagmom,avg_dev GSmagmom,mode GSmagmom,minimum SpaceGroupNumber,maximum SpaceGroupNumber,range SpaceGroupNumber,mean SpaceGroupNumber,avg_dev SpaceGroupNumber,mode SpaceGroupNumber,bandgap_transformed
0,16.0,63.0,25.28,65.0,88.0,23.0,74.6,10.72,66.0,196.966569,...,0.0,0.0,0.0,70.0,225.0,155.0,163.0,74.4,70.0,0.0
1,35.0,39.0,15.619048,51.0,95.0,44.0,81.0,18.666667,95.0,183.84,...,0.0,0.0,0.0,64.0,229.0,165.0,118.809524,73.079365,64.0,0.0
2,16.0,66.0,23.552913,65.0,88.0,23.0,83.482759,4.984542,88.0,207.2,...,0.0,0.0,0.0,70.0,225.0,155.0,139.482759,76.67063,70.0,1.040277
3,32.0,50.0,17.388823,65.0,89.0,24.0,84.034483,5.479191,89.0,207.2,...,0.0,0.0,0.0,14.0,225.0,211.0,108.586207,104.370987,14.0,0.920283
4,5.0,42.0,14.25,65.0,95.0,30.0,74.25,10.375,65.0,107.8682,...,0.0,0.0,0.0,64.0,225.0,161.0,170.0,55.0,225.0,0.0


Now we need to make two dataframes: a raw one and a standardized one. This is because tree based models (like CatBoost) prefer raw data since they split on physical thresholds (eg melting point < 1000). Distance-based models (KRR) and gradient-based models (MLP) require standardization. We need to use StandardScaler to force every feature to Mean=0, Std=1. Without this, larger values such as "Atomic Mass" (values ~100) will dominate smaller ones such as "Electronegativity" (values ~2) in the KRR distance calculation.

In [25]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_scaled = pd.DataFrame(
    scaler.fit_transform(df_num_reduced),
    columns=df_num_reduced.columns,
    index=df_num_reduced.index
)

df_scaled.head(5)


Unnamed: 0,minimum Number,range Number,avg_dev Number,minimum MendeleevNumber,maximum MendeleevNumber,range MendeleevNumber,mean MendeleevNumber,avg_dev MendeleevNumber,mode MendeleevNumber,maximum AtomicWeight,...,mean GSmagmom,avg_dev GSmagmom,mode GSmagmom,minimum SpaceGroupNumber,maximum SpaceGroupNumber,range SpaceGroupNumber,mean SpaceGroupNumber,avg_dev SpaceGroupNumber,mode SpaceGroupNumber,bandgap_transformed
0,-0.218569,1.215385,1.544964,1.228853,0.483321,-1.035492,0.608397,-0.685339,-0.190451,1.09399,...,-0.270252,-0.307462,-0.134781,-0.186733,0.370629,0.300482,0.427213,0.661024,-0.3889,-0.800261
1,1.067644,0.015067,0.291638,0.703012,1.250445,-0.267287,1.083642,0.10822,0.976142,0.837197,...,-0.270252,-0.307462,-0.134781,-0.263138,0.52962,0.425743,-0.366779,0.62256,-0.458259,-0.800261
2,-0.218569,1.365425,1.320907,1.228853,0.483321,-1.035492,1.268004,-1.258085,0.69455,1.294184,...,-0.270252,-0.307462,-0.134781,-0.186733,0.370629,0.300482,0.004668,0.727157,-0.3889,0.937602
3,0.864558,0.565213,0.521233,1.228853,0.59291,-0.998911,1.308974,-1.208689,0.734778,1.294184,...,-0.270252,-0.307462,-0.134781,-0.899845,0.370629,1.001944,-0.550466,1.533938,-1.036248,0.737143
4,-0.963219,0.165107,0.11403,1.228853,1.250445,-0.779424,0.582407,-0.719791,-0.230678,-0.649027,...,-0.270252,-0.307462,-0.134781,-0.263138,0.370629,0.375639,0.552986,0.095993,1.402867,-0.800261


In [26]:
#lets add on the formula so that we can potentially use it later on


df_scaled = pd.concat([df["formula"], df_scaled], axis=1)
df_unscaled = pd.concat([df["formula"], df_num_reduced], axis=1)

df_unscaled.head(5)

Unnamed: 0,formula,minimum Number,range Number,avg_dev Number,minimum MendeleevNumber,maximum MendeleevNumber,range MendeleevNumber,mean MendeleevNumber,avg_dev MendeleevNumber,mode MendeleevNumber,...,mean GSmagmom,avg_dev GSmagmom,mode GSmagmom,minimum SpaceGroupNumber,maximum SpaceGroupNumber,range SpaceGroupNumber,mean SpaceGroupNumber,avg_dev SpaceGroupNumber,mode SpaceGroupNumber,bandgap_transformed
0,Ag(AuS)2,16.0,63.0,25.28,65.0,88.0,23.0,74.6,10.72,66.0,...,0.0,0.0,0.0,70.0,225.0,155.0,163.0,74.4,70.0,0.0
1,Ag(W3Br7)2,35.0,39.0,15.619048,51.0,95.0,44.0,81.0,18.666667,95.0,...,0.0,0.0,0.0,64.0,229.0,165.0,118.809524,73.079365,64.0,0.0
2,Ag0.5Ge1Pb1.75S4,16.0,66.0,23.552913,65.0,88.0,23.0,83.482759,4.984542,88.0,...,0.0,0.0,0.0,70.0,225.0,155.0,139.482759,76.67063,70.0,1.040277
3,Ag0.5Ge1Pb1.75Se4,32.0,50.0,17.388823,65.0,89.0,24.0,84.034483,5.479191,89.0,...,0.0,0.0,0.0,14.0,225.0,211.0,108.586207,104.370987,14.0,0.920283
4,Ag2BBr,5.0,42.0,14.25,65.0,95.0,30.0,74.25,10.375,65.0,...,0.0,0.0,0.0,64.0,225.0,161.0,170.0,55.0,225.0,0.0


In [27]:
df_scaled.head(5)

Unnamed: 0,formula,minimum Number,range Number,avg_dev Number,minimum MendeleevNumber,maximum MendeleevNumber,range MendeleevNumber,mean MendeleevNumber,avg_dev MendeleevNumber,mode MendeleevNumber,...,mean GSmagmom,avg_dev GSmagmom,mode GSmagmom,minimum SpaceGroupNumber,maximum SpaceGroupNumber,range SpaceGroupNumber,mean SpaceGroupNumber,avg_dev SpaceGroupNumber,mode SpaceGroupNumber,bandgap_transformed
0,Ag(AuS)2,-0.218569,1.215385,1.544964,1.228853,0.483321,-1.035492,0.608397,-0.685339,-0.190451,...,-0.270252,-0.307462,-0.134781,-0.186733,0.370629,0.300482,0.427213,0.661024,-0.3889,-0.800261
1,Ag(W3Br7)2,1.067644,0.015067,0.291638,0.703012,1.250445,-0.267287,1.083642,0.10822,0.976142,...,-0.270252,-0.307462,-0.134781,-0.263138,0.52962,0.425743,-0.366779,0.62256,-0.458259,-0.800261
2,Ag0.5Ge1Pb1.75S4,-0.218569,1.365425,1.320907,1.228853,0.483321,-1.035492,1.268004,-1.258085,0.69455,...,-0.270252,-0.307462,-0.134781,-0.186733,0.370629,0.300482,0.004668,0.727157,-0.3889,0.937602
3,Ag0.5Ge1Pb1.75Se4,0.864558,0.565213,0.521233,1.228853,0.59291,-0.998911,1.308974,-1.208689,0.734778,...,-0.270252,-0.307462,-0.134781,-0.899845,0.370629,1.001944,-0.550466,1.533938,-1.036248,0.737143
4,Ag2BBr,-0.963219,0.165107,0.11403,1.228853,1.250445,-0.779424,0.582407,-0.719791,-0.230678,...,-0.270252,-0.307462,-0.134781,-0.263138,0.370629,0.375639,0.552986,0.095993,1.402867,-0.800261


In [30]:
df_scaled.to_csv("data_scaled.csv",index=False)
df_unscaled.to_csv("data_regular.csv",index=False)