Let's try two basic versions of these regression models: Random Forest and LightGBM

First let's extract the data and do some preprocessing:

In [1]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import lightgbm as lgb

RANDOM_SEED = 42

train = pd.read_csv("../data/train.csv")
test = pd.read_csv("../data/test.csv")

y = train["log_pSat_Pa"]
X = train.drop(columns=["log_pSat_Pa"])

non_numeric_cols = X.select_dtypes(include=['object']).columns
X = pd.get_dummies(X, columns=non_numeric_cols, drop_first=True)
test = pd.get_dummies(test, columns=non_numeric_cols, drop_first=True)

X, test = X.align(test, join='left', axis=1)
test = test.fillna(0)

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=RANDOM_SEED)

Then let's try RF:

In [4]:
rf_model = RandomForestRegressor(n_estimators=100, max_depth=20, random_state=RANDOM_SEED)
rf_model.fit(X_train, y_train)
rf_predictions = rf_model.predict(X_val)
rf_r2 = r2_score(y_val, rf_predictions)
print(f"Random Forest R2 Score: {rf_r2:.4f}")

Random Forest R2 Score: 0.7177


And LightGBM:

In [14]:
lgb_model = lgb.LGBMRegressor(
    boosting_type='gbdt',
    num_leaves=31,
    learning_rate=0.1,
    n_estimators=200,
    random_state=RANDOM_SEED
)
lgb_model.fit(X_train, y_train)
lgb_predictions = lgb_model.predict(X_val)
lgb_r2 = r2_score(y_val, lgb_predictions)
print(f"LightGBM R2 Score: {lgb_r2:.4f}")

You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 925
[LightGBM] [Info] Number of data points in the train set: 21309, number of used features: 28
[LightGBM] [Info] Start training from score -5.539761
LightGBM R2 Score: 0.7432
