# CalCOFI Ocean chemistry prediction
EDS 232 - Machine Learning

March 18, 2025

Marina Kochuten

Team: Bailey Jorgensen, Jordan Sibley, Nicole Pepper

## Your Task

- Acquire domain knowledge provided by Dr. Satterthwaite in her presentation
- Explore the data
- Load the dataset and perform initial exploratory data analysis to inform your modeling choices
- Preprocessing (if necessary)
- Is the data ready to be used in your model?
- Choose and train a model
- Select an appropriate machine learning algorithm for this task
- Train your model on the provided training data
- Tune relevant parameters
- Use cross-validation to optimize model performance
- Experiment with different hyperparameters to reduce error
- Submit your prediction
- Generate predictions on the provided test dataset

## Data

This dataset was downloaded from the CalCOFI data portal. Bottle and cast data was downloaded and merged, then relevant variables were selected.

You will use the data contained in the train.csv file to train a model that will predict dissolved inorganic carbon (DIC) content in the water samples.

A database description is available here: https://calcofi.org/data/oceanographic-data/bottle-database/

Files

- train.csv: the training set
- test.csv: the test set


## Variables

- Lat_Dec: Observed Latitude in decimal degrees
- Lon_Dec: Observed Longitude in decimal degrees
- NO2uM: Micromoles Nitrite per liter of seawater
- NO3uM: Micromoles Nitrate per liter of seawater
- NH3uM: Micromoles Ammonia per liter of seawater
- R_TEMP: Reported (Potential) Temperature in degrees Celsius
- R_Depth: Reported Depth (from pressure) in meters
- R_Sal: Reported Salinity (from Specific Volume Anomoly, M³/Kg)
- R_DYNHT: Reported Dynamic Height in units of dynamic meters (work per unit mass)
- R_Nuts: Reported Ammonium concentration
- R_Oxy_micromol.Kg: Reported Oxygen micromoles/kilogram
- PO4uM: Micromoles Phosphate per liter of seawater
- SiO3uM: Micromoles Silicate per liter of seawater
- TA1: Total Alkalinity micromoles per kilogram solution
- Salinity1: Salinity (Practical Salinity Scale 1978)??
- Temperature_degC: Not included in description but some other temperature measurement

In [1]:
# Load packages
import numpy as np
import pandas as pd
import time
from sklearn.model_selection import train_test_split, KFold, GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import BaggingRegressor
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, confusion_matrix
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import Ridge, Lasso, RidgeCV, LassoCV
from sklearn.linear_model import LinearRegression
import xgboost as xgb
from xgboost import XGBRegressor
from scipy.stats import uniform, randint
import matplotlib.pyplot as plt
import seaborn as sns

## Data Pre-Processing

In [2]:
# Read in data
df = pd.read_csv('train.csv')
final_test = pd.read_csv('test.csv')

# Look at the training data
df.head()

Unnamed: 0,id,Lat_Dec,Lon_Dec,NO2uM,NO3uM,NH3uM,R_TEMP,R_Depth,R_Sal,R_DYNHT,R_Nuts,R_Oxy_micromol.Kg,Unnamed: 12,PO4uM,SiO3uM,TA1.x,Salinity1,Temperature_degC,DIC
0,1,34.38503,-120.66553,0.03,33.8,0.0,7.79,323,141.2,0.642,0.0,37.40948,,2.77,53.86,2287.45,34.198,7.82,2270.17
1,2,31.418333,-121.998333,0.0,34.7,0.0,7.12,323,140.8,0.767,0.0,64.81441,,2.57,52.5,2279.1,34.074,7.15,2254.1
2,3,34.38503,-120.66553,0.18,14.2,0.0,11.68,50,246.8,0.144,0.0,180.2915,,1.29,13.01,2230.8,33.537,11.68,2111.04
3,4,33.48258,-122.53307,0.013,29.67,0.01,8.33,232,158.5,0.562,0.01,89.62595,,2.27,38.98,2265.85,34.048,8.36,2223.41
4,5,31.41432,-121.99767,0.0,33.1,0.05,7.53,323,143.4,0.74,0.05,60.03062,,2.53,49.28,2278.49,34.117,7.57,2252.62


In [3]:
# Look at the testing data
final_test.head()

Unnamed: 0,id,Lat_Dec,Lon_Dec,NO2uM,NO3uM,NH3uM,R_TEMP,R_Depth,R_Sal,R_DYNHT,R_Nuts,R_Oxy_micromol.Kg,PO4uM,SiO3uM,TA1,Salinity1,Temperature_degC
0,1455,34.321666,-120.811666,0.02,24.0,0.41,9.51,101,189.9,0.258,0.41,138.8383,1.85,25.5,2244.94,33.83,9.52
1,1456,34.275,-120.033333,0.0,25.1,0.0,9.84,102,185.2,0.264,0.0,102.7092,2.06,28.3,2253.27,33.963,9.85
2,1457,34.275,-120.033333,0.0,31.9,0.0,6.6,514,124.1,0.874,0.0,2.174548,3.4,88.1,2316.95,34.241,6.65
3,1458,33.828333,-118.625,0.0,0.0,0.2,19.21,1,408.1,0.004,0.2,258.6743,0.27,2.5,2240.49,33.465,19.21
4,1459,33.828333,-118.625,0.02,19.7,0.0,10.65,100,215.5,0.274,0.0,145.8399,1.64,19.4,2238.3,33.72,10.66


In [4]:
# Check the shape
df.shape

(1454, 19)

In [5]:
# Is Unnamed 12 all NA?
df['Unnamed: 12'].isna().sum()

1454

In [6]:
# Drop `Unnamed: 12`
df = df.drop('Unnamed: 12', axis = 1)

# Confirm dropping worked
df.columns

Index(['id', 'Lat_Dec', 'Lon_Dec', 'NO2uM', 'NO3uM', 'NH3uM', 'R_TEMP',
       'R_Depth', 'R_Sal', 'R_DYNHT', 'R_Nuts', 'R_Oxy_micromol.Kg', 'PO4uM',
       'SiO3uM', 'TA1.x', 'Salinity1', 'Temperature_degC', 'DIC'],
      dtype='object')

In [7]:
# Define features and target
X = df.drop('DIC', axis = 1)
y = df['DIC']

In [8]:
# Rename `TA1.x` in train data to match test (to fix error when scaling)
X.rename(columns={'TA1.x': 'TA1'}, inplace=True)

In [9]:
# Split data for validation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 808)

In [10]:
# Scale X values
scaler = StandardScaler()
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns=X.columns)
X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns = X.columns)

## SVM

In [14]:
# Initalize SVM regressor
svm = SVR()

# Initalize KFold CV object
svm_kfold = KFold(n_splits = 5)

# Set up parameter grid
param_grid = {
    'C': [0.1, 1, 10, 100],
    'kernel': ['linear', 'rbf', 'poly', 'sigmoid'],
    'gamma': ['scale', 'auto']
}

# Set up GridSearch
gs_svm = GridSearchCV(svm, param_grid, cv = svm_kfold, n_jobs = 5, verbose = 0)

# Fit grid search
gs_svm.fit(X_train_scaled, y_train)

# Print the best parameters
print(f"Best SVM Parameters: {gs_svm.best_params_}")

# Initalize best SVM model
svm_best = SVR(**gs_svm.best_params_)

# Train SVM model
svm_best.fit(X_train_scaled, y_train)

# Generate SVM predictions
svm_best_pred_train = svm_best.predict(X_train_scaled)
svm_best_pred_test = svm_best.predict(X_test_scaled)

# Calculate RMSE
svm_rmse_train = np.sqrt(mean_squared_error(y_train, svm_best_pred_train))
svm_rmse_test = np.sqrt(mean_squared_error(y_test, svm_best_pred_test))
print(f"SVM Best RMSE Train: {svm_rmse_train}")
print(f"SVM Best RMSE Test: {svm_rmse_test}")

Best SVM Parameters: {'C': 100, 'gamma': 'scale', 'kernel': 'linear'}
SVM Best RMSE Train: 5.367844886462071
SVM Best RMSE Test: 7.243489280603425


## XG Boost


In [15]:
# Split train one more time to have eval data
X_train2, X_val, y_train2, y_val = train_test_split(X_train, y_train, test_size = 0.2, random_state = 808)

# Scale features
X_train2_scaled = scaler.fit_transform(X_train2)
X_val_scaled = scaler.transform(X_val)

In [16]:
# Create XGB model
xgb_model1 = xgb.XGBRegressor(n_estimators = 1000,
                              learning_rate = 0.1,
                              eval_metric = "rmse", 
                              early_stopping_rounds = 100,
                              random_state = 808,
                              n_jobs = 5)

# Fit model
xgb_model1.fit(X_train2_scaled, y_train2, eval_set = [(X_val_scaled, y_val)], verbose = 0)

# Get the best number of trees
best_num_trees = xgb_model1.best_iteration

# Print the best number of trees
print(f"Best number of trees: {best_num_trees}")

Best number of trees: 743


In [17]:
# Tune learning rate
xgb_model2 = xgb.XGBRegressor(n_estimators = best_num_trees,
                              eval_metric = "rmse", 
                              random_state = 808,
                              n_jobs = 5)

# Define hyperparameter distributions
param_dist = {
    "learning_rate":uniform(0.001, 0.5)
}

# Set up RandomizedSearchCV
random_search = RandomizedSearchCV(xgb_model2, 
                                   param_dist, 
                                   n_iter = 20,
                                   cv = 5,  
                                   random_state = 808)

# Run random search
random_search.fit(X_train2_scaled, y_train2, verbose = 0)

# Print the best learning rate
best_learning_rate = random_search.best_params_['learning_rate']
print(f"Best learning rate: {best_learning_rate}")

Best learning rate: 0.054415222825961285


In [18]:
# Tune tree specific params

# Initialize model using best number of trees and learning rate
xgb_model3 = xgb.XGBRegressor(n_estimators = best_num_trees,
                              learning_rate = best_learning_rate,
                              eval_metric = "rmse", 
                              random_state = 808,
                             # cv = 5,
                              n_jobs = 5)
                             # verbose = 0)

# Define parameter dictionary
param_dict = {
    "max_depth": randint(3,8),  #randint upper bound is not inclusive: [a,b)
    "min_child_weight": randint(1,8),
    "gamma": uniform(0.05, 0.2)
}

# Set up new RandomizedSearchCV
random_search = RandomizedSearchCV(xgb_model3, 
                                   param_dict, 
                                   n_iter = 20,
                                   cv = 5, 
                                   random_state = 808,
                                   n_jobs = 5)

# Run random search
random_search.fit(X_train2_scaled, y_train2, verbose = 0)

# Print best parameters
best_params = random_search.best_params_
print(f"Best parameters: {best_params}")

Best parameters: {'gamma': 0.14742953267361908, 'max_depth': 6, 'min_child_weight': 1}


In [19]:
# Tune stochastic components

# Initialize model
xgb_model4 = xgb.XGBRegressor(n_estimators = best_num_trees,
                              learning_rate = best_learning_rate,
                              gamma = 0.14742953267361908,
                              max_depth = 6,
                              min_child_weight = 1,
                              eval_metric = "rmse", 
                              random_state = 808)

# Define parameter dictionary
param_dict = {
    "subsample": uniform(0.5, 0.5),
    "colsample_bytree": uniform(0.5, 0.5),
}

# Set up new RandomizedSearchCV
random_search = RandomizedSearchCV(xgb_model4, 
                                   param_dict, 
                                   n_iter = 20,
                                   cv = 5, 
                                   random_state = 808,
                                   n_jobs = 5)

# Run random search
random_search.fit(X_train2_scaled, y_train2, verbose = 0)

# Print best parameters
best_params = random_search.best_params_
print(f"Best parameters: {best_params}")

Best parameters: {'colsample_bytree': 0.8107414551062586, 'subsample': 0.5657684073330207}


In [21]:
# Final model
xgb_model = xgb.XGBRegressor(n_estimators = best_num_trees,
                              learning_rate = best_learning_rate,
                              gamma = 0.14742953267361908,
                              max_depth = 6,
                              min_child_weight = 1,
                              eval_metric = "rmse", 
                              random_state = 808,
                             subsample = 0.5657684073330207,
                             colsample_bytree = 0.8107414551062586)

# Fit to full training data
xgb_model.fit(X_train_scaled, y_train)

# Predictions
xgb_pred_train = xgb_model.predict(X_train_scaled)
xgb_pred_test = xgb_model.predict(X_test_scaled)

# Calculate RMSE
xgb_rmse_train = np.sqrt(mean_squared_error(y_train, xgb_pred_train))
xgb_rmse_test = np.sqrt(mean_squared_error(y_test, xgb_pred_test))
print(f"XGBoost RMSE Train: {xgb_rmse_train}")
print(f"XGBoost RMSE Test: {xgb_rmse_test}")

XGBoost RMSE Train: 0.21302371684553612
XGBoost RMSE Test: 6.614784065005989


## Decision Tree & KNN

In [23]:
# Initialize models
knn = KNeighborsRegressor(n_neighbors = 5)
dt = DecisionTreeRegressor()

# Train (fit) both models
knn.fit(X_train_scaled, y_train)
dt.fit(X_train_scaled, y_train)

# Predictions
knn_y_train_pred = knn.predict(X_train_scaled)
dt_y_train_pred = dt.predict(X_train_scaled)
knn_y_test_pred = knn.predict(X_test_scaled)
dt_y_test_pred = dt.predict(X_test_scaled)

# Compute RMSE
knn_train_rmse = np.sqrt(mean_squared_error(y_train, knn_y_train_pred))
knn_test_rmse = np.sqrt(mean_squared_error(y_test, knn_y_test_pred))
dt_train_rmse = np.sqrt(mean_squared_error(y_train, dt_y_train_pred))
dt_test_rmse = np.sqrt(mean_squared_error(y_test, dt_y_test_pred))

#Print training accuracy for both models
print(f'KNN RMSE Train: {knn_train_rmse}')
print(f'KNN RMSE Test: {knn_test_rmse}')
print(f'DT RMSE Train: {dt_train_rmse}')
print(f'DT RMSE Test: {dt_test_rmse}')

KNN RMSE Train: 9.14596056848805
KNN RMSE Test: 12.529606465726378
DT RMSE Train: 0.0
DT RMSE Test: 9.070576276949653


## Bagged tree

In [24]:
# Initialize bagging classifier
bagging = BaggingRegressor(n_estimators = 100,
                           max_samples = 0.5,
                            bootstrap = True,
                            random_state = 808)

# Train the model
bagging.fit(X_train_scaled, y_train)

# Make predictions
bagging_preds_train = bagging.predict(X_train_scaled)
bagging_preds_test = bagging.predict(X_test_scaled)

# Compute rmse
bag_train_rmse = np.sqrt(mean_squared_error(y_train, bagging_preds_train))
bag_test_rmse = np.sqrt(mean_squared_error(y_test, bagging_preds_test))

#Print training accuracy for both models
print(f'Bagging training RMSE: {bag_train_rmse}')
print(f'Bagging test RMSE: {bag_test_rmse}')

Bagging training RMSE: 3.5750039035816723
Bagging test RMSE: 7.081236805874178


## Linear Regression

In [57]:
# Initialize and fit the model
lr = LinearRegression().fit(X_train_scaled, y_train)

# Make predictions
lr_pred_train = lr.predict(X_train_scaled)
lr_pred_test = lr.predict(X_test_scaled)

# Calculate RMSE
lr_rmse_train = np.sqrt(mean_squared_error(y_train, lr_pred_train))
lr_rmse_test = np.sqrt(mean_squared_error(y_test, lr_pred_test))

print(f'LR RMSE Train: {lr_rmse_train}')
print(f'LR RMSE Test: {lr_rmse_test}')

LR RMSE Train: 5.1813774806839445
LR RMSE Test: 6.724247896221981


## Polynomial regression

In [27]:
# Transform features to include polynomial terms (degree 2 for quadratic terms)
poly = PolynomialFeatures(2, include_bias = False)
X_poly_train = poly.fit_transform(X_train_scaled)
X_poly_test = poly.transform(X_test_scaled)

# Train the model on polynomial features 
poly_model = LinearRegression().fit(X_poly_train, y_train)

# Make predictions using the polynomial model
y_poly_pred_train = poly_model.predict(X_poly_train)
y_poly_pred_test = poly_model.predict(X_poly_test)

# Calculate RMSE
poly_rmse_train = np.sqrt(mean_squared_error(y_train, y_poly_pred_train))
poly_rmse_test = np.sqrt(mean_squared_error(y_test, y_poly_pred_test))
print(f'Poly LR RMSE Train: {poly_rmse_train}')
print(f'Poly LR RMSE Test: {poly_rmse_test}')


Poly LR RMSE Train: 3.70118553972158
Poly LR RMSE Test: 15.74800385502208


## Ridge and Lasso

In [28]:
# Create OLS instance and fit it
ols = LinearRegression()
ols.fit(X_train_scaled, y_train)

# Define a fixed alpha (lambda)
alpha_fixed = 10

# Create Ridge regression instance and fit it
ridge = Ridge(alpha = alpha_fixed)
ridge.fit(X_train_scaled, y_train)

# Predictions using ridge model
ridge_train_pred = ridge.predict(X_train_scaled)
ridge_test_pred = ridge.predict(X_test_scaled)

# Evaluate MSE
ridge_rmse_train = np.sqrt(mean_squared_error(y_train, ridge_train_pred))
ridge_rmse_test = np.sqrt(mean_squared_error(y_test, ridge_test_pred))

print(f"Train MSE (alpha = {alpha_fixed}): {ridge_rmse_train:.4f}")
print(f"Test MSE (alpha = {alpha_fixed}): {ridge_rmse_test:.4f}")

Train MSE (alpha = 10): 5.4523
Test MSE (alpha = 10): 6.9659


In [29]:
# Try many alphas
alphas = np.logspace(start = -4, stop = 4, num = 100)  # Alphas from 0.0001 to 10,000

# Fit RidgeCV
ridge_cv = RidgeCV(alphas = alphas, cv = 10).fit(X_train_scaled, y_train)

# Predictions with the best alpha
y_test_pred_cv = ridge_cv.predict(X_test_scaled)
y_train_pred_cv = ridge_cv.predict(X_train_scaled)

# Caculate RMSE
ridgecv_rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred_cv))
ridgecv_rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred_cv))

print(f"RidgeCV RMSE Train: {ridgecv_rmse_train:.4f}")
print(f"RidgeCV RMSE Test: {ridgecv_rmse_test:.4f}")

RidgeCV RMSE Train: 5.1930
RidgeCV RMSE Test: 6.7328


In [51]:
# Fit lasso regression with cross-validation
lasso_cv = LassoCV(alphas = alphas, cv = 10, max_iter=10000).fit(X_train_scaled, y_train)

# Predict
lassocv_test_pred = lasso_cv.predict(X_test_scaled)
lassocv_train_pred = lasso_cv.predict(X_train_scaled)

# Calculate RMSE
lassocv_rmse_test = np.sqrt(mean_squared_error(y_test, lassocv_test_pred))
lassocv_rmse_train = np.sqrt(mean_squared_error(y_train, lassocv_train_pred))

print(f"LassoCV RMSE Train: {lassocv_rmse_train:.4f}")
print(f"LassoCV RMSE Test: {lassocv_rmse_test:.4f}")

LassoCV RMSE Train: 5.3506
LassoCV RMSE Test: 6.9606


  model = cd_fast.enet_coordinate_descent_gram(


## Random Forest

In [35]:
# Construct parameter grid
param_grid = {
    "max_features":["sqrt", 6, None],
    "n_estimators":[50, 100, 200],
    "max_depth":[3,4,5,6,7],
    "min_samples_split":[2,5,10],
    "min_samples_leaf":[1,2,4]
}


# Initialize Random forest regressor
rf = RandomForestRegressor(random_state = 808)

# Use cross-validation to find best combo of parameter values
gs = GridSearchCV(rf, param_grid = param_grid, n_jobs = -1, 
                  return_train_score = True, scoring = "neg_mean_squared_error")
gs.fit(X_train_scaled, y_train)

# Print best combo of parameters
print(f"GS Best Parameters: {gs.best_params_}")

# Train the best estimator
best_rf = RandomForestRegressor(**gs.best_params_, random_state = 808)
best_rf.fit(X_train_scaled, y_train)

# Generate predictions
rf_pred_train = best_rf.predict(X_train_scaled)
rf_pred_test = best_rf.predict(X_test_scaled)

# Calculate RMSE
rf_rmse_train = np.sqrt(mean_squared_error(y_train, rf_pred_train))
rf_rmse_test = np.sqrt(mean_squared_error(y_test, rf_pred_test))
print(f"\nRF RMSE Train: {rf_rmse_train:.3f}")
print(f"RF RMSE Test: {rf_rmse_test:.3f}")

# Extract feature importance from RF model
importance = best_rf.feature_importances_

# Create a list of feature names
feature_names = X_train_scaled.columns

# Create feature importance df
importance_df = (pd.DataFrame(zip(feature_names, importance), columns=['Feature', 'Importance'])
                 .sort_values(by = 'Importance', key = abs, ascending = False)
                 .reset_index(drop=True))

# Print the sorted feature importance
print("\nFeature Importances:\n", importance_df)

GS Best Parameters: {'max_depth': 7, 'max_features': 6, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 200}

RF RMSE Train: 3.407
RF RMSE Test: 6.993

Feature Importances:
               Feature  Importance
0               PO4uM    0.318997
1              SiO3uM    0.231175
2   R_Oxy_micromol.Kg    0.145176
3               R_Sal    0.096353
4           Salinity1    0.067644
5               NO3uM    0.058657
6                 TA1    0.028580
7              R_TEMP    0.024043
8             R_Depth    0.012058
9    Temperature_degC    0.011295
10            R_DYNHT    0.004388
11              NO2uM    0.001267
12                 id    0.000098
13            Lon_Dec    0.000091
14            Lat_Dec    0.000089
15              NH3uM    0.000056
16             R_Nuts    0.000034


In [46]:
# Try only using most important features
filtered_train = X_train_scaled[['PO4uM', 'SiO3uM', 'R_Oxy_micromol.Kg', 'R_Sal', 
                                 'Salinity1', 'NO3uM', 'TA1', 'R_TEMP', 'R_Depth']]
filtered_test = X_test_scaled[['PO4uM', 'SiO3uM', 'R_Oxy_micromol.Kg', 'R_Sal', 
                               'Salinity1', 'NO3uM', 'TA1', 'R_TEMP', 'R_Depth']]

# Fit
best_rf.fit(filtered_train, y_train)

# Predict
rf_filtered_train_preds = best_rf.predict(filtered_train)
rf_filtered_test_preds = best_rf.predict(filtered_test)

# Calculate RMSE
rf_filtered_rmse_train = np.sqrt(mean_squared_error(y_train, rf_filtered_train_preds))
rf_filtered_rmse_test = np.sqrt(mean_squared_error(y_test, rf_filtered_test_preds))

print("RF Filtered RMSE Train:", rf_filtered_rmse_train)
print("RF Filtered RMSE Test:", rf_filtered_rmse_test)

RF Filtered RMSE Train: 3.258669015911448
RF Filtered RMSE Test: 6.977185484989777


## Comparing models

Let's compare all of those models side by side

In [54]:
print(f"SVM Train: {svm_rmse_train:.4f}")
print(f"SVM Test: {svm_rmse_test:.4f}")
print(f"\nXGBoost Train: {xgb_rmse_train:.4f}")
print(f"XGBoost Test: {xgb_rmse_test:.4f}")
print(f'\nKNN Train: {knn_train_rmse:.4f}')
print(f'KNN Test: {knn_test_rmse:.4f}')
print(f'\nDT Train: {dt_train_rmse:.4f}')
print(f'DT Test: {dt_test_rmse:.4f}')
print(f'\nBagging Train: {bag_train_rmse:.4f}')
print(f'Bagging Test: {bag_test_rmse:.4f}')
print(f'\nLinear Regression Train: {lr_rmse_train:.4f}')
print(f'Linear Regression Test: {lr_rmse_test:.4f}')
print(f'\nPoly LR Train: {poly_rmse_train:.4f}')
print(f'Poly LR Test: {poly_rmse_test:.4f}')
print(f"\nRidge Train: {ridge_rmse_train:.4f}")
print(f"Ridge Test: {ridge_rmse_test:.4f}")
print(f"\nRidgeCV Train: {ridgecv_rmse_train:.4f}")
print(f"RidgeCV Test: {ridgecv_rmse_test:.4f}")
print(f"\nLassoCV Train: {lassocv_rmse_train:.4f}")
print(f"LassoCV Test: {lassocv_rmse_test:.4f}")
print(f"\nRF Train: {rf_rmse_train:.4f}")
print(f"RF Test: {rf_rmse_test:.4f}")
print(f"\nRF Filtered Train: {rf_filtered_rmse_train:.4f}")
print(f"RF Filtered Test: {rf_filtered_rmse_test:.4f}")

SVM Train: 5.3678
SVM Test: 7.2435

XGBoost Train: 0.2130
XGBoost Test: 6.6148

KNN Train: 9.1460
KNN Test: 12.5296

DT Train: 0.0000
DT Test: 9.0706

Bagging Train: 3.5750
Bagging Test: 7.0812

Linear Regression Train: 5.1814
Linear Regression Test: 6.7242

Poly LR Train: 3.7012
Poly LR Test: 15.7480

Ridge Train: 5.4523
Ridge Test: 6.9659

RidgeCV Train: 5.1930
RidgeCV Test: 6.7328

LassoCV Train: 5.3506
LassoCV Test: 6.9606

RF Train: 3.4072
RF Test: 6.9926

RF Filtered Train: 3.2587
RF Filtered Test: 6.9772


Out of these, XGBoost performed best on the testing data, but I saw a big change between the training and the testing scores.

Following is the Linear Regression and Ridge CV. I will try these 3 as submissions

## Submissions

In [None]:
###### XBG Submission ######

# Generate final predictions on real test data
test = scaler.fit_transform(final_test)
# Predict
pred = xgb_model.predict(test)

# Make submission df
# Create list of id
ids = list(final_test.id)

# Create list of DIC preds
dic = list(pred)

# Create submission df
xgboost_submission_df = pd.DataFrame(zip(ids, dic), columns=['id', 'DIC'])
xgboost_submission_df.to_csv('xgboost_submission.csv', index=False)

In [58]:
###### Linear Regression Submission ######

# Generate final predictions on real test data
test = scaler.fit_transform(final_test)
# Predict
pred = lr.predict(test)

# Make submission df
# Create list of id
ids = list(final_test.id)

# Create list of DIC preds
dic = list(pred)

# Create submission df
lr_df = pd.DataFrame(zip(ids, dic), columns=['id', 'DIC'])
lr_df.to_csv('lr-marina.csv', index=False)



In [None]:
###### RidgeCV Submission ######

# Generate final predictions on real test data
test = scaler.fit_transform(final_test)
# Predict
pred = ridge_cv.predict(test)

# Make submission df
# Create list of id
ids = list(final_test.id)

# Create list of DIC preds
dic = list(pred)

# Create submission df
ridgecv_df = pd.DataFrame(zip(ids, dic), columns=['id', 'DIC'])
ridgecv_df.to_csv('ridgecv.csv', index=False)