# Predict Podcast Listening Time
By Josh Houlding

<b>Competition Page:</b> [https://www.kaggle.com/competitions/playground-series-s5e4/overview](https://www.kaggle.com/competitions/playground-series-s5e4/overview)

The following is an AutoML solution to the April 2025 Kaggle competition based around predicting a podcast episode's listening time based on various factors, implemented using the H2O library.

In [9]:
# Install H2O AutoML library
#pip install h2o

In [10]:
import h2o
import multiprocessing

# Get the number of CPU cores
num_cores = multiprocessing.cpu_count()

# Set the desired CPU percentage (e.g., 50%)
cpu_percentage = 0.5

# Calculate the number of threads to use
desired_threads = int(num_cores * cpu_percentage)

# Initialize H2O with the specified number of threads
h2o.init(nthreads=desired_threads)

# Load data
train = h2o.import_file("train.csv")
#train_sample = train[0:10000] # Take sample of training set for better performance
test = h2o.import_file("test.csv")

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
; OpenJDK 64-Bit Server VM (Temurin)(build 25.432-b06, mixed mode)
  Starting server from C:\Users\jdh10\anaconda3\Lib\site-packages\h2o\backend\bin\h2o.jar
  Ice root: C:\Users\jdh10\AppData\Local\Temp\tmpfdjx9qm9
  JVM stdout: C:\Users\jdh10\AppData\Local\Temp\tmpfdjx9qm9\h2o_jdh10_started_from_python.out
  JVM stderr: C:\Users\jdh10\AppData\Local\Temp\tmpfdjx9qm9\h2o_jdh10_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,05 secs
H2O_cluster_timezone:,America/Los_Angeles
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.46.0.7
H2O_cluster_version_age:,6 days
H2O_cluster_name:,H2O_from_python_jdh10_epu0pu
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,5.304 Gb
H2O_cluster_total_cores:,16
H2O_cluster_allowed_cores:,8


Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


In [11]:
# Identify label and features
y = "Listening_Time_minutes"
x = train.columns
x.remove(y) # Get only features
x.remove("id") # ID is not useful for prediction

In [12]:
""" # Install libraries for multithreaded conversion of H2O dataframes to Pandas dataframes
!pip install polars
!pip install pyarrow
""";

In [13]:
from h2o.automl import H2OAutoML
import pandas as pd

# Train AutoML
aml = H2OAutoML(max_runtime_secs=10, seed=1)  # Adjust runtime as needed
aml.train(x=x, y=y, training_frame=train)

# View leaderboard
lb = aml.leaderboard
print(lb)

# Get best model
best_model = aml.leader

# Make predictions on test data
predictions = best_model.predict(test)

# Format predictions for submission
predicted_values = predictions["predict"].as_data_frame(use_multi_thread=True)
ids = test["id"].as_data_frame(use_multi_thread=True)
submission = pd.concat([ids, predicted_values], axis=1)
submission.columns = ["id", "Listening_Time_minutes"]

# Save the submission DataFrame to a CSV file
submission.to_csv("submission.csv", index=False)

AutoML progress: |
21:08:17.475: AutoML: XGBoost is not available; skipping it.

███████████████████████████████████████████████████████████████| (done) 100%
model_id                                                    rmse      mse       mae       rmsle    mean_residual_deviance
StackedEnsemble_BestOfFamily_1_AutoML_1_20250402_210816  13.0844  171.201   9.56584  nan                          171.201
GLM_1_AutoML_1_20250402_210816                           13.3549  178.355   9.80337  nan                          178.355
GBM_1_AutoML_1_20250402_210816                           14.6739  215.323  11.3383     0.578994                   215.323
DRF_1_AutoML_1_20250402_210816                           17.4512  304.543  12.3502     0.535024                   304.543
[4 rows x 6 columns]

stackedensemble prediction progress: |███████████████████████████████████████████| (done) 100%
Export File progress: |██████████████████████████████████████████████████████████| (done) 100%
Export File progress

In [14]:
# Check submitted file
submission.sample(5, random_state=42)

Unnamed: 0,id,Listening_Time_minutes
38683,788683,54.6937
64939,814939,49.46725
3954,753954,79.86723
120374,870374,64.671026
172861,922861,68.857285


In [15]:
# Shut down H2O
h2o.cluster().shutdown()

H2O session _sid_9c61 closed.
