# Import Libraries

In [24]:
import pandas as pd
import kaggle 
import numpy as np
import sklearn
import os

### Using Kaggle data on your own machine

Kaggle limits your weekly time using a GPU machine. The limits are very generous, but you may well still find it's not enough! In that case, you'll want to use your own GPU server, or a cloud server such as Colab, Paperspace Gradient, or SageMaker Studio Lab (all of which have free options). To do so, you'll need to be able to download Kaggle datasets.

The easiest way to download Kaggle datasets is to use the Kaggle API. You can install this using `pip` by running this in a notebook cell:

    !pip install kaggle

You need an API key to use the Kaggle API; to get one, click on your profile picture on the Kaggle website, and choose My Account, then click Create New API Token. This will save a file called *kaggle.json* to your PC. You need to copy this key on your GPU server. To do so, open the file you downloaded, copy the contents, and paste them in the following cell (e.g., `creds = '{"username":"xxx","key":"xxx"}'`):

In [25]:

iskaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')

Then execute this cell (this only needs to be run once):

In [26]:
# for working with paths in Python, I recommend using `pathlib.Path`
from pathlib import Path

cred_path = Path('~/.kaggle/kaggle.json').expanduser()
if not cred_path.exists():
    cred_path.parent.mkdir(exist_ok=True)
    cred_path.write_text(creds)
    cred_path.chmod(0o600)

Now you can download datasets from Kaggle.

In [27]:
path = Path('playground-series-s4e5')

And use the Kaggle API to download the dataset to that path, and extract it:

In [28]:
if not iskaggle and not path.exists():
    import zipfile,kaggle
    kaggle.api.competition_download_cli(str(path))
    zipfile.ZipFile(f'{path}.zip').extractall(path)

Note that you can easily download notebooks from Kaggle and upload them to other cloud services. So if you're low on Kaggle GPU credits, give this a try!

## Import and EDA

In [29]:
if iskaggle:
    path = Path('../input/playground-series-s4e5')
    !pip install -q datasets

Documents in NLP datasets are generally in one of two main forms:

- **Larger documents**: One text file per document, often organised into one folder per category
- **Smaller documents**: One document (or document pair, optionally with metadata) per row in a [CSV file](https://realpython.com/python-csv/).

Let's look at our data and see what we've got. In Jupyter you can use any bash/shell command by starting a line with a `!`, and use `{}` to include python variables, like so:

In [30]:
!ls {path}

'ls' is not recognized as an internal or external command,
operable program or batch file.


It looks like this competition uses CSV files. For opening, manipulating, and viewing CSV files, it's generally best to use the Pandas library, which is explained brilliantly in [this book](https://wesmckinney.com/book/) by the lead developer (it's also an excellent introduction to matplotlib and numpy, both of which I use in this notebook). Generally it's imported as the abbreviation `pd`.

Let's set a path to our data:

In [31]:
train = pd.read_csv(path/'train.csv')
test = pd.read_csv(path/'test.csv')

This creates a [DataFrame](https://pandas.pydata.org/docs/user_guide/10min.html), which is a table of named columns, a bit like a database table. To view the first and last rows, and row count of a DataFrame, just type its name:

In [32]:
train

Unnamed: 0,id,MonsoonIntensity,TopographyDrainage,RiverManagement,Deforestation,Urbanization,ClimateChange,DamsQuality,Siltation,AgriculturalPractices,...,DrainageSystems,CoastalVulnerability,Landslides,Watersheds,DeterioratingInfrastructure,PopulationScore,WetlandLoss,InadequatePlanning,PoliticalFactors,FloodProbability
0,0,5,8,5,8,6,4,4,3,3,...,5,3,3,5,4,7,5,7,3,0.445
1,1,6,7,4,4,8,8,3,5,4,...,7,2,0,3,5,3,3,4,3,0.450
2,2,6,5,6,7,3,7,1,5,4,...,7,3,7,5,6,8,2,3,3,0.530
3,3,3,4,6,5,4,8,4,7,6,...,2,4,7,4,4,6,5,7,5,0.535
4,4,5,3,2,6,4,4,3,3,3,...,2,2,6,6,4,1,2,3,5,0.415
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1117952,1117952,3,3,4,10,4,5,5,7,10,...,7,8,7,2,2,1,4,6,4,0.495
1117953,1117953,2,2,4,3,9,5,8,1,3,...,9,4,4,3,7,4,9,4,5,0.480
1117954,1117954,7,3,9,4,6,5,9,1,3,...,5,5,5,5,6,5,5,2,4,0.485
1117955,1117955,7,3,3,7,5,2,3,4,6,...,6,8,5,3,4,6,7,6,4,0.495


It's important to carefully read the [dataset description](https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/data) to understand how each of these columns is used.

One of the most useful features of `DataFrame` is the `describe()` method:

In [33]:
train.describe(include='all')

Unnamed: 0,id,MonsoonIntensity,TopographyDrainage,RiverManagement,Deforestation,Urbanization,ClimateChange,DamsQuality,Siltation,AgriculturalPractices,...,DrainageSystems,CoastalVulnerability,Landslides,Watersheds,DeterioratingInfrastructure,PopulationScore,WetlandLoss,InadequatePlanning,PoliticalFactors,FloodProbability
count,1117957.0,1117957.0,1117957.0,1117957.0,1117957.0,1117957.0,1117957.0,1117957.0,1117957.0,1117957.0,...,1117957.0,1117957.0,1117957.0,1117957.0,1117957.0,1117957.0,1117957.0,1117957.0,1117957.0,1117957.0
mean,558978.0,4.92145,4.926671,4.955322,4.94224,4.942517,4.934093,4.955878,4.927791,4.942619,...,4.946893,4.953999,4.931376,4.929032,4.925907,4.92752,4.950859,4.940587,4.939004,0.5044803
std,322726.5,2.056387,2.093879,2.072186,2.051689,2.083391,2.057742,2.083063,2.065992,2.068545,...,2.072333,2.088899,2.078287,2.082395,2.064813,2.074176,2.068696,2.081123,2.09035,0.0510261
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.285
25%,279489.0,3.0,3.0,4.0,4.0,3.0,3.0,4.0,3.0,3.0,...,4.0,3.0,3.0,3.0,3.0,3.0,4.0,3.0,3.0,0.47
50%,558978.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,...,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,0.505
75%,838467.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,...,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,0.54
max,1117956.0,16.0,18.0,16.0,17.0,17.0,17.0,16.0,16.0,16.0,...,17.0,17.0,16.0,16.0,17.0,18.0,19.0,16.0,16.0,0.725


In [34]:
test.describe(include='all')

Unnamed: 0,id,MonsoonIntensity,TopographyDrainage,RiverManagement,Deforestation,Urbanization,ClimateChange,DamsQuality,Siltation,AgriculturalPractices,...,IneffectiveDisasterPreparedness,DrainageSystems,CoastalVulnerability,Landslides,Watersheds,DeterioratingInfrastructure,PopulationScore,WetlandLoss,InadequatePlanning,PoliticalFactors
count,745305.0,745305.0,745305.0,745305.0,745305.0,745305.0,745305.0,745305.0,745305.0,745305.0,...,745305.0,745305.0,745305.0,745305.0,745305.0,745305.0,745305.0,745305.0,745305.0,745305.0
mean,1490609.0,4.91561,4.930288,4.960027,4.946084,4.938424,4.933524,4.958468,4.927651,4.945308,...,4.947436,4.944003,4.957209,4.92762,4.93072,4.926062,4.926957,4.948424,4.940204,4.943918
std,215151.2,2.056295,2.094117,2.071722,2.052602,2.081816,2.059243,2.089312,2.06811,2.073404,...,2.081322,2.072335,2.088787,2.079006,2.083348,2.065638,2.073692,2.065891,2.079128,2.087387
min,1117957.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1304283.0,3.0,3.0,4.0,4.0,3.0,3.0,4.0,3.0,3.0,...,3.0,4.0,3.0,3.0,3.0,3.0,3.0,4.0,3.0,3.0
50%,1490609.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,...,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0
75%,1676935.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,...,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0
max,1863261.0,16.0,17.0,16.0,17.0,17.0,17.0,16.0,16.0,16.0,...,16.0,17.0,17.0,16.0,16.0,17.0,19.0,22.0,16.0,16.0


## Explanation of columns
- Target value: FloodProbability

In [35]:
target_column = train['FloodProbability']
train['id'] = train['id'] + 1

In [36]:
categorical_columns = train.columns.drop(['FloodProbability'])
train[categorical_columns] = train[categorical_columns].astype('category')
test[categorical_columns] = test[categorical_columns].astype('category')


In [37]:
from sklearn.model_selection import train_test_split

selected_columns = train.columns.drop(['FloodProbability'])
X = train[selected_columns]
y = train['FloodProbability']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [38]:
print('Shape of train:', X_train.shape, y_train.shape)
print('Shape of test:', X_test.shape, y_test.shape)

Shape of train: (894365, 21) (894365,)
Shape of test: (223592, 21) (223592,)


In [None]:
import optuna
import lightgbm as lgb
from sklearn.metrics import mean_squared_error
# Create a study and specify the metric you want to optimize
study = optuna.create_study(direction="minimize")

# Define the objective function
def objective(trial):
  # Set the hyperparameter values
  learning_rate = trial.suggest_float("learning_rate", 0.01, 0.1)
  num_leaves = trial.suggest_int("num_leaves", 2, 256)
  max_depth = trial.suggest_int("max_depth", -1, 50)
  min_child_samples = trial.suggest_int("min_child_samples", 5, 100)
  subsample = trial.suggest_float("subsample", 0.5, 1.0)
  colsample_bytree = trial.suggest_float("colsample_bytree", 0.5, 1.0)
  n_estimators = trial.suggest_int("n_estimators", 64, 1024)

  # Create and train the model
  model = lgb.LGBMRegressor(
    learning_rate=learning_rate,
    num_leaves=num_leaves,
    max_depth=max_depth,
    min_child_samples=min_child_samples,
    subsample=subsample,
    colsample_bytree=colsample_bytree,
    n_estimators=n_estimators,
    random_state=42
  )
  model.fit(X_train, y_train)

  # Evaluate the model and return the metric
  y_pred = model.predict(X_test)
  mse = mean_squared_error(y_test, y_pred)
  return mse

# Run the study and examine the results
study.optimize(objective, n_trials=20)
print("Best trial:")
print(" Value: {}".format(study.best_trial.value))
print(" Params: {}".format(study.best_trial.params))

[I 2024-05-25 13:25:44,947] A new study created in memory with name: no-name-a1bd0186-de9d-4108-b970-470361e1b91e


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.071957 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 347
[LightGBM] [Info] Number of data points in the train set: 894365, number of used features: 20
[LightGBM] [Info] Start training from score 0.504480


[I 2024-05-25 13:26:41,238] Trial 0 finished with value: 0.00041807379081809697 and parameters: {'learning_rate': 0.04634997859101308, 'num_leaves': 44, 'max_depth': 42, 'min_child_samples': 18, 'subsample': 0.8729351996364592, 'colsample_bytree': 0.7822918769781728, 'n_estimators': 717}. Best is trial 0 with value: 0.00041807379081809697.


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.032651 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 347
[LightGBM] [Info] Number of data points in the train set: 894365, number of used features: 20
[LightGBM] [Info] Start training from score 0.504480


[I 2024-05-25 13:27:26,704] Trial 1 finished with value: 0.0004366566969048873 and parameters: {'learning_rate': 0.0446660788261972, 'num_leaves': 253, 'max_depth': 38, 'min_child_samples': 53, 'subsample': 0.7240015660903805, 'colsample_bytree': 0.9779316466038285, 'n_estimators': 313}. Best is trial 0 with value: 0.00041807379081809697.


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.076656 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 347
[LightGBM] [Info] Number of data points in the train set: 894365, number of used features: 20
[LightGBM] [Info] Start training from score 0.504480


[I 2024-05-25 13:28:13,716] Trial 2 finished with value: 0.00042827304240965864 and parameters: {'learning_rate': 0.06183833860951176, 'num_leaves': 193, 'max_depth': 35, 'min_child_samples': 11, 'subsample': 0.6860464771172161, 'colsample_bytree': 0.7107328953580572, 'n_estimators': 437}. Best is trial 0 with value: 0.00041807379081809697.


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.085300 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 347
[LightGBM] [Info] Number of data points in the train set: 894365, number of used features: 20
[LightGBM] [Info] Start training from score 0.504480


[I 2024-05-25 13:28:52,062] Trial 3 finished with value: 0.0004294345393978112 and parameters: {'learning_rate': 0.06940173422098746, 'num_leaves': 169, 'max_depth': 12, 'min_child_samples': 58, 'subsample': 0.6900216556829788, 'colsample_bytree': 0.6804918969032419, 'n_estimators': 487}. Best is trial 0 with value: 0.00041807379081809697.


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.019924 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 347
[LightGBM] [Info] Number of data points in the train set: 894365, number of used features: 20
[LightGBM] [Info] Start training from score 0.504480


[I 2024-05-25 13:29:26,473] Trial 4 finished with value: 0.0004299806701821142 and parameters: {'learning_rate': 0.09324732463029146, 'num_leaves': 90, 'max_depth': 18, 'min_child_samples': 67, 'subsample': 0.5407076415582814, 'colsample_bytree': 0.5447103273942675, 'n_estimators': 550}. Best is trial 0 with value: 0.00041807379081809697.


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.082097 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 347
[LightGBM] [Info] Number of data points in the train set: 894365, number of used features: 20
[LightGBM] [Info] Start training from score 0.504480


[I 2024-05-25 13:29:43,183] Trial 5 finished with value: 0.0012364123292321473 and parameters: {'learning_rate': 0.01856794286579284, 'num_leaves': 255, 'max_depth': -1, 'min_child_samples': 58, 'subsample': 0.5776445325647115, 'colsample_bytree': 0.6255473896714684, 'n_estimators': 125}. Best is trial 0 with value: 0.00041807379081809697.


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.080792 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 347
[LightGBM] [Info] Number of data points in the train set: 894365, number of used features: 20
[LightGBM] [Info] Start training from score 0.504480


[I 2024-05-25 13:29:50,101] Trial 6 finished with value: 0.0009261501894996605 and parameters: {'learning_rate': 0.0821727190699555, 'num_leaves': 171, 'max_depth': 2, 'min_child_samples': 47, 'subsample': 0.7052882599984518, 'colsample_bytree': 0.6886460652674287, 'n_estimators': 207}. Best is trial 0 with value: 0.00041807379081809697.


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.038882 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 347
[LightGBM] [Info] Number of data points in the train set: 894365, number of used features: 20
[LightGBM] [Info] Start training from score 0.504480


[I 2024-05-25 13:30:06,964] Trial 7 finished with value: 0.0004718394820951608 and parameters: {'learning_rate': 0.08956775506150669, 'num_leaves': 169, 'max_depth': 2, 'min_child_samples': 34, 'subsample': 0.8662816520533103, 'colsample_bytree': 0.892800190309152, 'n_estimators': 543}. Best is trial 0 with value: 0.00041807379081809697.


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.081322 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 347
[LightGBM] [Info] Number of data points in the train set: 894365, number of used features: 20
[LightGBM] [Info] Start training from score 0.504480


[I 2024-05-25 13:30:18,093] Trial 8 finished with value: 0.0010771249925268988 and parameters: {'learning_rate': 0.03961366426226605, 'num_leaves': 32, 'max_depth': 2, 'min_child_samples': 42, 'subsample': 0.6545806059200546, 'colsample_bytree': 0.8479618763713024, 'n_estimators': 344}. Best is trial 0 with value: 0.00041807379081809697.


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.018000 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 347
[LightGBM] [Info] Number of data points in the train set: 894365, number of used features: 20
[LightGBM] [Info] Start training from score 0.504480


[I 2024-05-25 13:30:52,598] Trial 9 finished with value: 0.000425064117189221 and parameters: {'learning_rate': 0.09294516737127564, 'num_leaves': 41, 'max_depth': 39, 'min_child_samples': 38, 'subsample': 0.8128073849848045, 'colsample_bytree': 0.5525947624704192, 'n_estimators': 763}. Best is trial 0 with value: 0.00041807379081809697.


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.077820 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 347
[LightGBM] [Info] Number of data points in the train set: 894365, number of used features: 20
[LightGBM] [Info] Start training from score 0.504480


[I 2024-05-25 13:31:17,348] Trial 10 finished with value: 0.0008616863574461962 and parameters: {'learning_rate': 0.023574590127920007, 'num_leaves': 3, 'max_depth': 49, 'min_child_samples': 87, 'subsample': 0.9649698871467792, 'colsample_bytree': 0.8113729632713232, 'n_estimators': 962}. Best is trial 0 with value: 0.00041807379081809697.


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.016977 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 347
[LightGBM] [Info] Number of data points in the train set: 894365, number of used features: 20
[LightGBM] [Info] Start training from score 0.504480


[I 2024-05-25 13:32:02,851] Trial 11 finished with value: 0.00041779817307033285 and parameters: {'learning_rate': 0.04750784209715817, 'num_leaves': 72, 'max_depth': 36, 'min_child_samples': 17, 'subsample': 0.8428032129760529, 'colsample_bytree': 0.5098830359430316, 'n_estimators': 784}. Best is trial 11 with value: 0.00041779817307033285.


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.101787 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 347
[LightGBM] [Info] Number of data points in the train set: 894365, number of used features: 20
[LightGBM] [Info] Start training from score 0.504480


[I 2024-05-25 13:32:56,408] Trial 12 finished with value: 0.00042084708900307733 and parameters: {'learning_rate': 0.04589851084666564, 'num_leaves': 91, 'max_depth': 29, 'min_child_samples': 6, 'subsample': 0.9106254086928621, 'colsample_bytree': 0.7835532670267296, 'n_estimators': 755}. Best is trial 11 with value: 0.00041779817307033285.


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.017431 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 347
[LightGBM] [Info] Number of data points in the train set: 894365, number of used features: 20
[LightGBM] [Info] Start training from score 0.504480


[I 2024-05-25 13:33:48,921] Trial 13 finished with value: 0.0004164518557891902 and parameters: {'learning_rate': 0.032336622012503397, 'num_leaves': 85, 'max_depth': 48, 'min_child_samples': 22, 'subsample': 0.8094141096220654, 'colsample_bytree': 0.5085884937300454, 'n_estimators': 739}. Best is trial 13 with value: 0.0004164518557891902.


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.071202 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 347
[LightGBM] [Info] Number of data points in the train set: 894365, number of used features: 20
[LightGBM] [Info] Start training from score 0.504480


[I 2024-05-25 13:35:01,244] Trial 14 finished with value: 0.00041540732285444415 and parameters: {'learning_rate': 0.03013007459364077, 'num_leaves': 96, 'max_depth': 49, 'min_child_samples': 24, 'subsample': 0.7909081820551932, 'colsample_bytree': 0.5896370155655485, 'n_estimators': 973}. Best is trial 14 with value: 0.00041540732285444415.


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.018185 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 347
[LightGBM] [Info] Number of data points in the train set: 894365, number of used features: 20
[LightGBM] [Info] Start training from score 0.504480


[I 2024-05-25 13:36:23,291] Trial 15 finished with value: 0.0004159689872469134 and parameters: {'learning_rate': 0.030868647120986987, 'num_leaves': 126, 'max_depth': 48, 'min_child_samples': 27, 'subsample': 0.7918710586241333, 'colsample_bytree': 0.5970574426595995, 'n_estimators': 994}. Best is trial 14 with value: 0.00041540732285444415.


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.103456 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 347
[LightGBM] [Info] Number of data points in the train set: 894365, number of used features: 20
[LightGBM] [Info] Start training from score 0.504480


[I 2024-05-25 13:38:06,322] Trial 16 finished with value: 0.0004272044656603823 and parameters: {'learning_rate': 0.014873926551318604, 'num_leaves': 130, 'max_depth': 25, 'min_child_samples': 29, 'subsample': 0.7591579981934622, 'colsample_bytree': 0.636531972228447, 'n_estimators': 998}. Best is trial 14 with value: 0.00041540732285444415.


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.021269 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 347
[LightGBM] [Info] Number of data points in the train set: 894365, number of used features: 20
[LightGBM] [Info] Start training from score 0.504480


[I 2024-05-25 13:39:21,045] Trial 17 finished with value: 0.0004166214781593593 and parameters: {'learning_rate': 0.030632920088536113, 'num_leaves': 126, 'max_depth': 45, 'min_child_samples': 72, 'subsample': 0.7620108025246923, 'colsample_bytree': 0.6121344864703552, 'n_estimators': 873}. Best is trial 14 with value: 0.00041540732285444415.


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.064549 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 347
[LightGBM] [Info] Number of data points in the train set: 894365, number of used features: 20
[LightGBM] [Info] Start training from score 0.504480


[I 2024-05-25 13:40:26,818] Trial 18 finished with value: 0.00041654771622057934 and parameters: {'learning_rate': 0.029782026415673114, 'num_leaves': 127, 'max_depth': 32, 'min_child_samples': 28, 'subsample': 0.9803364762515061, 'colsample_bytree': 0.5813287440061443, 'n_estimators': 888}. Best is trial 14 with value: 0.00041540732285444415.


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.089840 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 347
[LightGBM] [Info] Number of data points in the train set: 894365, number of used features: 20
[LightGBM] [Info] Start training from score 0.504480


[I 2024-05-25 13:42:23,424] Trial 19 finished with value: 0.00045679230685414786 and parameters: {'learning_rate': 0.011534249640746584, 'num_leaves': 212, 'max_depth': 50, 'min_child_samples': 100, 'subsample': 0.6183299501394925, 'colsample_bytree': 0.7391792556965442, 'n_estimators': 895}. Best is trial 14 with value: 0.00041540732285444415.


Best trial:
 Value: 0.00041540732285444415
 Params: {'learning_rate': 0.03013007459364077, 'num_leaves': 96, 'max_depth': 49, 'min_child_samples': 24, 'subsample': 0.7909081820551932, 'colsample_bytree': 0.5896370155655485, 'n_estimators': 973}


Best trial:
 Value: 0.00041540732285444415
 Params: {'learning_rate': 0.03013007459364077, 'num_leaves': 96, 'max_depth': 49, 'min_child_samples': 24, 'subsample': 0.7909081820551932, 'colsample_bytree': 0.5896370155655485, 'n_estimators': 973}

In [40]:
print('Shape of X_train:', X_train.shape)
print('Shape of X_test:', X_test.shape)
print('Shape of y_train:', y_train.shape)
print('Shape of y_test:', y_test.shape)
print('Shape of test:', test.shape)


Shape of X_train: (894365, 21)
Shape of X_test: (223592, 21)
Shape of y_train: (894365,)
Shape of y_test: (223592,)
Shape of test: (745305, 21)


In [42]:
import lightgbm as lgb

# Set the best hyperparameters
best_params = {
    'learning_rate': 0.03013007459364077,
    'num_leaves': 96,
    'max_depth': 49,
    'min_child_samples': 24,
    'subsample': 0.7909081820551932,
    'colsample_bytree': 0.5896370155655485,
    'n_estimators': 973
}

# Create the LightGBM model with the best hyperparameters
model = lgb.LGBMRegressor(**best_params)

# Train the model on the training data
model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = model.predict(test)




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0,018776 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 352
[LightGBM] [Info] Number of data points in the train set: 894365, number of used features: 21
[LightGBM] [Info] Start training from score 0,504480


In [43]:
# Create a dataframe with the predicted values
submission_df = pd.DataFrame({'id': test['id'], 'FloodProbability': y_pred})

# Save the dataframe to a CSV file
submission_df.to_csv('submission.csv', index=False)

# Didn't try this part yet.

#### also, these wouldn't work rn because i have used test dataset to create a submission file.

In [44]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Calculate mean squared error
mse = mean_squared_error(y_test, y_pred)

# Calculate root mean squared error
rmse = np.sqrt(mse)

# Calculate mean absolute error
mae = mean_absolute_error(y_test, y_pred)

# Calculate R-squared
r2 = r2_score(y_test, y_pred)

mse, rmse, mae, r2


ValueError: Found input variables with inconsistent numbers of samples: [223592, 745305]

In [None]:
import optuna
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

def objective(trial):
  """
  Objective function to be optimized by Optuna.

  Args:
      trial: Optuna trial object for suggesting hyperparameters.

  Returns:
      The average cross-validation score (e.g., R-squared) of the Random Forest model.
  """

  # Suggest hyperparameters using distributions or specific values
  n_estimators = trial.suggest_int("n_estimators", 64, 128, step=50)
  max_depth = trial.suggest_int("max_depth", 2, 10, step=2)
  min_samples_split = trial.suggest_int("min_samples_split", 2, 10, step=2)
  min_samples_leaf = trial.suggest_int("min_samples_leaf", 1, 5, step=2)

  # Create a Random Forest regressor with suggested hyperparameters
  model = RandomForestRegressor(n_estimators=n_estimators,
                                 max_depth=max_depth,
                                 min_samples_split=min_samples_split,
                                 min_samples_leaf=min_samples_leaf)

  # Perform faster cross-validation with fewer folds (adjust as needed)
  score = cross_val_score(model, X_train, y_train, cv=3, scoring="r2")

  # Return the negative score (Optuna minimizes the objective function)
  return -score.mean()  # Minimize negative R-squared (maximize R-squared)

# Create an Optuna study to manage the hyperparameter search
study = optuna.create_study(direction="minimize")

# Optimize hyperparameters with fewer trials initially (adjust as needed)
study.optimize(objective, n_trials=20)

# Get the best trial with suggested hyperparameters
best_trial = study.best_trial

# Print the best hyperparameters for clarity
print("Best hyperparameters:", best_trial.params)

# OPTIONAL: Re-run optimization with more trials if needed
# if not satisfied with results:
#     study.optimize(objective, n_trials=50)  # Increase trials if necessary
