# Hyperparameter Exploration
## This notebook is a research notebook on what woudl be the best features for our LightGBM model

Conducting research we found that the most significat parameters to chieve optimal performance are n_estimators, learning_rate, num_leaves, max_depth, and min_child_samples. While other parameters can also be optimized depending upon feature_fraction, bagging_fraction, bagging_freq, and regularization parameters (L1/L2) for preventing overfitting. 

Since our dataset is not very large, we wouldnt be needing feature_fraction, bagging_fraction, bagging_freq and lambda_l1, lambda_l2. These are good for regularization. 

## Core Parameters:

- n_estimators (or num_boost_round) sets the boosting iterations count. Start with a moderate value (e.g., 100-500) and use early stopping to find the optimal point.

- learning_rate tunes the step size per pass. Smaller learning rates lead to improved generalization but require more boosting iterations. We could start with 0.1 and reduce to 0.01 or even further to 0.001.

- num_leaves limits the number of leaves in each tree. A larger num_leaves allows for more complex trees but can lead to overfitting. A common guideline found is to set it around 2^(max_depth). 

- max_depth limits the depth of the trees. A smaller max_depth is reccomended as to prevent overfitting. We should experiment with values between 3 and 10. 

- min_child_samples (or min_data_in_leaf) sets the minimum number of data points that should live in a leaf. This averts overfitting by not allowing leaves to be generated from just a few data points.


## Other Parameters:

There are other important parameters that we could experiment with. All of these prevent or reduce overfitting or complexity + high dimensionality somehow. Some of these being:

- feature_fraction controls the ratio of features to be selected randomly for each tree. This can help with high dimensionality and overfitting.

- bagging_fraction and bagging_freq permit random data sampling. bagging_freq specifies bagging frequency.

- lambda_l1 and lambda_l2: L1 and L2 regularization coefficients penalize model complexity.

- min_gain_to_split specifies the minimum gain required for a split of a node. Used to regulate tree complexity.

- num_threads specifies how many threads LightGBM will be using. We should use the most available threads for faster training.


Some other important aspects for working with LightGBM I have found are: 

- Using the base parameters and gradually adjust them based on our data and goal.

- Use early stopping because this is crucial in finding the optimal number of boosting steps and preventing overfitting.

- Cross-validation: estimate your model with cross-validation to get a reliable measure of its performance.

- Maybe use the GPU-accelerated version of LightGBM for faster training.

- We could try tuning a few major parameters at a time and changing them gradually as oposed to tuning the parameters all  simultaneously. 

- For regularization we could use parameters like feature_fraction, bagging_fraction, and regularization parameters (L1/L2) to prevent overfitting, especially for large datasets.

- Because LightGBM uses a leaf-wise tree growing approach, producing deeper trees, we can adjust num_leaves and max_depth for fine-tuning the trees' complexity. -> these two last points are using the parameters I mentioned above.


Before we can start, we need to have our LightGBM model and test it manually with default parameters. This helps us verify the data pipeline, check for data issues, and get a performance baseline. 

## Hyperparamters Tuning with Grid Search

A grid-search is an automated way of trying different combinations of hyperparameters to find the best one for our model! -> This will be useful when we start testing!

An example is the following that I got from chatGPT:

In [29]:
param_grid = {
    'num_leaves': [31, 50, 100],
    'max_depth': [-1, 10, 20],
    'learning_rate': [0.01, 0.05, 0.1],
    'n_estimators': [100, 200],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0],
}

grid = GridSearchCV(LGBMRegressor(), param_grid, scoring='neg_root_mean_squared_error', cv=5, verbose=1)
grid.fit(X_train, y_train)

print("Best parameters:", grid.best_params_)
print("Best RMSE:", -grid.best_score_)

NameError: name 'GridSearchCV' is not defined

Before we can do this we have to make sure the data is pre-processesd (we did) but also transform text-based features like actors, directors, genre, theme by encoding into numbers (one-hot, label encoding, frequency encoding). And of course split the data into training and testing. 

We could also use RandomizedSearchCV, which is a method from scikit-learn that would also help us find the best hyperparameters for our model, but its faster and more efficient when we have more parameters or large ranges to explore. This is why it would be a good initial use. 

In contrast to GridSearchCV, RandomizedSearchCv tries a random subset of the combinations of parameters we give it (as opposed to all). It then finds the near-best with fewer trials. GridSearch however, is guaranteed to find the best in the grid, not just a near-best; that's why it would be a good option to use both, starting with RandomizedSearchCV. 

## RandomizedSearchCV with LightGBM example:

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from lightgbm import LGBMRegressor
from scipy.stats import randint, uniform

In [None]:
param_dist = {
    'num_leaves': randint(20, 150),
    'max_depth': randint(3, 20),
    'learning_rate': uniform(0.01, 0.2),
    'n_estimators': randint(50, 500),
    'subsample': uniform(0.5, 0.5),
    'colsample_bytree': uniform(0.5, 0.5),
}

search = RandomizedSearchCV(
    LGBMRegressor(),
    param_distributions=param_dist,
    n_iter=50,  # number of combinations they
    scoring='neg_root_mean_squared_error',
    cv=5,
    random_state=42,
    verbose=1
)

search.fit(X_train, y_train)

print("Best parameters:", search.best_params_)
print("Best RMSE:", -search.best_score_)

### Another Example of how it would work

In [25]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

param_dist = {
    'num_leaves': randint(20, 100),
    'max_depth': randint(3, 15),
    'learning_rate': uniform(0.01, 0.1),
    'n_estimators': randint(50, 300),
    'subsample': uniform(0.6, 0.4),
    'colsample_bytree': uniform(0.6, 0.4),
}

search = RandomizedSearchCV(
    LGBMRegressor(),
    param_distributions=param_dist,
    n_iter=30,  # try 30 combinations
    scoring='neg_root_mean_squared_error',
    cv=3,
    verbose=1,
    random_state=42
)

search.fit(X_train, y_train)

print("Best RMSE:", -search.best_score_)
print("Best Params:", search.best_params_)


NameError: name 'LGBMRegressor' is not defined