# Experimenting with LightGBM

In order to understand how to use the LightGBM model, let us download it and experiment a little with it. For the majority of this notebook, I will be using the information from their official website to compose a trial code.

LightGBM website: https://lightgbm.readthedocs.io/en/stable/

In [2]:
import lightgbm as lgb
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

In [3]:

data = pd.DataFrame({
    'age': np.random.randint(18, 70, size=100),
    'income': np.random.randint(20000, 100000, size=100),
    'gender': np.random.choice(['Male', 'Female'], size=100),
    'job': np.random.choice(['Engineer', 'Doctor', 'Artist', 'Teacher'], size=100),
    'bought_product': np.random.randint(0, 4, size=100)
})


### Defining the dataset

Lets create a random dataset in order to test the model's features. LightGBM was chosen since it is extremely good at handling mixed datatypes; so, in order to test this particular characteristic, the dataset created has integers, floats, strings, and booleans.

Lets create a random dataset with features 'age', 'income', 'gender', 'job' and target variable 'bought_product'. We will use LightGBM to predict how many products a customer will buy depending on their traits.

This dataset was created with ChatGPT using the prompt: Generate a random dataset with mixed datatypes.

In [5]:
data = pd.DataFrame({
    'age': np.random.randint(18, 70, size=100),
    'income': np.random.randint(20000, 100000, size=100),
    'gender': np.random.choice(['Male', 'Female'], size=100),
    'job': np.random.choice(['Engineer', 'Doctor', 'Artist', 'Teacher'], size=100),
    'bought_product': np.random.randint(0, 4, size=100)
})

### Features and variables

According to the LightGBM website, we first need to let the model know what our categorical features are so that it knows how to handle them accordingly. For this, I will use the built-in parameter of the model 'categorical_feature'.

In [6]:
categorical_features = ["gender", "job"]

Now we have to define the feature variables and the target variables. Since I want to be able to predict how many products each customer has bought, that is our target variables. So, in order to obtain our features (x), we have to drop that variable from our dataset and save it as our target in y.

In [7]:
# We will use x as the feature
x = data.drop("bought_product", axis=1)

# We will use y as the target
y = data["bought_product"]

Now, lets split our data into training and testing so that we can start using the model with the training set, and then determine the accuracy of our findings with our testing set. First we have to split the data (the common split is 70% training and 30% testing), and then we have to give LightGBM each of the datasets.

In [8]:
X_train, X_test, y_train, y_test = train_test_split(x, y, train_size=0.7, random_state=100)

In [9]:
train_data = lgb.Dataset(X_train, label=y_train, categorical_feature=categorical_features)
test_data = lgb.Dataset(X_test, label=y_test, reference=train_data)

### Parameters

On the LightGBM website there are many parameters, so understanding all of them would take a lot of time. In order to facilitate this process and determine which parameters are the most important to specify every time I want to use the model, I asked ChatGPT to give me an overview of the most important ones to keep in mind. Below I am attaching the prompts I gave it and the replies that were provided.

ChatGPT parameter conversation: https://chatgpt.com/share/684fff01-80c4-8003-a8a6-178d8e54f148

---

<u>Core Parameters</u>

**objective:** Defines what type of model we want to use for the loss function. For this test model we want to make use of multiclass (which is actually the same case for our blockbuster dataset). 
- Possible values: regression, regression_l1, huber, fair, poisson, quantile, mape, gamma, tweedie, binary, multiclass, multiclassova, cross_entropy, cross_entropy_lambda, lambdarank, rank_xendcg, aliases: objective_type, app, application, loss

**boosting_type:** The type of boosting algorithm that we want to use. Since we want to use gradient boosting decision trees (gbdt) for our blockbuster model, I will be attempting to use that one in this notebook as well.
- Possible values: gbdt, rf, dart, aliases: boosting_type, boost

**num_leaves:** Defines the total number of leaves that will be present in the tree. Here is where we have to be careful about overfitting if we have too many number of leaves.
- The default value for this is 31

**learning_rate:** The step size that we want to take after every iteration of the algorithm. If we are using trees, this is the contribution that each tree has to the final, so a smaller alpha value requires more trees.
- The default value for this is 0.1

---

<u>Learning Parameters</u>

**feature_fraction:** The fraction of features that we will be using for building each tree. Again, this is another place where we have to be careful with overfitting.
- Typical values are between 0.6 and 1

**bagging_fraction:** The fraction of features that will be used for each iteration (not for each tree like feature_fraction).
- The default value for this is 1.0

**bagging_freq:** How many 'bags' will be created. Creating bags means that we will train different trees (in our case) and then our final one is a combination of the predictions.
- The default value for this is 0

**verbose:** Controls the messages that will be printed. So either no debug info or extensive debug information.
- The default value for this is 1

---

<u>Metric Parameters</u>

**metric:** Defines how the model's predictions are evaluated. For this test model we want to make use of multi_logloss (which is actually the same case for our blockbuster dataset). 
- There are many possible values, please reference the LightGBM parameter website and go to the "Metric Parameters" section for more information.

---

Using this information, we can now define the parameters that will be used for this specific model.


In [10]:
parameters = {
    'objective': 'multiclass',
    'metric': 'multi_logloss',
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.1,
    'feature_fraction': 1.0,
    'bagging_fraction': 1.0,
    'bagging_freq': 0,
    'verbose': 0
}

In [11]:
model = lgb.train(parameters, train_data, valid_sets=[test_data], num_boost_round=100)

ValueError: pandas dtypes must be int, float or bool.
Fields with bad pandas dtypes: gender: object, job: object

In [12]:

# Convert categorical features to category dtype
for col in categorical_features:
    x[col] = x[col].astype('category')


In [13]:

# Prepare LightGBM dataset
train_data = lgb.Dataset(X_train, label=y_train, categorical_feature=categorical_features)
test_data = lgb.Dataset(X_test, label=y_test, reference=train_data)


In [14]:

# Set parameters for multiclass classification
params = {
    'objective': 'multiclass',
    'num_class': 4,
    'metric': 'multi_logloss',
    'verbose': -1
}


In [15]:

# Train the model
model = lgb.train(params, train_data, valid_sets=[test_data], num_boost_round=100, early_stopping_rounds=10)


TypeError: train() got an unexpected keyword argument 'early_stopping_rounds'

In [16]:

# Make predictions
y_pred = model.predict(X_test)
y_pred_classes = [np.argmax(row) for row in y_pred]


NameError: name 'model' is not defined

In [17]:

# Evaluate the model
from sklearn.metrics import accuracy_score, classification_report

print("Accuracy:", accuracy_score(y_test, y_pred_classes))
print(classification_report(y_test, y_pred_classes))


NameError: name 'y_pred_classes' is not defined