# LightGBM

For: NEIL HEINRICH BRAUN

The idea for this file is to train a lightGBM model given the dataset. The data files you will need to import is unfortunately not ready. But for now, write and test the code using `model_building_data.csv` which is provided in the data folder. Keep in mind that the final training/testing files will have more fields.

Great thing about LightGBM is it can handle missing data as is. Also, LightGBM can also handle data with weird ranges better than compared to neural network based models or even models like SVM regressor. So in practice, you dont need to do much processing because the data file should already have appropriate data for you to use.

Of course this does not mean you should not do any processing at all. In fact, you should explore dimension reduction techniques and do feature selection where appropriate. Also, some columns might have too many NaNs and should be remove entirely.

Furthermore, LightGBM is *not* a timeseries model. Therefore, you should engineer lagged variables for prediction as well.

Last thing to keep in mind is, some rows might have missing revenue but non-missing CAR etc. If you will drop NaNs, drop for each y values differently to prevent unnecessary data loss.

Tune all parameters using 3-fold CV with the timesplit function like in assignment 1. I'll write a different time split function and we'll rerun with 5-10 fold CV again later before submission.

This file should save the output of the prediction in the format:

| ticker | quarter_year  | log_revenue_prediction | CAR_prediction |
|--------|---------------|------------------------|----------------|
| BAC    | Q1 2001       | 123                    | 0.5            |
| JPM    | Q1 2001       | 456                    | 0.8            |
| WFC    | Q1 2001       | 789                    | 0.25           |

Enjoy!

In [35]:
import pandas as pd
import numpy as np
from sklearn.model_selection import GridSearchCV, TimeSeriesSplit, KFold
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import lightgbm as lgb

import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

## Revenue

In [2]:
df_train_rev = pd.read_csv("data/train_data_REV_with_text.csv")
df_test_rev = pd.read_csv("data/test_data_REV_with_text.csv")
df_train_rev = df_train_rev.sort_values(by=['datacqtr', 'tic']).reset_index(drop=True)
df_test_rev = df_test_rev.sort_values(by=['datacqtr', 'tic']).reset_index(drop=True)

In [3]:
# Create the lagged column
df_train_rev['Total Current Operating Revenue_lag1'] = df_train_rev.groupby('tic')['Total Current Operating Revenue'].shift(1)
df_train_rev['Total Current Operating Revenue_lag2'] = df_train_rev.groupby('tic')['Total Current Operating Revenue'].shift(2)
df_train_rev['Total Current Operating Revenue_lag3'] = df_train_rev.groupby('tic')['Total Current Operating Revenue'].shift(3)
df_train_rev['Total Current Operating Revenue_lag4'] = df_train_rev.groupby('tic')['Total Current Operating Revenue'].shift(4)

df_train_rev['Net Charge-Offs_lag1'] = df_train_rev.groupby('tic')['Net Charge-Offs'].shift(1)
df_train_rev['Net Charge-Offs_lag2'] = df_train_rev.groupby('tic')['Net Charge-Offs'].shift(2)
df_train_rev['Net Charge-Offs_lag3'] = df_train_rev.groupby('tic')['Net Charge-Offs'].shift(3)
df_train_rev['Net Charge-Offs_lag4'] = df_train_rev.groupby('tic')['Net Charge-Offs'].shift(4)

df_train_rev['Invested Capital - Total_lag1'] = df_train_rev.groupby('tic')['Invested Capital - Total'].shift(1)
df_train_rev['Invested Capital - Total_lag2'] = df_train_rev.groupby('tic')['Invested Capital - Total'].shift(2)
df_train_rev['Invested Capital - Total_lag3'] = df_train_rev.groupby('tic')['Invested Capital - Total'].shift(3)
df_train_rev['Invested Capital - Total_lag4'] = df_train_rev.groupby('tic')['Invested Capital - Total'].shift(4)



df_test_rev['Total Current Operating Revenue_lag1'] = df_test_rev.groupby('tic')['Total Current Operating Revenue'].shift(1)
df_test_rev['Total Current Operating Revenue_lag2'] = df_test_rev.groupby('tic')['Total Current Operating Revenue'].shift(2)
df_test_rev['Total Current Operating Revenue_lag3'] = df_test_rev.groupby('tic')['Total Current Operating Revenue'].shift(3)
df_test_rev['Total Current Operating Revenue_lag4'] = df_test_rev.groupby('tic')['Total Current Operating Revenue'].shift(4)

df_test_rev['Net Charge-Offs_lag1'] = df_test_rev.groupby('tic')['Net Charge-Offs'].shift(1)
df_test_rev['Net Charge-Offs_lag2'] = df_test_rev.groupby('tic')['Net Charge-Offs'].shift(2)
df_test_rev['Net Charge-Offs_lag3'] = df_test_rev.groupby('tic')['Net Charge-Offs'].shift(3)
df_test_rev['Net Charge-Offs_lag4'] = df_test_rev.groupby('tic')['Net Charge-Offs'].shift(4)

df_test_rev['Invested Capital - Total_lag1'] = df_test_rev.groupby('tic')['Invested Capital - Total'].shift(1)
df_test_rev['Invested Capital - Total_lag2'] = df_test_rev.groupby('tic')['Invested Capital - Total'].shift(2)
df_test_rev['Invested Capital - Total_lag3'] = df_test_rev.groupby('tic')['Invested Capital - Total'].shift(3)
df_test_rev['Invested Capital - Total_lag4'] = df_test_rev.groupby('tic')['Invested Capital - Total'].shift(4)

In [4]:
# Drop NA

df_train_rev = df_train_rev.dropna()
df_test_rev = df_test_rev.dropna()

In [5]:
X_rev_train = df_train_rev.drop(columns=["datacqtr", "tic", "Total Current Operating Revenue"]).copy().to_numpy()
y_rev_train = df_train_rev["Total Current Operating Revenue"].copy().to_numpy()

X_rev_test = df_test_rev.drop(columns=["datacqtr", "tic", "Total Current Operating Revenue"]).copy().to_numpy()
y_rev_test = df_test_rev["Total Current Operating Revenue"].copy().to_numpy()

In [11]:
param_grid = {
    'num_leaves': [31, 50, 70],
    'max_depth': [-1, 10, 20],
    'learning_rate': [0.01, 0.05, 0.1],
    'n_estimators': [100, 300, 500],
}

cv_strategy = TimeSeriesSplit(n_splits=5)

model_rev = lgb.LGBMRegressor(verbosity=-1)

grid_search_rev = GridSearchCV(estimator=model_rev, param_grid=param_grid, cv=cv_strategy, n_jobs=15, verbose=-1, scoring='neg_mean_squared_error')
grid_search_rev.fit(X_rev_train, y_rev_train)

In [15]:
# 1. Show the best parameters
print("Best Parameters:", grid_search_rev.best_params_)

# 2. Show the best (lowest) MSE (note: scoring was negative MSE)
best_mse = -grid_search_rev.best_score_
print("Best Cross-Validated MSE:", best_mse)

# 3. Make predictions on the test set
best_model_rev = grid_search_rev.best_estimator_
y_rev_pred = best_model_rev.predict(X_rev_test)

# 4. Evaluate predictions
mse = mean_squared_error(y_rev_test, y_rev_pred)
r2 = r2_score(y_rev_test, y_rev_pred)
mae = mean_absolute_error(y_rev_test, y_rev_pred)

print(f"Test Set Evaluation:")
print(f"  R² Score: {r2:.4f}")
print(f"  MSE: {mse:.8f}")
print(f"  MAE: {mae:.8f}")

Best Parameters: {'learning_rate': 0.05, 'max_depth': 10, 'n_estimators': 100, 'num_leaves': 50}
Best Cross-Validated MSE: 0.00015108183566477914
Test Set Evaluation:
  R² Score: 0.9949
  MSE: 0.00015343
  MAE: 0.00772067


In [16]:
df_test_rev_predict = df_test_rev[["tic", "datacqtr"]].copy()
df_test_rev_predict["light_gbm_rev_predict"] = y_rev_pred

In [17]:
df_test_rev_predict.to_csv("results/lightgbm_rev_predict_test.csv", index=False)

In [38]:
# Initialize arrays to store OOF predictions
oof_preds = np.zeros(len(X_rev_train))

cv_strategy = KFold(n_splits=10, shuffle=True, random_state=42)
for train_idx, valid_idx in cv_strategy.split(X_rev_train):
    X_train_fold, y_train_fold = X_rev_train[train_idx], y_rev_train[train_idx]
    X_valid_fold, y_valid_fold = X_rev_train[valid_idx], y_rev_train[valid_idx]

    fold_model = lgb.LGBMRegressor(**grid_search_rev.best_params_, verbosity=-1)
    fold_model.fit(X_train_fold, y_train_fold)

    oof_preds[valid_idx] = fold_model.predict(X_valid_fold)



In [39]:
df_oof_rev_prediction = df_train_rev[["tic", "datacqtr"]].copy()
df_oof_rev_prediction["light_gbm_rev_predict"] = oof_preds

In [40]:
df_oof_rev_prediction.to_csv("data/stacking_data/lightgbm_rev_predict.csv", index=False)

## CAR

In [29]:
df_train_car = pd.read_csv("data/train_data_CAR5_with_text.csv")
df_test_car = pd.read_csv("data/test_data_CAR5_with_text.csv")
df_train_car = df_train_car.sort_values(by=['tic', 'datacqtr']).reset_index(drop=True)
df_test_car = df_test_car.sort_values(by=['tic', 'datacqtr']).reset_index(drop=True)

In [30]:
# Create the lagged column
df_train_car['Total Current Operating Revenue_lag1'] = df_train_car.groupby('tic')['Total Current Operating Revenue'].shift(1)
# df_train_car['Total Current Operating Revenue_lag2'] = df_train_car.groupby('tic')['Total Current Operating Revenue'].shift(2)
# df_train_car['Total Current Operating Revenue_lag3'] = df_train_car.groupby('tic')['Total Current Operating Revenue'].shift(3)
# df_train_car['Total Current Operating Revenue_lag4'] = df_train_car.groupby('tic')['Total Current Operating Revenue'].shift(4)

df_train_car['Net Charge-Offs_lag1'] = df_train_car.groupby('tic')['Net Charge-Offs'].shift(1)
# df_train_car['Net Charge-Offs_lag2'] = df_train_car.groupby('tic')['Net Charge-Offs'].shift(2)
# df_train_car['Net Charge-Offs_lag3'] = df_train_car.groupby('tic')['Net Charge-Offs'].shift(3)
# df_train_car['Net Charge-Offs_lag4'] = df_train_car.groupby('tic')['Net Charge-Offs'].shift(4)

df_train_car['Invested Capital - Total_lag1'] = df_train_car.groupby('tic')['Invested Capital - Total'].shift(1)
# df_train_car['Invested Capital - Total_lag2'] = df_train_car.groupby('tic')['Invested Capital - Total'].shift(2)
# df_train_car['Invested Capital - Total_lag3'] = df_train_car.groupby('tic')['Invested Capital - Total'].shift(3)
# df_train_car['Invested Capital - Total_lag4'] = df_train_car.groupby('tic')['Invested Capital - Total'].shift(4)

df_train_car['car5_lag1'] = df_train_car.groupby('tic')['car5'].shift(1)
# df_train_car['car5_lag2'] = df_train_car.groupby('tic')['car5'].shift(2)
# df_train_car['car5_lag3'] = df_train_car.groupby('tic')['car5'].shift(3)
# df_train_car['car5_lag4'] = df_train_car.groupby('tic')['car5'].shift(4)



df_test_car['Total Current Operating Revenue_lag1'] = df_test_car.groupby('tic')['Total Current Operating Revenue'].shift(1)
# df_test_car['Total Current Operating Revenue_lag2'] = df_test_car.groupby('tic')['Total Current Operating Revenue'].shift(2)
# df_test_car['Total Current Operating Revenue_lag3'] = df_test_car.groupby('tic')['Total Current Operating Revenue'].shift(3)
# df_test_car['Total Current Operating Revenue_lag4'] = df_test_car.groupby('tic')['Total Current Operating Revenue'].shift(4)

df_test_car['Net Charge-Offs_lag1'] = df_test_car.groupby('tic')['Net Charge-Offs'].shift(1)
# df_test_car['Net Charge-Offs_lag2'] = df_test_car.groupby('tic')['Net Charge-Offs'].shift(2)
# df_test_car['Net Charge-Offs_lag3'] = df_test_car.groupby('tic')['Net Charge-Offs'].shift(3)
# df_test_car['Net Charge-Offs_lag4'] = df_test_car.groupby('tic')['Net Charge-Offs'].shift(4)

df_test_car['Invested Capital - Total_lag1'] = df_test_car.groupby('tic')['Invested Capital - Total'].shift(1)
# df_test_car['Invested Capital - Total_lag2'] = df_test_car.groupby('tic')['Invested Capital - Total'].shift(2)
# df_test_car['Invested Capital - Total_lag3'] = df_test_car.groupby('tic')['Invested Capital - Total'].shift(3)
# df_test_car['Invested Capital - Total_lag4'] = df_test_car.groupby('tic')['Invested Capital - Total'].shift(4)

df_test_car['car5_lag1'] = df_test_car.groupby('tic')['car5'].shift(1)
# df_test_car['car5_lag2'] = df_test_car.groupby('tic')['car5'].shift(2)
# df_test_car['car5_lag3'] = df_test_car.groupby('tic')['car5'].shift(3)
# df_test_car['car5_lag4'] = df_test_car.groupby('tic')['car5'].shift(4)

In [31]:
df_train_car

Unnamed: 0,datacqtr,tic,car5,GDP CHANGE (-1 to 1),UNEMPLOYMENT RATE (0 to 1),PRIME LOAN RATE (0 to 1),DEPOSITS CHANGE (-1 to 1),CONSUMER PRICE INDEX (0 to 1),SAVINGS PER GROSS INCOME (-1 to 1),Net Interest Income,...,reviews_rating,text_blob_reviews_sentiment,vader_reviews_sentiment_neg,vader_reviews_sentiment_pos,bert_reviews_label,bert_reviews_score,Total Current Operating Revenue_lag1,Net Charge-Offs_lag1,Invested Capital - Total_lag1,car5_lag1
0,2002Q3,ABVA,-0.000718,0.536682,0.226950,0.240000,0.155932,0.858035,0.630435,0.000000,...,0.500000,0.000000,0.500000,0.500000,0.500000,0.500000,,,,
1,2002Q4,ABVA,-0.035508,0.525962,0.241135,0.193016,0.207168,0.754757,0.652174,0.001665,...,0.500000,0.000000,0.500000,0.500000,0.500000,0.500000,0.018448,0.907823,0.000000,-0.000718
2,2003Q1,ABVA,0.137994,0.544421,0.241135,0.160000,0.109794,0.700878,0.554348,0.003983,...,0.500000,0.000000,0.500000,0.500000,0.500000,0.500000,0.017356,0.908680,0.000000,-0.035508
3,2003Q2,ABVA,-0.051305,0.557456,0.269504,0.158750,0.161457,0.579028,0.565217,0.007631,...,0.500000,0.000000,0.500000,0.500000,0.500000,0.500000,0.032957,0.908803,0.000000,0.137994
4,2003Q3,ABVA,0.049040,0.616419,0.269504,0.120000,0.201673,0.521375,0.554348,0.017963,...,0.500000,0.000000,0.500000,0.500000,0.500000,0.500000,0.053341,0.909171,0.003413,-0.051305
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8014,2019Q4,ZION,0.068700,0.543011,0.000000,0.253548,0.179586,0.721845,0.608696,0.633072,...,0.500000,0.000000,0.500000,0.500000,0.500000,0.500000,0.604401,0.823592,0.591243,-0.044618
8015,2020Q1,ZION,-0.056801,0.429512,0.024823,0.183226,0.188084,0.711620,0.652174,0.630463,...,0.415414,0.119543,0.079556,0.177932,0.419173,0.688792,0.601003,0.523349,0.594882,0.068700
8016,2020Q2,ZION,0.048808,0.000000,1.000000,0.000000,1.000000,0.474737,0.217391,0.633502,...,0.500000,0.000000,0.500000,0.500000,0.500000,0.500000,0.595861,0.653172,0.596922,-0.056801
8017,2020Q3,ZION,-0.007967,1.000000,0.553191,0.000000,0.271433,0.514690,0.315217,0.632209,...,0.500000,0.000000,0.500000,0.500000,0.500000,0.500000,0.588731,0.482752,0.593118,0.048808


In [32]:
df_train_car = df_train_car.dropna()
df_test_car = df_test_car.dropna()

In [33]:
X_car_train = df_train_car.drop(columns=["datacqtr", "tic", "car5"]).copy().to_numpy()
y_car_train = df_train_car["car5"].copy().to_numpy()

X_car_test = df_test_car.drop(columns=["datacqtr", "tic", "car5"]).copy().to_numpy()
y_car_test = df_test_car["car5"].copy().to_numpy()

In [34]:
cv_strategy = TimeSeriesSplit(n_splits=5)
model_car = lgb.LGBMRegressor(verbosity=-1)

grid_search_car = GridSearchCV(estimator=model_car, param_grid=param_grid, cv=cv_strategy, n_jobs=15, verbose=-1, scoring='neg_mean_squared_error')
grid_search_car.fit(X_car_train, y_car_train)

In [37]:
# 1. Show the best parameters
print("Best Parameters:", grid_search_car.best_params_)

# 2. Show the best (lowest) MSE (note: scoring was negative MSE)
best_mse = -grid_search_car.best_score_
print("Best Cross-Validated MSE:", best_mse)

# 3. Make predictions on the test set
best_model_car = grid_search_car.best_estimator_
y_car_pred = best_model_car.predict(X_car_test)

# 4. Evaluate predictions
mse = mean_squared_error(y_car_test, y_car_pred)
r2 = r2_score(y_car_test, y_car_pred)
mae = mean_absolute_error(y_car_test, y_car_pred)

print(f"Test Set Evaluation:")
print(f"  R² Score: {r2:.4f}")
print(f"  MSE: {mse:.8f}")
print(f"  MAE: {mae:.8f}")

Best Parameters: {'learning_rate': 0.01, 'max_depth': 20, 'n_estimators': 500, 'num_leaves': 50}
Best Cross-Validated MSE: 0.002973352572601879
Test Set Evaluation:
  R² Score: -0.0754
  MSE: 0.00380143
  MAE: 0.04608584


In [41]:
df_test_car_predict = df_test_car[["tic", "datacqtr"]].copy()
df_test_car_predict["light_gbm_car_predict"] = y_car_pred

In [42]:
df_test_car_predict.to_csv("results/lightgbm_car_predict_test.csv", index=False)

In [43]:
# Initialize arrays to store OOF predictions
oof_preds = np.zeros(len(X_car_train))

cv_strategy = KFold(n_splits=10, shuffle=True, random_state=42)
for train_idx, valid_idx in cv_strategy.split(X_car_train):
    X_train_fold, y_train_fold = X_car_train[train_idx], y_car_train[train_idx]
    X_valid_fold, y_valid_fold = X_car_train[valid_idx], y_car_train[valid_idx]

    fold_model = lgb.LGBMRegressor(**grid_search_car.best_params_, verbosity=-1)
    fold_model.fit(X_train_fold, y_train_fold)

    oof_preds[valid_idx] = fold_model.predict(X_valid_fold)



In [44]:
df_oof_car_prediction = df_train_car[["tic", "datacqtr"]].copy()
df_oof_car_prediction["light_gbm_car_predict"] = oof_preds

In [48]:
df_oof_car_prediction.to_csv("data/stacking_data/lightgbm_car_predict.csv", index=False)