######  Note: No training but all codes work perfectly fine as the kernel keeps on being disconnected and was told by Manas to submit as it is for now without training. 
###### If you have any questions, feel free to contact us and we will answer it as soon as possible. Thanks. 

#### Two notebooks have been submitted for Beat the Speed Pricing Models Competition. This notebook uses Stacking regression (consists of lgbm and xgboost) to train on the dataset. 

### Explainability 
Light GBM, XGBoost, and Stack Ensemble have been used to predict the "val_lvsvcharge". 
- Each model used only a few parameters to increase its accuracy while preventing overfitting. 
- Light GBM and XGBoost are tree structures so it's easy to understand, while Stack Ensemble combines the two alogrithms 
- Models have smooth sensitivities on the inputs. (does not have jumps of results with a small change of variable)
- Models have clear comments and explanations (see the upcoming codes)
- Models which training is fully explained and transparent 

### Importing Libaries and dependencies

In [None]:
import alphien
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn import linear_model
from matplotlib import pyplot as plt
from sklearn.svm import SVR
import scipy.stats as stats
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
import lightgbm as lgb
import warnings
warnings.filterwarnings('ignore')

ModuleNotFoundError: ignored

In [None]:
pip install mlxtend



### Preprocessing and Feature Extraction 

In [None]:
#The section where raw dataset (as provided by Alphien) gets cleaned and transformed into the input 
#format the model expects. Raw data (alphien.data.DataLoader) is input. The columns with object dtype
#are removed. The size of the combination of training and testing is set in this function. 
#Splitting feature/label and training/test. 
import re
def data_processing (raw):
    dl = raw
    dataGen = dl.batch(fromRow=1, toRow=500000)
    data = next(dataGen)
    data = data.rename(columns = lambda x:re.sub('[^A-Za-z0-9_]+', '', x))
    data = data[:150000]
    print(data.shape)
    data = data.select_dtypes(exclude=['object'])
    y = data['val_lvsvcharge'].reset_index(drop=True)
    X = data.loc[:, data.columns != 'val_lvsvcharge']
    y = data.iloc[:,-1:]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
    return X_train, X_test, y_train, y_test

In [None]:
X_train, X_test, y_train, y_test = data_processing(alphien.data.DataLoader()) #initializing training and testing

In [None]:
print(X_test.shape)

In [None]:
#define mean squre error and max absolute error for prediction
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import max_error, mean_absolute_error
from sklearn.metrics import mean_squared_error

kfolds = KFold(n_splits=10, shuffle=True, random_state=42)

def mse(y, y_pred):
    return mean_squared_error(y, y_pred)

def cv_rmse(model):
    testing = -cross_val_score(model, X_train, y_train, scoring="neg_mean_squared_error", cv=kfolds)
    print(testing)
    rmse = np.sqrt(testing)
    return (rmse)

def mae(y, y_pred):
    return np.abs(max_error(y, y_pred))


def cv_rmax(model):
    rmse = np.sqrt(-cross_val_score(model, X_train, y_train, scoring="neg_max_error", cv=kfolds))
    return (rmse)


In [None]:
#plot feature importance for a specific model and return important features for feature selection
def plotImp(model, total_features):
    features_used = []
    features_used_amt = []
    feature_list = list(model.feature_importances_)
    feature_len = len(feature_list)
    for i in range(feature_len):
        if feature_list[i] > 0:
            features_used.append(total_features[i])
            features_used_amt.append(feature_list[i])
    print(features_used)
    sns.barplot(y=list(features_used), x=list(features_used_amt))
    return features_used

### Models

### LightGBM
To achieve a higher accuracy in the lightgbm model, we have included the following parameters in this model:
- num_leaves
    - the larger the num_leaves, the higher accuracy it is, but may cause over-fitting
- max_bin
    - the larger the max_bin value, the higher the accuracy but the slower it is 
- learning_rate

In [None]:
#initialize lightGBM model and train it as preliminary experiment for feature selection
from math import sqrt
lgb_model = LGBMRegressor(objective='regression',num_leaves=35, 
                          learning_rate=0.05, n_estimators=300)
lgb_model.fit(X_train, y_train)
y_lgb_pred = lgb_model.predict(X_test)
LGB_model_error = mse(y_test, y_lgb_pred)
print(f'LGBM Mean Squared Error - {LGB_model_error}')

In [None]:
#plot feature importance and extract important features
total_features = list(X_train)
# plotImp(lgb_model, total_features)
L_column_selected = plotImp(lgb_model, total_features)

### XGBoost
To achieve a higher accuracy in XGBRegressor, we have included the following parameters in this model:
- learning_rate
- n_estimators
- max_depth 


In [None]:
#initialize XGBoost model and train it as preliminary experiment for feature selection
XGB_model = XGBRegressor(learning_rate=0.03,n_estimators=500, 
                         max_depth=6, objective='reg:squarederror')
XGB_model.fit(X_train, y_train)
y_XGB_predict = XGB_model.predict(X_test)
XGB_model_error = mse(y_test, y_XGB_predict)
print(f'XGBoost Mean Squared Error - {XGB_model_error}')

In [None]:
#plot feature importance and extract important features
# plotImp(XGB_model, total_features)
X_column_selected = plotImp(XGB_model, total_features)

#### Feature Selection

In [None]:
#outer join important features obtained from two pretrained models
select=[f for f in X_column_selected or L_column_selected]
select_X_train=X_train[X_train.columns.intersection(select)]
select_X_test=X_test[X_test.columns.intersection(select)]

In [None]:
print(select_X_test.shape)

### Stacking Regression
Stacking regression is used to combine lgb_model and XGB_model via a meta-regressor
- lower variance and lower bias 
- have higher accuracy 
- lowering the possibility of overfitting

In [None]:
# Training
# Note: this block of code has the feature similar to Training Loop
#       Please run the previous blocks starting from the section： Model in order to run this block of codes 
from sklearn.feature_selection import SelectFromModel
from numpy import sort
from mlxtend.regressor import StackingCVRegressor
stack_model = StackingCVRegressor(regressors=(lgb_model, XGB_model),
                                meta_regressor=XGB_model,
                                use_features_in_secondary=True)

stack_model.fit(np.array(select_X_train), np.array(y_train))
y_stack_pred = stack_model.predict(np.array(select_X_test))

In [None]:
print('MSE score on testing data:')
print(mse(y_test, y_stack_pred))
print('MAE score on testing data:')
print(mae(y_test, y_stack_pred))

### Prediction

In [None]:
# remove the target column and remove parameters that are 'object' type
def predDataTransform(unknown_data, unknown_select):
    unknown_data = unknown_data.rename(columns = lambda x:re.sub('[^A-Za-z0-9_]+', '', x))
    unknown_data = unknown_data.select_dtypes(exclude=['object'])
    unknown_data = unknown_data[unknown_data.columns.intersection(unknown_select)]
    return unknown_data 

In [None]:
#suppose you have unseen data
# Change the amoung of rows that you want to predict in the second row of this block of cells 
dl = alphien.data.DataLoader()
unseen = dl.batch(fromRow=1, toRow=10000)
data = next(unseen)
data = data.iloc[:,:-1]
selected_data = predDataTransform(data, select)
def myPredictFunc(newData, model):
    return model.predict(np.array(selected_data))

In [None]:
# stack_model is used for prediction function
ypred = myPredictFunc(selected_data, stack_model)

In [None]:
print(ypred[:10])