**Import Python Libraries**

In [1]:
import seaborn as sns

Seaborn is a data visualization library built on top of Matplotlib, providing a high-level interface for
creating informative and attractive statistical graphics.

In [2]:
import matplotlib.pyplot as plt

Matplotlib is a powerful plotting library that enables the creation of a wide variety of static, animated, and 
interactive visualizations in Python.

In [3]:
import numpy as np

NumPy is a powerful library for numerical operations, providing support for large, multi-dimensional arrays 
and matrices, along with mathematical functions to operate on these arrays.

In [4]:
import pandas as pd

Pandas is a powerful and popular library for data manipulation and analysis. It provides data structures like 
DataFrame for efficient handling of structured data.

In [5]:
import catboost as ctb

CatBoost is a machine learning library specifically designed for gradient boosting on decision trees.

**Import Python dependencies (functions and classes)**

In [6]:
from numpy import mean 

from numpy import std

from numpy import asarray

The mean function is used to calculate the arithmetic mean or average of numerical data.
The std function is used to compute the standard deviation of a set of numerical values.
The asarray function is used to convert input to an array.

In [7]:
from sklearn.preprocessing import Normalizer

The Normalizer class is used for normalizing samples individually to have unit norm.

In [8]:
from sklearn.model_selection import train_test_split

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import train_test_split : 
This imports the train_test_split function, which is commonly used to split a dataset into training and testing sets.

from sklearn.model_selection import cross_val_score : 
This imports the cross_val_score function, which is used for cross-validation.

In [9]:
from sklearn.metrics import mean_squared_error

from sklearn.metrics import accuracy_score

from sklearn.metrics import r2_score

from sklearn.metrics import mean_squared_error : 
This imports the mean_squared_error function, which is a metric used for regression tasks. It calculates the mean squared difference between the actual and predicted values, providing a measure of the model's accuracy.

from sklearn.metrics import accuracy_score : 
This imports the accuracy_score function, which is a classification metric. It measures the accuracy of classification models by comparing the predicted labels to the true labels.

from sklearn.metrics import r2_score : 
This imports the r2_score function, also known as the coefficient of determination. It assesses the goodness of fit of a
regression model by indicating the proportion of the variance in the dependent variable that is predictable from the independent variables.

In [10]:
np.random.seed(7)

np.random.seed(7): This uses the seed function from the NumPy library to seed the random number generator. The argument (in this case, 7) is an arbitrary value, and setting it ensures reproducibility in random number generation.

In [11]:
from hyperopt import hp

from hyperopt import hp: This imports the hp module, which stands for "hyperparameters." Hyperopt is a powerful library for hyperparameter optimization, and the hp module provides a convenient way to define and search through hyperparameter spaces.

**Setting Parameters of Regression**

In [12]:
ctb_params_reg = {
    'learning_rate' : hp.choice('learning_rate' , np.arange(0.05, 0.31, 0.05)),
    'max_depth' : hp.choice('max_depth' , np.arange(5, 16, 5, dtype = int)),
    'colsample_bylevel' : hp.choice('colsample_bylevel' , np.arange(0.3, 0.8, 0.1)),
    'n_estimators' : 100,
    'eval_metric' : 'RMSE'
}

This code defines a dictionary ctb_params_reg that represents a search space for hyperparameters used in the CatBoostRegressor model within the context of hyperparameter optimization using Hyperopt.

'learning_rate': hp.choice('learning_rate', np.arange(0.05, 0.31, 0.05)): This defines a hyperparameter for the learning rate with a choice space represented by np.arange(0.05, 0.31, 0.05).

'max_depth': hp.choice('max_depth', np.arange(5, 16, 5, dtype=int)): This defines a hyperparameter for the maximum depth of trees with a choice space represented by np.arange(5, 16, 5, dtype=int).

'colsample_bylevel': hp.choice('colsample_bylevel', np.arange(0.3, 0.8, 0.1)): This defines a hyperparameter for the fraction of features to consider for each split with a choice space represented by np.arange(0.3, 0.8, 0.1).

'n_estimators': 100: This sets a fixed value for the number of estimators (trees) in the ensemble to 100.

'eval_metric': 'RMSE': This sets the evaluation metric to Root Mean Squared Error (RMSE), which is commonly used for regression tasks.

In [13]:
ctb_params_fit = {
    'early_stopping_rounds' : 10,
    'verbose' : False
}

This code defines a dictionary ctb_params_fit that represents parameters used during the training (fitting) phase of the CatBoostRegressor model.

'early_stopping_rounds': 10: This parameter is set to 10, indicating that the training process will stop if the performance metric (evaluated on a validation set) does not improve after 10 consecutive rounds.

'verbose': False: This parameter is set to False, which means that the training process will not display detailed progress information. Setting verbose to False can be useful to reduce the amount of output during training.

In [14]:
ctb_para = dict()
ctb_para['params_reg'] = ctb_params_reg
ctb_para['params_fit'] = ctb_params_fit
ctb_para['func_loss'] = lambda y, pred: np.sqrt(mean_squared_error(y, pred))

This code creates a dictionary ctb_para that combines parameters and functions for configuring the CatBoostRegressor model within the context of hyperparameter optimization and training.

ctb_para['params_reg'] = ctb_params_reg: This associates the hyperparameter search space defined earlier (ctb_params_reg) with the key 'params_reg' in the dictionary.

ctb_para['params_fit'] = ctb_params_fit: This associates the training parameters defined earlier (ctb_params_fit) with the key 'params_fit' in the dictionary.

ctb_para['func_loss'] = lambda y, pred: np.sqrt(mean_squared_error(y, pred)): This associates a loss function with the key 'func_loss' in the dictionary. The loss function is defined as the square root of the mean squared error, which is a common metric for regression tasks. The lambda function takes the true labels (y) and the predicted values (pred) as inputs.

**Creating CatBoostRegressor Class**

In [15]:
from hyperopt import fmin, tpe, Trials, STATUS_OK, STATUS_FAIL

class CatOptimizer(object):
    def __init__ (self, x_train, x_test, y_train, y_test):
        self.x_train = x_train
        self.x_test = x_test
        self.y_train = y_train
        self.y_test = y_test
        
    def process (self, f_name, space, trials, algo, max_evals):
        fn = getattr(self, f_name)
        try:
            result = fmin(fn = fn, space = space, algo = algo, max_evals = max_evals, trials = trials)
        except Exception as e:
            return { 'status' : STATUS_FAIL,
                    'exception' : str(e)}
        return result, trials
    
    def cat_reg (self, para):
        reg = ctb.CatBoostRegressor(**para['params_reg'])
        return self.train_reg(reg, para)
    
    def train_reg (self, reg, para):
        reg.fit(self.x_train, self.y_train, eval_set = [(self.x_train, self.y_train), (self.x_test, self.y_test)],
                **para['params_fit'])
        pred = reg.predict(self.x_test)
        loss = para['func_loss'](self.y_test, pred)
        return {'loss' : loss, 'status' : STATUS_OK}

This code defines a Python class CatOptimizer that serves as a wrapper for hyperparameter optimization of a CatBoostRegressor model using the Hyperopt library.

The __init__ method initializes the class instance with training and testing data.

The process method is a generic method for hyperparameter optimization. It takes the name of a function (f_name), a search space (space), a set of trials (trials), an optimization algorithm (algo), and the maximum number of evaluations (max_evals). It attempts to optimize the specified function using the specified algorithm and search space.

The cat_reg method is the specific function to be optimized, creating a CatBoostRegressor model with hyperparameters provided in para['params_reg'].

The train_reg method trains the CatBoostRegressor model, evaluates it on a validation set, and computes the loss using the specified loss function.

**Reading data from train and test sets**

In [16]:
df_t_train = pd.read_csv("2021-NIRF_train.csv")
train_data = pd.DataFrame(df_t_train)
print(train_data.columns.values)

['SCORE' 'SS/20' 'FSR/30' 'FQE/20' 'FRU/30' 'TLR/100' 'PU/35' 'QP/40'
 'IPR/15' 'FPPP/10' 'RP/100' 'GPH/40' 'GUE/15' 'MS/25' 'GPHD/20' 'GO/100'
 'RD/30' 'WD/30' 'ESCS/20' 'PCS/20' 'OI/100' 'PR/100' 'RANK']


df_t_train = pd.read_csv("2021-NIRF_train.csv"): This line uses Pandas (pd) to read data from a CSV file named "2021-NIRF_train.csv" and stores it in a DataFrame called df_t_train.

train_data = pd.DataFrame(df_t_train): This line creates a new DataFrame called train_data by copying the data from df_t_train. While this line seems redundant, it's common to create a new DataFrame if you want to manipulate or analyze the data separately without affecting the original DataFrame.

print(train_data.columns.values): This line prints the column names of the train_data DataFrame. The columns attribute contains the column labels, and values returns them as a NumPy array. This statement helps to quickly inspect the column names in the train dataset.

In [17]:
test_data = pd.read_csv("2021-NIRF_test.csv")
print(test_data.columns.values)

['SCORE' 'SS/20' 'FSR/30' 'FQE/20' 'FRU/30' 'TLR/100' 'PU/35' 'QP/40'
 'IPR/15' 'FPPP/10' 'RP/100' 'GPH/40' 'GUE/15' 'MS/25' 'GPHD/20' 'GO/100'
 'RD/30' 'WD/30' 'ESCS/20' 'PCS/20' 'OI/100' 'PR/100' 'RANK']


test_data = pd.read_csv("2021-NIRF_test.csv"): This line uses Pandas (pd) to read data from a CSV file named "2021-NIRF_test.csv" and stores it in a DataFrame called test_data. This is similar to what was done with the training data.

print(test_data.columns.values): This line prints the column names of the test_data DataFrame. The columns attribute contains the column labels, and values returns them as a NumPy array. This statement helps to quickly inspect the column names in the test dataset.

**Feature Engineering**

In [18]:
train_labels = train_data['RANK']
test_labels = test_data['RANK']
train_features = train_data.drop('RANK', axis = 1)
test_features = test_data.drop('RANK', axis = 1)

train_labels = train_data['RANK']: This line extracts the column labeled 'RANK' from the train_data DataFrame and assigns it to the variable train_labels. This assumes that 'RANK' is the target variable or the labels for the training set.

test_labels = test_data['RANK']: Similarly, this line extracts the column labeled 'RANK' from the test_data DataFrame and assigns it to the variable test_labels. This assumes that 'RANK' is the target variable or the labels for the test set.

train_features = train_data.drop('RANK', axis=1): This line creates a new DataFrame train_features by removing the column labeled 'RANK' from the train_data DataFrame along the columns (axis=1). This DataFrame is now assumed to contain the features used for training.

test_features = test_data.drop('RANK', axis=1): Similarly, this line creates a new DataFrame test_features by removing the column labeled 'RANK' from the test_data DataFrame along the columns (axis=1). This DataFrame is assumed to contain the features used for testing.

In [19]:
print(train_features.shape)
print(test_features.shape)
print(train_labels.shape)
print(test_labels.shape)

(190, 22)
(20, 22)
(190,)
(20,)


print(train_features.shape): This line prints the shape of the train_features DataFrame, indicating the number of rows and columns. The output will be in the form (number_of_rows, number_of_columns).

print(test_features.shape): Similarly, this line prints the shape of the test_features DataFrame, providing information about the number of rows and columns in the test set.

print(train_labels.shape): This line prints the shape of the train_labels Series, indicating the number of elements. Since this represents the labels for the training set, the shape will be a tuple with one element, indicating the total number of labels.

print(test_labels.shape): Similarly, this line prints the shape of the test_labels Series, indicating the number of elements in the test set labels.

In [20]:
train_X = train_features
test_X = test_features
train_Y = train_labels
test_Y = test_labels

train_X = train_features: This line assigns the DataFrame train_features to the variable train_X. This typically represents the feature set used for training machine learning models.

test_X = test_features: Similarly, this line assigns the DataFrame test_features to the variable test_X. This typically represents the feature set used for testing or predicting with machine learning models.

train_Y = train_labels: This line assigns the Series train_labels to the variable train_Y. This typically represents the target variable or labels corresponding to the training set.

test_Y = test_labels: Similarly, this line assigns the Series test_labels to the variable test_Y. This typically represents the target variable or labels corresponding to the test set.

**Feature Normalization**

In [21]:
transformer_train = Normalizer().fit(train_X)
train_X = transformer_train.transform(train_X)
t_test = test_X
test_X = transformer_train.transform(test_X)

transformer_train = Normalizer().fit(train_X): This line creates an instance of the Normalizer class from scikit-learn and fits it to the training features (train_X). The fit method calculates the normalization parameters based on the training data.

train_X = transformer_train.transform(train_X): This line transforms (normalizes) the training features using the normalization parameters calculated during the fitting step. The transform method applies the normalization to the training set.

t_test = test_X: This line creates a copy of the original test features and assigns it to the variable t_test. This line is not necessary for the normalization process but might be used for comparison or other purposes.

test_X = transformer_train.transform(test_X): This line applies the same normalization transformation to the test features using the parameters learned from the training set. It ensures consistency in the normalization process between the training and test sets.

**Creating Result class**

In [22]:
class Result():
    def __init__ (self, y_true):
        self.results = pd.DataFrame({'true_test' : y_true})
        self.metrics = pd.DataFrame(columns = ('model','rmse','r2'))
    def record (self, model, y_pred):
        y_true = self.results.true_test.values
        y_pred = pd.Series(y_pred, name = model+'_pred')
        self.results = pd.concat([self.results, y_pred], axis = 1)
        rmse = np.sqrt(mean_squared_error(y_true,y_pred))
        r_squared = r2_score(y_true, y_pred)
        row_loc = len(self.metrics) + 1
        self.metrics.loc[row_loc] = [model, rmse, r_squared]
    def get_metrics(self):
        return self.metrics
    def get_results(self):
        return self.results
    
res_r = Result(test_Y)

def __init__(self, y_true): This is the constructor method that initializes an instance of the Result class. It takes a parameter y_true, representing the true values for the test set, and creates two DataFrames: results to store the true values and predictions, and metrics to store evaluation metrics.

def record(self, model, y_pred): This method records the predictions and calculates evaluation metrics for a given model. It takes the model name (model) and the predicted values (y_pred) as parameters.

def get_metrics(self): This method returns the DataFrame containing the recorded metrics.

def get_results(self): This method returns the DataFrame containing the true values and predictions.

res_r = Result(test_Y): An instance of the Result class is created with the true values for the test set (test_Y) passed as a parameter, and it is assigned to the variable res_r.

In [23]:
opt_model = CatOptimizer(train_X, test_X, train_Y, test_Y)
ctb_opt = opt_model.process(f_name = 'cat_reg', space = ctb_para, trials = Trials(), algo = tpe.suggest, max_evals = 100)
print(ctb_opt)

100%|███████████████████████████████████████████████| 100/100 [25:50<00:00, 15.51s/trial, best loss: 7.836680825832662]
({'colsample_bylevel': 3, 'learning_rate': 3, 'max_depth': 0}, <hyperopt.base.Trials object at 0x0000022E13521050>)


opt_model = CatOptimizer(train_X, test_X, train_Y, test_Y): This line creates an instance of the CatOptimizer class, initializing it with the training and testing data (train_X, test_X, train_Y, test_Y).

ctb_opt = opt_model.process(f_name='cat_reg', space=ctb_para, trials=Trials(), algo=tpe.suggest, max_evals=100): This line uses the process method of the opt_model instance to perform hyperparameter optimization. It optimizes the cat_reg function (CatBoostRegressor) using the hyperparameter search space defined in ctb_para, with the TPE (Tree-structured Parzen Estimator) algorithm, and a maximum of 100 evaluations. The result is stored in the variable ctb_opt.

print(ctb_opt): This line prints the result of the hyperparameter optimization, which typically includes the best hyperparameters found and information about the optimization process.

In [24]:
best_p = {'learning_rate' : np.arange(0.05, 0.31, 0.05)[ctb_opt[0]['learning_rate']],
         'max_depth' : np.arange(5, 16, 1, dtype = int)[ctb_opt[0]['max_depth']],
         'colsample_bylevel' : np.arange(0.3, 0.8, 0.1)[ctb_opt[0]['colsample_bylevel']]}

'learning_rate': np.arange(0.05, 0.31, 0.05)[ctb_opt[0]['learning_rate']]: This line retrieves the optimized learning rate value from the ctb_opt result using the index specified by ctb_opt[0]['learning_rate']. It then uses this index to select the corresponding value from the array generated by np.arange(0.05, 0.31, 0.05).

'max_depth': np.arange(5, 16, 1, dtype=int)[ctb_opt[0]['max_depth']]: Similarly, this line retrieves the optimized max depth value from the ctb_opt result and selects the corresponding value from the array generated by np.arange(5, 16, 1, dtype=int).

'colsample_bylevel': np.arange(0.3, 0.8, 0.1)[ctb_opt[0]['colsample_bylevel']]: This line retrieves the optimized colsample_bylevel value from the ctb_opt result and selects the corresponding value from the array generated by np.arange(0.3, 0.8, 0.1).

In [25]:
model = ctb.CatBoostRegressor( verbose = 0, n_estimators = 100, colsample_bylevel = best_p['colsample_bylevel'],
                             learning_rate = best_p['learning_rate'], max_depth = best_p['max_depth'], 
                             early_stopping_rounds = 10)

model = ctb.CatBoostRegressor(...): This line creates an instance of the CatBoostRegressor model from the CatBoost library for regression tasks.

verbose=0: This parameter sets the verbosity level to 0, meaning no output will be printed during the training process.

n_estimators=100: This sets the number of trees (estimators) in the ensemble to 100.

colsample_bylevel=best_p['colsample_bylevel']: This sets the fraction of features to consider for each split during training, using the value obtained from the hyperparameter optimization process.

learning_rate=best_p['learning_rate']: This sets the learning rate for the model, using the value obtained from the hyperparameter optimization process.

max_depth=best_p['max_depth']: This sets the maximum depth of the trees in the ensemble, using the value obtained from the hyperparameter optimization process.

early_stopping_rounds=10: This parameter sets the early stopping criteria. Training will stop after 10 rounds if the evaluation metric does not improve.

In [26]:
model.fit(train_X, train_Y)

<catboost.core.CatBoostRegressor at 0x22e134c7a50>

model.fit(train_X, train_Y): This line trains the CatBoostRegressor model (model) on the training data. The train_X variable represents the feature matrix (input variables), and train_Y represents the target variable or labels for the training set.

In [27]:
pred_y = model.predict(test_X)

pred_y = model.predict(test_X): This line applies the trained CatBoostRegressor model (model) to the test features (test_X) to generate predictions.

In [28]:
res_r.record('Cat Boost', pred_y)

res_r.record('Cat Boost', pred_y): This line calls the record method of the res_r instance, where:
'Cat Boost' is passed as the model name.
pred_y represents the predicted values obtained from the CatBoostRegressor model.

**Final R2 and RMSE value**

In [29]:
res_r.get_metrics().head()

Unnamed: 0,model,rmse,r2
1,Cat Boost,7.663461,0.98169


res_r.get_metrics(): This calls the get_metrics method of the res_r instance, which returns the DataFrame containing the recorded evaluation metrics for different models.

.head(): This method is used to display the first few rows of the DataFrame, providing a quick overview of the recorded metrics.

**Utilization of CatBoost instead of other models:**
The CatBoost model was chosen for its distinctive strengths in handling tabular data and regression scenarios. CatBoost, short for "Categorical Boosting," possesses several key advantages that contribute to its efficacy:

**Handling Categorical Features:**
CatBoost inherently handles categorical features without the need for extensive preprocessing. This is particularly beneficial when dealing with datasets containing a mix of numerical and categorical variables.

**Robust Performance:**
CatBoost is known for its robustness and ability to perform well "out of the box." It requires minimal hyperparameter tuning compared to other gradient boosting algorithms, making it a suitable choice for efficient model development.

**Optimized Tree Building:**
The algorithm employs a novel method for decision tree construction, optimizing the process and resulting in faster training times. This is especially advantageous for large datasets or scenarios where computational efficiency is crucial.

**Built-in Regularization:**
CatBoost incorporates built-in regularization techniques, which contribute to enhanced model generalization and mitigating overfitting. This is valuable when working with complex datasets to ensure the model's reliability on unseen data.

**Handling Missing Data:**
CatBoost has effective strategies for handling missing data, reducing the need for explicit imputation techniques. This is advantageous when working with real-world datasets that often contain missing values.

**Gradient Boosting with Categorical Features:**
CatBoost employs a gradient boosting framework, effectively capturing complex relationships within the data. The incorporation of categorical features into the boosting process enhances its ability to model intricate patterns in diverse datasets.

In summary, the CatBoost model presents a compelling solution for regression tasks, offering a balance between performance, ease of use, and robustness, particularly in scenarios where datasets exhibit a mix of categorical and numerical features. Its ability to handle complexities inherent in real-world data makes it a prudent choice for achieving accurate and reliable regression models.