# 1 - Multiple Linear Regression in Bitcoin Price Prediction

## 1.1 Methodology for MLR

Multiple Linear Regression is a statistical technique that aims to predict the value of a dependent variable based on multiple independent variables. The step-by-step process for implementing Multiple Linear Regression for Bitcoin price prediction is outlined below.

1. Data preprocessing methodology work with Time-series data:
- This data is time series it's sequential, so we don't use Cross-Validation or any of the model ML techniques to evaluate error. TimeSeriesSplit, which is a specific type of cross-validation technique used for time series data. It's important to use time-series cross-validation when dealing with sequential data to avoid training on future data. TimeSeriesSplit splits the data into folds, so that the folds with data from the previous past will be used as the training set, and the future data will only be used as the test set. For example, if we split the data into 3 folds, each fold would consist of:
    - Fold 1: Data from January 2016 to December 2017 (training set) and data from January 2018 to December 2018 (test set).    
    - Fold 2: Data from January 2016 to December 2018 (training set) and data from January 2019 to December 2019 (test set).    
    - Fold 3: Data from January 2016 to December 2019 (training set) and data from January 2020 to December 2020 (test set). <br>

The code: `tscv = TimeSeriesSplit(n_splits=3)` will creat a time-series cross-validation object that splits the data into 3 folds in chronological order. 

2. Linear Regression methodology: 

2.1 Definition
- Linear Regression methodology:
    - Simple Linear Regression: only 1 independent var Y = b0 + b1*X
    - Multi Linear Regression: More than one independent variable   
    Y = m0 + m1X1 + m2X2 + m3X3 + ... + mNXN
    - Polynomial Regression: independent variable of higher order than 1 (for example, order 2, 3)
- Logistic Reg: Classification Problem. (probability prediction of dependent variable based on independent variables.)

2.2 How Linear Regression Model works:
- In Linear Regression, we need to find the parameter set m_i (i from 0 to N) so that the function Y = f(X) best fits the training data set.

- The process of finding the model parameter: w0, w1, ..., wn set through the training process can use Gradient Descent, Stochastic Gradient Descent or Normal Equation methods by optimizing the MSE loss function. 

MSE (Mean Squared Error) or loss function formula (between the predicted values and the true values): 

 $$MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y_i})^2$$
 

3. Lasso and Ridge Regression

3.1.1 Definition: 
- Lasso and Ridge Regression are two regularization methods used to reduce overfitting in Linear Regression model.    
    - Lasso uses L1 regularization: to remove unimportant variables completely BY push the coefficients to zero completely. 
    - Ridge uses L2 regularization: to remove unimportant variables incompletely BY push the coefficients close to zero, but never down to zero completely.    
- The choice between these two methods depends on THE NUMBER OF FEATURES AND INFLUENCE OF EACH FEATURE.

3.1.2. How model work? similar to Linear Regression with loss function. 

And the formula for Lasso Regression loss function is:

$$MSE_{Lasso} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y_i})^2 + \alpha \sum_{j=1}^{p} |w_j| = MSE + \alpha \sum_{j=1}^{p} |w_j|$$  

The formula for Ridge Regression loss function is:

$$MSE_{Ridge} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y_i})^2 + \alpha \sum_{j=1}^{p} w_j^2 = MSE + \alpha \sum_{j=1}^{p} w_j^2$$  

- $MSE$ is the mean squared error between the predicted values and the true values.
- $n$ is the number of training samples.
- $p$ is the number of features in the model.
- $y_i$ is the true value of the ith sample.
- $\hat{y_i}$ is the predicted value of the ith sample.
- $w_j$ is the weight corresponding to the jth feature in the model.
- $\alpha$ is the regularization parameter.
<br>

3. 2 Packaging the LinearRegressionModel Class for hyperparameters tuning in Ridge and Lasso Regression.

Grid Search (tìm kiếm theo lưới) and randomized search (tìm kiếm ngẫu nhiên) for hyperparameters tuning:
- are approaches to finding hyperparameters for machine learning models.
- Grid Search:    
    - Make sure to find the best solution.    
    - Easy reproducibility of results.    
    - It takes a lot of computation time when the model has many hyperparameters.
- Randomized Search in reverse. (fits multiple hyperparameters)

- 3. 3 Important Features

- Important Features methodology: 

To identify important features and improve model performance, we can follow a few methodologies. Experiment with features: 

- 1.  Remove features(that are not important or can cause interference). 

- 2. Testing feature combinations(combine features together to create new features) 

- 3. Consider selecting a different model(if the linear model is not powerful enough for complex relationships between variables).

- 4. Combine important features with other models.



## 1.2 Idea Slide for MLR + Pictures

1. Linear Regression:
    - Simple Linear Regression: only 1 independent var Y = b0 + b1*X
    - Multi Linear Regression: More than one independent variable   
    Y = m0 + m1X1 + m2X2 + m3X3 + ... + mNXN

MSE (Mean Squared Error) loss function: mean squared error between the predicted values and the true values.

Linear Regression loss function is:

$$MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y_i})^2$$

2. Lasso and Ridge Regression: 

Lasso Regression loss function is:

$$MSE_{Lasso} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y_i})^2 + \alpha \sum_{j=1}^{p} |w_j|$$

Ridge Regression loss function is:

$$MSE_{Ridge} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y_i})^2 + \alpha \sum_{j=1}^{p} w_j^2$$



- $w_j$ is the weight corresponding to the jth feature in the model.
- $\alpha$ is the regularization parameter.
<br>

3. Important Features

Picture: 



## 1.3 Presentation Script for MLR: (with Result of Model)

1. Definition: 

- Multiple Linear Regression is a statistical technique that aims to predict the value of a dependent variable based on multiple independent variables.

We need to make a little distinction between Simple Linear Regression and Mutiple Linear Regression

    - Simple Linear Regression: only 1 independent variable Y = b0 + b1*X 
    
In there: 
- `y` is the dependent variable
- `X` is the independent variable
- `b0` is the intercept (also known as the coefficient of freedom or constant term)
- `b1` is the coefficient or the slope of the regression line

    - Multiple Linear Regression: More than one independent variable   
    Y = m0 + m1X1 + m2X2 + m3X3 + ... + mNXN**   
In there, y, X1 to XN, m0 to mN are defined in the same way as in Simple Linear Regression.

2. How Model works:
- In Linear Regression, we need to find the parameter set m_i (i from 0 to N) so that the function Y = f(X) best fits the training data set.

- The process of finding the model parameter: w0, w1, ..., wn set through the training process can use Gradient Descent, Stochastic Gradient Descent or Normal Equation methods by optimizing the MSE loss function. 

This is MSE (Mean Squared Error) or loss function formula (between the predicted values and the true values): 

 $$MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y_i})^2$$

<!-- Đọc công thức này bằng tiếng anh: MSE equal one over n, multiple ... -->
"The Mean Squared Error (MSE) is equal to one over n, multiplied by the sum of the squared differences between the actual values (y) and the predicted values (ŷ).

In there:
- Yi is the actual output value of the i-th data point;
- Ŷi is the output value predicted by the model with input Xi;
- n is the number of data points in the training set.


3. Lasso and Ridge Regression
- Due to reduce overfitting in Linear Regression model we use Lasso and Ridge Regression
    - Lasso uses L1 regularization: to remove unimportant variables completely BY push the coefficients to zero completely. 
    - Ridge uses L2 regularization: to remove unimportant variables incompletely BY push the coefficients close to zero, but never down to zero completely.    
- The choice between these two methods depends on THE NUMBER OF FEATURES AND INFLUENCE OF EACH FEATURE.

3.2. How model work? similar to Linear Regression with loss function. 

And the formula for Lasso Regression loss function is:

$$MSE_{Lasso} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y_i})^2 + \alpha \sum_{j=1}^{p} |w_j| = MSE + \alpha \sum_{j=1}^{p} |w_j|$$  

The MSE of Lasso equals the MSE plus the regularization parameter alpha multiplied by the sum of the absolute values of the model coefficients (w_j).

The formula for Ridge Regression loss function is:

$$MSE_{Ridge} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y_i})^2 + \alpha \sum_{j=1}^{p} w_j^2 = MSE + \alpha \sum_{j=1}^{p} w_j^2$$  

The MSE of Ridge equals the MSE plus the regularization parameter alpha multiplied by the sum of squares of the model coefficients (w_j).

In there: 
- $MSE$ is the mean squared error between the predicted values and the true values.
- $n$ is the number of training samples.
- $p$ is the number of features in the model.
- $y_i$ is the true value of the ith sample.
- $\hat{y_i}$ is the predicted value of the ith sample.
- $w_j$ is the weight corresponding to the jth feature in the model.
- $\alpha$ is the regularization parameter.
<br>


# 2 - XGBoost Regression (Extreme Gradient Boosting) in Bitcoin Price Prediction

## 2.1 Methodology for XGB Regression

XGBoost (Extreme Gradient Boosting) is a powerful machine learning algorithm that is widely used for regression and classification tasks. It is an ensemble learning method that combines the predictions of multiple weak learners to create a strong predictive model. Here is the step-by-step methodology for implementing XGBoost Regression for Bitcoin price prediction:

1. Data preprocessing:
- Splitting the dataset: Since this is a time-series dataset, it is crucial to split the data chronologically, ensuring that the training data comes before the testing data.

2. XGBoost Regression model methodology:

2.1 Definition: 
- Ensemble Methods: method of combining multiple models (search gg Machine Learning picture)    
    - Stacking techniques    
    - Bagging technique: Random Forest alg.  
    - Boosting technique: XGBoost, LightGBM, CatBoost, AdaBoost. Repeat the training process with different weights.        
        - XGBoost: regularization to reduce overfitting, gradient boosting to optimize tree weights, parallelization to train independent trees and speed up the training process, tools to monitor (giám sát) the training process.        
        - AdaBoost is the first boosting algorithm, CatBoost focuses on feature categorical (tính phân loại) and LightGBM focuses on speed and scalability for large datasets.

- Definition XGBoost:         
XGBoost is a machine learning algorithm based on a decision tree and gradient boosting technique, designed to handle continuous classification or estimation problems, efficient and accurate in processing the data. large and multidimensional data sets.

2.2 How does the XGBoost model work? 

- The goal of the XGBoost model is to optimize the loss function, minimizing the difference between the predicted value and the actual value.
- XGBoost use: regularization to reduce overfitting, gradient boosting to optimize tree weights, parallelization to train independent trees and speed up the training process, tools to monitor (giám sát) the training process.  
- The loss function of XGBoost is the sum of the loss functions of each tree in the population (quần thể), including the loss function of the regularization terms(thuật ngữ chính quy).

The main formula of the loss function in XGBoost is: 

$$Obj^{(t)}=\sum_{i=1}^{n}l(y_i, \hat{y}_i^{(t-1)}+f_t(x_i))+\Omega(f_t) $$

In there:
- $Obj^{(t)}$ is the loss function value at the t_th loop.
- $n$ is the number of training data points.
- $y_i$ is the target value of the i-th data point.
- $\hat{y}_i^{(t-1)}$ is the predicted value of the model at the previous loop t-1.
- $f_t(x_i)$ is the predicted value of the t tree on the ith data point.
- $l(y_i, \hat{y}_i^{(t-1)}+f_t(x_i))$ is a loss function, which measures the difference between the predicted value and the actual value economy on the i-th data point.
- $\Omega(f_t)$ is the regularization term (hàm chi phí), which measures the complexity of the t-th tree.

 

## 2.2 Idea Slide for XGB Regression + Pictures

1. Ensemble Methods: 
- Ensemble Methods: method of combining multiple models (search gg Machine Learning picture)    
    - Stacking techniques    
    - Bagging technique: Random Forest alg.  
    - Boosting technique: XGBoost, LightGBM, CatBoost, AdaBoost. Repeat the training process with different weights.        
      

2. XGBoost: 

- Definition XGBoost:         
XGBoost is a machine learning algorithm based on decision tree and gradient boosting technique, designed to handle continue classification or estimation problems, efficient and accurate in processing the data. large and multidimensional data sets.

- XGBoost use: regularization to reduce overfitting, gradient boosting to optimize tree weights, parallelization to train independent trees and speed up the training process, tools to monitor (giám sát) the training process.  

The main formula of the loss function in XGBoost is: 

$$Obj^{(t)}=\sum_{i=1}^{n}l(y_i, \hat{y}_i^{(t-1)}+f_t(x_i))+\Omega(f_t)$$

In there:
- $Obj^{(t)}$ is the loss function value at the t_th loop.
- $n$ is the number of training data points.
- $y_i$ is the target value of the i-th data point.
- $\hat{y}_i^{(t-1)}$ is the predicted value of the model at the previous loop t-1.
- $f_t(x_i)$ is the predicted value of the t tree on the ith data point.
- $l(y_i, \hat{y}_i^{(t-1)}+f_t(x_i))$ is a loss function, which measures the difference between the predicted value and the actual value economy on the i-th data point.
- $\Omega(f_t)$ is the regularization term (hàm chi phí), which measures the complexity of the t-th tree.

 

## 2.3 Presentation Script for XGB Regression (with Result of Model)

0. Ensemble Methods is a method of combining multiple models. Three common ensemble methods are Stacking, Bagging and Boosting techniques.

1. Definition XGBoost:         
- XGBoost (eXtreme Gradient Boosting) is a machine learning algorithm based on decision tree and gradient boosting technique, designed to handle continue classification or estimation problems, efficient and accurate in processing the data large and multidimensional data sets.

2. How does the XGBoost model work? 
- The goal of the XGBoost model is to optimize the loss function, minimizing the difference between the predicted value and the actual value.

- XGBoost use: regularization to reduce overfitting, gradient boosting to optimize tree weights, parallelization to train independent trees and speed up the training process, tools to monitor (giám sát) the training process.  
- The loss function of XGBoost is the sum of the loss functions of each tree in the population (quần thể), including the loss function of the regularization terms(thuật ngữ chính quy).

The main formula of the loss function in XGBoost is: 

$$Obj^{(t)}=\sum_{i=1}^{n}l(y_i, \hat{y}_i^{(t-1)}+f_t(x_i))+\Omega(f_t) $$

"Objective value at iteration t $Obj^{(t)}$ is equal to the sum of the loss function ($l$) evaluated for each training data point $i$,
 where the loss function compares the target value $(y_i)$ 
 with the predicted value at the previous iteration $\hat{y}_i^{(t-1)}$ plus the predicted value of the t-th tree $f_t(x_i)$,
  and this sum is then added to the regularization term $\Omega(f_t)$."


In there:
- $Obj^{(t)}$ is the loss function value at the t_th loop.
- $n$ is the number of training data points.
- $y_i$ is the target value of the i-th data point.
- $\hat{y}_i^{(t-1)}$ is the predicted value of the model at the previous loop t-1.
- $f_t(x_i)$ is the predicted value of the t tree on the ith data point.
- $y_i, \hat{y}_i^{(t-1)}$ is
- $l(y_i, \hat{y}_i^{(t-1)}+f_t(x_i))$ is a loss function,
 which measures the difference between the predicted value and the actual value economy on the i-th data point.
- $\Omega(f_t)$ is the regularization term (hàm chi phí), which measures the complexity of the t-th tree.