## LASSO (least absolute shrinkage and selection operator):

<br>
<br>
<div style="max-width: 800px; margin: 0 auto; text-align: center;">
<span style="font-family: 'Helvetica'; font-size: 14pt;">
    <b>In statistics and machine learning</b>, lasso (least absolute shrinkage and selection operator; also Lasso or LASSO) is a <b>regression analysis method</b> that performs both <b>variable selection</b> and <b>regularization</b> in order to enhance the <b>prediction accuracy</b> and <b>interpretability</b> of the resulting statistical model.
</span>
</div>

### Statistics and Machine Learning: 
    These are two fields where Lasso is commonly used. Statistics is the science of making decisions using data, and machine learning is a field of artificial intelligence where systems can learn from data.

### Lasso (Least absolute shrinkage and selection operator): 
    Lasso is a method used for prediction models, particularly in linear regression.

### Regression analysis method: 
    Regression analysis is a statistical method used to understand the relationship between dependent and independent variables. In simple terms, you're trying to find a function that best fits your data.

### Variable selection: 
    This is the process of selecting which variables to include in the model. This is important because not all variables contribute significantly to the prediction. Some variables may even make the model perform worse.

### Regularization: 
    Regularization is a technique used to prevent overfitting, which is when a model learns the training data too well and performs poorly on unseen data. Regularization addresses overfitting by adding a penalty term to the loss function. This penalty discourages learning a more complex model and encourages learning a simpler model, thus reducing the risk of overfitting.
    
    There are several types of regularization, two of the most common ones being L1 and L2 regularization:

    L1 Regularization (Lasso): Adds a penalty equivalent to the absolute value of the magnitude of the coefficients.

    L2 Regularization (Ridge): Adds a penalty equivalent to the square of the magnitude of the coefficients.

    In the context of linear regression, the loss function is typically the sum of squared residuals. Regularization adds a term to this function that penalizes large values of the coefficients. 
    
    Lasso Regularization Formula

    The objective of linear regression is to minimize the residual sum of squares (RSS), given by:

    RSS = Σ (y_i - (α + Σ β_j * x_ij))^2

    where:

        y_i is the observed output,
        α is the intercept,
        β_j is the coefficient for variable j,
        x_ij is the i-th observation on the j-th variable.
        
    The Lasso regression then introduces a penalty term to this objective, resulting in the following minimization objective:

    Lasso = Σ (y_i - (α + Σ β_j * x_ij))^2 + λ Σ |β_j|

    where:

    λ is a tuning parameter that decides how much we want to penalize the flexibility of our model. The greater the value of λ, the greater is the penalization and simpler is the resulting model. This is the parameter we adjust to control overfitting.

### Prediction accuracy: 
    This refers to how close the model's predictions are to the actual values. A model with high prediction accuracy is good because it means it makes predictions that are close to the actual values.

### Interpretability: 
    This is about how well we can understand the model. An interpretable model is one where we can easily understand the relationship between the inputs and the output.


    

In [1]:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import xgboost as xgb
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [2]:
# This dataset consists of 10 physiological variables (age, sex, weight, blood pressure) 
# measured on 442 patients, and an indication of disease progression after one year.
diabetes = load_diabetes()

X = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)

y = diabetes.target

# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [3]:
X.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641


In [4]:
# linear regression
lr = LinearRegression()
lr.fit(X_train, y_train)

print("Linear Regression Coefficients:")
print(pd.Series(lr.coef_, index=X.columns))

Linear Regression Coefficients:
age     29.254013
sex   -261.706469
bmi    546.299723
bp     388.398341
s1    -901.959668
s2     506.763241
s3     121.154351
s4     288.035267
s5     659.268951
s6      41.376701
dtype: float64


In [5]:
# lasso regression
lasso = Lasso(alpha=0.2)
lasso.fit(X_train, y_train)

print("\nLasso Coefficients:")
print(pd.Series(lasso.coef_, index=X.columns))


Lasso Coefficients:
age      0.000000
sex   -102.887375
bmi    554.780358
bp     300.339972
s1      -0.000000
s2      -0.000000
s3    -238.030096
s4       0.000000
s5     331.459416
s6       0.000000
dtype: float64


In [6]:
# compute mean squared error on test set for linear regression
lr_pred = lr.predict(X_test)
lr_mse = mean_squared_error(y_test, lr_pred)

# compute mean squared error on test set for lasso
lasso_pred = lasso.predict(X_test)
lasso_mse = mean_squared_error(y_test, lasso_pred)

print("\nMean Squared Error (Linear Regression): ", lr_mse)
print("Mean Squared Error (Lasso): ", lasso_mse)


Mean Squared Error (Linear Regression):  2821.7509810013107
Mean Squared Error (Lasso):  2806.108707743924


## Ridge regression:

<br>
<br>
<div style="max-width: 800px; margin: 0 auto; text-align: center;">
<span style="font-family: 'Helvetica'; font-size: 14pt;">
    Similar to LASSO regression, Ridge regression uses regularization to prevent overfitting. They both work by adding a penalty to the loss function that the model minimizes. The key differences between them lie in the form of this penalty.
</span>
</div>

### Regularization: 
    Ridge Regression, on the other hand, uses L2 regularization. It adds a penalty equivalent to the square of the magnitude of coefficients. It minimizes the sum of the squared coefficients (||β||2^2).
    
### Feature Selection:
    Lasso has the ability to perform feature selection. It can shrink some of the less important features' coefficients to zero, thus eliminating them from the model altogether. Ridge, however, doesn't have this feature selection property. It will shrink the coefficients towards zero, but it will not set any of them exactly to zero (unless the regularization parameter α is set to an extremely high value).
    
### Multicollinearity:
    When faced with highly correlated (multicollinear) features, Ridge regression tends to distribute the coefficient load among them, while Lasso is likely to pick one and disregard the others.

In [7]:
# linear regression
lr = LinearRegression()
lr.fit(X_train, y_train)

print("Linear Regression Coefficients:")
print(pd.Series(lr.coef_, index=X.columns))

Linear Regression Coefficients:
age     29.254013
sex   -261.706469
bmi    546.299723
bp     388.398341
s1    -901.959668
s2     506.763241
s3     121.154351
s4     288.035267
s5     659.268951
s6      41.376701
dtype: float64


In [8]:
# ridge regression
ridge = Ridge(alpha=0.1)
ridge.fit(X_train, y_train)

print("\nRidge Coefficients:")
print(pd.Series(ridge.coef_, index=X.columns))


Ridge Coefficients:
age     39.663529
sex   -213.846880
bmi    505.914292
bp     341.714474
s1    -108.806301
s2     -70.575810
s3    -211.906580
s4     160.193540
s5     332.773542
s6      77.680452
dtype: float64


In [9]:
# compute mean squared error on test set for linear regression
lr_pred = lr.predict(X_test)
lr_mse = mean_squared_error(y_test, lr_pred)

# compute mean squared error on test set for ridge
ridge_pred = ridge.predict(X_test)
ridge_mse = mean_squared_error(y_test, ridge_pred)

print("\nMean Squared Error (Linear Regression): ", lr_mse)
print("Mean Squared Error (Ridge): ", ridge_mse)


Mean Squared Error (Linear Regression):  2821.7509810013107
Mean Squared Error (Ridge):  2805.401298319341


## Decision Tree:

<br>
<br>
<div style="max-width: 800px; margin: 0 auto; text-align: center;">
<span style="font-family: 'Helvetica'; font-size: 14pt;">
    Decision Trees are a type of supervised learning algorithm that is mostly used in classification problems. It works for both categorical and continuous input and output variables. In the decision tree, we split the population or sample into two or more homogeneous sets (or sub-populations) based on the most significant splitter/differentiator in input variables.
</span>
</div>

### Selection of the Best Attribute:: 
    The first step is to select the best attribute for the root node. This is done using different attribute selection measures like Information Gain, Gini Index, Chi-Square, etc.
    
### Tree Creation:
    After finding the root node, the data is divided into subsets. The subsets should be created in such a way that each subset contains data with the same value for an attribute. This process is repeated for each branch (recursive partitioning).
    
### Stop Condition:
    This process of recursive partitioning might result in a tree with too many branches and depth, leading to overfitting. So we must set a stopping criteria to stop the tree growth. This could be a certain depth is reached, minimum samples at a node, etc.
    
### DecisionTreeRegressor:
    A Machine Learning model that uses a decision tree to predict a continuous outcome variable (regression). It belongs to the family of supervised learning algorithms.
    
### How it works:
    The DecisionTreeRegressor, like all decision trees, works by splitting the dataset into distinct subsets based on certain conditions. The goal is to create subsets that are as pure as possible, meaning that the instances in each subset should have similar target values. The decision tree algorithm accomplishes this by selecting the conditions that produce the most significant improvement in purity for each split. This process is repeated recursively until a stopping condition is met.
    
    For regression, the purity or quality of a split is usually measured by the reduction in variance, which is the squared standard deviation of the target value. The variance is calculated as follows:
    
    1) For each feature, the dataset is sorted by that feature's values.
    2) Then for each unique value, the dataset is split into two: one subset with instances whose feature value is below or equal to the threshold, and one with instances above that threshold.
    3) The variance of the target values in each subset is calculated and combined in a weighted sum, where the weights are the proportions of instances in each subset.
    4) The best split is the one that has the smallest weighted variance.
    5) The algorithm keeps track of the feature and threshold that produces the best split.
    6) The process is repeated recursively for each subset until a stopping condition is met, such as reaching a maximum depth, or if the improvement in variance reduction falls below a threshold.

In [10]:
dt = DecisionTreeRegressor(random_state=42, max_depth=3)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)

# Linear Regression
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

mse_dt = mean_squared_error(y_test, y_pred_dt)
mse_lr = mean_squared_error(y_test, y_pred_lr)


print("\nMean Squared Error (Linear Regression): ", mse_lr)
print("Mean Squared Error (Decision Tree Regressor): ", mse_dt)


Mean Squared Error (Linear Regression):  2821.7509810013107
Mean Squared Error (Decision Tree Regressor):  3616.769894653006


## Random Forest Regression:

<br>
<br>
<div style="max-width: 800px; margin: 0 auto; text-align: center;">
<span style="font-family: 'Helvetica'; font-size: 14pt;">
    Random Forest Regression is a machine learning algorithm that employs an ensemble of decision trees to perform regression tasks. "Ensemble" here means that it uses multiple learning models internally, and the final prediction is made by averaging the predictions of each individual model.
</span>
</div>

### Bootstrapping: 
    Random Forest starts by picking 'n' random subsets from the original dataset, with replacement (meaning the same instance can be selected multiple times). This technique is called bootstrapping. Each subset is then used to train a separate decision tree.
    
### Random Feature Selection:
    At each node in each decision tree, Random Forest selects a random subset of features rather than using all features to determine the best split. This randomness ensures that the trees are de-correlated and makes the model more robust against overfitting.
    
### Training and Prediction:
    Each tree is grown to the maximum depth and makes its prediction independently. The final prediction of the Random Forest is the average of the predictions of all trees for regression tasks.
    
###    Differences to DecisionTreeRegressor:

    1) Ensemble vs Single Model: The fundamental difference between a RandomForestRegressor and a DecisionTreeRegressor is that the former is an ensemble of decision trees, whereas the latter is a single decision tree. This means a RandomForestRegressor can capture more complexity than a single DecisionTreeRegressor.

    2) Overfitting: Decision trees are notorious for overfitting to the training data, especially if the trees are allowed to grow too deep or complex. Random forests, however, are less prone to overfitting. This is because each tree in the ensemble is trained on a different subset of the data and uses a random subset of features to make splits. The randomness ensures that each tree is somewhat different, and the final prediction, which is an average of all the trees, is less likely to be influenced by the noise or fluctuations in the data.

    3) Variance and Bias: RandomForestRegressor generally has lower variance and higher bias than a single DecisionTreeRegressor. This is because averaging the predictions of many trees can reduce the variance (at the cost of a slight increase in bias), leading to a better overall model.

    4)Interpretability: Decision trees are relatively simple to interpret because you can visualize the entire tree structure and follow the decision path. Random forests, on the other hand, are more challenging to interpret due to the large number of trees. However, they offer feature importance metrics, which can provide insights about which features have contributed the most to the predictions.

In [11]:
rf = RandomForestRegressor(n_estimators=100, random_state=42, max_depth = 3)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

# Linear Regression
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

# Compare Mean Squared Errors
mse_rf = mean_squared_error(y_test, y_pred_rf)
mse_lr = mean_squared_error(y_test, y_pred_lr)

print(f"Mean Squared Error (Linear Regression): {mse_lr}")
print(f"Mean Squared Error (Random Forest Regressor): {mse_rf}")


Mean Squared Error (Linear Regression): 2821.7509810013107
Mean Squared Error (Random Forest Regressor): 2737.34564593494


## Boosting:

<br>
<br>
<div style="max-width: 800px; margin: 0 auto; text-align: center;">
<span style="font-family: 'Helvetica'; font-size: 14pt;">
    Boosting is a type of ensemble learning technique that aims to convert a set of weak learners into a single strong learner. A weak learner is a model that performs relatively poorly: its accuracy is above chance, but just barely. Boosting algorithms iteratively train a sequence of weak learners, with each iteration aiming to correct the mistakes made by the previous learners. The idea is that by combining many weak models, we can produce a powerful, robust model that has lower bias and variance. Here are a few popular boosting methods:
</span>
</div>




### AdaBoost (Adaptive Boosting): 
    This is one of the first and most simple boosting algorithms. Each new weak learner is trained with a weighted version of the data, where the weights depend on the errors of the previous weak learner: instances that were misclassified by the previous learner are given more weight, while instances that were correctly classified are given less weight. This encourages the new weak learner to focus more on the harder cases.

### Gradient Boosting: 
    This approach is more general than AdaBoost. Instead of tweaking instance weights, Gradient Boosting fits the new weak learner to the residual errors made by the previous learner. Then it adds this new learner into the ensemble, effectively taking a step in the direction that minimizes the overall training loss (hence the "gradient" in "gradient boosting"). This can be used with any differentiable loss function, not just the one used by AdaBoost.

### XGBoost (Extreme Gradient Boosting): 
    XGBoost is a more efficient and flexible version of gradient boosting. It uses a more regularized model formalization to control overfitting, which gives it better performance. It also includes several useful features for handling missing values and making the algorithm computationally efficient.

### LightGBM: 
    This is a gradient boosting framework that uses tree-based learning algorithms and follows the leaf-wise approach. It differs from other tree-based algorithms by choosing to split the tree leaf-wise while other algorithms do it level-wise. It can handle the large size of data and takes lower memory to run.

### CatBoost: 
    CatBoost can automatically deal with categorical variables without showing the type conversion error, which helps you to focus on tuning your model better rather than sorting out trivial errors. It yields state-of-the-art results without extensive data training typically required by other machine learning methods, and it provides powerful out-of-the-box support for the more descriptive data formats that accompany many business problems.



In [12]:
xgbr = xgb.XGBRegressor(objective ='reg:squarederror', n_estimators = 1000, seed = 42)
xgbr.fit(X_train, y_train)
y_pred_xgb = xgbr.predict(X_test)

# Linear Regression
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

mse_xgb = mean_squared_error(y_test, y_pred_xgb)
mse_lr = mean_squared_error(y_test, y_pred_lr)

print(f"Mean Squared Error (Linear Regression): {mse_lr}")
print(f"Mean Squared Error (XGBoost Regressor): {mse_xgb}")


Mean Squared Error (Linear Regression): 2821.7509810013107
Mean Squared Error (XGBoost Regressor): 3582.7540439353597
