<h1>Import libraries<h1>

In [1]:
from Model.Ridge_from_scratch import *
from Model.Lasso_from_scratch import *
from Model.ElasticNet_from_scratch import *
from Metrics.Classification_metrics import *
from Metrics.Regression_metrics import *
from Plots.Prediction_plots import *
Plots_predictions = Prediction_plots()

<h1>Explanation of Regularization<h1>

$\text{Regularization is the method used for the overfitting problem of linear and logistic regression.}$<p>
$\text{Overfitting, is when our algorithm returns good predictions on training data, but much weaker ones on unseen-test data.}$<p>
$\text{Generally speaking, when talking about any Machine Learning algorithm, it is important that it maximizes matching scores,}$
$\text{while minimizing error variance (of predictions and true values) between different sets of analyzed data.}$<p>
$\text{Adding a "regularization term" makes it likely that the model will increase the error slightly (on the training data), while it will decrease the variances.}$<p>
$\text{Each regularization algorithm reduces the complexity of the model, thus reducing overfitting.}$<p>
$\text{Trade-off between error and variance is a classic problem for any algorithm.}$<p>
$\text{The following section will present regularization algorithms for which the relationships will be true:}$<p>
$\text{- As } \alpha \text{ increases, the error }\textbf{increases,}$<p>
$\text{- As } \alpha \text{ increases, the variance }\textbf{decreases}.$

<h1>Advantages and disadvantages of Regularization<h1>

$\text{Advantages of Regularization:}$<p>
$\text{- Reduces over-fitting by reducing the computational complexity of the model,}$<p>
$\text{- If there is correlation between variables it works better than regressions,}$<p>
$\text{- For a large number of variables, shorter compilation time than ordinary regressions.}$<br>

$\text{Disadvantages of Regularization:}$<p>
$\text{- Due to the reduction in the complexity of the model can lead to the phenomenon of mismatch (the model is too simple),}$<p>
$\text{- Regularization can make the model heavier to interpret due to the restrictions placed on the coefficients.}$

<h2>Explanation of Ridge Regression<h2>

$\text{The first regularization algorithm that I would like to explain in Ridge Regression.}$<p><br>

$\text{Ridge Regression is an } \textbf{L2 regularization algorithm.}$<p>
$\text{Our linear regression minimization problem, for Ridge (after taking into account the "regularization term"), will look like this:}$<p>
$$RSS_{Ridge}=\sum_{i=1}^{N}\left(y_i-{\hat{\beta}}_{Ridge}\times X_i\right)^2+\alpha\times\sum_{m=0}^{M}{{\hat{\beta}}_{Ridge.\ m}\ }^2$$
$\text{Where: } y_i \text{ - dependent variable for observation } i,$<p>
${\hat{\beta}}_{Ridge} \text{ - vector of estimated coefficients,}$<p>
$X_i \text{ - vector of independent variables for observation } i,$<p>
$\alpha \text{ - shrinkage parameter,}$<p>
$M \text{ - the number of independent variables used in the model,}$<p>
${\hat{\beta}}_{Ridge,\ m} \text{ - estimated coefficient value for variable m.}$<p>

$\text{Thus, when using gradient optimization:}$<p>
$$Loss=\frac{1}{N}\times\left(\sum_{i=1}^{N}\left(y_i-{\hat{\beta}}_{Ridge}\times X_i\right)^2+\alpha\times\sum_{m=0}^{M}{{\hat{\beta}}_{Ridge,\ m}\ }^2\right)=\frac{1}{N}\times\left(\sum_{i=1}^{N}\left({y_i}^2-2\times y_i\times{\hat{\beta}}_{Ridge}\times X_i+{{\hat{\beta}}_{Ridge}}^2\times{X_i}^2\right)+\alpha\times\sum_{m=0}^{M}{{\hat{\beta}}_{Ridge,\ m}\ }^2\right)$$

$$\frac{\partial Loss}{\partial{\hat{\beta}}_{Ridge}}=\frac{1}{N}\times\left(\sum_{i=1}^{N}\left(-2\times y_i\times X_i+2\times{\hat{\beta}}_{Ridge}\times{X_i}^2\right)+2\times\alpha\times\sum_{m=0}^{M}{\hat{\beta}}_{Ridge,\ m}\right)=-\frac{2}{N}\times\left(\sum_{i=1}^{N}\left(X_i\times\left(y_i-{\hat{\beta}}_{Ridge}\times X_i\right)\right)-\alpha\times\sum_{m=0}^{M}{\hat{\beta}}_{Ridge,\ m}\right)$$

$${\hat{\beta}}_{New}={\hat{\beta}}_{Old}+learning\ rate\times\frac{2}{N}\times\left(\sum_{i=1}^{N}\left(X_i\times\left(y_i-{\hat{\beta}}_{Old}\times X_i\right)\right)-\alpha\times\sum_{m=0}^{M}{\hat{\beta}}_{Old,\ m}\right)$$

$\text{If we would like to write this in matrix form then:}$<p>
$${\hat{\beta}}_{Ridge}=\left(X^\prime\times X+\alpha\times I\right)^{-1}\times X^\prime\times\gamma$$
$\text{Where: } \alpha \text{ - shrinkage parameter (shrinkage parameter),}$<p>
$I \text{ - identity matrix.}$<p>

$\text{We can also apply the Ridge algorithm to a classification problem.}$<p>
$\text{To do that we have to convert our dependent variables to: } \{1,-1\} \text{ and treat the task as a regression problem.}$<p>
$\text{Thus, the calculations are exactly the same as for regression!}$<p>
$\text{It means that we will perform the determination of the optimal coefficient using the formula:}$

$${\hat{\beta}}_{Ridge}=\left(X^\prime\times X+\alpha\times I\right)^{-1}\times X^\prime\times\gamma$$

$\text{In the case of multiclassing, we will have to break each class into separate subclasses by which we will}$<p>
$\text{obtain a coefficient matrix with dimensions (when the intercept is included in model): } [number\ of\ features+1;\ number\ of\ classes].$

<h2>Explanation of Lasso<h2>

$\text{Lasso (Least Absolute Shrinkage Selector Operator) is an } \textbf{L1 regularization algorithm.}$<p>
$\text{Our linear regression minimization problem, for Lasso (after taking into account the "regularization term"), will look like this:}$<p>

$$RSS_{Lasso}=\sum_{i=1}^{N}\left(y_i-{\hat{\beta}}_{Lasso}\times X_i\right)^2+\alpha\times\sum_{m=0}^{M}\left|{\hat{\beta}}_{Lasso,m}\right|$$

$\text{Where: } y_i \text{ - dependent variable for observation i,}$<p>
${\hat{\beta}}_{Lasso} \text{ - vector of estimated coefficients,}$<p>
$X_i \text{ - vector of independent variables for observation i,}$<p>
$\alpha \text{ - shrinkage parameter,}$<p>
$M \text{ - the number of independent variables used in the model,}$<p>
${\hat{\beta}}_{Lasso,m} \text{ - estimated coefficient value for variable m.}$<p>

$\text{Thus, when using gradient optimization:}$

$$Loss=\frac{1}{N}\times\left(\sum_{i=1}^{N}\left(y_i-{\hat{\beta}}_{Lasso}\times X_i\right)^2+\alpha\times\sum_{m=0}^{M}\left|{\hat{\beta}}_{Lasso,m}\right|\right)=\frac{1}{N}\times\left(\sum_{i=1}^{N}\left({y_i}^2-2\times y_i\times{\hat{\beta}}_{Lasso}\times X_i+{{\hat{\beta}}_{Lasso}}^2\times{X_i}^2\right)+\alpha\times\sum_{m=0}^{M}\left|{\hat{\beta}}_{Lasso,m}\right|\right)$$

$$\frac{\partial Loss}{\partial{\hat{\beta}}_{Lasso}}=\frac{1}{N}\times\left(\sum_{i=1}^{N}\left(-2\times y_i\times X_i+2\times{\hat{\beta}}_{Lasso}\times{X_i}^2\right)+\alpha\times\sum_{m=0}^{M}\frac{{\hat{\beta}}_{Lasso,m}}{\left|{\hat{\beta}}_{Lasso,m}\right|}\right)=-\frac{2}{N}\times\left(\sum_{i=1}^{N}\left(X_i\times\left(y_i-{\hat{\beta}}_{Lasso}\times X_i\right)\right)-\frac{\alpha}{2}\times\sum_{m=0}^{M}\frac{{\hat{\beta}}_{Lasso,m}}{\left|{\hat{\beta}}_{Lasso,m}\right|}\right)$$

$${\hat{\beta}}_{New}={\hat{\beta}}_{Old}+learning\ rate\times\frac{2}{N}\times\left(\sum_{i=1}^{N}\left(X_i\times\left(y_i-{\hat{\beta}}_{Old}\times X_i\right)\right)-\frac{\alpha}{2}\times\sum_{m=0}^{M}\frac{{\hat{\beta}}_{Old,m}}{\left|{\hat{\beta}}_{Old,m}\right|}\right)$$

$\text{We can also apply the Lasso algorithm to a classification problem.}$<p>
$\text{To do that we have to convert our dependent variables to: } \{1,-1\} \text{ and treat the task as a regression problem.}$<br>

$\text{In the case of multiclassing, we will have to break each class into separate subclasses by which we will}$<p>
$\text{obtain a coefficient matrix with dimensions (when the intercept is included in model): } [number\ of\ features+1;\ number\ of\ classes].$

<h2>Explanation of ElasticNet<h2>

$\text{ElasticNet is a combination of } \textbf{L1 and L2 regularization algorithms.}$<p>
$\text{Our linear regression minimization problem, for ElasticNet (after taking into account both "regularization terms"), will look like this:}$<p>

$$RSS_{Elastic}=\sum_{i=1}^{N}\left(y_i-{\hat{\beta}}_{Elastic}\times X_i\right)^2+\alpha_1\times\sum_{m=0}^{M}\left|{\hat{\beta}}_{Elastic,\ m}\right|+\alpha_2\times\sum_{m=0}^{M}{{\hat{\beta}}_{Elastic,\ m}}^2$$

$\text{Where: } y_i \text{ - dependent variable for observation i,}$<p>
${\hat{\beta}}_{Elastic} \text{ - vector of estimated coefficients,}$<p>
$X_i \text{ - vector of independent variables for observation i,}$<p>
$\alpha \text{ - shrinkage parameter,}$<p>
$M \text{ - the number of independent variables used in the model,}$<p>
${\hat{\beta}}_{Elastic,m} \text{ - estimated coefficient value for variable m.}$<p>

$\text{Thus, when using gradient optimization:}$

$$Loss=\frac{1}{N}\times\left(\sum_{i=1}^{N}\left(y_i-{\hat{\beta}}_{Elastic}\times X_i\right)^2+\alpha_1\times\sum_{m=0}^{M}\left|{\hat{\beta}}_{Elastic,\ m}\right|+\alpha_2\times\sum_{m=0}^{M}{{\hat{\beta}}_{Elastic,\ m}}^2\right)=\frac{1}{N}\times\left(\sum_{i=1}^{N}\left({y_i}^2-2\times y_i\times{\hat{\beta}}_{Elastic}\times X_i+{{\hat{\beta}}_{Elastic}}^2\times{X_i}^2\right)+\alpha_1\times\sum_{m=0}^{M}\left|{\hat{\beta}}_{Elastic,\ m}\right|+\alpha_2\times\sum_{m=0}^{M}{{\hat{\beta}}_{Elastic,\ m}}^2\right)$$

$$\frac{\partial Loss}{\partial{\hat{\beta}}_{Elastic}}=\frac{1}{N}\times\left(\sum_{i=1}^{N}\left(-2\times y_i\times X_i+2\times{\hat{\beta}}_{Elastic}\times{X_i}^2\right)+\alpha_1\times\sum_{m=0}^{M}\frac{{\hat{\beta}}_{Elastic,\ m}}{\left|{\hat{\beta}}_{Elastic,\ m}\right|}+2\times\alpha_2\times\sum_{m=0}^{M}{\hat{\beta}}_{Elastic,\ m}\right)=-\frac{2}{N}\times\left(\sum_{i=1}^{N}\left(X_i\times\left(y_i-{\hat{\beta}}_{Elastic}\times X_i\right)\right)-\frac{\alpha_1}{2}\times\sum_{m=0}^{M}\frac{{\hat{\beta}}_{Elastic,\ m}}{\left|{\hat{\beta}}_{Elastic,\ m}\right|}-\alpha_2\times\sum_{m=0}^{M}{\hat{\beta}}_{Elastic,\ m}\right)$$

$${\hat{\beta}}_{New}={\hat{\beta}}_{Old}+learning\ rate\times\frac{2}{N}\times\left(\sum_{i=1}^{N}\left(X_i\times\left(y_i-{\hat{\beta}}_{Old}\times X_i\right)\right)-\frac{\alpha_1}{2}\times\sum_{m=0}^{M}\frac{{\hat{\beta}}_{Elastic,\ m}}{\left|{\hat{\beta}}_{Elastic,\ m}\right|}-\alpha_2\times\sum_{m=0}^{M}{\hat{\beta}}_{Elastic,\ m}\right)$$

$\text{We can also apply the ElasticNet algorithm to a classification problem.}$<p>
$\text{To do that we have to convert our dependent variables to: } \{1,-1\} \text{ and treat the task as a regression problem.}$<br>

$\text{In the case of multiclassing, we will have to break each class into separate subclasses by which we will}$<p>
$\text{obtain a coefficient matrix with dimensions (when the intercept is included in model): } [number\ of\ features+1;\ number\ of\ classes].$

<h2>Comparision of Regularizers<h2>

$\text{- Ridge (L2) works better than Lasso (L1) when we have a lot of statistically significant variables in the dataset, because it will keep them in the model,}$<p>
$\text{- Lasso works better than Ridge when we have few statistically significant variables in the dataset,}$<p>
$\text{- Lasso can be used for variable selection unlike Ridge,}$<p>
$\text{- In the case where we have a huge number of variables (for example, 10000) even if there is a correlation between them,}$<p>
$\text{Lasso should be better in terms of memory when possibly needing to store and use the algorithm again.}$<p>
$\text{In addition, for such a large number of variables, even using Ridge, we may have the phenomenon of overfitting, as there will be too many variables and noise,}$<p>
$\text{- Lasso's problem, on the other hand, is the removal of variables in situations where there is correlation between predictors,}$<p>
$\text{causing the algorithm to lose important information,}$<p>
$\text{- In such situations (like the one described two points above) that ElasticNet is useful, which on the one hand will remove some noise,}$<p>
$\text{but we have some chance of keeping the variables correlated (ElasticNet works best for large data sets).}$

<h1>Classification<h1>

<h1>Preprocessing<h1>

<h2>Download data<h2>

In [2]:
data = pd.read_csv("Data/bank-balanced.csv")
X = data.drop("deposit", axis=1)
y = data["deposit"]

In [3]:
print("Number of observations in data: {}".format(len(data)))
data.head()

Number of observations in data: 11162


Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,deposit
0,59,admin.,married,secondary,no,2343,yes,no,unknown,5,may,1042,1,-1,0,unknown,yes
1,56,admin.,married,secondary,no,45,no,no,unknown,5,may,1467,1,-1,0,unknown,yes
2,41,technician,married,secondary,no,1270,yes,no,unknown,5,may,1389,1,-1,0,unknown,yes
3,55,services,married,secondary,no,2476,yes,no,unknown,5,may,579,1,-1,0,unknown,yes
4,54,admin.,married,tertiary,no,184,no,no,unknown,5,may,673,2,-1,0,unknown,yes


<h2>Check for null data<h2>

In [4]:
data.isnull().sum()/len(data)

age          0.0
job          0.0
marital      0.0
education    0.0
default      0.0
balance      0.0
housing      0.0
loan         0.0
contact      0.0
day          0.0
month        0.0
duration     0.0
campaign     0.0
pdays        0.0
previous     0.0
poutcome     0.0
deposit      0.0
dtype: float64

<h3>Check dtypes of dataset<h3>

In [5]:
data.dtypes

age           int64
job          object
marital      object
education    object
default      object
balance       int64
housing      object
loan         object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
deposit      object
dtype: object

<h2>Divide our data into train and test sets<h2>

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, random_state=17, test_size=0.2)

$\text{Because of the assumption of a normal distribution of our continuous variables, it is worth modifying their distribution before modeling.}$<p>
$\text{For this purpose, the Z-score will be used, which is described by the following formula:}$

$${\hat{X}}_{m, i}=\frac{X_{m, i}-{\bar{X}}_m}{\sigma_m}$$

$\text{Where: } {\hat{X}}_{m, i} \text{ - standardized observation } i \text{ of variable } m,$<p>
$X_{m, i} \text{ - observation } i \text{ of variable } m,$<p>
${\bar{X}}_m \text{ - mean of variable } m,$<p>
$\sigma_m \text{ - standard deviation of variable } m.$

$\text{To properly approach the modeling process, the test set should remain unknown until the prediction is made.}$<p>
$\text{For this reason, we will only learn the mean and standard deviation for the training set and transform}$<p>
$\text{the two data sets based on just these values.}$ 

In [7]:
categorical_data_train = X_train.select_dtypes(include="object")
continous_data_train = X_train.select_dtypes(exclude="object")
categorical_data_test = X_test.select_dtypes(include="object")
continous_data_test = X_test.select_dtypes(exclude="object")
categorical_data_train = pd.get_dummies(categorical_data_train, drop_first=True, dtype=int)
categorical_data_test = pd.get_dummies(categorical_data_test, drop_first=True, dtype=int)
mean_train = np.mean(continous_data_train, axis=0)
std_train = np.std(continous_data_train, axis=0)
continous_data_train = (continous_data_train-mean_train)/std_train
continous_data_test = (continous_data_test-mean_train)/std_train

In [8]:
X_train_final = pd.concat([categorical_data_train, continous_data_train], axis=1)
X_test_final = pd.concat([categorical_data_test, continous_data_test], axis=1)

<h1>Evaluation and Visualization<h1>

$\text{To verify how well our algorithms are able to perform, a cross-validation will be used on the training set (in order to average the results obtained).}$<p>
$\text{Then we will check whether the algorithms will perform equally well (or even better) on the test data.}$

$\text{Simple Cross-Validation class}$

In [9]:
class Cross_Validation():
    def __init__(self, metric, algorithm_instance, cross_validation_instance):
        metrics = {"accuracy": [lambda y, y_pred: accuracy_score(y, y_pred), "preds"],
                    "roc_auc": [lambda y, y_pred: roc_auc_score(y, y_pred), "probs"],
                    "mse": [lambda y, y_pred: mean_squared_error(y, y_pred), "preds"],
                    "rmse": [lambda y, y_pred: mean_squared_error(y, y_pred)**0.5, "preds"],
                    "mae": [lambda y, y_pred: mean_absolute_error(y, y_pred), "preds"]}
        if metric not in metrics:
            raise ValueError('Unsupported metric: {}'.format(metric))
        self.eval_metric = metrics[metric][0]
        self.metric_type = metrics[metric][1]
        self.algorithm = algorithm_instance
        self.cv = cross_validation_instance
    
    def fit(self, X, y, verbose=False):
        X = self.check_X(X=X)
        y = self.check_y(y=y)
        self.train_scores, self.valid_scores = [], []
        for iter, (train_idx, valid_idx) in enumerate(self.cv.split(X, y)):
            X_train, X_valid = X[train_idx, :], X[valid_idx, :]
            y_train, y_valid = y[train_idx], y[valid_idx]
            self.algorithm.fit(X_train, y_train)
            if(self.metric_type == "preds"):
                y_train_pred = self.algorithm.predict(X_train)
                y_valid_pred = self.algorithm.predict(X_valid)
            else:
                y_train_pred = self.algorithm.predict_proba(X_train)[:, 1]
                y_valid_pred = self.algorithm.predict_proba(X_valid)[:, 1]
            self.train_scores.append(self.eval_metric(y_train, y_train_pred))
            self.valid_scores.append(self.eval_metric(y_valid, y_valid_pred))
            if(verbose == True):
                print("Iter {}: train scores: {}; valid scores: {}".format(iter, np.round(self.eval_metric(y_train, y_train_pred), 5), np.round(self.eval_metric(y_valid, y_valid_pred), 5)))
        return np.mean(self.train_scores), np.mean(self.valid_scores)
    
    def check_X(self, X):
        if not isinstance(X, pd.DataFrame) and not isinstance(X, np.ndarray) and not torch.is_tensor(X):
            raise TypeError('Wrong type of X. It should be dataframe, numpy array or torch tensor.')
        X = np.array(X)
        if(X.ndim == 1):
            X = X[None, :]
        return X
    
    def check_y(self, y):
        if not isinstance(y, pd.DataFrame) and not isinstance(y, pd.Series) and not isinstance(y, np.ndarray) and not torch.is_tensor(y):
            raise TypeError('Wrong type of y. It should be pandas DataFrame, pandas Series, numpy array or torch tensor.')
        y = np.array(y)
        if(y.ndim == 2):
            y = y.squeeze()
        return y

$\text{Ridge}$

In [10]:
CV = Cross_Validation(metric="roc_auc", algorithm_instance=Ridge_Classifier(alpha=1, fit_intercept=True), cross_validation_instance=KFold(n_splits=5, shuffle=True, random_state=17))
mean_of_train_scores, mean_of_valid_scores = CV.fit(X=X_train_final, y=y_train, verbose=True)
print("Mean of train scores: {}; Mean of valid scores: {}".format(np.round(mean_of_train_scores, 5), np.round(mean_of_valid_scores, 5)))

Iter 0: train scores: 0.90176; valid scores: 0.90198
Iter 1: train scores: 0.90251; valid scores: 0.89812
Iter 2: train scores: 0.902; valid scores: 0.90082
Iter 3: train scores: 0.90388; valid scores: 0.89291
Iter 4: train scores: 0.90194; valid scores: 0.903
Mean of train scores: 0.90242; Mean of valid scores: 0.89937


$\text{Lasso}$

In [11]:
CV = Cross_Validation(metric="roc_auc", algorithm_instance=Lasso_Classifier(alpha=1, fit_intercept=True), cross_validation_instance=KFold(n_splits=5, shuffle=True, random_state=17))
mean_of_train_scores, mean_of_valid_scores = CV.fit(X=X_train_final, y=y_train, verbose=True)
print("Mean of train scores: {}; Mean of valid scores: {}".format(np.round(mean_of_train_scores, 5), np.round(mean_of_valid_scores, 5)))

Iter 0: train scores: 0.86528; valid scores: 0.85678
Iter 1: train scores: 0.83656; valid scores: 0.82935
Iter 2: train scores: 0.84203; valid scores: 0.85691
Iter 3: train scores: 0.85935; valid scores: 0.84001
Iter 4: train scores: 0.84683; valid scores: 0.8431
Mean of train scores: 0.85001; Mean of valid scores: 0.84523


$\text{ElasticNet}$

In [12]:
CV = Cross_Validation(metric="roc_auc", algorithm_instance=ElasticNet_Classifier(l1_term=1, l2_term=1, fit_intercept=True), cross_validation_instance=KFold(n_splits=5, shuffle=True, random_state=17))
mean_of_train_scores, mean_of_valid_scores = CV.fit(X=X_train_final, y=y_train, verbose=True)
print("Mean of train scores: {}; Mean of valid scores: {}".format(np.round(mean_of_train_scores, 5), np.round(mean_of_valid_scores, 5)))

Iter 0: train scores: 0.86622; valid scores: 0.85773
Iter 1: train scores: 0.83783; valid scores: 0.8306
Iter 2: train scores: 0.84355; valid scores: 0.85801
Iter 3: train scores: 0.86078; valid scores: 0.84162
Iter 4: train scores: 0.84806; valid scores: 0.84455
Mean of train scores: 0.85129; Mean of valid scores: 0.8465


$\text{Logistic Regression}$

In [13]:
from sklearn.linear_model import LogisticRegression
CV = Cross_Validation(metric="roc_auc", algorithm_instance=LogisticRegression(penalty=None, fit_intercept=True), cross_validation_instance=KFold(n_splits=5, shuffle=True, random_state=17))
mean_of_train_scores, mean_of_valid_scores = CV.fit(X=X_train_final, y=y_train, verbose=True)
print("Mean of train scores: {}; Mean of valid scores: {}".format(np.round(mean_of_train_scores, 5), np.round(mean_of_valid_scores, 5)))

Iter 0: train scores: 0.90451; valid scores: 0.90404
Iter 1: train scores: 0.90536; valid scores: 0.90014
Iter 2: train scores: 0.90446; valid scores: 0.90353
Iter 3: train scores: 0.90626; valid scores: 0.89701
Iter 4: train scores: 0.90436; valid scores: 0.90488
Mean of train scores: 0.90499; Mean of valid scores: 0.90192


$\text{Actually as we can see our Logistic Regression is doing better.}$<p>
$\text{Moreover train scores and valid scores are simmmilar for Logistic Regression therefore we can conclude that there might be no need to use Regularization.}$<p>

$\text{Just to be sure, compare predictions for train and test data sets.}$

In [14]:
ridge = Ridge_Classifier(alpha=1, fit_intercept=True)
lasso = Lasso_Classifier(alpha=1, fit_intercept=True)
elastic = ElasticNet_Classifier(l1_term=1, l2_term=1, fit_intercept=True)
logistic = LogisticRegression(penalty=None, fit_intercept=True)
ridge.fit(X_train_final, y_train)
y_ridge_prob_train = ridge.predict_proba(X_train_final)[:,1]
y_ridge_prob_test = ridge.predict_proba(X_test_final)[:,1]
print("Ridge: Train: {}; Test: {}".format(np.round(roc_auc_score(y_train, y_ridge_prob_train), 4), np.round(roc_auc_score(y_test, y_ridge_prob_test), 4)))
lasso.fit(X_train_final, y_train)
y_lasso_prob_train = lasso.predict_proba(X_train_final)[:,1]
y_lasso_prob_test = lasso.predict_proba(X_test_final)[:,1]
print("Lasso: Train: {}; Test: {}".format(np.round(roc_auc_score(y_train, y_lasso_prob_train), 4), np.round(roc_auc_score(y_test, y_lasso_prob_test), 4)))
elastic.fit(X_train_final, y_train)
y_elastic_prob_train = elastic.predict_proba(X_train_final)[:,1]
y_elastic_prob_test = elastic.predict_proba(X_test_final)[:,1]
print("ElasticNet: Train: {}; Test: {}".format(np.round(roc_auc_score(y_train, y_elastic_prob_train), 4), np.round(roc_auc_score(y_test, y_elastic_prob_test), 4)))
logistic.fit(X_train_final, np.array(y_train).squeeze())
y_logistic_prob_train = logistic.predict_proba(X_train_final)[:,1]
y_logistic_prob_test = logistic.predict_proba(X_test_final)[:,1]
print("Logistic Regression: Train: {}; Test: {}".format(np.round(roc_auc_score(np.array(y_train).squeeze(), y_logistic_prob_train), 4), np.round(roc_auc_score(np.array(y_test).squeeze(), y_logistic_prob_test), 4)))

Ridge: Train: 0.9022; Test: 0.9021
Lasso: Train: 0.8645; Test: 0.8667
ElasticNet: Train: 0.8365; Test: 0.832
Logistic Regression: Train: 0.9047; Test: 0.9042


$\text{Again, best results for Logistic Regression and again simmilar for train and test data set which means that algorithm is not overfitted.}$<p>
$\text{Obviuosly for some datasets there is no need to use regularization.}$

<h1>Regression<h1>

$\text{For regression we will use dataset with diabetes, and degrees of features to see how well our algorithm is able to predict with some noise added.}$

In [15]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.datasets import load_diabetes
X = load_diabetes()['data']
y = load_diabetes()['target']
poly = PolynomialFeatures(degree = 3, interaction_only=True)
X_poly_transformed = poly.fit_transform(X)

<h2>Divide our data into train and test sets<h2>

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X_poly_transformed, y, shuffle=True, random_state=17, test_size=0.2)

<h1>Evaluation and Visualization<h1>

$\text{To verify how well our algorithms are able to perform, a cross-validation will be used on the training set (in order to average the results obtained).}$<p>
$\text{Then we will check whether the algorithms will perform equally well (or even better) on the test data.}$

$\text{Using same cross validation class}$

$\text{Ridge}$

In [17]:
CV = Cross_Validation(metric="mse", algorithm_instance=Ridge_Regressor(alpha=1, fit_intercept=True), cross_validation_instance=KFold(n_splits=5, shuffle=True, random_state=17))
mean_of_train_scores, mean_of_valid_scores = CV.fit(X=X_train, y=y_train, verbose=True)
print("Mean of train scores: {}; Mean of valid scores: {}".format(np.round(mean_of_train_scores, 5), np.round(mean_of_valid_scores, 5)))

Iter 0: train scores: 3332.89605; valid scores: 3862.14291
Iter 1: train scores: 3477.34297; valid scores: 3655.18482
Iter 2: train scores: 3435.1236; valid scores: 3447.49526
Iter 3: train scores: 3546.01792; valid scores: 2654.42716
Iter 4: train scores: 3353.94337; valid scores: 4158.55277
Mean of train scores: 3429.06478; Mean of valid scores: 3555.56058


$\text{Lasso}$

In [18]:
CV = Cross_Validation(metric="mse", algorithm_instance=Lasso_Regressor(alpha=1, fit_intercept=True, learning_rate=0.3, max_iter=1500), cross_validation_instance=KFold(n_splits=5, shuffle=True, random_state=17))
mean_of_train_scores, mean_of_valid_scores = CV.fit(X=X_train, y=y_train, verbose=True)
print("Mean of train scores: {}; Mean of valid scores: {}".format(np.round(mean_of_train_scores, 5), np.round(mean_of_valid_scores, 5)))

Iter 0: train scores: 2756.83865; valid scores: 3444.59901
Iter 1: train scores: 2876.97863; valid scores: 2697.05806
Iter 2: train scores: 2780.87022; valid scores: 2995.2885
Iter 3: train scores: 2857.21508; valid scores: 2692.37004
Iter 4: train scores: 2751.77135; valid scores: 3264.88895
Mean of train scores: 2804.73479; Mean of valid scores: 3018.84091


$\text{ElasticNet}$

In [19]:
CV = Cross_Validation(metric="mse", algorithm_instance=ElasticNet_Regressor(l1_term=1, l2_term=1, fit_intercept=True, learning_rate=0.3, max_iter=1500), cross_validation_instance=KFold(n_splits=5, shuffle=True, random_state=17))
mean_of_train_scores, mean_of_valid_scores = CV.fit(X=X_train, y=y_train, verbose=True)
print("Mean of train scores: {}; Mean of valid scores: {}".format(np.round(mean_of_train_scores, 5), np.round(mean_of_valid_scores, 5)))

Iter 0: train scores: 2760.45038; valid scores: 3445.50135
Iter 1: train scores: 2879.79203; valid scores: 2700.27571
Iter 2: train scores: 2784.5875; valid scores: 2996.30815
Iter 3: train scores: 2860.73193; valid scores: 2695.38855
Iter 4: train scores: 2753.97096; valid scores: 3270.95494
Mean of train scores: 2807.90656; Mean of valid scores: 3021.68574


$\text{Linear Regression}$

In [20]:
from sklearn.linear_model import LinearRegression
CV = Cross_Validation(metric="mse", algorithm_instance=LinearRegression(fit_intercept=True), cross_validation_instance=KFold(n_splits=5, shuffle=True, random_state=17))
mean_of_train_scores, mean_of_valid_scores = CV.fit(X=X_train, y=y_train, verbose=True)
print("Mean of train scores: {}; Mean of valid scores: {}".format(np.round(mean_of_train_scores, 5), np.round(mean_of_valid_scores, 5)))

Iter 0: train scores: 3699.72695; valid scores: 27979.6338
Iter 1: train scores: 1233.44231; valid scores: 20710.24547
Iter 2: train scores: 2869.92908; valid scores: 21294.60563
Iter 3: train scores: 1877.69611; valid scores: 12042.07143
Iter 4: train scores: 1114.85866; valid scores: 55752.4
Mean of train scores: 2159.13062; Mean of valid scores: 27555.79127


$\text{Now check for our test data.}$

In [21]:
ridge = Ridge_Regressor(alpha=1, fit_intercept=True)
lasso = Lasso_Regressor(alpha=1, fit_intercept=True, learning_rate=0.3, max_iter=1500)
elastic = ElasticNet_Regressor(l1_term=1, l2_term=1, fit_intercept=True, learning_rate=0.3, max_iter=1500)
linear = LinearRegression(fit_intercept=True)
ridge.fit(X_train, y_train)
y_ridge_pred_train = ridge.predict(X_train)
y_ridge_pred_test = ridge.predict(X_test)
print("Ridge: Train: {}; Test: {}".format(np.round(mean_squared_error(y_train, y_ridge_pred_train), 10), np.round(mean_squared_error(y_test, y_ridge_pred_test), 4)))
lasso.fit(X_train, y_train)
y_lasso_pred_train = lasso.predict(X_train)
y_lasso_pred_test = lasso.predict(X_test)
print("Lasso: Train: {}; Test: {}".format(np.round(mean_squared_error(y_train, y_lasso_pred_train), 4), np.round(mean_squared_error(y_test, y_lasso_pred_test), 4)))
elastic.fit(X_train, y_train)
y_elastic_pred_train = elastic.predict(X_train)
y_elastic_pred_test = elastic.predict(X_test)
print("ElasticNet: Train: {}; Test: {}".format(np.round(mean_squared_error(y_train, y_elastic_pred_train), 4), np.round(mean_squared_error(y_test, y_elastic_pred_test), 4)))
linear.fit(X_train, np.array(y_train).squeeze())
y_linear_pred_train = linear.predict(X_train)
y_linear_pred_test = linear.predict(X_test)
print("Linear Regression: Train: {}; Test: {}".format(np.round(mean_squared_error(np.array(y_train).squeeze(), y_linear_pred_train), 10), np.round(mean_squared_error(np.array(y_test).squeeze(), y_linear_pred_test), 4)))

Ridge: Train: 3322.2976222304; Test: 3396.5909
Lasso: Train: 2827.2603; Test: 3182.1703
ElasticNet: Train: 2830.4023; Test: 3185.5175
Linear Regression: Train: 2103.6062322946; Test: 30916.7191


$\text{In this dataset we can clearly see that a simple linear regression model gives best results for train data.}$<p>
$\text{However, on the valid/test sets predictions are really poor - our model is overfitted.}$<p>
$\text{On the other hand there are regularization algorithms with much more simmilar predictions to real values (compared to Linear Regression).}$<p>
$\text{We will tune our regularization models a bit to see if we can improve the results.}$

$\text{Ridge}$

In [22]:
for alpha in [0.001, 0.01, 0.1, 1, 10, 100, 1000]:
    CV = Cross_Validation(metric="mse", algorithm_instance=Ridge_Regressor(alpha=alpha, fit_intercept=True), cross_validation_instance=KFold(n_splits=5, shuffle=True, random_state=17))
    mean_of_train_scores, mean_of_valid_scores = CV.fit(X=X_train, y=y_train, verbose=False)
    print("Alpha: {}; Mean of train scores: {}; Mean of valid scores: {}".format(alpha, np.round(mean_of_train_scores, 5), np.round(mean_of_valid_scores, 5)))

Alpha: 0.001; Mean of train scores: 2364.9751; Mean of valid scores: 2966.03738
Alpha: 0.01; Mean of train scores: 2628.89097; Mean of valid scores: 2972.03753
Alpha: 0.1; Mean of train scores: 2786.73719; Mean of valid scores: 3011.22406
Alpha: 1; Mean of train scores: 3429.06478; Mean of valid scores: 3555.56058
Alpha: 10; Mean of train scores: 5113.65518; Mean of valid scores: 5145.78677
Alpha: 100; Mean of train scores: 5822.44381; Mean of valid scores: 5829.08668
Alpha: 1000; Mean of train scores: 5917.19329; Mean of valid scores: 5920.72745


$\text{Lasso}$

In [23]:
for alpha in [0.001, 0.01, 0.1, 1, 10, 100, 1000]:
    CV = Cross_Validation(metric="mse", algorithm_instance=Lasso_Regressor(alpha=alpha, fit_intercept=True, learning_rate=0.3, max_iter=1500), cross_validation_instance=KFold(n_splits=5, shuffle=True, random_state=17))
    mean_of_train_scores, mean_of_valid_scores = CV.fit(X=X_train, y=y_train, verbose=False)
    print("Alpha: {}; Mean of train scores: {}; Mean of valid scores: {}".format(alpha, np.round(mean_of_train_scores, 5), np.round(mean_of_valid_scores, 5)))

Alpha: 0.001; Mean of train scores: 2804.53525; Mean of valid scores: 3018.63305
Alpha: 0.01; Mean of train scores: 2804.72292; Mean of valid scores: 3018.81419
Alpha: 0.1; Mean of train scores: 2804.73445; Mean of valid scores: 3018.82892
Alpha: 1; Mean of train scores: 2804.73479; Mean of valid scores: 3018.84091
Alpha: 10; Mean of train scores: 2805.69304; Mean of valid scores: 3024.94072
Alpha: 100; Mean of train scores: 4376.87077; Mean of valid scores: 4942.15764
Alpha: 1000; Mean of train scores: 201661.54105; Mean of valid scores: 203082.81363


$\text{ElasticNet}$

$\text{Ridge and Lasso algorithms showed that the predictions are better for smaller values of penalty terms.}$<p>
$\text{Therefore in tuning of ElasticNet there will be used only values from range: } [0.001, 1]$

In [24]:
for l1_term in [0.001, 0.01, 0.1, 1]:
    for l2_term in [0.001, 0.01, 0.1, 1]:
        CV = Cross_Validation(metric="mse", algorithm_instance=ElasticNet_Regressor(l1_term=l1_term, l2_term=l2_term, fit_intercept=True, learning_rate=0.3, max_iter=1500), cross_validation_instance=KFold(n_splits=5, shuffle=True, random_state=17))
        mean_of_train_scores, mean_of_valid_scores = CV.fit(X=X_train, y=y_train, verbose=False)
        print("L1 term: {}; L2 term: {}; Mean of train scores: {}; Mean of valid scores: {}".format(l1_term, l2_term, np.round(mean_of_train_scores, 5), np.round(mean_of_valid_scores, 5)))

L1 term: 0.001; L2 term: 0.001; Mean of train scores: 2805.65053; Mean of valid scores: 3019.55215
L1 term: 0.001; L2 term: 0.01; Mean of train scores: 2807.87447; Mean of valid scores: 3021.6121
L1 term: 0.001; L2 term: 0.1; Mean of train scores: 2808.07484; Mean of valid scores: 3021.83561
L1 term: 0.001; L2 term: 1; Mean of train scores: 2808.09033; Mean of valid scores: 3021.84961
L1 term: 0.01; L2 term: 0.001; Mean of train scores: 2805.34587; Mean of valid scores: 3019.28274
L1 term: 0.01; L2 term: 0.01; Mean of train scores: 2807.71611; Mean of valid scores: 3021.46419
L1 term: 0.01; L2 term: 0.1; Mean of train scores: 2808.05825; Mean of valid scores: 3021.82087
L1 term: 0.01; L2 term: 1; Mean of train scores: 2808.08867; Mean of valid scores: 3021.84815
L1 term: 0.1; L2 term: 0.001; Mean of train scores: 2804.79221; Mean of valid scores: 3018.92376
L1 term: 0.1; L2 term: 0.01; Mean of train scores: 2806.34309; Mean of valid scores: 3020.17983
L1 term: 0.1; L2 term: 0.1; Mean o

$\text{For best values of tuned hyperparameters, check wether we could improve our test data scores.}$

In [25]:
ridge = Ridge_Regressor(alpha=0.001, fit_intercept=True)
lasso = Lasso_Regressor(alpha=0.001, fit_intercept=True, learning_rate=0.3, max_iter=1500)
elastic = ElasticNet_Regressor(l1_term=0.1, l2_term=0.001, fit_intercept=True, learning_rate=0.3, max_iter=1500)
linear = LinearRegression(fit_intercept=True)
ridge.fit(X_train, y_train)
y_ridge_pred_train = ridge.predict(X_train)
y_ridge_pred_test = ridge.predict(X_test)
print("Ridge: Train: {}; Test: {}".format(np.round(mean_squared_error(y_train, y_ridge_pred_train), 10), np.round(mean_squared_error(y_test, y_ridge_pred_test), 4)))
lasso.fit(X_train, y_train)
y_lasso_pred_train = lasso.predict(X_train)
y_lasso_pred_test = lasso.predict(X_test)
print("Lasso: Train: {}; Test: {}".format(np.round(mean_squared_error(y_train, y_lasso_pred_train), 4), np.round(mean_squared_error(y_test, y_lasso_pred_test), 4)))
elastic.fit(X_train, y_train)
y_elastic_pred_train = elastic.predict(X_train)
y_elastic_pred_test = elastic.predict(X_test)
print("ElasticNet: Train: {}; Test: {}".format(np.round(mean_squared_error(y_train, y_elastic_pred_train), 4), np.round(mean_squared_error(y_test, y_elastic_pred_test), 4)))
linear.fit(X_train, np.array(y_train).squeeze())
y_linear_pred_train = linear.predict(X_train)
y_linear_pred_test = linear.predict(X_test)
print("Linear Regression: Train: {}; Test: {}".format(np.round(mean_squared_error(np.array(y_train).squeeze(), y_linear_pred_train), 10), np.round(mean_squared_error(np.array(y_test).squeeze(), y_linear_pred_test), 4)))

Ridge: Train: 2408.2648219278; Test: 3623.3098
Lasso: Train: 2827.065; Test: 3181.933
ElasticNet: Train: 2827.3356; Test: 3182.0772
Linear Regression: Train: 2103.6062322946; Test: 30916.7191


$\text{We improved our results just a bit for Lasso and ElasticNet.}$<p>
$\text{We can now see how regularization algorithms work and also when we should use them!}$