# Introduction to Machine Learning
## Lesson 4 Lasso, Ridge and Cross Validation

## Introduction


The task of our first lesson is to understand how work Lasso, Ridgee and Cross Validation. This includes going through the theory and following a set of instructions to complete the code.


## Task

In this lab work we will learn and try to understand cross validation, lasso and ridge. A simplified version of the cab dataset will be used as the dataset

### About Taxi Dataset
Taxi trip records include id, vendor_id, passenger_count, store_and_fwd_flag, distance_km, log_trip_duration.
The dataset is often used for testing algorithms for classification and pattern recognition due to its simplicity and well-defined structure.

### Performing the Classification
To do so you will need:
- Obtain data from competition
- Create a Jupyter notebook which will produce a file for submission
- Submit it to the competition

In [3]:
### Questions

*Choose the correct statements about Lasso and Ridge*: 

1. Ridge-regularization is more likely to zero out model weights than Lasso.
2. Lasso regularization is more likely to zero out model weights than Ridge.
3. Ridge and Lasso regularization are designed to deal with the situation of undertraining. 
4. The essence of Lasso regularization is to add to the minimized functional the sum of the modules of the trained coefficients.
5. The essence of Lasso regularization is to add to the minimized functional the sum of squares of the trained coefficients.

_______________________________________________________

*Choose the correct statements about Multicollinearity: 

1. Multicollinearity ensures that we get an overfitted model.
2. Multicollinearity occurs when there is a strong linear dependence in the object-attribute matrix.
3. If there are linearly dependent features in the object-attribute matrix, then we cannot apply the matrix formula for finding the optimal regression coefficients with a 50% probability.
4. If linearly dependent features are present in the object-sign matrix, then the minimized functional has not one single but 2-3 points of minimum.
5. If the object-sign matrix contains linearly dependent features, then the minimized functional has an infinite number of minima.
6. Multicollinearity can be cured by removing dependent features or using regularization.

_______________________________________________________

## Importing required Libraries

First we need to import necessary libraries:

[Pandas](https://pandas.pydata.org/) - For data analysis and manipulation

[Numpy](https://numpy.org/) - To deal with matrices

Then drop unused fields.

In [None]:
import numpy as np
import pandas as pd

## Preparaing Data
Preparing data for machine learning involves several steps such as data collection, cleaning the data from noise and outliers, transforming the data into a suitable format, normalizing or standardizing values, and creating and selecting features (feature engineering) to improve the quality of the model. This process is important to ensure the accuracy and reliability of machine learning models because data quality directly affects their performance.



In [None]:

processed_data = pd.read_csv('processed_data.csv', index_col='id')
processed_data = processed_data.assign(log_trip_duration=np.log1p(processed_data['trip_duration']))
processed_data = processed_data.drop('trip_duration', axis=1)

processed_data.head()

#### Exercise 1
The task is to measure the quality of Linear Regression after processing the data on Cross-Validation on 4 folds

use LinearRegression and cross_validate from sklearn

In [14]:
from sklearn.model_selection import KFold

selector = KFold(n_splits=4, shuffle=True, random_state=33)

from sklearn.model_selection import cross_validate
from sklearn.linear_model import LinearRegression

### Your code is here


#### Exercise 2

In linear algebra the concept of **rank of a matrix** is often used. It corresponds to the number of linearly independent columns in the matrix. In other words, it allows us to estimate whether there is an excess of information in our dataframe. If the rank of the matrix is less than the number of used columns, some fixtures should be removed, because otherwise a situation of strict multicollinearity occurs.

To measure the rank in our matrixes, we can use the function numpy.linalg.matrix_rank

The constant feature can be neglected in this exercise.

You should output it in console through function of python **print**

In [4]:
### Your code is here

#### Exercise 3
Doesn't it seem to us that because of the new 4 features there is a problem of multicollinearity? How to make it so that, on the one hand, we get adequate quality, and on the other hand not to remove new features?

Regularization will help us here

Find such a regularization parameter $\lambda$ for the Ridge and Lasso case that the RMSLE error on cross-validation is strictly less than 0.4

**ALARM**: use a data mass-scaling procedure (use the MinMaxScaler method) before applying the regularization. Important - to preserve the concept of independence of training on the traine and on the test, at each iteration of cross-validation it is necessary to measure the standardization parameters exclusively on the traine and then apply them on the validation fold.

In [40]:
from sklearn.preprocessing import MinMaxScaler

from sklearn.pipeline import Pipeline

from sklearn.linear_model import Lasso, Ridge

from sklearn.model_selection import GridSearchCV

In [5]:
### Your code is here

Find the best model on Cross-Validation. Use **best_estimator_** from **cv_lasso**

In [7]:
### Your code is here

List of different scorers for all models
The ones listed above. Use **cv_results_**

In [8]:
### Your code is here

#### Exercise 4
In this assignment, adjust the hyperparameters of the Ridge regression model using cross-validation. Create a payplane with MinMaxScaler and Ridge, define a grid of alpha values, use GridSearchCV with negative RMS error and cross validation. Then apply fit to X and Y data to find the optimal hyperparameters.

In [9]:
### Your code is here

 Find the optimal hyperparameters of the Ridge regression model using cross-validation. Use **best_estimator_** from **cv_ridge**

In [10]:
### Your code is here

Output the mean values of the test data scores for each combination of parameters. Use **cv_results_** like in previous exercise 3

In [11]:
### Your code is here