# Introduction to Machine Learning

## Lesson 2. Linear regression

## Introduction

This lab introduces you to the Linear Regression and how to apply the solutions. Furthermore, you will learn how to prepare the dataset into a machine learning model.

## Task

In this lab, we would be performing a basic classificatiob on Taxi Dataset

### About Taxi Dataset
Taxi trip records include id, vendor_id, pickup_datetime, passenger_count, store_and_fwd_flag, trip_duration, distance_km.
The dataset is often used for testing algorithms for classification and pattern recognition due to its simplicity and well-defined structure.

### Performing the Classification
To do so you will need:
- Obtain data from competition
- Create a Jupyter notebook which will produce a file for submission
- Submit it to the competition

### Questions
*To solve a linear regression problem, use the following formula:
$\beta^*=(X^T \cdot X)^{-1} \cdot X^T \cdot Y$.*: 

1. No
2. Yes, this is the only sure-fire way. 
3. When to do it depends on the specifics of the data.

-------------------------------------------------------

*If we want to find the minimum of an arbitrary differentiable function, we need to find the points at which the derivative is zero.*: 

1. Yes, but check that the derivative to the left of this point is negative and to the right is positive.
2. Yes, but check that the derivative to the left of that point is positive and to the right is negative.
3. Yes, take any such point near which the derivative has changed its sign.
4. No, it is necessary to use the apparatus of working with matrices.

-------------------------------------------------------

*When we have found the minimum of a function, it means that we have found the point at which the function reaches its minimum value for all possible arguments.*: 

1. Yes, because that's what a minimum is.
2. No, a function can have several different minima. To find the minimum value, you need to find all the minima and compare them to each other. 
3. No, a function can have several minima that differ from each other. Depending on the nature of the function's behavior, there may (but not necessarily) be a global one among them.
4. No correct answer

## Importing required Libraries

First we need to import necessary libraries:

[Pandas](https://pandas.pydata.org/) - For data analysis and manipulation

In [2]:
import pandas as pd

## Preparaing Data
Data in Machine Learning and Deep Learning is usally consisting of `train` and `test` splits. Sometimes there's a `validation` split as well.

The main purpose of the train-test split is to assess how well a machine learning model generalizes to unseen data. By splitting the dataset, we can train the model on one subset of the data (the training set) and test its performance on another subset (the test set).


Train test split usually serves two purposes:
1.   **Avoiding Overfitting**: By using a separate test set, we ensure that the model’s performance is not overly optimistic, as it has not seen the test data during training. This helps to avoid overfitting, where the model performs well on the training data but poorly on unseen data

2.   **Model Validation**: It provides a straightforward way to validate the model, giving insights into how it might perform in real-world scenarios


Here your goal is to train any appropriate ML model on `train` split and run inference on `test` split.



In [3]:
df = pd.read_csv('taxi_dataset.csv', index_col=0)

df.head()

Unnamed: 0_level_0,vendor_id,pickup_datetime,passenger_count,store_and_fwd_flag,trip_duration,distance_km,prediction_1,prediction_2
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
id2875421,1,2016-03-14 17:24:55,930.399753,0,455.0,1.500479,578.156451,355.27071
id2377394,0,2016-06-12 00:43:35,930.399753,0,663.0,1.807119,962.657188,674.295781
id3858529,1,2016-01-19 11:35:24,930.399753,0,2124.0,6.39208,2546.180515,2422.132431
id3504673,1,2016-04-06 19:32:31,930.399753,0,429.0,1.487155,737.926214,795.992362
id2181028,1,2016-03-26 13:30:55,930.399753,0,435.0,1.189925,666.070794,-4.158492


#### EXERCISE 1:

Use the targeting column (*trip_duration*) and all attributes except the order start time (*pickup_datetime*) as a Sample


To get started, let's use the "boxed solution". To do this, create a model variable, put in it the **LinearRegression** class from the **linear_model** module of the **sklearn** library.

Next, use the **fit** method, put into it the **X** array consisting of object attributes (you can use a pandas dataframe or a numpy array) and the **Y** array with targets.

In [3]:
from sklearn.linear_model import LinearRegression

### Your code is here

#### Exercise 2

To look at the values of the obtained model coefficients, you should refer to the **coef_** attribute of the linear regression class. To view the free weights, to the **intercept_** attribute 

In [2]:
### Your code is here


## Fitting the Model

It is a process of training a model on a dataset to learn the underlying patterns and relationships within the data. This involves adjusting the model's parameters so that its predictions closely match the actual target values. During fitting, the model uses algorithms to minimize the error between its predictions and the true outcomes by optimizing a loss function. The result is a trained model that can make accurate predictions on new, unseen data.

#### Exercise 3
Now implement the LinearRegressionByMatrix function, which will take 3 parameters as input:

Matrix object-value **(X)**, vector with answers **(Y)**, Boolean parameter **fit_intercept** whose purpose is to add a constant feature (consisting of ones) if True, and do nothing if False.

The function should return a one-dimensional np.array object with the evaluated **$\beta_1, ..., \beta_n, \beta_0$**.

$$
\beta^* = (X^T \cdot X)^{-1} \cdot X^T \cdot Y
$$

In [4]:
import numpy as np

### Your code is here

def LinearRegressionByMatrix(X: np.array, Y: np.array, fit_intercept: bool = True):
        """
        :param X: matrix of objects
        :param Y: vector (matrix with 1 column) of responses
        :param fit_intercept: should I add a constant variable to the data?
        
        :return: one-dimensional numpy-array with beta coefficients obtained
        """
        
        return 

#### Exercise 4
Are the coefficients the same as in the "box" version?

Having the coefficients of the model, you can recover predictions for each object!

Do it through matrix product operation of **X** matrix and obtained coefficients (as a result of *LinearRegressionByMatrix*)

In [5]:
### Your code is here