# Introduction

>This notebook introduces the concepts of Machine Learning and more particularly linear regression, in order to show to what extent Python is a programming language adapted to machine learning problems. All these concepts will be presented in more detail and put into practice in the modules dedicated to Machine Learning.
>
> Machine Learning is a sub-field of Artificial Intelligence, which gives the computer the ability to learn to automatically perform tasks from data. When the task to be performed is the prediction of a variable, we speak of supervised learning.
>
> Linear regression is one of the first supervised learning predictive models to have been studied. This model makes it possible to predict a quantitative variable. Today it is the most popular model for practical applications thanks to its simplicity.
>
> In the linear regression model, we have $y$ the quantitative variable to predict (called target variable) and explanatory variables allowing the prediction.

### Univariate Linear Regression

> In the univariate linear model, we work with two variables: $y$ is known as the **target** variable, and $x$ is referred to as the **explanatory variable**. Linear regression involves modeling the relationship between these two variables using an **affine function**. Therefore, the formula for the univariate linear model is expressed as:
>
>$$y \approx \beta_1 x + \beta_0 $$
>
>Here:
>- $y$ represents the variable we aim to predict.
>- $x$ stands for the explanatory variable.
>- $\beta_1$ and $\beta_0$ denote the parameters of the affine function. $\beta_1$ determines the **slope**, while $\beta_0$ determines the >**y-intercept** (also known as the **bias**).
>
>**The objective of linear regression is to find the optimal values of $\beta_0$ and $\beta_1$ for predicting the variable $y$ based on a given value of $x$**.

### Multivariate Linear Regression


> Multivariate linear regression involves establishing a linear relationship between a target variable $y$ and **multiple explanatory variables** $x_1$, $x_2$, ..., $x_p$, commonly referred to as *features*:
>
> $$
\begin{align}
    y & \approx β_0 + β_1 x_1 + β_2 x_2 + ⋯ + β_p x_p \\
      & \approx β_0+ \sum_{j=1}^{p} β_j x_j 
\end{align}
$$
>
> There are now $p + 1$ parameters $\beta_j$ to find.


# 1. Utilizing scikit-learn for Linear Regression

In this section, we will explore how to employ the **`scikit-learn`** library to solve a Machine Learning problem using Linear Regression.

Throughout the upcoming exercises, our objective will be to predict the **selling price of cars** based on their **characteristics**.

### Importing the Dataset

The dataset we will be working with contains a variety of attributes related to different cars from the year 1985.

For simplicity, only numeric variables have been retained, and rows with missing values have been removed.

* **(a)** Import the `pandas` module and assign it an alias `pd`.

* **(b)** Load the `automobiles.csv` dataset into a `DataFrame` named `df` using the `read_csv` function from `pandas`. The file is located in the same directory as the notebook runtime environment.

* **(c)** Display the first 5 rows of the `df` DataFrame to verify a successful import.


In [2]:
import pandas as pd
df = pd.read_csv('C:/Users/OlhaIshchenko/Documents/Daten_Analyse/unterricht/csv_Datei/automobiles.csv')
df.head()

Unnamed: 0,symboling,normalized-losses,wheel-base,length,width,height,curb-weight,engine-size,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,2,164,99.8,176.6,66.2,54.3,2337,109,3.19,3.4,10.0,102,5500,24,30,13950
1,2,164,99.4,176.6,66.4,54.3,2824,136,3.19,3.4,8.0,115,5500,18,22,17450
2,1,158,105.8,192.7,71.4,55.7,2844,136,3.19,3.4,8.5,110,5500,19,25,17710
3,1,158,105.8,192.7,71.4,55.9,3086,131,3.13,3.4,8.3,140,5500,17,20,23875
4,2,192,101.2,176.8,64.8,54.3,2395,108,3.5,2.8,8.8,101,5800,23,29,16430


* The `symboling` variable indicates the degree of risk in relation to the insurer, taking into account factors like the risk of accidents and breakdowns.

* The `normalized_losses` variable represents the relative average annual cost of vehicle insurance. It's normalized based on cars of the same type (SUV, utility, sports, etc.).

* The next 13 variables refer to technical specifications of the cars, including dimensions, engine displacement, horsepower, etc.

* The final variable, `price`, denotes the selling price of the vehicle. This is the variable we aim to predict.


### Separating Explanatory Variables from Target Variable

> We will now create two separate `DataFrames` - one containing the explanatory variables and the other containing the target variable `price`.

* **(d)** Create a `DataFrame` named `X` by copying all the explanatory variables from our dataset, excluding `price`.

* **(e)** Create a `DataFrame` named `y` by copying the target variable `price`.


In [9]:
X = df[df.columns[:-1]]
y = df['price']

### Splitting the Data into Training and Test Sets

> We are now going to divide our dataset into two distinct sets: A **training** set and a **test** set. 
>
>> - The training set is deployed to train the model, i.e., to determine the optimal $\beta_0$, ..., $\beta_p$ parameters for this dataset.
>>
>> - The test set is utilized to assess the trained model. This evaluation gauges its ability to make accurate predictions on data it has **never encountered** before.

> The `train_test_split` function from the `model_selection` submodule of **`scikit-learn`** proves to be exceptionally useful for this task.

* **(f)** Execute the following cell to import the `train_test_split` function.


In [10]:
from sklearn.model_selection import train_test_split

> This function is used as follows:
>
> ```python
>    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
> ```
>
> - `X_train` and` y_train` represent the explanatory and target variables of the **training** dataset.
>
>
> - `X_test` and` y_test` signify the explanatory and target variables of the **test** dataset.
>
>
> - The `test_size` parameter dictates the **proportion** of the dataset allocated for the test set. In the example above, this proportion is set to 20% of the initial dataset.
>
>
> - The `random_state` argument ensures that the data splitting can be reproduced. Indeed, the operation being random, 2 successive cuts will in theory give 2 different results. As long as the value of `random_state` is the same (it doesn't matter what value it is), the result of the train_test_split function will remain the same.

* **(g)** Using the `train_test_split` function, divide the dataset into a training set (`X_train`, `y_train`) and a test set (`X_test`, `y_test`) so that the test set encompasses **15% of the original dataset**. Precise the parameter `random_state = 42

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.15,
                                                    random_state=42)

### Training the regression model

> To train a linear regression model on this dataset, we will use the **`LinearRegression`** class contained in the `linear_model` submodule of `scikit-learn`.

* **(h)** Execute the following cell to import the `LinearRegression` class.

In [12]:
from sklearn.linear_model import LinearRegression

> The `scikit-learn` API makes it easy to train and evaluate models. All scikit-learn model classes have the following two methods:
>
>> * **`fit`**: Train the model on the dataset given as input.
>>
>>
>> * **`predict`**: Make a prediction from a set of explanatory variables given as input.
>
> Below is an example of training a model with scikit-learn:
>
>
> ```python
> # Instantiation of the model
> linreg = LinearRegression()
>
> # Training the model on the training set
> linreg.fit(X_train, y_train)
>
> # Prediction of the target variable for the test dataset. These predictions are stored in y_pred.
> y_pred = linreg.predict(X_test)
>
> ```


* **(i)** Instantiate a `LinearRegression` model named **`lr`**.


* **(j)** Train `lr` on the training dataset.


* **(k)** Make a prediction on the training data. Store these predictions in `y_pred_train`.


* **(l)** Make a prediction on the test data. Store these predictions in `y_pred_test`.

In [13]:
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_train = lr.predict(X_train)
y_pred_test = lr.predict(X_test)

### Evaluating Model Performance

> To assess the accuracy of the model's predictions, we have several predefined metrics available in the `scikit-learn` library. One commonly used metric for regression tasks is the Mean Squared Error (MSE). It is accessible through the `mean_squared_error` function in the `metrics` submodule of `scikit-learn`.
> 
> The MSE is computed by averaging the squared differences between the predicted values obtained from the regression function and the actual target values.
>
> The `mean_squared_error` function of `scikit-learn` is used as follows:
>
> ```python
>    mean_squared_error(y_true, y_pred)
> ```
> where:
>
>> * `y_true` contains the true values of the target variable.
>> * `y_pred` contains the values **predicted** by our model for the same explanatory variables.


* **(m)** Import the **`mean_squared_error`** function from the `sklearn.metrics` submodule.


* **(n)** Evaluate the prediction quality of the model on **training data**. Store the result in a variable named `mse_train`.


* **(o)** Evaluate model prediction quality on **test data**. Store the result in a variable named `mse_test`.


In [14]:
from sklearn.metrics import mean_squared_error

In [15]:
mse_train = mean_squared_error(y_train, y_pred_train)
mse_test = mean_squared_error(y_test, y_pred_test)
print('MSE for train: ', mse_train)
print('MSE for test: ', mse_test)

MSE for train:  5025260.95589875
MSE for test:  6976108.119005606


>The Mean Squared Error (MSE) you'll obtain is expected to be in the millions on the test data, which might be hard to interpret.
>
>To address this, we'll also utilize another metric called the **Mean Absolute Error** (MAE). This metric is on the same scale as the target variable, making it more easily interpretable.
>
* **(p)** Import the `mean_absolute_error` function from the `sklearn.metrics` submodule.

* **(q)** Assess the prediction quality on both the test and training data using the mean absolute error.

* **(r)** Calculate the average purchase price for all vehicles from the `DataFrame` `df`. Based on this, do you find the model's predictions to be reliable?


In [17]:
from sklearn.metrics import mean_absolute_error

In [20]:
mae_train = mean_absolute_error(y_train, y_pred_train)
mae_test = mean_absolute_error(y_test, y_pred_test)
print('MAE for train: ', mae_train)
print('MAE for test: ', mae_test)
print('Relative errors: ', mae_test / df['price'].mean())

MAE for train:  1664.3472259040404
MAE for test:  1917.5752957207417
Relative errors:  0.16753631000197153


In [19]:
df['price'].mean()

np.float64(11445.729559748428)

# Conclusion and Recap

>In this notebook, we introduced the resolution of a Machine Learning problem.
>
> The different stages that we studied are the classic stages of any project:
>
> * Data exploration with the `Pandas` library
>
> * Data preparation by separating the explanatory variables from the target variable
>
> * Separation of the dataset into two (a training set and a test set) using the `train_test_split` function from the `scikit-learn` library
>
> * Identification of the type of problem: here a regression
>
> * Instantiation of a model like `LinearRegression` with the `scikit-learn` library
>
> * Training the model on the training dataset using the `fit` method
>
> * Prediction on test data using the `predict` method
>
> * Evaluation of model performance by calculating the error between these predictions and the true values ​​of the target variable in the test data. The evaluation for a regression model is easily done using the `mean_squared_error` or `mean_absolute_error` functions of the `metrics` submodule of scikit-learn.
>
> In the next notebook, we will carry out the same steps but for the resolution of a classification Machine Learning problem.