<a href="https://colab.research.google.com/github/piotr-osiwianski/data-science-bootcamp/blob/master/06_uczenie_maszynowe/03_metryki_regresja.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

* @author: krakowiakpawel9@gmail.com  
* @site: e-smartdata.org

### scikit-learn
>Strona biblioteki: [https://scikit-learn.org](https://scikit-learn.org)  
>
>Dokumentacja/User Guide: [https://scikit-learn.org/stable/user_guide.html](https://scikit-learn.org/stable/user_guide.html)
>
>Podstawowa biblioteka do uczenia maszynowego w języku Python.
>
>Aby zainstalować bibliotekę scikit-learn, użyj polecenia poniżej:
```
pip install scikit-learn
```

### Metryki - Problem regresji:
1. [Import bibliotek](#a0)
2. [Interpretacja graficzna](#a2)
3. [Mean Absolute Error - MAE - Średni błąd bezwzględny](#a3)
4. [Mean Squared Error - MSE - Błąd średniokwadratowy](#a4)
5. [Root Mean Squared Error - RMSE - Pierwiastek błędu średniokwadratowego](#a5)
6. [Max Error - Błąd maksymalny](#a6)
7. [R2 score - współczynnik determinacji](#a7)

    

### <a name='a0'></a>  Import bibliotek

In [4]:
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

In [13]:

# wartosci = 100, odchylenie standardowe = 20,
y_true = 100 + 20 * np.random.randn(50)
y_true

array([125.3811523 , 132.84479151,  94.17514019,  72.64789   ,
        93.39456092,  99.31765741, 106.42431633, 110.64518412,
       109.50765913,  85.89075333, 129.03229092, 108.52523509,
       109.97997916, 112.97274009, 136.12165641, 141.2700659 ,
        87.06546312,  86.28222595,  87.2444297 , 100.87255678,
       112.43852659,  79.8998623 ,  93.53764617,  96.74669104,
       101.84661172, 124.3340282 , 129.02820486, 125.16298328,
       122.03904582,  86.627924  , 129.31765451, 115.75229451,
        64.80394364,  76.00088619, 114.96550146,  77.49767571,
       103.67674647, 113.5927107 , 136.87376778, 149.5390332 ,
       104.85141828,  89.2120048 ,  81.08798471,  98.79189918,
        49.14650095,  93.42724806,  90.86148545,  89.55278833,
       116.75337946, 111.69288461])

In [14]:
# dodamy pewien szum do naszych danych
y_pred = y_true + 10 * np.random.randn(50)
y_pred

array([114.82802365, 134.24594419,  88.96445844,  65.67823263,
        94.81312803, 105.84251642, 113.41937017,  96.90015493,
       103.43862201,  79.50133099, 133.20552147, 112.17795345,
       113.26502645, 119.74860884, 142.40406592, 137.14097395,
        75.53959978,  92.14419468,  96.20597558,  89.05882045,
       112.15573675,  65.69350056, 106.51641783,  96.72463087,
        77.54834763, 134.6140117 , 138.275178  , 111.84896606,
       122.58262493,  98.98155955, 130.49901511, 115.1180164 ,
        54.61029594,  72.10992187, 120.63971386,  72.15441105,
       103.67332149, 114.38958905, 148.3886451 , 145.22980585,
        99.73806781,  86.98580392,  72.79246874, 111.88448414,
        53.83096455, 102.38766294,  98.04622096,  77.77481778,
       126.77985475, 101.8569436 ])

In [19]:
results = pd.DataFrame({'y_true': y_true, 'y_pred': y_pred})
# metoda head() wyswietli 5 pierwszych wierszy
results.head()

Unnamed: 0,y_true,y_pred
0,125.381152,114.828024
1,132.844792,134.245944
2,94.17514,88.964458
3,72.64789,65.678233
4,93.394561,94.813128


In [21]:
# pokazemy roznice wartosci prawdziwej do tej przypisanej przez model
results['error'] = results['y_true'] - results['y_pred']
results.head()

Unnamed: 0,y_true,y_pred,error
0,125.381152,114.828024,10.553129
1,132.844792,134.245944,-1.401153
2,94.17514,88.964458,5.210682
3,72.64789,65.678233,6.969657
4,93.394561,94.813128,-1.418567



### <a name='a2'></a> Interpretacja graficzna

In [23]:
def plot_regression_results(y_true, y_pred):
    results = pd.DataFrame({'y_true': y_true, 'y_pred': y_pred})
    # min i max zeby zlapac najmniejsze i najwieksze wartosci
    min = results[['y_true', 'y_pred']].min().min()
    max = results[['y_true', 'y_pred']].max().max()

    fig = go.Figure(data=[go.Scatter(x=results['y_true'], y=results['y_pred'], mode='markers'),
                    go.Scatter(x=[min, max], y=[min, max])],
                    layout=go.Layout(showlegend=False, width=800, height=500,
                                     xaxis_title='y_true',
                                     yaxis_title='y_pred',
                                     title='Regression results'))
    fig.show()
plot_regression_results(y_true, y_pred)

In [25]:
y_true = 100 + 20 * np.random.randn(1000)
y_pred = y_true + 10 * np.random.randn(1000)
results = pd.DataFrame({'y_true': y_true, 'y_pred': y_pred})
results['error'] = results['y_true'] - results['y_pred']

px.histogram(results, x='error', nbins=50, width=800)
# im wiecej po prawej stronie krzywej to nasz model przeszacowuje wartosci
# najlepiej jak rozklad bedzie symetryczny

### <a name='a3'></a> Mean Absolute Error - Średni błąd bezwzględny
### $$MAE = \frac{1}{n}\sum_{i=1}^{n}|y_{true} - y_{pred}|$$

In [26]:
def mean_absolute_error(y_true, y_pred):
    return abs(y_true - y_pred).sum() / len(y_true)

mean_absolute_error(y_true, y_pred)

7.879966736022787

In [27]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_true, y_pred)

7.879966736022787

### <a name='a4'></a> Mean Squared Error - MSE - Błąd średniokwadratowy
### $$MSE = \frac{1}{n}\sum_{i=1}^{n}(y_{true} - y_{pred})^{2}$$

In [28]:
def mean_squared_error(y_true, y_pred):
    return ((y_true - y_pred) ** 2).sum() / len(y_true)

mean_squared_error(y_true, y_pred)

96.81396977612414

In [29]:
from sklearn.metrics import mean_squared_error

mean_squared_error(y_true, y_pred)

96.81396977612414

### <a name='a5'></a> Root Mean Squared Error - RMSE - Pierwiastek błędu średniokwadratowego
### $$RMSE = \sqrt{MSE}$$

In [30]:
# mowi o tym jak srednio nasze predykcje odbiegaja od wartosci prawdziwych
def root_mean_squared_error(y_true, y_pred):
    return np.sqrt(((y_true - y_pred) ** 2).sum() / len(y_true))

root_mean_squared_error(y_true, y_pred)

9.83940901559256

In [31]:
np.sqrt(mean_squared_error(y_true, y_pred))

9.83940901559256

### <a name='a6'></a>  Max Error - Błąd maksymalny

$$ME = max(|y\_true - y\_pred|)$$

In [32]:
def max_error(y_true, y_pred):
    return abs(y_true - y_pred).max()

In [33]:
max_error(y_true, y_pred)

29.146490290560834

In [34]:
from sklearn.metrics import max_error

max_error(y_true, y_pred)

29.146490290560834

### <a name='a7'></a>  R2 score - współczynnik determinacji
### $$R2\_score = 1 - \frac{\sum_{i=1}^{N}(y_{true} - y_{pred})^{2}}{\sum_{i=1}^{N}(y_{true} - \overline{y_{true}})^{2}}$$

In [35]:
from sklearn.metrics import r2_score

r2_score(y_true, y_pred)

0.7657087453125047

In [37]:
def r2_score(y_true, y_pred):
    numerator = ((y_true - y_pred) ** 2).sum()
    denominator = ((y_true - y_true.mean()) ** 2).sum()
    try:
        r2 = 1 - numerator / denominator
    except ZeroDivisionError:
        print('Dzielenie przez zero')
    return r2

In [38]:
r2_score(y_true, y_pred)

0.7657087453125047