<a href="https://colab.research.google.com/github/jenny102292/data-science-bootcamp/blob/main/06_uczenie_maszynowe/03_metryki_regresja.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### scikit-learn
>Strona biblioteki: [https://scikit-learn.org](https://scikit-learn.org)  
>
>Dokumentacja/User Guide: [https://scikit-learn.org/stable/user_guide.html](https://scikit-learn.org/stable/user_guide.html)
>
>Podstawowa biblioteka do uczenia maszynowego w języku Python.
>
>Aby zainstalować bibliotekę scikit-learn, użyj polecenia poniżej:
```
pip install scikit-learn
```

### Metryki - problem regresji:

1. [Import bibliotek](#a1)
2. [Interpretacja graficzna](#a2)
3. [Mean Absolute Error - MAE - średni błąd bezwzględny](#a3)
4. [Mean Squared Error - MSE - błąd średniokwadratowy](#a4)
5. [Roor Mean Squared Error - RMSE - pierwiastek błędu średniokwadratowego](#a5)
6. [Max Error - błąd maksymalny](#a6)
7. [R2 score - współczynnik determinacji](#a7)

### <a name='a1'></a> Import bibliotek

In [3]:
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

In [4]:
y_true = 100 + 20 * np.random.randn(50)
y_true

array([ 93.95827432,  96.63923478,  83.98612065,  83.77775013,
        88.16687714, 104.43253564, 110.60659595,  84.76130101,
        93.34643139,  70.24643644, 109.93510992, 105.45437671,
        84.78635188,  86.49140986, 116.09115521,  75.11834371,
        76.01917202,  84.92021764, 113.06340612,  87.20059287,
        81.15330673, 110.89136709, 120.42449648,  82.67821518,
       107.22497496,  81.90294094, 111.38104193,  94.89871342,
       129.37934759, 100.25705633, 119.12282726, 101.35464054,
       131.77291348,  85.92089396, 113.22377507, 111.9479476 ,
       122.17811069, 100.18258214, 117.32744895, 118.01555307,
       104.86042189, 113.95085941, 115.39083974,  70.98349407,
        96.04077034, 114.78722501, 100.25415709,  91.00541626,
       108.21571303,  81.3287804 ])

In [5]:
y_pred = y_true + 10 * np.random.randn(50)
y_pred

array([ 93.76910275,  92.39727576,  82.29848719,  95.79894258,
        72.1311244 , 112.96223823,  93.46336988, 103.75605404,
        88.18394743,  84.79376705, 119.7517848 , 116.00751749,
        66.29394542,  98.82892962, 113.81389487,  81.22939556,
        80.12075244,  86.00224749, 119.22735001,  85.81849957,
        82.17414433, 100.95317368, 112.79436744,  75.08707792,
       111.89554177,  86.70532339, 124.85771598,  88.74927338,
       120.78486539, 108.7873895 , 114.10072935,  88.79224811,
       117.22640041,  98.5435774 , 115.9073574 , 102.00514441,
       123.57701819, 102.47594952, 112.03591129, 121.2444636 ,
        96.47953157, 130.11048077,  98.05105497,  63.75480492,
       100.74113384, 111.10946716, 105.71701532, 110.77745084,
       129.52201293,  95.02075681])

In [6]:
results = pd.DataFrame({'y_true': y_true, 'y_pred': y_pred})
results.head()

Unnamed: 0,y_true,y_pred
0,93.958274,93.769103
1,96.639235,92.397276
2,83.986121,82.298487
3,83.77775,95.798943
4,88.166877,72.131124


In [7]:
results['error'] = results['y_true'] - results['y_pred']
results.head()

Unnamed: 0,y_true,y_pred,error
0,93.958274,93.769103,0.189172
1,96.639235,92.397276,4.241959
2,83.986121,82.298487,1.687633
3,83.77775,95.798943,-12.021192
4,88.166877,72.131124,16.035753


### <a name='a2'></a> Interpretacja graficzna

In [8]:
def plot_regression(y_true, y_pred):
  results = pd.DataFrame({'y_true': y_true, 'y_pred': y_pred})
  min = results[['y_true', 'y_pred']].min().min()
  max = results[['y_true', 'y_pred']].max().max()

  fig = go.Figure(data=[go.Scatter(x=results['y_true'], y=results['y_pred'], mode='markers'),
                        go.Scatter(x=[min, max], y=[min, max])],
                  layout=go.Layout(showlegend=False, width=800, height=500,
                                   xaxis_title='y_true',
                                   yaxis_title='y_pred',
                                   title='Regression results'))
  fig.show()
plot_regression(y_true, y_pred)

In [9]:
y_true = 100 + 20 * np.random.randn(1000)
y_pred = y_true + 10 * np.random.randn(1000)
results = pd.DataFrame({'y_true': y_true, 'y_pred': y_pred})
results['error'] = results['y_true'] - results['y_pred']

px.histogram(results, x='error', nbins = 50, width = 800)

### <a name='a3'></a> Mean Absolute Error - średni błąd bezwzględny
### $$MAE = \frac{1}{n}\sum_{i=1}^{n}|y_{true} - y_{pred}|$$

In [10]:
def mean_absolute_error(y_true, y_pred):
  return abs(y_true - y_pred).sum() / len(y_true)

mean_absolute_error(y_true, y_pred)

8.000039648563577

In [11]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_true, y_pred)

8.000039648563577

### <a name='a4'></a> Mean Squared Error - MSE - błąd średniokwadratowy
### $$MSE = \frac{1}{n}\sum_{i=1}^{n}(y_{true} - y_{pred})^{2}$$

In [12]:
def mean_squared_error(y_true, y_pred):
  return ((y_true - y_pred) ** 2).sum() / len(y_true)

mean_squared_error(y_true, y_pred)

97.167194131639

In [13]:
from sklearn.metrics import mean_squared_error
mean_squared_error(y_true, y_pred)

97.167194131639

### <a name='a5'></a> Root Mean Squared Error - RMSE - pierwiastek błędu średniokwadratowego
### $$RMSE = \sqrt{MSE}$$

In [14]:
def root_mean_squared_error(y_true, y_pred):
  return np.sqrt(((y_true - y_pred) ** 2).sum() / len(y_true))

root_mean_squared_error(y_true, y_pred)

9.857342143379167

In [15]:
np.sqrt(mean_squared_error(y_true, y_pred))

9.857342143379167

### <a name='a6'></a> Max Error - błąd maksymalny

$$ME = max(|y\_true - y\_pred|)$$

In [16]:
def max_error(y_true, y_pred):
  return abs(y_true - y_pred).max()

In [17]:
max_error(y_true, y_pred)

31.468156789463293

In [18]:
from sklearn.metrics import max_error

max_error(y_true, y_pred)

31.468156789463293

### <a name='a7'></a> R2 score - współczynnik determniancji
### $$R2\_score = 1 - \frac{\sum_{i=1}^{N}(y_{true} - y_{pred})^{2}}{\sum_{i=1}^{N}(y_{true} - \overline{y_{true}})^{2}}$$

In [19]:
from sklearn.metrics import r2_score

r2_score(y_true, y_pred)

0.7748858250325559

In [20]:
def r2_score(y_true, y_pred):
  numerator = ((y_true - y_pred) ** 2).sum()
  denominator = ((y_true - y_true.mean()) ** 2).sum()
  try:
    r2 = 1 - numerator / denominator
  except ZeroDivisionError:
    print('Dzielenie przez zero')
  return r2

In [21]:
r2_score(y_true, y_pred)

0.7748858250325559