# Mean squared error

## Setup

We need the following modules:

- pandas
- scikit-learn

In [1]:
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

## Data

### Create data

In [2]:
df = pd.DataFrame(
    {'sales': [2500, 4500, 6500, 8500, 10500, 12500, 14500, 16500, 18500, 20500],
      'ads'  : [900, 1400, 3600, 3800, 6200, 5200, 6800, 8300, 9800, 10100]}
)

### Data structure

In [3]:
df

Unnamed: 0,sales,ads
0,2500,900
1,4500,1400
2,6500,3600
3,8500,3800
4,10500,6200
5,12500,5200
6,14500,6800
7,16500,8300
8,18500,9800
9,20500,10100


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   sales   10 non-null     int64
 1   ads     10 non-null     int64
dtypes: int64(2)
memory usage: 288.0 bytes


## Model

In [5]:
intercept = 500 
slope = 2

df['sales_prediction'] = intercept + slope * df['ads'] 

### Calculate mean squared error

---

Step 1: calculate the squared difference between true outcome and prediction:


The general formula:


$$Squared\ errors \ (SE) = (y_i-\hat{y_i})^2$$



For our data this means:

$$ se = (sales_i - sales\_prediction_i)^2$$

In [6]:
df['se'] = (df['sales'] - df['sales_prediction']) ** 2

In [7]:
df.head()

Unnamed: 0,sales,ads,sales_prediction,se
0,2500,900,2300,40000
1,4500,1400,3300,1440000
2,6500,3600,7700,1440000
3,8500,3800,8100,160000
4,10500,6200,12900,5760000


---

Step 2: Sum up all squared errors

Calculate the sum of squared errors (SSE)

The general formula:

$$Sum\ of\ squared\ errors \ (SSE) = \sum_{i=1}^{n}(y_i-\hat{y_i})^2$$

which equals

$$Sum\ of\ squared\ errors \ (SSE) = \sum_{i=1}^{n} SE$$



For our data this means:



$$sse = \sum_{i=1}^{n}(sales_i-\hat{sales\_prediction_i})^2$$


which equals

$$sse = \sum_{i=1}^{n} se$$

In [8]:
# Step 2: sum up all squared errors
sse = df['se'].sum()

In [9]:
sse

14520000

---

Step 3: Divide by the number of examples

The general formula:

$$Mean\ squared\ error \ (MSE) = \frac{1}{n} \times \sum_{i=1}^{n}(y_i-\hat{y_i})^2$$

which equals

$$Mean\ squared\ error \ (MSE) = \frac{1}{n} \times SSE$$

*with n = the number of observations*



For our data this means:



$$mse = \frac{1}{n} \times \sum_{i=1}^{n}(sales_i-\hat{sales\_prediction_i})^2$$

which equals

$$mse = \frac{1}{n} \times sse $$


*with n = the number of observations*



In [10]:
# Obtain number of examples (observations)
n = len(df)

In [11]:
n

10

In [12]:
# Calculate mean squared error
mse = sse / n

In [13]:
mse

1452000.0

In [14]:
print(f'MSE: {mse:.0f}')

MSE: 1452000


### Root mean squared error

Another important measure is the root mean squared error:

The general formula:

$$Root\ mean\ squared\ error \ (RMSE) = \sqrt{\frac{1}{n} \times \sum_{i=1}^{n}(y_i-\hat{y_i})^2}$$

which equals

$$Root\ mean\ squared\ error \ (RMSE) = \sqrt{\frac{1}{n} \times SSE}$$

*with n = the number of observations*



For our data this means:



$$rmse = \sqrt{\frac{1}{n} \times \sum_{i=1}^{n}(sales_i-\hat{sales\_prediction_i})^2}$$

which equals

$$rmse = \sqrt{\frac{1}{n} \times sse} $$

which also equals

$$rmse = (\frac{1}{n} \times sse)^\frac{1}{2} $$



with n = the number of observations

In [15]:
# Calculate root mean squared error
rmse = mse ** 0.5

In [16]:
rmse

1204.9896265113655

### Mean absolute error

Another performance measure (which is less commonly used) is the mean absolute error:

The general formula:

$$Mean\ absolute\ error \ (MAE) = \frac{1}{n} \times \sum_{i=1}^{n}(y_i-\hat{y_i})$$



### Use functions

#### Mean squared error

In [17]:
mean_squared_error(df['sales'], df['sales_prediction'])

1452000.0

#### Root mean squared error

In [18]:
mean_squared_error(df['sales'], df['sales_prediction'], squared=False)

1204.9896265113655

#### Mean absolute error

In [19]:
mean_absolute_error(df['sales'], df['sales_prediction'])

980.0