**Important notes:**

**Important notes:**

- You can complete this exercise in 1) Colab or 2) on your local machine
  1. **Colab**: Click on the 🚀 symbol at the top of the page and select Colab. When you finished the exercise, download the file: `File` > `Download` > `ipynb`. 
  2. **Local**: click on the download button at the top of the page and choose `.ipynb`. Activate the conda environment `mr` before you start: `conda activate mr`


- Don't change the name of the file and don't delete any cells.


- Make sure you fill in any place that says  <font color='green'> \# YOUR CODE HERE </font> or "YOUR ANSWER HERE", as well as your name and (if necessary) collaborators below.


- The function **NotImplementedError()** prevents you from hand in assignments with empty cells. Simply delete the function if you start working on a cell with this entry.


- Before you turn this problem in (i.e., after you completed all tasks), make sure everything runs as expected: Restart the kernel and run all cells:
  - in *Colab*: in the menubar, select `Runtime` and click on `Restart and run all`
  - if you use *Jupyter Notebook*: in the menubar, select `Kernel` and click on `Restart & Run All`
  - if you use *Visual Studio Code*: select "Restart" and then "Run All" 


Good luck!

In [1]:
NAME = "lm156"
COLLABORATORS = ""

In [2]:
import IPython
assert IPython.version_info[0] >= 3, "Your version of IPython is too old, please update it."

---

# Mean squared error

## Setup

We need the following modules:

- pandas
- scikit-learn

In [3]:
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

## Data

### Create data

In [4]:
df = pd.DataFrame(
    {'sales': [2500, 4500, 6500, 8500, 10500, 12500, 14500, 16500, 18500, 20500],
      'ads'  : [900, 1400, 3600, 3800, 6200, 5200, 6800, 8300, 9800, 10100]}
)

### Data structure

In [5]:
df

Unnamed: 0,sales,ads
0,2500,900
1,4500,1400
2,6500,3600
3,8500,3800
4,10500,6200
5,12500,5200
6,14500,6800
7,16500,8300
8,18500,9800
9,20500,10100


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   sales   10 non-null     int64
 1   ads     10 non-null     int64
dtypes: int64(2)
memory usage: 288.0 bytes


## Model

In [7]:
intercept = 500 
slope = 2

df['sales_prediction'] = intercept + slope * df['ads'] 

### Calculate mean squared error

---

Step 1: calculate the squared difference between true outcome and prediction:


The general formula:


$$Squared\ errors \ (SE) = (y_i-\hat{y_i})^2$$



For our data this means:

$$ se = (sales_i - sales\_prediction_i)^2$$

Hint

---

```python
df['___'] = (df['___'] - df['___']) ___ ___

```

---

- Use the `**` (power) operator to raise a value to the power of 2.

In [8]:
df['se'] = (df['sales'] - df['sales_prediction']) **2

In [9]:
df.head()

Unnamed: 0,sales,ads,sales_prediction,se
0,2500,900,2300,40000
1,4500,1400,3300,1440000
2,6500,3600,7700,1440000
3,8500,3800,8100,160000
4,10500,6200,12900,5760000


In [10]:
# Check your code
assert df.loc[0, 'se'] == 40000

---

Step 2: Sum up all squared errors

Calculate the sum of squared errors (SSE)

The general formula:

$$Sum\ of\ squared\ errors \ (SSE) = \sum_{i=1}^{n}(y_i-\hat{y_i})^2$$

which equals

$$Sum\ of\ squared\ errors \ (SSE) = \sum_{i=1}^{n} SE$$



For our data this means:



$$sse = \sum_{i=1}^{n}(sales_i-\hat{sales\_prediction_i})^2$$


which equals

$$sse = \sum_{i=1}^{n} se$$

Hint

Sum up the squared errors and save the result in a new object called `sse` (not in the Pandas dataframe)

---

```python
___ = df['___'].___
```

---

- Save the result as `sse`
- use the method `.sum()` to sum up all `se`

In [11]:
sse = df['se'].sum()

In [12]:
sse

14520000

In [13]:
# Chack your code
assert sse == 14520000

---

Step 3: Divide by the number of examples

The general formula:

$$Mean\ squared\ error \ (MSE) = \frac{1}{n} \times \sum_{i=1}^{n}(y_i-\hat{y_i})^2$$

which equals

$$Mean\ squared\ error \ (MSE) = \frac{1}{n} \times SSE$$

*with n = the number of observations*



For our data this means:



$$mse = \frac{1}{n} \times \sum_{i=1}^{n}(sales_i-\hat{sales\_prediction_i})^2$$

which equals

$$mse = \frac{1}{n} \times sse $$


*with n = the number of observations*



In [14]:
# Obtain number of examples (observations)
n = len(df)

In [15]:
n

10

Calculate the mse and save the result as `mse`:

In [16]:
mse = 1/n *sse

In [17]:
mse

1452000.0

In [18]:
# alternative way to print the result
print(f'The MSE equals {mse:.0f}')

The MSE equals 1452000


In [19]:
# Check your code
assert 1451000 < mse < 1453000

### Root mean squared error

---

Another important measure is the root mean squared error:

The general formula:

$$Root\ mean\ squared\ error \ (RMSE) = \sqrt{\frac{1}{n} \times \sum_{i=1}^{n}(y_i-\hat{y_i})^2}$$

which equals

$$Root\ mean\ squared\ error \ (RMSE) = \sqrt{\frac{1}{n} \times SSE}$$

*with n = the number of observations*






For our data this means:


$$rmse = \sqrt{\frac{1}{n} \times \sum_{i=1}^{n}(sales_i-\hat{sales\_prediction_i})^2}$$

which equals

$$rmse = \sqrt{\frac{1}{n} \times sse} $$

which also equals

$$rmse = (\frac{1}{n} \times sse)^\frac{1}{2} $$



with n = the number of observations

Calculate the rmse and save the result as `rmse` (not in your Dataframe):

In [20]:
rmse= (mse)**(1/2)

In [21]:
rmse

1204.9896265113655

In [22]:
# Check your code
assert 1203 < rmse < 1206

### Mean absolute error

Another performance measure (which is less commonly used) is the mean absolute error:

The general formula:

$$Mean\ absolute\ error \ (MAE) = \frac{1}{n} \times \sum_{i=1}^{n}(y_i-\hat{y_i})$$



### Use functions

#### Mean squared error

In [23]:
mean_squared_error(df['sales'], df['sales_prediction'])

1452000.0

#### Root mean squared error

In [24]:
mean_squared_error(df['sales'], df['sales_prediction'], squared=False)

1204.9896265113655

#### Mean absolute error

In [25]:
mean_absolute_error(df['sales'], df['sales_prediction'])

980.0