**Important notes:**

**Important notes:**

- You can complete this exercise in 1) Colab or 2) on your local machine
  1. **Colab**: Click on the 🚀 symbol at the top of the page and select Colab. When you finished the exercise, download the file: `File` > `Download` > `ipynb`. 
  2. **Local**: click on the download button at the top of the page and choose `.ipynb`. Activate the conda environment `mr` before you start: `conda activate mr`


- Don't change the name of the file and don't delete any cells.


- Make sure you fill in any place that says  <font color='green'> \# YOUR CODE HERE </font> or "YOUR ANSWER HERE", as well as your name and (if necessary) collaborators below.


- The function **NotImplementedError()** prevents you from hand in assignments with empty cells. Simply delete the function if you start working on a cell with this entry.


- Before you turn this problem in (i.e., after you completed all tasks), make sure everything runs as expected: Restart the kernel and run all cells:
  - in *Colab*: in the menubar, select `Runtime` and click on `Restart and run all`
  - if you use *Jupyter Notebook*: in the menubar, select `Kernel` and click on `Restart & Run All`
  - if you use *Visual Studio Code*: select "Restart" and then "Run All" 


Good luck!

In [1]:
NAME = "yy015"
COLLABORATORS = ""

In [2]:
import IPython
assert IPython.version_info[0] >= 3, "Your version of IPython is too old, please update it."

---

# Introduction to linear regression

## Setup

In [3]:
import pandas as pd
import altair as alt
import warnings

warnings.simplefilter(action='ignore', category=FutureWarning)
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

## Data

We create our own data:

### Create data

In [4]:
df = pd.DataFrame(
    {'sales': [2500, 4500, 6500, 8500, 10500, 12500, 14500, 16500, 18500, 20500],
      'ads'  : [1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000]}
)

### Data structure

In [5]:
df

Unnamed: 0,sales,ads
0,2500,1000
1,4500,2000
2,6500,3000
3,8500,4000
4,10500,5000
5,12500,6000
6,14500,7000
7,16500,8000
8,18500,9000
9,20500,10000


*Do you recognize a relationship between the two variables?*

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   sales   10 non-null     int64
 1   ads     10 non-null     int64
dtypes: int64(2)
memory usage: 288.0 bytes


## Analysis

Show the relationship between the variables:

In [7]:
chart = alt.Chart(df).mark_point().encode(
   x=alt.X('ads', axis=alt.Axis(title='Ads (in $)')),
   y=alt.Y('sales', axis=alt.Axis(title="Sales (in units)")),
   tooltip=['ads', 'sales']
).interactive()

chart

## Model

Let's take a closer look at ad spendings of 2000. What is the value of sales you would expect?

In [8]:
callout = alt.Chart(df.iloc[1:2]).mark_point(
    color='red', size=300, tooltip="Ads = 2000, Sales = 4500"
).encode(
    x=alt.X('ads', axis=alt.Axis(title='Ads (in $)')),
    y=alt.Y('sales', axis=alt.Axis(title="Sales (in units)"))
)

chart + callout

### Prediction

What is your sales prediction for tv ad spendings of 2000?

- calculate a prediction for sales using ad spendings of 2000
- in your code, use the variables `number_0` and `number_1` to obtain your prediction
- save the result as `sales_prediction`

Hint:

---

```python

number_0 = ___
number_1 = ___
ad_spendings = ___

sales_prediction = number_0 + number_1 * ad_spendings

```

---

In [9]:
number_0 = 500
number_1 = 2
ad_spendings = 2000

sales_prediction = number_0 + number_1 * ad_spendings



In [10]:
# check your code
assert 4000 <= sales_prediction <= 5000

Next, use your solution to make your calculations within pandas for every value of ad spendings (`ads`) and save the result in your dataframe (as `sales_prediction`):

Hint:

---

```python
df['___'] = number_0 + number_1 * df['___'] 
```

---

- name the new column `sales_prediction`


In [11]:
df['sales_prediction'] = number_0 + number_1 * df['ads'] 

In [12]:
# Check your code
assert 2000 <= df.iloc[0, 2] <= 3000

In [13]:
df.head()

Unnamed: 0,sales,ads,sales_prediction
0,2500,1000,2500
1,4500,2000,4500
2,6500,3000,6500
3,8500,4000,8500
4,10500,5000,10500


Visualize predictions as a line

In [14]:
line = alt.Chart(df).mark_line().encode(
         alt.X('ads', axis=alt.Axis(title='Ads (in $)')),
         alt.Y('sales_prediction', axis=alt.Axis(title="Sales (in units)")),
         color=alt.value("#0001F5"))

chart + line

Next, we want to see where our blue line crosses the y axis. In other words, we need to include a value of `ads = 0` to our dataframe. 

Hint:

---

```python

df_new = pd.DataFrame({
    'ads': ___ , 
    'sales': ___, 
    'sales_prediction': ___ + ___ * ___
    }, ___)

___ = ___.___([___ , ___], ignore_index = ___)

```

---



First, we create a new Dataframe called `df_new` with only one row: 

- the value for `ads` is `0` 
- we don't know the real value of `sales`: therefore, we write `None` (Pandas will transform this to `NaN`, which means "Not a Number") in this cell 
- we use our formula from above to obtain the value of `sales_prediction` (use the value of `0` for ads)
- we need to manually include an index. Simply use `index=[0]`  

Next, we need to append `df_new` to the end of or DataFrame object `df`:

- use `pd.concat()` to combine `df` and `df_new` and save the result in `df` 
- we clear the existing index and reset it in the result by setting the `ignore_index` option to `True`.


In [15]:
df_new = pd.DataFrame({
    'ads': 0 , 
    'sales': None, 
    'sales_prediction': number_0 + number_1 * 0
    }, index=[0])

df_new

Unnamed: 0,ads,sales,sales_prediction
0,0,,500


In [16]:
df = pd.concat([df , df_new], ignore_index = True)

In [17]:
# check your code
assert len(df) == 11
assert df.loc[10, 'ads'] == 0

In [18]:
df

Unnamed: 0,sales,ads,sales_prediction
0,2500.0,1000,2500
1,4500.0,2000,4500
2,6500.0,3000,6500
3,8500.0,4000,8500
4,10500.0,5000,10500
5,12500.0,6000,12500
6,14500.0,7000,14500
7,16500.0,8000,16500
8,18500.0,9000,18500
9,20500.0,10000,20500


In [19]:
chart_new = alt.Chart(df).mark_point().encode(
   x=alt.X('ads', axis=alt.Axis(title='Ads (in $)')),
   y=alt.Y('sales', axis=alt.Axis(title="Sales (in units)"))
)

line_new = alt.Chart(df).mark_line().encode(
         alt.X('ads', axis=alt.Axis(title='Ads (in $)')),
         alt.Y('sales_prediction', axis=alt.Axis(title="Sales (in units)")),
         color=alt.value("#0001F5"))

callout_new = alt.Chart(df.loc[10:'sales_prediction']).mark_point(
    color='red', 
    size=300, 
    tooltip="Ads = 0, Sales prediction = 500"
).encode(
    x='ads',
    y='sales_prediction'
)

chart_new + line_new + callout_new