# Exercise 04 - Train your first model

## Import libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model, metrics
from sklearn.model_selection import train_test_split

## Read and display data
We assume that the data is located in a subdirectory called "data". To read in the data we use the `read_csv()` method of Pandas.

In [2]:
df = pd.read_csv("../Data/advertising.csv")
df

Unnamed: 0,TV,Radio,Newspaper,Sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,12.0
3,151.5,41.3,58.5,16.5
4,180.8,10.8,58.4,17.9
...,...,...,...,...
195,38.2,3.7,13.8,7.6
196,94.2,4.9,8.1,14.0
197,177.0,9.3,6.4,14.8
198,283.6,42.0,66.2,25.5


## Select features and labels
We select from the dataframe, column by column, what should be interpreted as features and what should be interpreted as labels.

In [9]:
features = ["TV", "Radio"] # List of Features
X = df[features]
y = df.Sales

display(X)
display(y)


Unnamed: 0,TV,Radio
0,230.1,37.8
1,44.5,39.3
2,17.2,45.9
3,151.5,41.3
4,180.8,10.8
...,...,...
195,38.2,3.7
196,94.2,4.9
197,177.0,9.3
198,283.6,42.0


0      22.1
1      10.4
2      12.0
3      16.5
4      17.9
       ... 
195     7.6
196    14.0
197    14.8
198    25.5
199    18.4
Name: Sales, Length: 200, dtype: float64

In [11]:
df[features]

Unnamed: 0,TV,Radio
0,230.1,37.8
1,44.5,39.3
2,17.2,45.9
3,151.5,41.3
4,180.8,10.8
...,...,...
195,38.2,3.7
196,94.2,4.9
197,177.0,9.3
198,283.6,42.0


## Training-Test Split
Next, we split the data into training and test data. scikit-learn provides the function `train_test_split` in the `sklearn.model` submodule for this purpose. By default, it randomly selects 25% of the data as test data. Setting the `random_state` parameter to a fixed number guarantees that the random generator will select the same 25% of the data each time the function is called (often makes it easier to avoid errors during model development).

In [25]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

## Model training
We build a linear regression model. Like all models in scikit-learn, it has a `.fit` method that we use to train the model on the training data.

In [28]:
m = linear_model.LinearRegression()
m.fit(X_train, y_train)

## Evaluation of the solution
We determine the $R^2$ value and the (root) mean squared error (RMSE) on the training and test data. To do this, we first determine the model prediction `y_pred`. Model predictions in scikit-learn are made using the `.predict` method, which each model has. The parameters $w$ can be accessed using `m.coef_` or `m.intercept`.

In [33]:
# Metrics on the training data
y_pred = m.predict(X_train)
print(f"R^2 Training data: {metrics.r2_score(y_train, y_pred)}.")
print(f"MSE Training data: {metrics.mean_squared_error(y_train, y_pred)}.")

# Metrics on the test data
y_pred = m.predict(X_test)
print(f"R^2 Test data: {metrics.r2_score(y_test, y_pred)}.")
print(f"MSE Test data: {metrics.mean_squared_error(y_test, y_pred)}.")

# Learned coefficients of the model function:
print(f"w0 = {m.intercept_}")
print(f"w = {m.coef_}")

R^2 Training data: 0.8955275873177727.
MSE Training data: 2.8420054812102697.
R^2 Test data: 0.9133183876478477.
MSE Test data: 2.356396290987085.
w0 = 4.8193142960357385
w = [0.05461317 0.10204696]


## Task
1. look at the sample code again.

2. delete the code and run a linear regression for the Advertising dataset using all three features `TV`, `Radio` and `Newspaper`. To do this, perform the following steps in sequence:
    * Read in data.
    * Select features and label.
    * Split data into test and training data.
    * Create and train the model.
    * Evaluate the model. Is the model better than the two-feature model?
    
3. bonus: Visualize the data as scatter plots. You can use e.g. the plot functionality of Pandas for this: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html

In [34]:
features2 = ["TV", "Radio", "Newspaper"]
X2 = df[features2]
y2 = df.Sales

X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.25, random_state=42)

m2 = linear_model.LinearRegression()
m2.fit(X2_train)

# Metrics on the new training data

