# Exercise 2: Linear Regression
----------
In this exercise, you are going to implement a first machine learning model and get to know the libraries *pandas* and *scikit-learn*.

## Dataset
We will use a data set originally published here: [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/9/auto+mpg)

Download the data set from Moodle. The data set consists of two files:
- auto-mpg.data: contains the data
- auto-mpg.names: contains information about the data set

The data set contains data from 398 different car models. This includes, besides the car name, information about:
- fuel consumption in miles per gallon
- cylinders
- engine displacement
- horsepower
- weight
- acceleration
- model year
- origin

The goal of this exercise is to predict the fuel consumption of the cars using the other available attributes as input to a linear regression algorithm.

## Importing Data with *pandas*

The *pandas* library is a very important library often used in data science to handle data sets. It includes functions to analyze, explore and manipulate data.
You can check out information about *pandas* on their website: [https://pandas.pydata.org/docs/index.html](https://pandas.pydata.org/docs/index.html)

When working with data sets in *pandas*, the data is loaded into a pandas DataFrame, which is a two-dimensional structure similar to a table. In general the columns of the DataFrame refer to the different features of the data set while the rows represent the instances of the data. *Pandas* gives you many possibilities to handle and analyze the data in the DataFrame, e.g. to calculate statistical properties or to clean the data.


In [1]:
import pandas as pd

data = {
    "Height": [180, 165, 172, 201, 177],
    "Weight": [80, 56, 105, 102, 68],
    "Name": ['Jack', 'John', 'Oliver', 'George', 'William']
}

# load the data into a data frame
dataframe = pd.DataFrame(data)

print(dataframe)

   Height  Weight     Name
0     180      80     Jack
1     165      56     John
2     172     105   Oliver
3     201     102   George
4     177      68  William


In [2]:
# the info() function gives you a first overview of the data like the number of rows and columns and the data types.
dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Height  5 non-null      int64 
 1   Weight  5 non-null      int64 
 2   Name    5 non-null      object
dtypes: int64(2), object(1)
memory usage: 252.0+ bytes


In [3]:
# Select a column of the dataframe
print(dataframe['Height'])
# The result of the selection is a Pandas Series, which is a one-dimensional array
print(type(dataframe['Height']))

0    180
1    165
2    172
3    201
4    177
Name: Height, dtype: int64
<class 'pandas.core.series.Series'>


In [4]:
# Select rows and columns of the dataframe using loc
# input to loc are the labels of the data
print(dataframe.loc[0:2,['Weight', 'Name']])

   Weight    Name
0      80    Jack
1      56    John
2     105  Oliver


In [5]:
# select a row of the dataframe using iloc
# iloc uses integer-based indexing
print(dataframe.iloc[0:3,1:3])
# this gives the same output as the code block above

   Weight    Name
0      80    Jack
1      56    John
2     105  Oliver


------

## Task 1: Load data
Load the car data set for this exercise using the read_csv function from pandas and take a first look at the data to ensure it was properly loaded.

## Task 2: Clean data

As described in the auto-mpg.names file, there are six missing horsepower values within the data set. For this exercise, we are going to ignore the six cars with this missing information. Use pandas to find and delete the six instances with missing horsepower information.

## Task 3: Linear Regression

Your task is to predict the fuel consumption in miles per gallon of the cars. Use the formula for linear regression from the lecture $\beta = (X^{T}X)^{-1}X^{T}y$ to perform this task. Choose the available numeric features cylinders, displacement, horsepower, weight, acceleration, model year and origin as input features.

Calculate the root mean square error of your prediction.

-------------

## Scikit-Learn

As seen in Task 3, the linear regression model can be easily implemented in Python. For more complex algorithms, it makes sense to use existing libraries. *scikit-learn* is a very helpful library in the field of machine learning. With the help of *scikit-learn*, many machine learning models can be easily implemented. It also contains methods to transform and pre-process data before applying the machine learning algorithm and can be used for evaluation as well.

For example, a linear regression model can be implemented as shown in the following code block.

In [13]:
# import scikit-learn
from sklearn.linear_model import LinearRegression
import numpy as np

# data set
x = np.transpose(np.array([[1,2,3,4,5,6,7,8]]))
y = np.transpose(np.array([3,4.5,5.8,7,10,13,14.6,16]))

# define a linear model and fit it to the given data
lm = LinearRegression(fit_intercept=True).fit(x,y)

# print the coefficients of the linear model
print(lm.intercept_, lm.coef_)

# calculate the predicted values
y_pred = lm.predict(x)

# calculate the root mean square error
from sklearn.metrics import mean_squared_error
rmse = mean_squared_error(y, y_pred, squared=False)
print('The root mean squared error is {:.2f}'.format(rmse))

0.3392857142857153 [1.97738095]
The root mean squared error is 0.63


--------------
## Task 4: Optimization and Evaluation

In this task, we want to compare the results of the classic linear regression that we have already implemented above with the results of a lasso approach. In order to compare the results on unseen data, we have to define a training and a test data set.

a) Use the *scikit-learn* function "train_test_split" to split the data into the two sets. Choose a size of 70% for the training data and 30% for the test data.

b) Learn the classic linear regression model on the training data set and evaluate its performance on training and test data set using the root mean square error as evaluation metric.

c) Learn a linear regression model with lasso regularization on the training data set and evaluate its performance on training and test data set using the root mean square error as evaluation metric. Use the "Lasso" module from *scikit-learn* to perform this task and set the alpha-value to 1. Compare the model coefficients and the performance with the classic linear regression approach from above. Which of the models seems better suited for the given task?

d) Perform a hyperparameter optimization for the lasso approach by trying different values for the hyperparameter alpha. Evaluate the performance and find the alpha value with the best performance.