### Regression: Predicting continuous labels

In contrast with the discrete labels of a classification algorithm, we will next look at a simple *regression* task in which the labels are continuous quantities.

Consider the data shown in the following figure, which consists of a set of points each with a continuous label:


<img src="https://github.com/soltaniehha/Business-Analytics/blob/master/figs/12-01-regression-1.png?raw=true" width="600" align="center"/>

As with the classification example, we have two-dimensional data: that is, there are two features describing each data point.
The color of each point represents the continuous label for that point.

There are a number of possible regression models we might use for this type of data, but here we will use a simple linear regression to predict the points.
This simple linear regression model assumes that if we treat the label as a third spatial dimension, we can fit a plane to the data.
This is a higher-level generalization of the well-known problem of fitting a line to data with two coordinates.

We can visualize this setup as shown in the following figure:


<img src="https://github.com/soltaniehha/Business-Analytics/blob/master/figs/12-01-regression-2.png?raw=true" width="800" align="center"/>

Notice that the *feature 1-feature 2* plane here is the same as in the two-dimensional plot from before; in this case, however, we have represented the labels by both color and three-dimensional axis position.
From this view, it seems reasonable that fitting a plane through this three-dimensional data would allow us to predict the expected label for any set of input parameters.
Returning to the two-dimensional projection, when we fit such a plane we get the result shown in the following figure:


<img src="https://github.com/soltaniehha/Business-Analytics/blob/master/figs/12-01-regression-3.png?raw=true" width="600" align="center"/>

This plane of fit gives us what we need to predict labels for new points.
Visually, we find the results shown in the following figure:


<img src="https://github.com/soltaniehha/Business-Analytics/blob/master/figs/12-01-regression-4.png?raw=true" width="900" align="center"/>

As with the classification example, this may seem rather trivial in a low number of dimensions.
But the power of these methods is that they can be straightforwardly applied and evaluated in the case of data with many, many features.

For example, this is similar to the task of computing the apparent temperature (feels like), we might use the following features and labels:

- *feature 1*, *feature 2*, etc. $\to$ temperature, humidity, or wind speed
- *label* $\to$ apparent temperature

The apparent temperature for a small number of data points might be determined through an independent set of (typically more expensive) observations.
Apparent temperature to remaining data points could then be estimated using a suitable regression model, without the need to employ the more expensive observation across the entire set.

# Example Linear Regression - Apparent Temperature

Goal: predict the apparent temperature from a series of measurements.

Data was download from [Kaggle](https://www.kaggle.com/budincsevity/szeged-weather#weatherHistory.csv) and can be loaded directly from the course's GitHub. It includes hourly weather data for Szeged, Hungary area from 2006 to 2016:

In [47]:
from IPython.display import Pretty as disp
hint = 'https://raw.githubusercontent.com/soltaniehha/Business-Analytics/master/docs/hints/'  # path to hints on GitHub

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

sns.set(style="white")
df = pd.read_csv('https://raw.githubusercontent.com/soltaniehha/Business-Analytics/master/data/weatherHistory.csv')
df.head(3)

Unnamed: 0,Formatted Date,Summary,Precip Type,Temperature (C),Apparent Temperature (C),Humidity,Wind Speed (km/h),Wind Bearing (degrees),Visibility (km),Loud Cover,Pressure (millibars),Daily Summary
0,2006-04-01 00:00:00.000 +0200,Partly Cloudy,rain,9.472222,7.388889,0.89,14.1197,251.0,15.8263,0.0,1015.13,Partly cloudy throughout the day.
1,2006-04-01 01:00:00.000 +0200,Partly Cloudy,rain,9.355556,7.227778,0.86,14.2646,259.0,15.8263,0.0,1015.63,Partly cloudy throughout the day.
2,2006-04-01 02:00:00.000 +0200,Mostly Cloudy,rain,9.377778,9.377778,0.89,3.9284,204.0,14.9569,0.0,1015.94,Partly cloudy throughout the day.


In [48]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96453 entries, 0 to 96452
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Formatted Date            96453 non-null  object 
 1   Summary                   96453 non-null  object 
 2   Precip Type               95936 non-null  object 
 3   Temperature (C)           96453 non-null  float64
 4   Apparent Temperature (C)  96453 non-null  float64
 5   Humidity                  96453 non-null  float64
 6   Wind Speed (km/h)         96453 non-null  float64
 7   Wind Bearing (degrees)    96453 non-null  float64
 8   Visibility (km)           96453 non-null  float64
 9   Loud Cover                96453 non-null  float64
 10  Pressure (millibars)      96453 non-null  float64
 11  Daily Summary             96453 non-null  object 
dtypes: float64(8), object(4)
memory usage: 8.8+ MB


## Preprocessing

1. There are a small number of NAs in "Precip Type"; we will drop all the NAs
2. We won't need the following fields, we will drop them: 'Formatted Date', 'Summary', 'Daily Summary'
3. We will convert 'Precip Type' to dummy variables. Note that we've used `drop_first=True` so we won't have to drop one of the categories

In [49]:
df = df.dropna()
df = df.drop(['Formatted Date','Summary','Daily Summary'], axis=1)
df = pd.get_dummies(df, ['Precip Type'], drop_first=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 95936 entries, 0 to 96452
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Temperature (C)           95936 non-null  float64
 1   Apparent Temperature (C)  95936 non-null  float64
 2   Humidity                  95936 non-null  float64
 3   Wind Speed (km/h)         95936 non-null  float64
 4   Wind Bearing (degrees)    95936 non-null  float64
 5   Visibility (km)           95936 non-null  float64
 6   Loud Cover                95936 non-null  float64
 7   Pressure (millibars)      95936 non-null  float64
 8   Precip Type_snow          95936 non-null  uint8  
dtypes: float64(8), uint8(1)
memory usage: 6.7 MB


Create a feature DataFrame called `X`; our target variable is 'Apparent Temperature (C)' and that's what we need to exclude in the feature DataFrame:

In [50]:
# Your answer goes here
X = df.drop('Apparent Temperature (C)', axis=1)
X.shape

(95936, 8)

In [51]:
# Don't run this cell to keep the outcome as your frame of reference

In [52]:
# SOLUTION: Uncomment and execute the line below to get help
#disp(hint + '11-03-X')

Create a target vector with 'Apparent Temperature (C)' and call it `y`:

In [53]:
# Your answer goes here
y = df['Apparent Temperature (C)']
y.shape

(95936,)

In [54]:
# Don't run this cell to keep the outcome as your frame of reference

In [55]:
# SOLUTION: Uncomment and execute the line below to get help
#disp(hint + '11-03-y')

We would like to evaluate the model on data it has not seen before, and so we will split the data into a training set and a testing set. Use a 30% split for test. You can use seed value 833 if you would like to get similar values as this notebook:

In [56]:
# Your answer goes here
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.3, random_state=833)

In [57]:
# SOLUTION: Uncomment and execute the line below to get help
#disp(hint + '11-03-split')

In [58]:
print("Xtrain shape:", Xtrain.shape)
print("Xtest shape:", Xtest.shape)

Xtrain shape: (67155, 8)
Xtest shape: (28781, 8)


With the data arranged, we can follow our recipe to predict the labels:

First, instantiate a simple linear regrssion model. You would first need to import `LinearRegression`; it can be found under the `linear_model` module in `sklearn`. Call this model: `model`.

We will instantiate the model with all the default parameters:

In [59]:
# Your answer goes here
from sklearn.linear_model import LogisticRegression  # 1. choose model class
model = LogisticRegression() 

In [60]:
# SOLUTION: Uncomment and execute the line below to get help
#disp(hint + '11-03-model')

Fit model to the training data:

In [61]:
# Your answer goes here
model.fit(Xtrain, ytrain)

ValueError: ignored

In [None]:
# SOLUTION: Uncomment and execute the line below to get help
#disp(hint + '11-03-fit')

In [None]:
print("Model coefficients:    ", model.coef_)
print("Model intercept:", model.intercept_)

predict on new (test) data and store the results as `y_model`:

In [None]:
# Your answer goes here


In [None]:
# SOLUTION: Uncomment and execute the line below to get help
#disp(hint + '11-03-predict')

Now that our predictions are ready we can merge them along with the ground truth, our 'Apparent Temperature (C)' field, to the test features and visually inspect our model performance:

In [None]:
test = Xtest.join(ytest).reset_index()
test.join(pd.Series(y_model, name='predicted')).head()

Calculating the mean absolute error:

In [None]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(ytest, y_model)