<hr style="height:2px; border:none; color:black; background-color:black;">

<img src="./images/logo-iug@2x.png" alt="IUG" style="width:300px;"/>

# Data Day@IUG 
**Learning Lab #3**: Data Exploration with Python by Dr. N. Tsourakis

[ntsourakis@iun.ch](ntsourakis@iun.ch)

## Time Series and Simple Linear Regression 
* **Simple linear regression** is the **simplest** regression algorithm
* Using a **time series** of average New York City January high-temperature data for 1895 through 2018, we'll
    * Perform **simple linear regression**
    * Display a **scatter plot** with a **regression line** 
    * Use the **coefficient** and **intercept values** calculated by the estimator to **make predictions**
* Temperature data stored in **`ave_hi_nyc_jan_1895-2018.csv`**

Given a collection of numeric values representing an **independent variable** and a **dependent variable**, simple linear regression **describes the relationship between these variables with a straight line**, known as the **regression line**.

<img src="./images/linear-regression.png" alt="Linear Regression" width="600px"/>

How can we choose the best regression line?

<img src="./images/linear-regression-example.png" alt="Linear Regression Example" width="600px"/>

### Loading the Average High Temperatures into a `DataFrame` 
* Load the data from `ave_hi_nyc_jan_1895-2018.csv`, rename the `'Value'` column to `'Temperature'`, remove `01` from the end of each date value and display a few data samples:

**We added `%matplotlib inline` to enable Matplotlib in this notebook.**

In [None]:
%matplotlib inline
import pandas as pd

In [None]:
# Load the data.
nyc = pd.read_csv('./data/ave_hi_nyc_jan_1895-2018.csv')

In [None]:
# Show the first three samples.
nyc.head(3)

In [None]:
# Define the names of the column.
nyc.columns = ['Date', 'Temperature', 'Anomaly']

In [None]:
# Extract the year.
nyc.Date = nyc.Date.floordiv(100)

In [None]:
# Show again the first three samples.
nyc.head(3)

<hr style="height:2px; border:none; color:black; background-color:black;">

### Splitting the Data for Training and Testing (1 of 3)
* We’ll use the **`LinearRegression`** estimator from **`sklearn.linear_model`** 
* By default, this estimator uses **all** the **numeric features** in a dataset to perform **multiple linear regression**  
* For **simple linear regression** select **one** feature (the `Date` here) as the **independent variable**
    * A column in `DataFrame` is a **one-dimensional** `Series` 
    * Scikit-learn estimators require training and testing data to be **two-dimensional** 
    * We'll transform **`Series` of _n_** elements, into two dimensions containing **_n_ rows** and **one column** 

### Splitting the Data for Training and Testing (2 of 3)
* `nyc.Date.values` returns NumPy array containing `Date` column’s values
* **`reshape(-1, 1)`** tells `reshape` to **infer** the number of rows, based on the number of columns (`1`) and the number of elements (124) in the array
    * Transformed array will have 124 rows and one column

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# Use a part of the data for trining and part for testing.
X_train, X_test, y_train, y_test = train_test_split(
    nyc.Date.values.reshape(-1, 1), nyc.Temperature.values, random_state=11)

### Splitting the Data for Training and Testing (3 of 3)
* Confirm the **75%–25% train-test split** 

In [None]:
X_train.shape

In [None]:
X_test.shape

<hr style="height:2px; border:none; color:black; background-color:black;">

### Training the Model (1 of 2)
* [**LinearRegression default settings**](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
linear_regression = LinearRegression()

In [None]:
# Use the training data to find the regression line.
linear_regression.fit(X=X_train, y=y_train)

* To find the **best fitting regression line** for the data, the `LinearRegression` estimator **iteratively adjusts** the **slope** and **intercept** to **minimize** the **sum of the squares** of the data points’ **distances** from the line 


<hr style="height:2px; border:none; color:black; background-color:black;">

### Training the Model (2 of 2)
* We'll soon use **slope** and **intercept** to make **predictions** with 

\begin{equation}
y = m x + b
\end{equation}

* Slope is the estimator’s **`coeff_`** attribute (**m** in the equation) 
* Intercept is the estimator’s **`intercept_`** attribute (**b** in the equation)

In [None]:
# Slope of the line.
linear_regression.coef_

In [None]:
# Point of intercept with the y-axis.
linear_regression.intercept_

<hr style="height:2px; border:none; color:black; background-color:black;">

### Testing the Model
* Test the model using the data in **`X_test`** and check some of the **predictions**

In [None]:
predicted = linear_regression.predict(X_test)

In [None]:
expected = y_test

In [None]:
for p, e in zip(predicted[::5], expected[::5]):  # check every 5th element
    print(f'predicted: {p:.2f}, expected: {e:.2f}')

<hr style="height:2px; border:none; color:black; background-color:black;">

### Predicting Future Temperatures and Estimating Past Temperatures 
* Use the **coefficient** and **intercept** values to make **predictions** 

In [None]:
# lambda implements y = mx + b
predict = (lambda x: linear_regression.coef_ * x + 
                     linear_regression.intercept_)

In [None]:
# Predict a future value.
predict(2019)

In [None]:
# 'Predict' a past value.
predict(1890)

<hr style="height:2px; border:none; color:black; background-color:black;">

### Visualizing the Dataset with the Regression Line 
* Create a **scatter plot** with a regression line 
* **Cooler** temperatures shown in **darker colors**

* **Instructor Note:** All code that modifies a graph must be in the same notebook cell

In [None]:
import seaborn as sns

axes = sns.scatterplot(data=nyc, x='Date', y='Temperature',
    hue='Temperature', palette='winter', legend=False)  

axes.set_ylim(10, 70)  # scale y-axis 

import numpy as np

x = np.array([min(nyc.Date.values), max(nyc.Date.values)])

y = predict(x)

import matplotlib.pyplot as plt

line = plt.plot(x, y)

<hr style="height:2px; border:none; color:black; background-color:black;">