# Intro to Machine Learning

In this notebook, we will try to solve the problem of predicting tomorrow's temperature though machine learning. We will begin with simple linear regression, move on to multiple linear regression, and finally try to choose a good set of features that best predict tomorrow's temperature.

## Retrieving the Data
Let's download some weather data. We will download 2019 weather data to fit our model to and 2020 weather data to test it against. We will be getting the data using [wwo-hist](https://towardsdatascience.com/obtain-historical-weather-forecast-data-in-csv-format-using-python-5a6c090fc828). This code will take a little while to download the data.

In [None]:
# code here shows a sample of the weather data so the user can see the data
# they are working with
import pandas as pd
import numpy as np
from wwo_hist import retrieve_hist_data
import matplotlib.pyplot as plt
import csv
from sklearn.linear_model import LinearRegression

api_key = '080bb880d8ba4f43ad5231331211703'
location_list = ['43210']

# Download data to train on
train_data = retrieve_hist_data(api_key,
                                location_list,
                                '1-JAN-2019',
                                '1-JAN-2020',
                                24,
                                store_df = True)[0]

# Download data to test on
test_data = retrieve_hist_data(api_key,
                                location_list,
                                '1-JAN-2020',
                                '1-JAN-2021',
                                24,
                                store_df = True)[0]


## Examining the Data
Let's look at the data that we just downloaded. Here are some of the data we have from 6 different days. Each column is a different type of data that we have access to.

In [None]:
print(train_data.columns.to_list())
train_data.head(6)

# Simple Linear Regression
Now we will try to predict tomorrow's temperature in one of the easiest ways. Linear regression. Through linear regression, we can make predictions by simply scaling our input and adding an offset. The equation for linear regression is y = a + bx

In our case, we will predict tomorrow's temperature using today's temperature. This will give us the equation

Tomorrow = b * (Today) + a

Using our data from 2019, we will fit this model to our data. By that, we mean that we will learn the value of a and b that best match our data.

In [None]:
# code to set up the 1 variable simple linear regression

temps = np.array(train_data["tempC"], float)

# Shift temperatures so for any row, today_temp has todays temp
# And tomorrow_temp has the temperature for the next day
today_temp = temps[:-1]
tomorrow_temp = temps[1:]

x = np.array([today_temp]).T
y = tomorrow_temp

simple_reg = LinearRegression().fit(x, y)

# Prints regression equation
coefficients = simple_reg.coef_
coefficient = coefficients[0]
intercept = simple_reg.intercept_
correlation = simple_reg.score(x,y)
print('slope is %f' %(coefficient))
print('intercept is %f' %(intercept))
print('correlation is %f' %(correlation))
print('y-hat = %fx + %f explains about %f%% of variation' %(coefficient, intercept, correlation*100))



## Graph
Plot the regression with today's temperature versus tomorrows's temperature

In [None]:
# Points on the regression line
xplt = np.array([min(x) - 2, max(x) + 2])          
yhat = simple_reg.predict(xplt)

# Create plot
plt.plot(x,y,'o')                    # Plot the data points
plt.plot(xplt,yhat,'-',linewidth=3)  # Plot the regression line
plt.xlabel('Today\'s temperature °C')
plt.ylabel('Tomorrow\'s temperature °C')
plt.suptitle('Today\'s Temperature vs Tomorrow\'s Temperature in °C')
plt.grid(True)
plt.savefig('linear_reg.png')
plt.show()

Plot the simple linear regression model's prediction for each day of 2019.

In [None]:
# Points on the regression line
yhat = simple_reg.predict(x)
day = range(len(today_temp))
dayplt = range(len(yhat))

# Create plot
plt.plot(day,y,'o')                    # Plot the data points
plt.plot(dayplt,yhat,'-',linewidth=3)  # Plot the regression line
plt.xlabel('Day')
plt.ylabel('Temperature °C')
plt.suptitle('Predicting Temperature in °C')
plt.grid(True)
plt.savefig('daily_linear_reg.png')
plt.show()

## Testing
This will calculate the accuracy of the model by comparing today's temperature to tomorrow's temperature and subtracting that from 1 to get the r squared value. The r squared tells us how much error our model has compared to the real data. 

In [None]:
temps = np.array(test_data["tempC"], float)

# Shift temperatures so for any row, today_temp has todays temp
# And tomorrow_temp has the temperature for the next day
today_temp = temps[:-1]
tomorrow_temp = temps[1:]

x_test = np.array([today_temp]).T
y_test = tomorrow_temp

r_squared = simple_reg.score(x_test, y_test)
print('Score on test data is %f%%' % (r_squared * 100))

## Multiple Linear Regression
Mutiple Linear regression of 3 variables to improve predicting the weather to produce tomorrow's temp

Set the independent variables.

In [None]:
# independent variables
var1 = list(map(int, train_data["tempC"].tolist()))[:-1]
var2 = list(map(int, train_data["pressure"].tolist()))[:-1]
var3 = list(map(int, train_data["humidity"].tolist()))[:-1]

Calculate mulitple linear regression for the independent variables.

In [None]:
# Points on the regression line
xplt = np.array(list(zip(var1, var2, var3)))
multi_reg = LinearRegression().fit(xplt, y)

yhat = multi_reg.predict(xplt)

day = range(len(y))
dayplt = range(len(yhat))

#Prints information
coefficients = multi_reg.coef_
intercept = multi_reg.intercept_
correlation = multi_reg.score(xplt,y)
numVar = (xplt.shape)[1]
for j in range(numVar):
    print('slope of x_%d is %f' %(j,coefficients[j]))
print('intercept is %f' %(intercept))
print('correlation is %f' %(correlation))
print('y-hat explains about %f%% of variation' %(correlation*100))

## Graph
Plot the multiple linear regression model's prediction for each day of 2019.

In [None]:
plt.plot(day,y,'o')                    # Plot the data points
plt.plot(dayplt,yhat,'-',linewidth=3)  # Plot the regression line
plt.xlabel('Day')
plt.ylabel('Temperature °C')
plt.suptitle('Predicting Tomorrow\'s Temperature in °C')
plt.grid(True)
plt.savefig('multi_linear_reg.png')
plt.show()

Plot the predicted versus actual temperature.

In [None]:
plt.plot(yhat,y,'o')
plt.xlabel('Predicted Temperature °C')
plt.ylabel('Actual Temperature °C')
plt.suptitle('Predicted vs Actual Temperature in °C')
plt.grid(True)
plt.savefig('plot_multi_linear_reg.png')
plt.show()

## Testing
Calculate how close the multiple linear regression model fits temperature data from another year

In [None]:
# independent variables
var1 = list(map(int, test_data["tempC"].tolist()))[:-1]
var2 = list(map(int, test_data["pressure"].tolist()))[:-1]
var3 = list(map(int, test_data["humidity"].tolist()))[:-1]

x_test = np.array(list(zip(var1, var2, var3)))
y_test = np.array(test_data["tempC"], np.float)[1:]

r_squared = multi_reg.score(x_test, y_test)
print('Score on test data is %f%%' % (r_squared * 100))

# Deciding Features

In this section, we are going to see just how good of a model we can make. As an example, we made a model that is trained on every feature that can be traned on as well as some transformed features. As you can expect, this scores very well on the training data.

In [None]:
# Put together all the features we want to train on
def extract_features(data):
    return np.array([
            np.array(data["maxtempC"], float),                  # Every datafield given
            np.array(data["mintempC"], float),
            np.array(data["totalSnow_cm"], float),              
            np.array(data["sunHour"], float),
            np.array(data["uvIndex"], float),
            np.array(data["moon_illumination"], float),
            np.array(data["DewPointC"], float),
            np.array(data["FeelsLikeC"], float),
            np.array(data["HeatIndexC"], float),
            np.array(data["WindChillC"], float),
            np.array(data["WindGustKmph"], float),
            np.array(data["cloudcover"], float),
            np.array(data["humidity"], float),
            np.array(data["precipMM"], float),
            np.array(data["pressure"], float),
            np.array(data["tempC"], float),
            np.array(data["visibility"], float),
            np.array(data["winddirDegree"], float),
            np.array(data["windspeedKmph"], float),
            np.array(data["maxtempC"], float) ** 2,             # Every datafield squared
            np.array(data["mintempC"], float) ** 2,
            np.array(data["totalSnow_cm"], float) ** 2,
            np.array(data["sunHour"], float) ** 2,
            np.array(data["uvIndex"], float) ** 2,
            np.array(data["moon_illumination"], float) ** 2,    
            np.array(data["DewPointC"], float) ** 2,
            np.array(data["FeelsLikeC"], float) ** 2,
            np.array(data["HeatIndexC"], float) ** 2,
            np.array(data["WindChillC"], float) ** 2,
            np.array(data["WindGustKmph"], float) ** 2,
            np.array(data["cloudcover"], float) ** 2,
            np.array(data["humidity"], float) ** 2,
            np.array(data["precipMM"], float) ** 2,
            np.array(data["pressure"], float) ** 2,
            np.array(data["tempC"], float) ** 2,
            np.array(data["visibility"], float) ** 2,
            np.array(data["winddirDegree"], float) ** 2,
            np.array(data["windspeedKmph"], float) ** 2,
            np.sin(np.array(data["visibility"], float)),        # Sin, log, sqrt, and inverse of visibility
            np.log(np.array(data["visibility"], float)),
            np.sqrt(np.array(data["visibility"], float)),
    ]).T[:-1]

X = extract_features(train_data)
y = np.array(train_data["tempC"], float)[1:]

feature_reg = LinearRegression().fit(X, y)

# Print info about regression
coefficients = feature_reg.coef_
intercept = feature_reg.intercept_
correlation = feature_reg.score(X,y)
for j in range(X.shape[1]):
    print('slope of feature %d is %f' %(j+1 ,coefficients[j]))
print('intercept is %f' %(intercept))
print('Score on train data is %f%%' % (correlation * 100))

In [None]:
# predicted vs actual
yhat = feature_reg.predict(X)
plt.plot(yhat,y,'o')
plt.xlabel('Predicted Temperature °C')
plt.ylabel('Actual Temperature °C')
plt.suptitle('Predicted vs Actual Temperature in °C')
plt.grid(True)
plt.savefig('plot_multi_linear_reg.png')
plt.show()

## Testing

Now lets see how it performs on the test data. In this example, where we trained on everyhting, the test data prediction is poor. This can be expected, because by giving our model many features, we allowed it to overfit on our training data.

In [None]:
X_test = extract_features(test_data)
y_test = np.array(test_data["tempC"], float)[1:]

r_squared = feature_reg.score(X_test, y_test)
print('Score on test data is %f%%' % (r_squared * 100))

# Your turn

Go back through the deciding features section and see how high you can get the test score!