# Intro to Machine Learning

In this notebook, we will try to solve the problem of predicting tomorrow's temperature though machine learning. We will begin with simple linear regression, move on to multiple linear regression, and finally try to choose a good set of features that best predict tomorrow's temperature.

## Retrieving the Data
Let's download some weather data. We will download 2019 weather data to fit our model to and 2020 weather data to test it against. We will be getting the data using [wwo-hist](https://towardsdatascience.com/obtain-historical-weather-forecast-data-in-csv-format-using-python-5a6c090fc828). This code will take a little while to download the data.

In [None]:
# code here shows a sample of the weather data so the user can see the data
# they are working with
import pandas as pd
import numpy as np
from wwo_hist import retrieve_hist_data
import matplotlib.pyplot as plt
import csv
from sklearn.linear_model import LinearRegression

api_key = '080bb880d8ba4f43ad5231331211703'
location_list = ['43210']
train_data = retrieve_hist_data(api_key,
                                location_list,
                                '1-JAN-2019',
                                '1-JAN-2020',
                                24,
                                store_df = True)[0]

test_data = retrieve_hist_data(api_key,
                                location_list,
                                '1-JAN-2020',
                                '1-JAN-2021',
                                24,
                                store_df = True)[0]


## Examining the Data
Let's look at the data that we just downloaded. Here are some of the data we have from 6 different days. Each column is a different type of data that we have access to.

In [None]:
train_data.head(6)

# Simple Linear Regression
Now we will try to predict tomorrow's temperature in one of the easiest ways. Linear regression. Through linear regression, we can make predictions by simply scaling our input and adding an offset. The equation for linear regression is y = a + bx

In our case, we will predict tomorrow's temperature using today's temperature. This will give us the equation

Tomorrow = b * (Today) + a

Using our data from 2019, we will fit this model to our data. By that, we mean that we will learn the value of a and b that best match our data.

In [None]:
# code to set up the 1 variable simple linear regression

today_temp = list(map(int, train_data["tempC"].tolist()))
tomorrow_temp = list(map(int, train_data["tempC"].tolist()))

tomorrow_temp.pop(0)
today_temp.pop()

x = np.array(today_temp).reshape(-1,1)
y = tomorrow_temp

reg = LinearRegression().fit(x, y)

# Points on the regression line
xplt = np.array([min(x) - 2, max(x) + 2])          
yplt = reg.predict(xplt)

# Prints regression equation
coefficients = reg.coef_
coefficient = coefficients[0]
intercept = reg.intercept_
correlation = reg.score(x,y)
print('slope is %f' %(coefficient))
print('intercept is %f' %(intercept))
print('correlation is %f' %(correlation))
print('y-hat = %fx + %f explains about %f%% of variation' %(coefficient, intercept, correlation*100))



## Testing
This will calculate the accuracy of the model by comparing today's temperature to tomorrow's temperature and subtracting that from 1 to get the r squared value. The r squared tells us how much error our model has compared to the real data. 

In [None]:
r_squared=reg.score(x, y)
print('R^2 value is %f' %(r_squared))

## Graph
Plot the regression with today's tempature versus tomorrows's temperature

In [None]:
plt.plot(x,y,'o')                    # Plot the data points
plt.plot(xplt,yplt,'-',linewidth=3)  # Plot the regression line
plt.xlabel('Today\'s temperature')
plt.ylabel('Tomorrow\'s temperature')
plt.suptitle('Today\'s Temperature vs Tomorrow\'s Temperature in degrees C')
plt.grid(True)
plt.savefig('linear_reg.png')
plt.show()

In [None]:
xplt = np.array(today_temp).reshape(-1,1)
yplt = reg.predict(xplt)
day = range(365)
dayplt = range(len(yplt))

plt.plot(day,y,'o')                    # Plot the data points
plt.plot(dayplt,yplt,'-',linewidth=3)  # Plot the regression line
plt.xlabel('Day')
plt.ylabel('Temperature')
plt.suptitle('Predicting Temperature in Degrees C')
plt.grid(True)
plt.savefig('daily_linear_reg.png')
plt.show()

## Multiple Linear Regression
Mutiple Linear regression of 2 variables to improve predicting the weather (today's temp and another variable which may be selected) to produce tomorrow's temp

In [None]:
# independent variables
numVar = 3
var1 = list(map(int, train_data["tempC"].tolist()))
var1.pop()
var2 = list(map(int, train_data["pressure"].tolist()))
var2.pop()
var3 = list(map(int, train_data["humidity"].tolist()))

# tomorrow temp
tomorrow_temp = list(map(int, train_data["tempC"].tolist()))
tomorrow_temp.pop(0)

# tranform variable
sqVar1 = [x*x for x in var1]
reciprocalVar1 = np.reciprocal(var1)
log10Var1 = np.log10(var1)
lnVar1 = np.log(var1)

# create x and y arrays
x1 = np.array(var1)
x2 = np.array(var2)
x3 = np.array(var3)
y = tomorrow_temp

# # Points on the regression line
x = []       
for i in range(len(x1)):
    x.append([x1[i],x2[i],x3[i]])
xplt = np.array(x)
reg = LinearRegression().fit(xplt, y)

yplt = reg.predict(xplt)

day = range(365)
dayplt = range(len(yplt))

#Prints information
coefficients = reg.coef_
intercept = reg.intercept_
correlation = reg.score(xplt,y)
for j in range(numVar):
    print('slope of x%d is %f' %(j,coefficients[j]))
print('intercept is %f' %(intercept))
print('correlation is %f' %(correlation))
print('y-hat explains about %f%% of variation' %(correlation*100))

## Graph
Plot the multiple linear regression with the domain of temperature, pressure, and humidity

In [None]:
plt.plot(day,y,'o')                    # Plot the data points
plt.plot(dayplt,yplt,'-',linewidth=3)  # Plot the regression line
plt.xlabel('Day')
plt.ylabel('Temperature')
plt.suptitle('Predicting Tomorrow\'s Temperature in Degrees C')
plt.grid(True)
plt.savefig('multi_linear_reg.png')
plt.show()

## Testing
This will calculate the accuracy of the multiple linear regression model

In [None]:
r_squared=reg.score(xplt, y)
print('R^2 value is %f' %(r_squared))

In this section you may select as much data as you want ot be in the domain of the linear regression. This will serve to demonstrate overtraining and undertraining as well as how some data will be more beniefical to solving a problem while other data is completely independent from the solution.

In [None]:
# code here sets up a linear regression of n variables