# Linear Regression of the Weather history dataset using sklearn
Regression is a way to approximate a function $f$ using various data points on the actual graph of the function. Linear regression constructs an approximate function that is linear(so this technique works when the function $f$ has its data points all near a particular line or a plane(2 input 1 output so graph is in $\mathbb{R}^3$ ) or a hyperplane). Linear regression learns this hyperplane's equation and then can help predict outputs to new input values.
The learnt function $h$ is defined by a parametric equation $h = \vec n \cdot \vec x + c$ or $\vec w \cdot x + b$. The parameters $\vec w = (w_1, w_2, \cdots, w_n)$ and $b$ are learnt like in binary classification using grad desc and backprop to update parameters.

Details:
Actually, ``sklearn`` uses the least square approximation("OLS") to minimize the l2 loss(actually sqrt(N*the l2 loss)). The **exact** value of the weights and bias (i.e. $\vec \theta$) are found directly using the normal equation(which follows from classic least square approx, see notes). 
Unlike the gradient descent approach, this way the answer is found **exactly** and so the minimum loss will be the same each time. Small randomness in the loss values here is due to randomness in the test set(i.e. the matrix $X$ in the OLS computation) because of the shuffling used to choose a random train and test set.

The following implementation uses ``sklearn`` and abstracts the details of the optimization.

Reading material: [CS50 AI Lecture 4](https://cs50.harvard.edu/ai/2020/notes/4/)

In [136]:
from sklearn.linear_model import LinearRegression
from csv import reader
import numpy as np
from numpy.random import shuffle

### Regress humidity as a fn of Temperature

In [137]:
def load_data(file):
    data = [] # a list of dictionaries, each dictionary has two elements, ("input params", inputs) and ("output params", outputs). This structure is purely to make the code more understandable as to what is being regressed.

    def load(file_obj):    
        csv_reader = reader(file_obj) # csv_reader is an iterable iterating over rows of the file.
        next(csv_reader) # skip the title row
        for row in csv_reader: # row is a list of strings containing the data of each row in the file.
            data.append({
                # input: temperature, function output: humidity
                "temperature": [float(row[3])], 
                "humidity": float(row[5]) # the float is since row is list[str]
            })
    
    # load data based on type of file. file is assumed to be either str(filename) or a file object.
    if isinstance(file, str):
        with open(file) as f: load(f)
    else: load(file)
    return data

In [138]:
model = LinearRegression() # model is an instance of the LinearRegression class. C++ code would be LinearRegression model() or auto model = LinearRegression()
data = load_data("weatherHistory.csv")
print("Dataset loaded.")

# divide data into training and test sets.
holdout = int(0.4 * len(data))
shuffle(data)
testing_set = data[:holdout]; training_set = data[holdout:]

# train model on training set
X_train = [example["temperature"] for example in training_set]
Y_train = [example["humidity"] for example in training_set]
model.fit(X_train, Y_train)
print("Model trained with", len(X_train), "examples.")

# test model predictions on testing set
X_test = [example["temperature"] for example in testing_set]
Y_test = [example["humidity"] for example in testing_set]
predictions = model.predict(X=X_test)

# calculate the L2 loss function.
loss = 0
sz = 0 # size of testset
for y, yhat in zip(Y_test, predictions):
    loss += (y - yhat)**2
    sz += 1
loss /= sz
print(f"Mean squared loss over {sz} testcases is {loss:.4f}.") # the loss is around 0.02 where answers are in [0, 1], that's remarkable accuracy.

Dataset loaded.
Model trained with 57872 examples.
Mean squared loss over 38581 testcases is 0.0230.


### Part 2: Regress humidity as a fn of Temperature and Pressure
We present a general algorithm to regress any output parameter as a function of any list of input parameters.

In [143]:
# General load function. We do it without csv.reader for variety and learning.
def load(file: str, params: list, output: str): # params and output are expected to be lowercase
    with open(file) as f:
        titles = f.readline().split(sep=',')
        
        indices = [i for i, title in enumerate(titles) if title.lower() in params]
        for i, title in enumerate(titles):
            if output == title.lower(): 
                outindex = i; break
        else: 
            raise ValueError("Invalid output variable.")
        data = []
        for row in f:
            words = row.split(',')
            data.append({"inputs": [float(words[i]) for i in indices], "output": float(words[outindex])})
    return data

def calc_l2loss(actual, prediction):
    # n_test = 0; loss = 0
    # for y, yhat in zip(actual, prediction):
    #     loss += (y-yhat)**2; n_test += 1
    # return loss/n_test
    return np.sum((np.array(actual) - prediction)**2)/len(prediction)

In [144]:
params = ["temperature (c)", "pressure (millibars)"]
output = "humidity"
data = load("weatherHistory.csv", params=params, output=output)
# print(data[:10])
print("Dataset loaded.")

model = LinearRegression()

shuffle(data)
holdout = int(0.4 * len(data))
train, test = data[holdout:], data[:holdout]
X_train = [example["inputs"] for example in train]
y_train = [example["output"] for example in train]
X_test  = [example["inputs"] for example in test]
y_test  = [example["output"] for example in test]

model.fit(X_train, y_train)
print("Model trained with", len(X_train), "examples.")

prediction = model.predict(X_test)
loss = calc_l2loss(y_test, prediction)
print(f"Mean squared loss while regressing {output} against {params} with {len(prediction)} testcases is {loss:.4f}.")

Dataset loaded.
Model trained with 57872 examples.
Mean squared loss while regressing humidity against ['temperature (c)', 'pressure (millibars)'] with 38581 testcases is 0.0228.
