## **Linear Regression**
We will use Linear regression for predicting house prices

We are using a Kaggle dataset- https://www.kaggle.com/harlfoxem/housesalesprediction

In [None]:
# Lets import required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split


### **Dataset Preparation**

In [None]:
# Execute this cell for loading dataset in a pandas dataframe

from IPython.display import clear_output
!wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=16x6-8Znn2T50zFwVvKlzsdN7Jd1hpjct' -O Linear_regression_dataset

data_df = pd.read_csv("Linear_regression_dataset")

In [None]:
# Lets have a quick Look at dataset

print("(No of rows, No of Columns) = ",data_df.shape)
data_df.head()

So there are **19** features (of course we will not use id as feature :) ), and 1 variable to predict(price)

But note that the **date** column contain strings so first we will remove T00.. part from it and than convert it to numpy array.

In [None]:
data_df['date'] = data_df["date"].str.replace("T000000", "", regex=False).astype(int)                                         # Remove T000000 part from data column. Hint: search about .str.replace() method. :)

data_array = data_df.drop(columns=["id"]).to_numpy()                                             # Create a numpy array which does not have "id" field
assert (data_array.shape == (21613,20))

data_df.head()

Now the next task is **normalization**.

We will scale each column of dataset by x -> (x-u)/s

where u is mean(x), and s is standard deviation of u

In [None]:
mean = data_array.mean(axis=0)                                  # this should be an array, each entry should be mean of a column
sd = data_array.std(axis=0)                                    # this should be an array, each entry should be standard deviation of a column

data_array_norm = (data_array-mean)/sd

print(data_array_norm.shape)

The last step is to make train and test dataset and to create seperate vector for price

In [None]:
labels = data_array_norm[:, 1]                                                                                                            # extract the price column from data
x_array_norm = np.delete(data_array_norm, 1, axis=1)                                                                                                      # delete the price column from data_array_norm. Hint: use np.delete()

x_train, x_test, y_train, y_test = train_test_split(x_array_norm,labels,test_size=0.15,random_state=42,shuffle=True)    # splitting data into test and train set.

print(x_train.shape,x_test.shape,y_train.shape,y_test.shape)

### **Loss and gradient descent**
We will use mean squared error(MSE) as loss

Use the gradient descent algorithm which you learned from tutorials

Your task is to complete the following functions

In [None]:
def loss(y_pred,y_true):
  """
  input:
  y_pred = [array] predicted value of y
  y_true = [array] ground truth

  output:
  mse: [scalar] the MES loss
  """
  mse = np.mean((y_pred - y_true) ** 2)                      # fill code here

  return mse

In [None]:
def y(x,a,b):
  """
  This function should return predicted value of y = ax+b
  input:
  x: [array] the feature vector of shape (m,n)
  a: [array] weights of shape (n,)
  b: [scalar] bias

  output:
  y_pred: [array] predicted value of y of shape (m,)
  """

  m,n = x.shape
  y_pred = x@a+b                   # fill code here

  assert(y_pred.shape==(m,))
  return y_pred

In [None]:
def gradient(x,a,b,y_true):
  """
  This function shoud return gradient of loss
  input:
  x: [array] the feature vector of shape (m,n)
  a: [array] weights of shape (n,)
  b: [scalar] bias
  y_true: [array] ground truth of shape (m,)

  output:
  grad: [tuple] a tuple (derivative with respect to a[array of shape(n,)], derivative with respect to b[scalar])
  """
  m,n = x.shape
  yp = y(x,a,b)

  da = (2/m)*(x.T@(yp-y_true))              # write code to calculate derivative of loss with respect to a
  db = (2/m)*np.sum((yp-y_true))              # write code to calculate derivative of loss with respect to b

  assert(da.shape ==(n,))
  return (da,db)

In [None]:
def gradient_descent(x,y_true,learning_rate=0.01,epochs = 10):
  """
  This function perfroms gradient descent and minimizes loss
  input:
  x: [array] the feature vector of shape (m,n)
  y_true: [array] ground truth of shape (m,)

  output:
  loss: [array] of size (epochs,)
  weights: [tuple] (a,b)
  """
  m,n = x.shape
  loss_mse = []                                 # initialize empty list to store loss
  a = np.zeros(n)                                       # initialize a- weights and b- bias
  b = 0

  for i in range(epochs):
    # calculate derivative using gradient() function
    da, db = gradient(x, a, b, y_true)
    # apply gradient descent now to update a and b
    a = a - learning_rate * da
    b = b - learning_rate * db

    l_mse = loss(y(x, a, b), y_true)                                # calculate loss at this point
    loss_mse.append(l_mse)

    print("Epoch ",i+1," Completed!","loss = ",l_mse)

  print("Training completed!!")

  assert(a.shape==(n,))

  return (loss_mse,a,b)

### **Training**

In [None]:
epochs = 500              # tweak this!!!
learn_rate = 0.02          # choose learning rate wisely otherwise loss may diverge!!

train_loss,a,b = gradient_descent(x_train, y_train, learning_rate=learn_rate, epochs=epochs)

### **Evaluation and Visualization**
Lets plot how loss varies with epochs


In [None]:
test_loss = loss(y(x_test, a, b), y_test)

print("Loss on test data = ",test_loss)

# Visualization of loss

plt.plot(train_loss)                   # plot loss versus epochs
plt.title("Training Loss")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.show()