# Linear Regression Assignment

In [28]:
# Import the required libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

## Data Preprocessing

### **Exploring the dataset**

Let's start with loading the training data from the csv into a pandas dataframe

In [7]:
df = pd.read_csv('train_processed_splitted.csv')

Let's see what the first 5 rows of this dataset looks like

In [10]:
df.head()

Unnamed: 0,LotArea,TotalBsmtSF,GrLivArea,GarageArea,PoolArea,OverallCond,Utilities,SalePrice
0,11553,1051,1159,336,0,5,AllPub,158000
1,8400,1052,1052,288,0,5,AllPub,138500
2,8960,1008,1028,360,0,6,AllPub,115000
3,11100,0,930,308,0,7,AllPub,84900
4,15593,1304,2287,667,0,4,AllPub,225000


What are all the features present? What is the range for each of the features along with their mean?

In [11]:
df.columns.values

array(['LotArea', 'TotalBsmtSF', 'GrLivArea', 'GarageArea', 'PoolArea',
       'OverallCond', 'Utilities', 'SalePrice'], dtype=object)

### **Feature Scaling and One-Hot Encoding**

You must have noticed that some features `(such as Utilities)` are not continuous values.
  
These features contain values indicating different categories and must somehow be converted to numbers so that the computer can understand it. `(Computers only understand numbers and not strings)`
  
These features are called categorical features. We can represent these features as a `One-Hot Representation`
  
  
You must have also noticed that all the other features, each are in a different scale. This can be detremental to the performance of our linear regression model and so we normalize them so that all of them are in the range $[0,1]$

> NOTE: When you are doing feature scaling, store the min/max which you will use to normalize somewhere. This is then to be used at testing time. Try to think why are doing this?

In [29]:
# Do the one-hot encoding here
print(df['Utilities'].unique())
df['Utilities'].value_counts()
one_hot_encoded_data = pd.get_dummies(df, columns = ['Utilities'], dtype=float)
print(one_hot_encoded_data)

['AllPub' 'NoSeWa']
      LotArea  TotalBsmtSF  GrLivArea  GarageArea  PoolArea  OverallCond  \
0       11553         1051       1159         336         0            5   
1        8400         1052       1052         288         0            5   
2        8960         1008       1028         360         0            6   
3       11100            0        930         308         0            7   
4       15593         1304       2287         667         0            4   
...       ...          ...        ...         ...       ...          ...   
1309     9020         1127       1165         490         0            7   
1310    10793          780       1620         462         0            5   
1311     8885          864        902         484         0            5   
1312    11275          710       2978         564         0            7   
1313    10206            0        944         528         0            3   

      SalePrice  Utilities_AllPub  Utilities_NoSeWa  
0        1580

In [60]:
# Do the feature scaling here


from sklearn.preprocessing import MinMaxScaler
 
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(one_hot_encoded_data)
scaled_df = pd.DataFrame(scaled_data,columns=one_hot_encoded_data.columns)
scaled_df.head()

Unnamed: 0,LotArea,TotalBsmtSF,GrLivArea,GarageArea,PoolArea,OverallCond,SalePrice,Utilities_AllPub,Utilities_NoSeWa
0,0.047924,0.172013,0.155426,0.236953,0.0,0.5,0.170948,1.0,0.0
1,0.033186,0.172177,0.135268,0.203103,0.0,0.5,0.143869,1.0,0.0
2,0.035804,0.164975,0.130746,0.253879,0.0,0.625,0.111235,1.0,0.0
3,0.045806,0.0,0.112283,0.217207,0.0,0.75,0.069435,1.0,0.0
4,0.066807,0.213421,0.367935,0.470381,0.0,0.375,0.263991,1.0,0.0


### **Conversion to NumPy**

Ok so now that we have all preprocessed all the data, we need to convert it to numpy for our linear regression model
  
Assume that our dataset has a total of $N$ datapoints. Each datapoint having a total of $D$ features (after one-hot encoding), we want our numpy array to be of shape $(N, D)$

In our task, we have to predict the `SalePrice`. We will need 2 numpy arrays $(X, Y)$. These represent the features and targets respectively

In [62]:
# Convert to numpy array
new_df = one_hot_encoded_data[['SalePrice']].copy()
new_df2 = one_hot_encoded_data[['LotArea','TotalBsmtSF','GrLivArea','PoolArea',"OverallCond","Utilities_AllPub","Utilities_NoSeWa"]].copy()
X = new_df.to_numpy()
Y = new_df2.to_numpy()


## Linear Regression formulation
  
We now have our data in the form we need. Let's try to create a linear model to get our initial (Really bad) prediction


Let's say a single datapoint in our dataset consists of 3 features $(x_1, x_2, x_3)$, we can pose it as a linear equation as follows:
$$ y = w_1x_1 + w_2x_2 + w_3x_3 + b $$
Here we have to learn 4 parameters $(w_1, w_2, w_3, b)$
  
  
Now how do we extend this to multiple datapoints?  
  
  
Try to answer the following:
- How many parameters will we have to learn in the cae of our dataset? (Don't forget the bias term)
- Form a linear equation for our dataset. We need just a single matrix equation which correctly represents all the datapoints in our dataset
- Implement the linear equation as an equation using NumPy arrays (Start by randomly initializing the weights from a standard normal distribution)

In [63]:
# Assuming you have N data points and 3 features
N = 1313  # Number of data points
num_features = 9  # Number of features

# Randomly initialize weights (3 weights per feature for each data point)
weights = np.random.randn(N, num_features)

# Randomly initialize bias (1 bias term for each data point)
biases = np.random.randn(N, 1)

# Generate random feature values for a single data point (for example)
x_values = np.random.randn(num_features, 1)

# Calculate predicted outputs for all data points
predictions = np.dot(weights, x_values) + biases

How well does our model perform? Try comparing our predictions with the actual values

### **Learning weights using gradient descent**

So these results are really horrible. We need to somehow update our weights so that it correclty represents our data. How do we do that?

We must do the following:
- We need some numerical indication for our performance, for this we define a Loss Function ( $\mathscr{L}$ )
- Find the gradients of the `Loss` with respect to the `Weights`
- Update the weights in accordance to the gradients: $W = W - \alpha\nabla_W \mathscr{L}$

Lets define the loss function:
- We will use the MSE loss since it is a regression task. (Specify the assumptions we make while doing so as taught in the class).
- Implement this loss as a function. (Use numpy as much as possible)

In [64]:
import numpy as np

def mse_loss_fn(y_true, y_pred):
    """
    Calculate the Mean Squared Error (MSE) loss.

    Parameters:
    - y_true: Array of true target values.
    - y_pred: Array of predicted values.

    Returns:
    - mse_loss: MSE loss value.
    """
    # Calculate the squared differences
    squared_diff = (y_true - y_pred) ** 2
    
    # Calculate the mean of squared differences
    mse_loss = np.mean(squared_diff)
    
    return mse_loss


Calculate the gradients of the loss with respect to the weights (and biases). First write the equations down on a piece of paper, then proceed to implement it

In [67]:
def get_gradients(y_true, y_pred, W, b, X):
    """
    Calculates the gradients for the MSE loss function with respect to the weights (and bias)

    Args:
        y_true: The true values of the target variable
        y_pred: The predicted values of the target variable using our model (W*X + b)
        W: The weights of the model
        b: The bias of the model
        X: The input features

    Returns:
        dW: The gradients of the loss function with respect to the weights
        db: The gradients of the loss function with respect to the bias
    """
    
    N = len(y_true)  # Number of data points
    
    # Calculate the gradient with respect to the weights
    dW = (-2/N) * np.dot(X.T, (y_true - y_pred))
    
    # Calculate the gradient with respect to the bias
    db = (-2/N) * np.sum(y_true - y_pred)
    
    return dW, db



Update the weights using the gradients

In [66]:
def update(weights, bias, gradients_weights, gradients_bias, lr):
    """
    Updates the weights (and bias) using the gradients and the learning rate

    Args:
        weights: The current weights of the model
        bias: The current bias of the model
        gradients_weights: The gradients of the loss function with respect to the weights
        gradients_bias: The gradients of the loss function with respect to the bias
        lr: The learning rate

    Returns:
        weights_new: The updated weights of the model
        bias_new: The updated bias of the model
    """
    
    # Update the weights using the gradients and learning rate
    weights_new = weights - lr * gradients_weights
    
    # Update the bias using the gradients and learning rate
    bias_new = bias - lr * gradients_bias
    
    return weights_new, bias_new


Put all these together to find the loss value, its gradient and finally updating the weights in a loop. Feel free to play around with different learning rates and epochs
  
> NOTE: The code in comments are just meant to be used as a guide. You will have to do changes based on your code

In [71]:
NUM_EPOCHS = 10
LEARNING_RATE = 2e-2

losses = []

for epoch in range(NUM_EPOCHS):
    # Forward pass: Calculate predictions
    y_pred = np.dot(x_values, weights) + biases
    
    # Calculate the Mean Squared Error (MSE) loss
    loss = np.mean((Y - y_pred)**2)
    losses.append(loss)
    
    # Compute the gradients of the loss with respect to weights and bias
    gradient_weights = (-2/len(Y)) * np.dot(X.T, (Y - y_pred))
    gradient_bias = (-2/len(Y)) * np.sum(Y - y_pred)
    
    # Update weights and bias using gradient descent
    W -= LEARNING_RATE * gradient_weights
    b -= LEARNING_RATE * gradient_bias

    # Print the loss at each epoch
    print(f"Epoch {epoch + 1}/{NUM_EPOCHS}: Loss = {loss:.4f}")

# After training, the weights (W) and bias (b) have been updated.
print("Trained Weights:", W)
print("Trained Bias:", b)

ValueError: shapes (9,1) and (1313,9) not aligned: 1 (dim 1) != 1313 (dim 0)

Now use matplotlib to plot the loss graph

In [68]:
plt.plot(range(NUM_EPOCHS), losses)
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.title("Loss Over Epochs")
plt.show()

NameError: name 'NUM_EPOCHS' is not defined

### **Testing with test data**

Load and apply all the preprocessing steps used in the training data for the testing data as well. Remember to use the **SAME** min/max values which you used for the training set and not recalculate them from the test set. Also mention why we are doing this.

Using the weights learnt above, predict the values in the test dataset. Also answer the following questions:
- Are the predictions good?
- What is the MSE loss for the testset
- Is the MSE loss for testing greater or lower than training
- Why is this the case