# Linear Regression

In [None]:
# Import the required libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

## Data Preprocessing

### **Exploring the dataset**

Let's start with loading the training data from the csv into a pandas dataframe



Load the datasets from GitHub. Train dataset has already been loaded for you in df below. To get test dataset use the commented code.

In [None]:
import pandas as pd # Import pandas before using it

df = pd.read_csv('https://raw.githubusercontent.com/cronan03/DevSoc_AI-ML/main/train_processed_splitted.csv')


# 11553,1051,1159,336,0,5,AllPub,158000
# 8400,1052,1052,288,0,5,AllPub,138500
print("First few rows of the dataset:")
print(df.head())

#Select only numerical columns
numerical_df = df.select_dtypes(include=['number'])

# Compute the range (max - min) for each numerical feature
feature_ranges = numerical_df.max() - numerical_df.min()

# Compute the mean for each numerical feature
feature_means = numerical_df.mean()

# Print the results
print("\nFeature statistics:")
for feature in numerical_df.columns: # Iterate over numerical columns only
    print(f"Feature '{feature}':")
    print(f"  Range: {feature_ranges[feature]}")
    print(f"  Mean: {feature_means[feature]}")
    print()


First few rows of the dataset:
   LotArea  TotalBsmtSF  GrLivArea  GarageArea  PoolArea  OverallCond  \
0    11553         1051       1159         336         0            5   
1     8400         1052       1052         288         0            5   
2     8960         1008       1028         360         0            6   
3    11100            0        930         308         0            7   
4    15593         1304       2287         667         0            4   

  Utilities  SalePrice  
0    AllPub     158000  
1    AllPub     138500  
2    AllPub     115000  
3    AllPub      84900  
4    AllPub     225000  

Feature statistics:
Feature 'LotArea':
  Range: 213945
  Mean: 10622.104261796043

Feature 'TotalBsmtSF':
  Range: 6110
  Mean: 1058.3112633181127

Feature 'GrLivArea':
  Range: 5308
  Mean: 1512.9003044140031

Feature 'GarageArea':
  Range: 1418
  Mean: 473.48021308980213

Feature 'PoolArea':
  Range: 738
  Mean: 2.6430745814307457

Feature 'OverallCond':
  Range: 8
  Mean: 5

Let's see what the first 5 rows of this dataset looks like

In [None]:
df.head()

Unnamed: 0,LotArea,TotalBsmtSF,GrLivArea,GarageArea,PoolArea,OverallCond,Utilities,SalePrice
0,11553,1051,1159,336,0,5,AllPub,158000
1,8400,1052,1052,288,0,5,AllPub,138500
2,8960,1008,1028,360,0,6,AllPub,115000
3,11100,0,930,308,0,7,AllPub,84900
4,15593,1304,2287,667,0,4,AllPub,225000


What are all the features present? What is the range for each of the features along with their mean?

In [None]:
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/cronan03/DevSoc_AI-ML/main/train_processed_splitted.csv')

# Display all features present
print("All features present:\n", df.columns.tolist())

# Select only numerical columns
numerical_df = df.select_dtypes(include=['number'])

# Compute the range (max - min) for each numerical feature
feature_ranges = numerical_df.max() - numerical_df.min()

# Compute the mean for each numerical feature
feature_means = numerical_df.mean()

# Print the range and mean for each numerical feature
print("\nFeature statistics:")
for feature in numerical_df.columns:
    print(f"Feature '{feature}':")
    print(f"  Range: {feature_ranges[feature]}")
    print(f"  Mean: {feature_means[feature]}")
    print()

All features present:
 ['LotArea', 'TotalBsmtSF', 'GrLivArea', 'GarageArea', 'PoolArea', 'OverallCond', 'Utilities', 'SalePrice']

Feature statistics:
Feature 'LotArea':
  Range: 213945
  Mean: 10622.104261796043

Feature 'TotalBsmtSF':
  Range: 6110
  Mean: 1058.3112633181127

Feature 'GrLivArea':
  Range: 5308
  Mean: 1512.9003044140031

Feature 'GarageArea':
  Range: 1418
  Mean: 473.48021308980213

Feature 'PoolArea':
  Range: 738
  Mean: 2.6430745814307457

Feature 'OverallCond':
  Range: 8
  Mean: 5.582191780821918

Feature 'SalePrice':
  Range: 720100
  Mean: 180795.50456621006



### **Feature Scaling and One-Hot Encoding**

You must have noticed that some features `(such as Utilities)` are not continuous values.
  
These features contain values indicating different categories and must somehow be converted to numbers so that the computer can understand it. `(Computers only understand numbers and not strings)`
  
These features are called categorical features. We can represent these features as a `One-Hot Representation`
  
  
You must have also noticed that all the other features, each are in a different scale. This can be detremental to the performance of our linear regression model and so we normalize them so that all of them are in the range $[0,1]$

> NOTE: When you are doing feature scaling, store the min/max which you will use to normalize somewhere. This is then to be used at testing time. Try to think why are doing this?

In [None]:
# Do the one-hot encoding here
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/cronan03/DevSoc_AI-ML/main/train_processed_splitted.csv')

# Identify categorical features
# Carefully check for typos and cappitalization against the actual Dataframe
categorical_features = ['MSZoning', 'Street', 'Utilities', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle']

# Print the columns of your dataframe to verify the names and compare to your list
print(df.columns)

# Check if all categorical features are present in the Dataframe
missing_features = [feature for feature in categorical_features if feature not in df.columns]
if missing_features:
    print("Warning: The following features are not in the Dataframe:", missing_features) # Correct indentation here

# Apply one-hot encoding to categorical features that are present
present_categorical_features = [feature for feature in categorical_features if feature in df.columns]
df = pd.get_dummies(df, columns=present_categorical_features)

# Select numerical columns (including the newly created one-hot columns)
numerical_df = df.select_dtypes(include=['number'])

# Compute the range (max - min) for each numerical feature
feature_ranges = numerical_df.max() - numerical_df.min()

# Compute the mean for each numerical feature
feature_means = numerical_df.mean()

# Print the range and mean for each numerical feature
print("\nFeature statistics (including one-hot encoded features):")
for feature in numerical_df.columns:
    print(f"Feature '{feature}':")
    print(f"  Range: {feature_ranges[feature]}")
    print(f"  Mean: {feature_means[feature]}")
    print()

Index(['LotArea', 'TotalBsmtSF', 'GrLivArea', 'GarageArea', 'PoolArea',
       'OverallCond', 'Utilities', 'SalePrice'],
      dtype='object')

Feature statistics (including one-hot encoded features):
Feature 'LotArea':
  Range: 213945
  Mean: 10622.104261796043

Feature 'TotalBsmtSF':
  Range: 6110
  Mean: 1058.3112633181127

Feature 'GrLivArea':
  Range: 5308
  Mean: 1512.9003044140031

Feature 'GarageArea':
  Range: 1418
  Mean: 473.48021308980213

Feature 'PoolArea':
  Range: 738
  Mean: 2.6430745814307457

Feature 'OverallCond':
  Range: 8
  Mean: 5.582191780821918

Feature 'SalePrice':
  Range: 720100
  Mean: 180795.50456621006



In [None]:
# Do the feature scaling here


### **Conversion to NumPy**

Ok so now that we have all preprocessed all the data, we need to convert it to numpy for our linear regression model
  
Assume that our dataset has a total of $N$ datapoints. Each datapoint having a total of $D$ features (after one-hot encoding), we want our numpy array to be of shape $(N, D)$

In our task, we have to predict the `SalePrice`. We will need 2 numpy arrays $(X, Y)$. These represent the features and targets respectively

In [None]:
# Convert to numpy array
import pandas as pd
import numpy as np

# ... (Your existing code for one-hot encoding)


# Sample DataFrame (replace this with how you actually create your DataFrame)
numerical_df = pd.DataFrame({
    'feature1': [1, 2, 3],
    'feature2': [4, 5, 6],
    'SalePrice': [7, 8, 9]
})


# Extract the features (X) and target variable (Y)
X = numerical_df.drop('SalePrice', axis=1).values  # Assuming 'SalePrice' is your target
Y = numerical_df['SalePrice'].values

print("Shape of feature matrix X:", X.shape)  # Verify the shape (N, D)
print("Shape of target vector Y:", Y.shape)  # Verify the shape (N,)

Shape of feature matrix X: (3, 2)
Shape of target vector Y: (3,)


## Linear Regression formulation
  
We now have our data in the form we need. Let's try to create a linear model to get our initial (Really bad) prediction


Let's say a single datapoint in our dataset consists of 3 features $(x_1, x_2, x_3)$, we can pose it as a linear equation as follows:
$$ y = w_1x_1 + w_2x_2 + w_3x_3 + b $$
Here we have to learn 4 parameters $(w_1, w_2, w_3, b)$
  
  
Now how do we extend this to multiple datapoints?  
  
  
Try to answer the following:
- How many parameters will we have to learn in the cae of our dataset? (Don't forget the bias term)
- Form a linear equation for our dataset. We need just a single matrix equation which correctly represents all the datapoints in our dataset
- Implement the linear equation as an equation using NumPy arrays (Start by randomly initializing the weights from a standard normal distribution)

In [None]:
import numpy as np

# Set random seed for reproducibility
np.random.seed(0)

# Define the dimensions
n_samples = 100  # number of data points
n_features = 5   # number of features
n_outputs = 1    # number of outputs

# Initialize the input matrix X (n_samples x n_features)
X = np.random.rand(n_samples, n_features)

# Initialize the weights matrix W (n_features x n_outputs)
W = np.random.randn(n_features, n_outputs)  # standard normal distribution

# Initialize the bias vector b (n_outputs)
b = np.random.randn(n_outputs)  # standard normal distribution

# Compute the output matrix Y
Y = np.dot(X, W) + b

# Print the shapes to verify
print("Shape of X:", X.shape)
print("Shape of W:", W.shape)
print("Shape of b:", b.shape)
print("Shape of Y:", Y.shape)

# Print the first few values to check the results
print("First few rows of X:\n", X[:5])
print("Weights W:\n", W)
print("Bias b:\n", b)
print("First few rows of Y:\n", Y[:5])


Shape of X: (100, 5)
Shape of W: (5, 1)
Shape of b: (1,)
Shape of Y: (100, 1)
First few rows of X:
 [[0.86231855 0.0486903  0.25364252 0.44613551 0.10462789]
 [0.34847599 0.74009753 0.68051448 0.62238443 0.7105284 ]
 [0.20492369 0.34169811 0.67624248 0.87923476 0.54367805]
 [0.28269965 0.03023526 0.71033683 0.0078841  0.37267907]
 [0.53053721 0.92211146 0.08949455 0.40594232 0.0243132 ]]
Weights W:
 [[-0.49901664]
 [ 0.02135122]
 [-0.91911344]
 [ 0.19275385]
 [-0.36505522]]
Bias b:
 [-1.79132755]
First few rows of Y:
 [[-2.40592613]
 [-2.71430599]
 [-2.53683239]
 [-2.71916269]
 [-2.04927059]]


How well does our model perform? Try comparing our predictions with the actual values

### **Learning weights using gradient descent**

So these results are really horrible. We need to somehow update our weights so that it correclty represents our data. How do we do that?

We must do the following:
- We need some numerical indication for our performance, for this we define a Loss Function ( $\mathscr{L}$ )
- Find the gradients of the `Loss` with respect to the `Weights`
- Update the weights in accordance to the gradients: $W = W - \alpha\nabla_W \mathscr{L}$

Lets define the loss function:
- We will use the MSE loss since it is a regression task. (Specify the assumptions we make while doing so as taught in the class).
- Implement this loss as a function. (Use numpy as much as possible)

In [None]:
def mse_loss_fn(y_true, y_pred):
    # TODO: Implement the MSE loss function here
    pass

Calculate the gradients of the loss with respect to the weights (and biases). First write the equations down on a piece of paper, then proceed to implement it

In [None]:
def get_gradients(y_true, y_pred, W, b, X):
    """
    Calculates the gradients for the MSE loss function with respect to the weights (and bias)

    Args:
        y_true: The true values of the target variable (SalePrice in our case)
        y_pred: The predicted values of the target variable using our model (W*X + b)

        W: The weights of the model
        b: The bias of the model
        X: The input features

    Returns:
        dW: The gradients of the loss function with respect to the weights
        db: The gradients of the loss function with respect to the bias
    """

    # TODO: Implement the gradient calculations here
    pass

Update the weights using the gradients

In [None]:
import numpy as np

def calculate_gradients(X, y_true, y_pred):
    """
    Calculates the gradients of the loss with respect to weights and bias.

    Args:
        X: Input data (shape: [number of samples, number of features])
        y_true: True target values (shape: [number of samples])
        y_pred: Predicted target values (shape: [number of samples])

    Returns:
        A tuple containing (gradient_w, gradient_b)
    """

    n_samples = X.shape[0]  # Number of data points

    # Calculate gradients
    gradient_w = (1/n_samples) * np.dot(X.T, (y_pred - y_true))
    gradient_b = (1/n_samples) * np.sum(y_pred - y_true)

    return


In [None]:
def update(weights, bias, gradients_weights, gradients_bias, lr):
    """
    Updates the weights (and bias) using the gradients and the learning rate

    Args:
        weights: The current weights of the model
        bias: The current bias of the model

        gradients_weights: The gradients of the loss function with respect to the weights
        gradients_bias: The gradients of the loss function with respect to the bias

        lr: The learning rate

    Returns:
        weights_new: The updated weights of the model

    """

    # TODO Implement the update step here

    pass

Put all these together to find the loss value, its gradient and finally updating the weights in a loop. Feel free to play around with different learning rates and epochs
  
> NOTE: The code in comments are just meant to be used as a guide. You will have to do changes based on your code

In [None]:
NUM_EPOCHS = 10
LEARNING_RATE = 2e-2

losses = []

for epoch in range(NUM_EPOCHS):
    # y_pred =
    # loss = loss_fn(y_true, y_pred)
    # losses.append(loss)
    # gradients_weights, gradients_bias = get_gradients(y_true, y_pred, W, b, X)
    # update(W, b, gradients_weights, gradients_bias, LEARNING_RATE)


Now use matplotlib to plot the loss graph

### **Testing with test data**

Load and apply all the preprocessing steps used in the training data for the testing data as well. Remember to use the **SAME** min/max values which you used for the training set and not recalculate them from the test set. Also mention why we are doing this.

To load test data from GitHub, use the code below.


In [None]:
#testdf = pd.read_csv('https://raw.githubusercontent.com/cronan03/DevSoc_AI-ML/main/test_processed_splitted.csv')

In [None]:
import pandas as pd

# Your data snippet (let's put it into a DataFrame)
data = {'LotArea': [15623, 11952],
        'TotalBsmtSF': [2396, 808],
        'GrLivArea': [4476, 1969],
        'GarageArea': [813, 534],
        'PoolArea': [555, 0],
'OverallCond': [5, 6],
        'Utilities': ['AllPub', 'AllPub'],
        'SalePrice': [745000, 190000]}

train_data = pd.DataFrame(data)

# Calculate minimum and maximum values (excluding non-numerical columns)
numerical_columns = ['LotArea', 'TotalBsmtSF', 'GrLivArea', 'GarageArea', 'PoolArea', 'OverallCond', 'SalePrice']
train_min_values = train_data[numerical_columns].min()
train_max_values = train_data[numerical_columns].max()

print("Minimum Values:\n", train_min_values)
print("\nMaximum Values:\n", train_max_values)

Minimum Values:
 LotArea         11952
TotalBsmtSF       808
GrLivArea        1969
GarageArea        534
PoolArea            0
OverallCond         5
SalePrice      190000
dtype: int64

Maximum Values:
 LotArea         15623
TotalBsmtSF      2396
GrLivArea        4476
GarageArea        813
PoolArea          555
OverallCond         6
SalePrice      745000
dtype: int64


Using the weights learnt above, predict the values in the test dataset. Also answer the following questions:
- Are the predictions good?
- What is the MSE loss for the testset
- Is the MSE loss for testing greater or lower than training
- Why is this the case