# Linear Regression From Scratch

**Dataset:** Housing Prices Dataset (Kaggle)  
**Source:** https://www.kaggle.com/datasets/yasserh/housing-prices-dataset

**Objective:**  
To understand how linear regression works internally by implementing it from
scratch using only basic Python, NumPy, and Pandas without using high-level
machine learning libraries such as sklearn or statsmodels.

**Instructions to Run:**  
Download `Housing.csv` from the above Kaggle link and place it in the same
directory as this notebook before running the cells.


## Part 1 – Data Loading and Preprocessing

This section focuses on preparing the Housing Prices dataset for machine
learning. It includes loading the data, understanding its structure,
handling categorical variables, splitting the dataset into training and
testing sets, and applying feature scaling while avoiding information leakage.


In [19]:
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv("Housing.csv")

# Basic inspection
df.head()


Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished
1,12250000,8960,4,4,4,yes,no,no,no,yes,3,no,furnished
2,12250000,9960,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished
3,12215000,7500,4,2,2,yes,no,yes,no,yes,3,yes,furnished
4,11410000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished


In [20]:
# Dataset structure and info
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545 entries, 0 to 544
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   price             545 non-null    int64 
 1   area              545 non-null    int64 
 2   bedrooms          545 non-null    int64 
 3   bathrooms         545 non-null    int64 
 4   stories           545 non-null    int64 
 5   mainroad          545 non-null    object
 6   guestroom         545 non-null    object
 7   basement          545 non-null    object
 8   hotwaterheating   545 non-null    object
 9   airconditioning   545 non-null    object
 10  parking           545 non-null    int64 
 11  prefarea          545 non-null    object
 12  furnishingstatus  545 non-null    object
dtypes: int64(6), object(7)
memory usage: 55.5+ KB


In [21]:
# Statistical summary
df.describe()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,parking
count,545.0,545.0,545.0,545.0,545.0,545.0
mean,4766729.0,5150.541284,2.965138,1.286239,1.805505,0.693578
std,1870440.0,2170.141023,0.738064,0.50247,0.867492,0.861586
min,1750000.0,1650.0,1.0,1.0,1.0,0.0
25%,3430000.0,3600.0,2.0,1.0,1.0,0.0
50%,4340000.0,4600.0,3.0,1.0,2.0,0.0
75%,5740000.0,6360.0,3.0,2.0,2.0,1.0
max,13300000.0,16200.0,6.0,4.0,4.0,3.0


### Dataset Understanding

The Housing Prices dataset contains information about houses and their selling prices.
The target variable for prediction is **price**.

The dataset includes:
- **Numerical features** such as area, number of bedrooms, bathrooms, stories, and parking.
- **Categorical features** such as mainroad, guestroom, basement, airconditioning, prefarea,
  and furnishingstatus, which are represented as text values.

Since machine learning models require numerical input, categorical variables must be
converted into numerical form during preprocessing.


### Handling Categorical Variables

Several features in the dataset are categorical and represented as text values
such as "yes", "no", or category labels. Since linear regression models require
numerical inputs, these categorical variables must be converted into numerical
form.

One-hot encoding is used to transform categorical features into binary
numerical columns. To prevent multicollinearity (dummy variable trap), one
category from each categorical feature is dropped.


In [22]:
# Identify categorical columns
categorical_cols = [
    'mainroad',
    'guestroom',
    'basement',
    'hotwaterheating',
    'airconditioning',
    'prefarea',
    'furnishingstatus'
]

# Apply one-hot encoding
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Preview the encoded dataset
df_encoded.head()


Unnamed: 0,price,area,bedrooms,bathrooms,stories,parking,mainroad_yes,guestroom_yes,basement_yes,hotwaterheating_yes,airconditioning_yes,prefarea_yes,furnishingstatus_semi-furnished,furnishingstatus_unfurnished
0,13300000,7420,4,2,3,2,True,False,False,False,True,True,False,False
1,12250000,8960,4,4,4,3,True,False,False,False,True,False,False,False
2,12250000,9960,3,2,2,2,True,False,True,False,False,True,True,False
3,12215000,7500,4,2,2,3,True,False,True,False,True,True,False,False
4,11410000,7420,4,1,2,2,True,True,True,False,True,False,False,False


In [23]:
# Verify that all columns are now numeric
df_encoded.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545 entries, 0 to 544
Data columns (total 14 columns):
 #   Column                           Non-Null Count  Dtype
---  ------                           --------------  -----
 0   price                            545 non-null    int64
 1   area                             545 non-null    int64
 2   bedrooms                         545 non-null    int64
 3   bathrooms                        545 non-null    int64
 4   stories                          545 non-null    int64
 5   parking                          545 non-null    int64
 6   mainroad_yes                     545 non-null    bool 
 7   guestroom_yes                    545 non-null    bool 
 8   basement_yes                     545 non-null    bool 
 9   hotwaterheating_yes              545 non-null    bool 
 10  airconditioning_yes              545 non-null    bool 
 11  prefarea_yes                     545 non-null    bool 
 12  furnishingstatus_semi-furnished  545 non-null    b

### Separating Features and Target Variable

For training a supervised machine learning model, the dataset must be divided
into input features and the target variable.

- **Features (X)** include all independent variables used for prediction.
- **Target (y)** is the dependent variable that the model aims to predict.

In this dataset, **price** is the target variable, and all remaining columns
are treated as input features.


In [24]:
# Separate features and target
X = df_encoded.drop('price', axis=1).values
y = df_encoded['price'].values

# Check shapes
print("Feature matrix shape:", X.shape)
print("Target vector shape:", y.shape)


Feature matrix shape: (545, 13)
Target vector shape: (545,)


### Train–Test Split

To evaluate how well the model generalizes to unseen data, the dataset is split
into training and testing sets.

The training set is used to learn the model parameters, while the test set is
used only for evaluation. The data is shuffled before splitting to ensure a
random distribution of samples.

This split is performed **before feature scaling** to avoid information leakage
from the test set into the training process.


In [25]:
# Set random seed for reproducibility
np.random.seed(42)

# Create shuffled indices
indices = np.arange(X.shape[0])
np.random.shuffle(indices)

# Define split ratio
train_size = int(0.8 * X.shape[0])

# Split indices
train_indices = indices[:train_size]
test_indices = indices[train_size:]

# Create train-test split
X_train = X[train_indices]
X_test = X[test_indices]
y_train = y[train_indices]
y_test = y[test_indices]

# Verify shapes
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)


X_train shape: (436, 13)
X_test shape: (109, 13)
y_train shape: (436,)
y_test shape: (109,)


### Feature Scaling

Feature scaling is applied to ensure that all input features contribute equally
to the learning process. Gradient descent is sensitive to feature scales, and
features with larger numerical ranges can dominate the learning process.

In this implementation, **standardization** is used, where each feature is
transformed using the formula:

x' = (x − μ) / σ

The mean (μ) and standard deviation (σ) are computed **only on the training
data** and then applied to both training and test sets to avoid information
leakage.


In [26]:
# Convert feature matrices to float type
X_train = X_train.astype(float)
X_test = X_test.astype(float)

# Compute mean and standard deviation from training data
mean = X_train.mean(axis=0)
std = X_train.std(axis=0)

# Avoid division by zero (safety step)
std[std == 0] = 1

# Standardize training and test data
X_train_scaled = (X_train - mean) / std
X_test_scaled = (X_test - mean) / std

# Verify scaling
print("Training data mean (approx):", np.round(X_train_scaled.mean(axis=0), 2))
print("Training data std (approx):", np.round(X_train_scaled.std(axis=0), 2))


Training data mean (approx): [ 0. -0.  0.  0. -0.  0. -0.  0.  0. -0.  0. -0. -0.]
Training data std (approx): [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]


Boolean features resulting from one-hot encoding are converted to float values
to ensure numerical compatibility during standardization.


## Part 2 – Linear Regression Model

Linear regression models the relationship between input features and a target
variable by fitting a linear equation to the data.

The model is defined as:

ŷ = Xw + b

where:
- X is the input feature matrix
- w is the vector of model weights (coefficients)
- b is the bias (intercept)
- ŷ is the predicted output


In [27]:
def predict(X, weights, bias):
    """
    Predict target values using linear regression model.

    Parameters:
    X : numpy array of shape (n_samples, n_features)
    weights : numpy array of shape (n_features,)
    bias : float

    Returns:
    y_pred : numpy array of predicted values
    """
    return np.dot(X, weights) + bias


In [28]:
# Initialize weights and bias for testing
test_weights = np.zeros(X_train_scaled.shape[1])
test_bias = 0

# Make predictions
test_preds = predict(X_train_scaled, test_weights, test_bias)

print("Sample predictions:", test_preds[:5])


Sample predictions: [0. 0. 0. 0. 0.]


## Part 3 – Training Algorithm (Gradient Descent)

To train the linear regression model, a loss function is defined to measure the
difference between predicted values and actual target values.

The **Mean Squared Error (MSE)** is used as the loss function:

MSE = (1 / n) * Σ (y − ŷ)²

Gradient descent is an optimization algorithm that iteratively updates the
model parameters (weights and bias) in order to minimize the loss function.


In [29]:
def mean_squared_error(y_true, y_pred):
    """
    Compute Mean Squared Error (MSE).

    Parameters:
    y_true : numpy array of true values
    y_pred : numpy array of predicted values

    Returns:
    mse : float
    """
    return np.mean((y_true - y_pred) ** 2)


In [30]:
def gradient_descent(X, y, learning_rate=0.01, epochs=1000):
    """
    Train linear regression model using gradient descent.

    Parameters:
    X : numpy array of shape (n_samples, n_features)
    y : numpy array of shape (n_samples,)
    learning_rate : float
    epochs : int

    Returns:
    weights : numpy array of shape (n_features,)
    bias : float
    """
    n_samples, n_features = X.shape

    # Initialize parameters
    weights = np.zeros(n_features)
    bias = 0

    # Gradient descent loop
    for _ in range(epochs):
        # Predictions
        y_pred = predict(X, weights, bias)

        # Compute gradients
        dw = (-2 / n_samples) * np.dot(X.T, (y - y_pred))
        db = (-2 / n_samples) * np.sum(y - y_pred)

        # Update parameters
        weights -= learning_rate * dw
        bias -= learning_rate * db

    return weights, bias


### Model Training

The linear regression model is trained using gradient descent on the
standardized training data. During training, the model iteratively updates
its weights and bias to minimize the Mean Squared Error loss.


In [31]:
# Train the linear regression model
learning_rate = 0.01
epochs = 1000

weights, bias = gradient_descent(
    X_train_scaled,
    y_train,
    learning_rate=learning_rate,
    epochs=epochs
)

print("Training completed.")
print("Bias:", bias)
print("First 5 weights:", weights[:5])


Training completed.
Bias: 4788865.680013891
First 5 weights: [543087.76344709 100303.83970572 492881.85553305 410773.30439896
 219141.83771986]


## Part 4 – Testing and Evaluation

After training, the model is evaluated on the test dataset to measure how well
it generalizes to unseen data. The Mean Squared Error (MSE) is used as the
evaluation metric.


In [32]:
# Predict on test data
y_test_pred = predict(X_test_scaled, weights, bias)

# Compute test MSE
test_mse = mean_squared_error(y_test, y_test_pred)

print("Test Mean Squared Error:", test_mse)


Test Mean Squared Error: 1241717058160.144


In [33]:
# Compare a few actual vs predicted values
for i in range(5):
    print(f"Actual: {y_test[i]:.0f}, Predicted: {y_test_pred[i]:.0f}")


Actual: 9870000, Predicted: 7488384
Actual: 3990000, Predicted: 2787121
Actual: 3850000, Predicted: 2977323
Actual: 7000000, Predicted: 7118468
Actual: 4200000, Predicted: 4122126


### Conclusion

In this notebook, linear regression was implemented entirely from scratch using
only basic Python, NumPy, and Pandas. The objective was to understand the internal
working of linear regression, including data preprocessing, model formulation,
loss computation, gradient descent optimization, and model evaluation without
using high-level machine learning libraries.
