# Data Preparation and Model Training

This notebook demonstrates the workflow for preparing the house price dataset, training a regression model, making predictions, and analyzing results.

## 1. Import Required Libraries and Modules

We start by importing necessary libraries and modules, and setting up the Python path to allow imports from the project directory.

In [1]:
import sys
import os
# Add the parent directory to sys.path to allow imports from my_project
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))

from my_project.dataset import HousePricingDataModule

## 2. Initialize and Prepare Data

We initialize the data module and process the raw dataset to prepare it for training.

In [2]:
# Initialize data module and process data
dataModule = HousePricingDataModule(data_dir='../data/raw/house_price_regression_dataset.csv')
dataModule.prepare_data()

Dataset loaded with shape: (1000, 8)
Features shape: (1000, 7), Target shape: (1000,)
Data split into train, val, and test sets. Interim files saved.
Data normalized and processed files saved.


## 3. Train the Model

We train the house price regression model using the processed data. Training parameters such as batch size, learning rate, and number of epochs are specified.

In [3]:
from my_project.modeling import train, predict
import importlib
import argparse

sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))

# Simulate command-line arguments for train.main()
args = argparse.Namespace(
    batch_size=64,
    num_workers=4,
    lr=1e-3,
    weight_decay=0.0,
    epochs=20,
)

# Start training
train.main(args)

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs

  | Name    | Type       | Params | Mode 
-----------------------------------------------
0 | net     | Sequential | 2.6 K  | train
1 | loss_fn | MSELoss    | 0      | train
-----------------------------------------------
2.6 K     Trainable params
0         Non-trainable params
2.6 K     Total params
0.010     Total estimated model params size (MB)
7         Modules in train mode
0         Modules in eval mode

  | Name    | Type       | Params | Mode 
-----------------------------------------------
0 | net     | Sequential | 2.6 K  | train
1 | loss_fn | MSELoss    | 0      | train
-----------------------------------------------
2.6 K     Trainable params
0         Non-trainable params
2.6 K     Total params
0.010     Total estimated model params size (MB)
7         Modules in train mode
0         M

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

c:\Users\joaqu\Desktop\MASTER\Software Development\Practice 1\Software_Development\.venv\Lib\site-packages\pytorch_lightning\trainer\connectors\data_connector.py:428: Consider setting `persistent_workers=True` in 'val_dataloader' to speed up the dataloader worker initialization.


                                                                            

c:\Users\joaqu\Desktop\MASTER\Software Development\Practice 1\Software_Development\.venv\Lib\site-packages\pytorch_lightning\trainer\connectors\data_connector.py:428: Consider setting `persistent_workers=True` in 'train_dataloader' to speed up the dataloader worker initialization.


Epoch 19: 100%|██████████| 10/10 [00:05<00:00,  1.76it/s, v_num=0, val_loss=4.86e+11, train_loss=4.34e+11]

`Trainer.fit` stopped: `max_epochs=20` reached.


Epoch 19: 100%|██████████| 10/10 [00:05<00:00,  1.75it/s, v_num=0, val_loss=4.86e+11, train_loss=4.34e+11]



## 4. Make Predictions

After training, we use the trained model to make predictions on the test dataset. The predictions are saved to a CSV file for further analysis.

In [6]:
# Run prediction
output_csv = predict.run_predict(
    data_dir="data/processed",
    models_dir="models",
    output_path="models/test_predictions.csv",
    device="auto",
    target_col="House_Price",
)


Using checkpoint: c:\Users\joaqu\Desktop\MASTER\Software Development\Practice 1\Software_Development\models\house_price_regressor.ckpt
Test RMSE: 669855.877397
Predictions saved to: c:\Users\joaqu\Desktop\MASTER\Software Development\Practice 1\Software_Development\models\test_predictions.csv


## 5. Analyze Results

Finally, we load the predictions and display the first few rows to inspect the results.

In [7]:
import pandas as pd
df_preds = pd.read_csv("../models/test_predictions.csv")
df_preds.head()

Unnamed: 0,prediction,y_true
0,119.99303,901000.5
1,68.461914,494537.5
2,129.48296,949404.2
3,213.0077,1040389.0
4,160.33043,794010.0
