<a href="https://colab.research.google.com/github/proffranciscofernando/house_price_prediction/blob/main/main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# California House Price Prediction

This notebook demonstrates a machine learning pipeline built with Scikit-Learn to predict house prices in California. It includes data preprocessing, model training, hyperparameter optimisation using Grid Search with cross-validation, and an interactive interface for making predictions based on user input.

## 1. Importing Necessary Libraries

Firstly, we import all the necessary libraries and suppress any non-critical warnings for cleaner output.

In [None]:
# Importing Necessary Libraries and Suppressing Warnings

import numpy as np
import pandas as pd

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

import joblib  # For saving and loading the model

# Suppressing warnings
import warnings
warnings.filterwarnings('ignore')

## 2. Loading and Exploring the Dataset

We will use the **California Housing Dataset** to predict house prices.

### 2.1 Loading the Dataset

In [None]:
# Loading the dataset
from sklearn.datasets import fetch_california_housing

data = fetch_california_housing(as_frame=True)
df = data.frame

# Viewing the first few rows
df.head()

### 2.2 Exploratory Data Analysis

Let's check the basic information of the dataset to understand its structure.

In [None]:
# Basic information about the dataset
df.info()

### 2.3 Selecting Features and Target Variable

We will select some relevant columns to simplify the example.

In [None]:
# Selecting features and target variable
X = df.drop('MedHouseVal', axis=1)
y = df['MedHouseVal']

### 2.4 Handling Missing Values

We will check for any missing values in the selected data.

In [None]:
# Checking for missing values
X.isnull().sum()

As we can see, there are no missing values in the dataset.

## 3. Data Preparation

### 3.1 Splitting Data into Training and Testing Sets

We split the dataset into training and testing sets to evaluate the performance of the models.

In [None]:
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 4. Building the Pipeline

### 4.1 Creating the Preprocessing Pipeline

We define transformations for numerical columns.

In [None]:
# Defining numerical columns (there are no categorical columns in this dataset)
numeric_features = X.columns.tolist()
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

# Since we have no categorical features, we do not define a categorical transformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features)
    ])

## 5. Hyperparameter Optimisation with Grid Search

### 5.1 Defining Models and Parameters

We will define a list of regression models and the hyperparameters we wish to optimise for each.

In [None]:
# Defining models and their hyperparameters
models_params = {
    'Linear Regression': {
        'model': LinearRegression(),
        'params': {}
    },
    'Ridge Regression': {
        'model': Ridge(),
        'params': {
            'regressor__alpha': [0.1, 1.0, 10.0]
        }
    },
    'Lasso Regression': {
        'model': Lasso(),
        'params': {
            'regressor__alpha': [0.001, 0.01, 0.1, 1.0]
        }
    },
    'Random Forest': {
        'model': RandomForestRegressor(),
        'params': {
            'regressor__n_estimators': [50, 100, 200],
            'regressor__max_depth': [None, 5, 10]
        }
    },
    'SVR': {
        'model': SVR(),
        'params': {
            'regressor__C': [0.1, 1, 10],
            'regressor__kernel': ['linear', 'rbf']
        }
    }
}

### 5.2 Running Grid Search with Cross-Validation

Now, we will perform Grid Search with cross-validation for each model.

In [None]:
# Running Grid Search with cross-validation for each model
results = []

for name, mp in models_params.items():
    model = mp['model']
    params = mp.get('params', {})

    # Creating the pipeline with the current model
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('regressor', model)
    ])

    if params:
        # Grid Search with cross-validation
        grid_search = GridSearchCV(
            estimator=pipeline,
            param_grid=params,
            cv=5,
            scoring='neg_mean_squared_error',
            n_jobs=-1
        )

        # Training the model
        grid_search.fit(X_train, y_train)

        # Best model found
        best_model = grid_search.best_estimator_
    else:
        # If there are no hyperparameters to tune
        pipeline.fit(X_train, y_train)
        best_model = pipeline
        grid_search = None

    # Making predictions on the test set
    y_pred = best_model.predict(X_test)

    # Calculating metrics
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_test, y_pred)

    # Storing the results
    results.append({
        'Model': name,
        'Best Hyperparameters': grid_search.best_params_ if params else 'N/A',
        'MAE': mae,
        'RMSE': rmse,
        'R²': r2,
        'Pipeline': best_model  # Storing the best model
    })

### 5.3 Comparing Optimised Models

Let's visualise the metrics of each optimised model in a dataframe.

In [None]:
# Comparing the optimised models
df_results = pd.DataFrame(results)
df_results = df_results.sort_values(by='RMSE')
df_results.reset_index(drop=True, inplace=True)
df_results[['Model', 'MAE', 'RMSE', 'R²', 'Best Hyperparameters']]

## 6. Selecting and Saving the Best Model

### 6.1 Selecting the Best Model

In [None]:
# Selecting the best model based on RMSE
best_model_name = df_results.loc[0, 'Model']
best_pipeline = df_results.loc[0, 'Pipeline']

print(f"The best model was: {best_model_name}")
print(f"With hyperparameters: {df_results.loc[0, 'Best Hyperparameters']}")

### 6.2 Saving the Model

In [None]:
# Saving the model using joblib
joblib.dump(best_pipeline, 'best_model.pkl')

## 7. Loading the Model and Making Predictions

### 7.1 Making Predictions with New User Input

In this section, we will create an interactive interface where the user can input new data, and the model will make predictions based on that data.

In [None]:
# Loading the saved model
loaded_model = joblib.load('best_model.pkl')

#### Function to Get User Input Data

In [None]:
# Function to get user input data
def get_user_data():
    print("Please enter the house details:")

    # Prompting user for input
    MedInc = input("MedInc (Median income in tens of thousands): ")
    while True:
        try:
            MedInc = float(MedInc)
            if MedInc < 0:
                raise ValueError
            break
        except ValueError:
            print("Invalid value. Please enter a positive number.")
            MedInc = input("MedInc: ")

    HouseAge = input("HouseAge (Median house age in the area): ")
    while True:
        try:
            HouseAge = float(HouseAge)
            if HouseAge < 0:
                raise ValueError
            break
        except ValueError:
            print("Invalid value. Please enter a positive number.")
            HouseAge = input("HouseAge: ")

    AveRooms = input("AveRooms (Average rooms per house): ")
    while True:
        try:
            AveRooms = float(AveRooms)
            if AveRooms <= 0:
                raise ValueError
            break
        except ValueError:
            print("Invalid value. Please enter a positive number.")
            AveRooms = input("AveRooms: ")

    AveBedrms = input("AveBedrms (Average bedrooms per house): ")
    while True:
        try:
            AveBedrms = float(AveBedrms)
            if AveBedrms <= 0:
                raise ValueError
            break
        except ValueError:
            print("Invalid value. Please enter a positive number.")
            AveBedrms = input("AveBedrms: ")

    Population = input("Population (Population of the block): ")
    while True:
        try:
            Population = float(Population)
            if Population <= 0:
                raise ValueError
            break
        except ValueError:
            print("Invalid value. Please enter a positive number.")
            Population = input("Population: ")

    AveOccup = input("AveOccup (Average occupancy per house): ")
    while True:
        try:
            AveOccup = float(AveOccup)
            if AveOccup <= 0:
                raise ValueError
            break
        except ValueError:
            print("Invalid value. Please enter a positive number.")
            AveOccup = input("AveOccup: ")

    Latitude = input("Latitude: ")
    while True:
        try:
            Latitude = float(Latitude)
            break
        except ValueError:
            print("Invalid value. Please enter a number.")
            Latitude = input("Latitude: ")

    Longitude = input("Longitude: ")
    while True:
        try:
            Longitude = float(Longitude)
            break
        except ValueError:
            print("Invalid value. Please enter a number.")
            Longitude = input("Longitude: ")

    # Creating a DataFrame with the entered data
    new_house = pd.DataFrame({
        'MedInc': [MedInc],
        'HouseAge': [HouseAge],
        'AveRooms': [AveRooms],
        'AveBedrms': [AveBedrms],
        'Population': [Population],
        'AveOccup': [AveOccup],
        'Latitude': [Latitude],
        'Longitude': [Longitude]
    })

    return new_house

#### Iterative Loop for Predictions

In [None]:
# Iterative loop for making predictions
while True:
    # Get user input data
    new_house = get_user_data()

    # Make prediction
    prediction = loaded_model.predict(new_house)

    predicted_price = prediction[0]

    print(f"\nPredicted house price: ${predicted_price * 100000:.2f}\n")

    # Ask if the user wants to input another house
    continue_input = input("Would you like to enter another house? (y/n): ").lower()
    if continue_input != 'y':
        print("Ending predictions.")
        break

## 8. Conclusion

In this notebook, we built a machine learning pipeline that includes multiple regression models. We utilised Grid Search with cross-validation to optimise the hyperparameters of each model. We evaluated each model using metrics such as MAE, RMSE, and R². Based on these metrics, we selected the best optimised model and saved it for future use. Finally, we implemented an interactive interface that allows users to input new data and obtain model predictions, making the application practical and interactive.

This process is essential in machine learning projects to ensure that we are selecting the most appropriate model and hyperparameters for our problem, as well as facilitating the deployment and continuous use of the model in real-world environments.

## 9. References

- [Scikit-Learn Documentation on Pipelines](https://scikit-learn.org/stable/modules/compose.html#pipeline)
- [Scikit-Learn Documentation on GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)
- [Saving Models with Joblib](https://scikit-learn.org/stable/modules/model_persistence.html)
- [California Housing Dataset](https://scikit-learn.org/stable/datasets/real_world.html#california-housing-dataset)
- [Cross-Validation in Scikit-Learn](https://scikit-learn.org/stable/modules/cross_validation.html)
- [Regression Metrics](https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics)