# Predicting House Pricing with Machine Learning

**Note**: This project uses a Kaggle dataset. [Dataset](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques)

In this notebook, we're going to go through the Kaggle House Prices - Advanced Regression Techniques Competition.

## 1. Problem Definition

The goal of this project is to develop a machine learning model capable of accurately predicting the sale price of residential properties. The dataset consists of various features that capture the physical attributes, location, and condition of the houses. By leveraging these features, the model aims to provide precise price estimates, which can support buyers, sellers, and real estate professionals in making informed decisions.

This problem represents a supervised regression task, where the target variable is the continuous numerical sale price of the properties.

---

## 2. Data

This project uses the dataset provided by the Kaggle House Prices Competition. The dataset contains 79 explanatory variables describing various attributes of residential properties in Ames, Iowa, along with the target variable `SalePrice`, which represents the property's sale price in dollars.

### Data Files
- `train.csv`: The training dataset, including features and the target variable, `SalePrice`.
- `test.csv`: The test dataset, containing features but no target variable. This is used for evaluating the model.
- `data_description.txt`: A detailed guide to the meaning and significance of each feature.
- `sample_submission.csv`: An example of the required format for model predictions.

---

## 3. Evaluation

The primary objective of this project is to predict the sale price (`SalePrice`) for each house in the test set. The evaluation is based on the **Root Mean Squared Logarithmic Error (RMSLE)** calculated between the logarithm of the predicted and actual values of `SalePrice`.

### **Why RMSLE?**
Taking the logarithm ensures that errors in predicting expensive and inexpensive houses impact the result equally, providing a fairer evaluation metric.

### **Metric**
The RMSLE is computed as:

$$
RMSLE = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (\log(\text{Prediction}_i + 1) - \log(\text{Actual}_i + 1))^2}
$$

Where:
- $n$: Total number of predictions.
- $\log(\text{Prediction}_i + 1)$: Logarithm of the predicted house price plus 1.
- $\log(\text{Actual}_i + 1)$: Logarithm of the actual house price plus 1.

For more information, refer to the [Kaggle evaluation section](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview/evaluation).

**Note**: The objective of most regression evaluation metrics is to minimize the error. In this project, our goal is to build a machine learning model that minimizes the **Root Mean Squared Logarithmic Error (RMSLE)**.

---

## 4. Features

The dataset includes 79 features that describe various aspects of residential properties. These features provide critical information about the physical attributes, location, and condition of the houses, which are used to predict the target variable, `SalePrice`.

To better understand the dataset, the features can be grouped into the following categories:

### **Feature Categories**
Below is an overview of the primary feature categories and examples of key variables within each:

- **Property Characteristics**:
  - `MSSubClass`: Building class (e.g., type of dwelling).
  - `MSZoning`: General zoning classification (e.g., residential, commercial).
  - `LotArea`: Lot size in square feet.
  - `Neighborhood`: Physical location within Ames city limits.

- **Structural Details**:
  - `OverallQual`: Overall material and finish quality.
  - `YearBuilt`: Original construction date.
  - `GrLivArea`: Above-grade (ground) living area in square feet.

- **Basement and Garage**:
  - `TotalBsmtSF`: Total square feet of basement area.
  - `GarageCars`: Size of garage in car capacity.
  - `GarageArea`: Size of garage in square feet.

- **Amenities**:
  - `Fireplaces`: Number of fireplaces.
  - `PoolArea`: Pool area in square feet.
  - `Fence`: Fence quality.

- **Sales Information**:
  - `SaleType`: Type of sale (e.g., warranty deed, contract).
  - `SaleCondition`: Condition of sale (e.g., normal, partial).

### **Target Variable**
- `SalePrice`: The sale price of the house in dollars. This is the target variable the model is trained to predict.

### **Source**
For a comprehensive list of all features and their descriptions, refer to the official [data description file](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data) provided on Kaggle.
