<a href="https://www.kaggle.com/code/piyushgone/singapore-resale-flat-prices-predicting?scriptVersionId=150566806" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

 # **Singapore  Resale Flat Prices Predicting**

> **Skills take away From This Project:** Data Wrangling, EDA, Model Building, Model Deployment

> **Domain:** Real Estate



> ### **Problem Statement:**
The objective of this project is to develop a machine learning model and deploy it as a user-friendly web application that predicts the resale prices of flats in Singapore. This predictive model will be based on historical data of resale flat transactions, and it aims to assist both potential buyers and sellers in estimating the resale value of a flat.


 ### **Project Overview:**
This project encompasses a series of tasks aimed at creating a predictive model for resale flat transactions in Singapore, specifically focusing on data from the Housing and Development Board (HDB) spanning from 1990 to the present day.

### **Tasks:**

1. **Data Collection and Preprocessing:**
   - Gather a comprehensive dataset of resale flat transactions from HDB.
   https://beta.data.gov.sg/collections/189/view
   - Perform data preprocessing to clean and structure the dataset, preparing it for machine learning applications.

2. **Feature Engineering:**
   - Extract pertinent features from the dataset, such as town, flat type, storey range, floor area, flat model, and lease commence date.
   - Introduce additional features, if necessary, to augment prediction accuracy.

3. **Model Selection and Training:**
   - Choose a suitable machine learning regression model (e.g., linear regression, decision trees, or random forests).
   - Train the selected model on historical data, utilizing a portion of the dataset for training purposes.

4. **Model Evaluation:**
   - Assess the model's predictive performance using regression metrics like Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R2 Score.

5. **Streamlit Web Application:**
   - Develop an intuitive web application using Streamlit.
   - Enable users to input flat details (town, flat type, storey range, etc.) and utilize the trained model to predict the resale price based on user inputs.

6. **Deployment on Render:**
   - Deploy the Streamlit application on the Render platform to make it accessible to users over the internet.

7. **Testing and Validation:**
   - Conduct thorough testing of the deployed application to ensure its correct functionality and the delivery of accurate predictions.

# Step 0: Import and Reading Data


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot') 

# from sklearn.model_selection import train_test_split
# from sklearn.linear_model import LinearRegression
# from sklearn.ensemble import RandomForestRegressor
# import xgboost as xgb
# import lightgbm as lgb
# import joblib


import warnings

# warnings.filterwarnings("ignore", module="matplotlib")
# warnings.filterwarnings("ignore", module="seaborn")
warnings.filterwarnings("ignore")

In [None]:
df_90 = pd.read_csv(r'/kaggle/input/singapore-resale-flat-prices-data-set/ResaleFlatPricesBasedonApprovalDate19901999.csv')
df_00 = pd.read_csv(r'/kaggle/input/singapore-resale-flat-prices-data-set/ResaleFlatPricesBasedonApprovalDate2000Feb2012.csv')
df_12 = pd.read_csv(r'/kaggle/input/singapore-resale-flat-prices-data-set/ResaleFlatPricesBasedonRegistrationDateFromJan2015toDec2016.csv')
df_15 = pd.read_csv(r'/kaggle/input/singapore-resale-flat-prices-data-set/ResaleflatpricesbasedonregistrationdatefromJan2017onwards.csv')


In [None]:
df = df_15
df.sample(10)

### Step 1.1 - Understanding Data


In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.describe()

# Step 1.2 - Data Preperation 

In [None]:
df.isna().sum()

In [None]:
df.duplicated().sum()

In [None]:
df = df.drop_duplicates()

In [None]:
sns.boxplot(df) # checking outliers

In [None]:
sns.boxplot(x = 'resale_price', data = df)

In [None]:
def remove_outliers(df, column_name):
    """
    Remove outliers from a specific column in a DataFrame using the IQR method.

    Parameters:
    - df: pandas DataFrame
    - column_name: Name of the column for outlier removal

    Returns:
    - DataFrame with outliers removed
    """

    # Calculate the first and third quartiles
    Q1 = df[column_name].quantile(0.25)
    Q3 = df[column_name].quantile(0.75)

    # Calculate the IQR (Interquartile Range)
    IQR = Q3 - Q1

    # Define the lower and upper bounds for outlier removal
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Remove outliers
    df_no_outliers = df[(df[column_name] >= lower_bound) & (df[column_name] <= upper_bound)]

    return df_no_outliers

updated_df = remove_outliers(df, 'resale_price')


In [None]:
sns.boxplot(x = 'resale_price', data = updated_df)

In [None]:
c = round(((df_15.shape[0] - updated_df.shape[0]) / df_15.shape[0])*100 , 3)
print(f'Removed Outliers which consisted of {c} % of the orignal data')

In [None]:
sns.boxplot(updated_df['floor_area_sqm'])

In [None]:
updated_df = remove_outliers(df, 'floor_area_sqm')

In [None]:
print(updated_df.shape)
print(f'Outliers removed: {round(((df_15.shape[0] - updated_df.shape[0]) / df_15.shape[0])*100 , 3)} %')

# Step 2: Feature Engineering:

In [None]:
df = updated_df
df = df[['flat_type',
         'storey_range',
         'floor_area_sqm', 
         'flat_model', 
         'lease_commence_date',
         'resale_price',
        'town']]
df.head()

In [None]:
df['age_of_flat'] = 2023 - df['lease_commence_date']
df['price_per_square_meter'] = df['resale_price'] / df['floor_area_sqm']
df['town'] = df.town.astype('category')
df['flat_type'] = df.flat_type.astype('category')

In [None]:
df.head()

In [None]:
df.to_csv('data.csv', index=False)

In [None]:
df = pd.read_csv(r'/kaggle/working/data.csv')

# Step 3: Exploratory Data Analysis (EDA):

#### Univarient analysis 

In [None]:
sns.countplot(data = df,
              y = 'flat_type',
              order = df['flat_type'].value_counts().index)
plt.title('Top Type Models'), plt.show()

In [None]:
sns.countplot(data = df,
              y = 'flat_model',
              order=df['flat_model'].value_counts().index), 
plt.title('Top Flat Models'), plt.show()

In [None]:
sns.countplot(data = df,
              y = 'town',
              order=df['town'].value_counts().index), 
plt.title('Top Town Models'), plt.show()

In [None]:
sns.histplot(df['floor_area_sqm'])
               
plt.title('Floor area sqm - Distribution'), plt.xlabel('Floor area sqm') ,plt.show()

In [None]:
sns.histplot(df['age_of_flat'])
plt.xlabel('Age of the Flat'), plt.title('Histogram of Age of the Flat'), plt.show()

In [None]:
sns.distplot(df['price_per_square_meter'])
plt.xlabel('Price / Square Meter'), plt.title('KDE of Price per Square meter'), plt.show()

#### Bivariate Analysis

In [None]:
sns.scatterplot(x = 'resale_price',
                y = 'floor_area_sqm',
                data = df)

In [None]:
sns.scatterplot(x = 'resale_price',
                y = 'price_per_square_meter',
                hue = 'flat_type',
                data = df)

In [None]:
sns.scatterplot(x = 'age_of_flat',
                y = 'price_per_square_meter',
                hue = 'flat_type',
                data = df)

In [None]:
sns.kdeplot(x = 'age_of_flat',
                y = 'price_per_square_meter',
#                 hue = 'flat_type',
                data = df)

#### Multivariate Analysis 

In [None]:
sns.pairplot(data = df, 
             hue = 'flat_type')

In [None]:
df.columns

In [None]:
df_corr = df[['floor_area_sqm','age_of_flat','price_per_square_meter','resale_price']].corr()
df_corr

In [None]:
sns.heatmap(df_corr, annot=True)

### Step 4: Model Selection and Training

In [None]:
!pip install pycaret[full]

In [None]:
df.columns

In [None]:
from pycaret.regression import *
s = setup(df, target = 'resale_price', 
              ignore_features = ['price_per_square_meter','Unnamed: 0'],
              session_id = 123, 
              n_jobs = -1, 
              use_gpu= False)

In [None]:
best = compare_models()

In [None]:
print(best)

Best Model Found: Random Forest Regressor

In [None]:
rfr = create_model('rf')

In [None]:
evaluate_model(rfr)

In [None]:
plot_model(rfr, plot = 'residuals')

In [None]:
plot_model(rfr, plot = 'feature')

In [None]:
predict_model(rfr)

In [None]:
predictions = predict_model(rfr, data=df)
predictions.head()

In [None]:
save_model(rfr, 'RFR_pipeline')

In [None]:
loaded_model = load_model('RFR_pipeline')
print(loaded_model)