![Housing In India](https://github.com/mobadara/housing-in-india/blob/main/assets/images/houses.jpeg?raw=1)

# 🏠 **Housing Prices in India: Exploratory Data Analysis**


[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mobadara/housing-in-india/blob/main/notebooks/exploratory-data-analysis.ipynb)


## 🧭 **Project Overview**

India’s housing market is diverse and dynamic, shaped by rapid urbanization, regional disparities, and evolving economic factors. Understanding property price variations across different regions requires more than just statistical summaries — it calls for an analytical deep dive into **location, infrastructure, and socioeconomic variables**.

This notebook focuses on performing an **Exploratory Data Analysis (EDA)** to understand the structure and relationships within the dataset before any cleaning or modeling is performed.  
Insights gained from this analysis will directly guide **data cleaning, transformation, and feature engineering** in the next phase.

---

## 🗂️ **Notebook Outline**
0. Setup
1. Dataset Overview  
2. Descriptive Statistics  
3. Distribution Analysis  
4. Feature Relationships  
5. Outlier Detection  
6. Missing Data Patterns  
7. Cleaning & Transformation Recommendations  

---


> 💡 **Note:** This notebook focuses purely on analysis.  
> All cleaning, preprocessing, and modeling steps will be implemented in subsequent notebooks for reproducibility and clarity.

---

## **Setup**
In this phase, I use imported the required python packages (local and those installed using `pip`), downloaded the dataset using `kagglehub` package and put the dataset in the right format. The downloaded dataset contains three `csv` files: `train.csv`, `test.csv` and `sample_submission.csv`. **Sample Submission** is a **one-column** dataset, column being the predicted **house price** from the **test dataset** (`test.csv`). The test dataset does not contain the target (price in lacs).

The version of the packages used are also printed for reproducibility and reporting purposes.

In [1]:
# Import necessary packages
import kagglehub
import os
import sys
import pandas as pd

In [2]:
# Download Dataset
# Download latest version
# try:
#     path = kagglehub.dataset_download("anmolkumar/house-price-prediction-challenge")
# except Exception as e:
#     print(f"Error downloading dataset: {e}")
#     print("Use the available dataset in the data folder.")

In [3]:
# Set up Environment
MODULE_PATH = os.path.abspath(os.path.join("..", "src"))
if not os.path.exists("../data"):
    os.makedirs('../data')
if MODULE_PATH not in sys.path:
    sys.path.insert(0, MODULE_PATH)
!cp -r {path}/* ../data/
!rm -r {path}
print(f'Dataset downloaded to ../data/')
print(f'Files in ../data/: {os.listdir("../data/")}')

cp: cannot stat '{path}/*': No such file or directory
rm: cannot remove '{path}': No such file or directory
Dataset downloaded to ../data/
Files in ../data/: ['sample_submission.csv', 'train.csv', 'test.csv']


In [4]:
# Import personal packages
from analysis.inspection import Inspector

## **Dataset Overview**

In [5]:
df_train = pd.read_csv("../data/train.csv")

In [6]:
inspector = Inspector(data=df_train)
inspector.inspect()

Dataset Shape: (29451, 12)

Data Types:
------------------------------


POSTED_BY                 object
UNDER_CONSTRUCTION         int64
RERA                       int64
BHK_NO.                    int64
BHK_OR_RK                 object
SQUARE_FT                float64
READY_TO_MOVE              int64
RESALE                     int64
ADDRESS                   object
LONGITUDE                float64
LATITUDE                 float64
TARGET(PRICE_IN_LACS)    float64
dtype: object


Basic Information:
------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29451 entries, 0 to 29450
Data columns (total 12 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   POSTED_BY              29451 non-null  object 
 1   UNDER_CONSTRUCTION     29451 non-null  int64  
 2   RERA                   29451 non-null  int64  
 3   BHK_NO.                29451 non-null  int64  
 4   BHK_OR_RK              29451 non-null  object 
 5   SQUARE_FT              29451 non-null  float64
 6   READY_TO_MOVE          29451 non-null  int64  
 7   RESALE                 29451 non-null  int64  
 8   ADDRESS                29451 non-null  object 
 9   LONGITUDE              29451 non-null  float64
 10  LATITUDE               29451 non-null  float64
 11  TARGET(PRICE_IN_LACS)  29451 non-null  float64
dtypes: float64(4), int64(5), object(3)
memory usage: 2.7+ MB


None

<analysis.inspection.Inspector at 0x735c15f78440>

## **Data Inspection**
- The train dataset contains **29,451** instances with **12** fratures recorded for each instance.
- There are three features of the `object` type which are `categorical` in the case of a pandas dataframe. These columns are `POSTED_BY`, `BHK_OR_RK` and `ADDRESS`.
- The data contains four `float` features: The area covered by the house recorded as `SQUARE_FT`, latitude, longitude, (recorded as `LATITUDE`, `LONGITUDE` respectively) and the target feature **price** recorded as `TARGET(PRICE_IN_LACS)`.
- The dataset contains five `int` features: `UNDER_CONSTRUCTION`, `RERA`, `BHK_NO`, `BHK_OR_RK`, `READY_TO_MOVE` and `RESALE`. Integers are whole numbers but often, datatypes recorded as integers store binary (0, 1) values. In other cases, it contains ordinal data, the cases will be processed accordingly.

### **Actions to be taken:**
- The inconsistencies in the column names will be handled in the **data cleaning** phase of the project.
- Take note of the columns (`int`) that stored binary values.
- Take note for the object columns, and think about insights to extract from them and the right tools to use. 