# Data Science Assignment: Mumbai Property Price Prediction

## 1. Data Understanding

### Dataset Overview
The provided dataset (`Assignment Data Scientist(in).csv`) contains market trend data aggregated by Locality and Quarter.

**Column Explanations:**
- **Locality**: The specific neighborhood (e.g., "Andheri West").
- **Quarter**: Time period (e.g., "Jul-Sep 2024").
- **Average Price**: The average price **per square foot** in INR.
- **Price Range**: Low-High range.

### Quality Checks & Assumptions
- **Target**: `Average Price` (normalized to `price_per_sqft`).
- **Missing Features**: The dataset lacks granular property details (Bedrooms, Floor, etc.). We will model `Price ~ Locality` to predict the base rate.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import pickle

# Configuration
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

In [None]:
# Helper Function: Clean Price Text
def convert_price_text_to_number(price_text):
    if isinstance(price_text, str):
        if '-' in price_text:
            parts = price_text.split('-')
            try:
                return (float(parts[0].replace(',', '').strip()) + float(parts[1].replace(',', '').strip())) / 2
            except:
                return np.nan
        return float(price_text.replace(',', '').strip())
    return price_text

# Load Data
file_path = "Assignment Data Scientist(in).csv"
property_data = pd.read_csv(file_path, on_bad_lines='skip')
print(f"Data Loaded. Shape: {property_data.shape}")
property_data.head()

## 2. Exploratory Data Analysis (EDA)

We will perform the following checks:
1. Missing Values
2. Duplicates
3. Outlier Detection
4. Distribution of Prices
5. Top Expensive Localities


In [None]:
# Clean Target Variable
property_data['price_per_sqft'] = property_data['Average Price'].apply(convert_price_text_to_number)

# Check Missing Values
print("Missing Values per Column:")
print(property_data.isnull().sum())

# Drop missing targets
property_data = property_data.dropna(subset=['price_per_sqft'])

# Check Duplicates
print(f"\nDuplicate Rows: {property_data.duplicated().sum()}")

In [None]:
# 1. Outlier Detection (Boxplot)
plt.figure(figsize=(10, 4))
sns.boxplot(x=property_data['price_per_sqft'], color='orange')
plt.title('Boxplot of Price per Sqft (Outlier Detection)')
plt.show()

# 2. Price Distribution
plt.figure(figsize=(10, 5))
sns.histplot(property_data['price_per_sqft'], bins=50, kde=True, color='blue')
plt.title('Distribution of Property Prices (per Sqft)')
plt.show()

# 3. Top 10 Most Expensive Localities
avg_price = property_data.groupby('Locality')['price_per_sqft'].mean().sort_values(ascending=False).head(10)
plt.figure(figsize=(12, 6))
sns.barplot(x=avg_price.values, y=avg_price.index, palette='magma')
plt.title('Top 10 Most Expensive Localities')
plt.xlabel('Avg Price per Sqft')
plt.show()

## 3. Machine Learning Model

**Goal:** Predict `price_per_sqft` based on `Locality`.

**Approach:**
- **Model:** Linear Regression.
- **Preprocessing:** One-Hot Encoding for `Locality`.
- **Validation:** 80/20 Train/Test Split to evaluate generalization on unseen data.


In [None]:
# 1. Feature Engineering
# Extract Year for filtering (keep recent data 2023+)
property_data['year'] = property_data['Quarter'].apply(lambda x: int(x.split(' ')[-1]) if isinstance(x, str) and ' ' in x else 0)
recent_data = property_data[property_data['year'] >= 2023].copy()

# Normalize Locality
recent_data['Locality'] = recent_data['Locality'].str.lower().str.strip()

X = recent_data[['Locality']]
y = recent_data['price_per_sqft']

# 2. Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")

# 3. Pipeline (Encoder + Model)
model_pipeline = Pipeline([
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False)),
    ('regressor', LinearRegression())
])

# 4. Train
model_pipeline.fit(X_train, y_train)

# 5. Evaluate
y_pred = model_pipeline.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f"\nModel Performance on Test Set:")
print(f"RMSE: {rmse:.2f}")
print(f"R2 Score: {r2:.4f}")

In [None]:
# Save the Pipeline
with open('locality_price_model.pkl', 'wb') as f:
    pickle.dump(model_pipeline, f)
print("Model pipeline saved successfully.")