# Data Science Assignment: Mumbai Property Price Prediction

## 1. Data Understanding

### Dataset Overview
The provided dataset (`Assignment Data Scientist(in).csv`) contains market trend data rather than individual property listings. 

**Column Explanations:**
- **Locality**: The specific neighborhood in Mumbai (e.g., "Andheri West").
- **Quarter**: The time period of the data point (e.g., "Jul-Sep 2024").
- **Average Price**: The average price **per square foot** in INR for that locality during that quarter.
- **Price Range**: The low and high range of prices per sqft.
- **Growth Type**: Percentage growth quarter-over-quarter.

### Numerical vs Categorical
- **Categorical**: `Locality`, `Quarter`, `City`, `Type`.
- **Numerical**: `Average Price`, `Price Range` (after cleaning).

### Target Selection
- **Target**: `Average Price` (Price per Sqft).
- **Reason**: This is the primary indicator of property value in the dataset. Since we don't have individual features like 'bedrooms', predicting the rate per sqft allows us to estimate the total price of any hypothetical apartment by multiplying Rate * Area.

### Data Quality & Assumptions
- **Discrepancy**: The assignment requested features like 'bedrooms' and 'bathrooms', but the data is aggregated by locality. 
- **Assumption**: We assume the `Average Price` represents the market rate per sqft for a standard apartment. We will use this rate to predict total prices for the API.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_squared_error, r2_score
import pickle

#Helper Function: Clean Price Text
def convert_price_text_to_number(price_text):
    if isinstance(price_text, str):
        # Handle ranges like "20,000-22,000"
        if '-' in price_text:
            parts = price_text.split('-')
            try:
                low = float(parts[0].replace(',', '').strip())
                high = float(parts[1].replace(',', '').strip())
                return (low + high) / 2
            except:
                return np.nan
        # Handle single numbers
        return float(price_text.replace(',', '').strip())
    return price_text

#Load Data
file_path = "Assignment Data Scientist(in).csv"
property_data = pd.read_csv(file_path, on_bad_lines='skip')

print(f"Data Loaded. Shape: {property_data.shape}")
property_data.head()

## 2. Exploratory Data Analysis (EDA)

In [None]:
# Clean the Target Variable
property_data['price_per_sqft'] = property_data['Average Price'].apply(convert_price_text_to_number)
property_data = property_data.dropna(subset=['price_per_sqft'])

# Extract Year for Trend Analysis
property_data['year'] = property_data['Quarter'].apply(
    lambda x: int(x.split(' ')[-1]) if isinstance(x, str) and ' ' in x else 0
)

#Analysis 1: Top 10 Most Expensive Localities
avg_price_by_locality = property_data.groupby('Locality')['price_per_sqft'].mean().sort_values(ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(x=avg_price_by_locality.head(10).values, y=avg_price_by_locality.head(10).index, palette='viridis')
plt.title('Top 10 Most Expensive Localities (Avg Price/Sqft)')
plt.xlabel('Price in INR per Sqft')
plt.show()

In [None]:
#Analysis 2: Price Distribution
plt.figure(figsize=(10, 6))
sns.histplot(property_data['price_per_sqft'], bins=50, kde=True)
plt.title('Distribution of Property Prices (per Sqft)')
plt.xlabel('Price/Sqft')
plt.show()

## 3. Minimal ML Model

**Approach**: Locality-Based Rate Card.
Since 'Locality' is the strongest predictor available in this dataset, we will build a model that looks up the average price for a given locality. We use only data from 2023 onwards to ensure the prices are current.

In [None]:
# Filter for Recent Data (2023+)
recent_data = property_data[property_data['year'] >= 2023]
if recent_data.empty:
    recent_data = property_data

print(f"Training on {len(recent_data)} records from 2023 onwards.")

# Train Model: Compute Mean Price per Locality
locality_price_map = recent_data.groupby('Locality')['price_per_sqft'].mean().to_dict()

#Validation
actual_prices = []
predicted_prices = []

for idx, row in recent_data.iterrows():
    loc = row['Locality']
    actual = row['price_per_sqft']
    if loc in locality_price_map:
        predicted = locality_price_map[loc]
        actual_prices.append(actual)
        predicted_prices.append(predicted)

rmse = np.sqrt(mean_squared_error(actual_prices, predicted_prices))
r2 = r2_score(actual_prices, predicted_prices)

print(f"Model Results:")
print(f"RMSE: {rmse:.2f}")
print(f"R2 Score: {r2:.2f}")

In [None]:
# Save the Model for the API
with open('locality_price_model.pkl', 'wb') as f:
    pickle.dump(locality_price_map, f)

print("Model saved to locality_price_model.pkl")