# City House Rent Prediction (India) — End-to-End Regression Project

We will build a model that predicts **monthly rent (₹)** from rental listing attributes.

This notebook is written like a real ML project notebook:
- short code cells
- clear narration
- **inference after key outputs**
- train / validation / test split
- cross-validation for model comparison
- pipelines for preprocessing

We keep the math light. We focus on decisions and interpretation.

## 1) Load the dataset

We start by loading the CSV into a DataFrame.
Then we preview a few rows to understand the columns.

In [None]:
# Pandas is the standard library for working with tables in Python.
import pandas as pd

DATA_PATH = r"data\cities_magicbricks_rental_prices.csv"
df = pd.read_csv(DATA_PATH)

df.head()

> **Inference**
> - `rent` is the target we want to predict. It is continuous → **regression**.
> - We have numeric features (`area`, `beds`, …) and categorical features (`city`, `furnishing`, …).
> - `house_type` is free text. We will skip it in the first baseline model.

## 2) Quick data checks

We check size, data types, and missing values.
This tells us what preprocessing we need.

In [None]:
# Quick dataset size summary (readable format)
from IPython.display import display, Markdown

rows, cols = df.shape
display(Markdown(f"**Dataset size:** `{rows:,}` rows × `{cols:,}` columns"))


> **Inference**
> - More rows usually means more stable model estimates.
> - A small dataset can make model comparisons noisy.

In [None]:
# Basic schema overview
df.info()

# Rich schema summary (types + missingness)
from IPython.display import display
import pandas as pd

schema = pd.DataFrame({
    "column": df.columns,
    "dtype": [str(t) for t in df.dtypes],
    "missing": df.isna().sum().values,
    "missing_%": (df.isna().mean() * 100).round(2).values,
})
schema = schema.sort_values(["missing", "column"], ascending=[False, True]).reset_index(drop=True)

display(
    schema.style
        .format({"missing": "{:,}", "missing_%": "{:.2f}%"})
        .bar(subset=["missing_%"], align="mid")
        .set_caption("Schema summary (sorted by missingness)")
)


> **Inference**
> - Categorical columns typically show up as `object`.
> - Numeric columns show up as `int64` or `float64`.
> - You should see a **small amount of missing values**.
> We injected them on purpose so we can demonstrate missing-value handling.

In [None]:
# Missing values by column (top)
import pandas as pd
from IPython.display import display

missing = (df.isna().sum().sort_values(ascending=False)).reset_index()
missing.columns = ["column", "missing_count"]
missing["missing_%"] = (missing["missing_count"] / len(df) * 100).round(2)

display(
    missing.head(12).style
        .format({"missing_count": "{:,}", "missing_%": "{:.2f}%"})
        .bar(subset=["missing_%"], align="mid")
        .set_caption("Top columns with missing values")
)


> **Inference**
> - Missing values are scattered.
> - This is common in real data pipelines.

## 3) Handling missing data (explicit section)

Missing data is normal in real projects.

Common strategies:
1. **Drop** rows or columns (fast, but you may lose signal)  
2. **Impute** missing values (simple and effective for baselines)  
3. **Model-based imputation** (KNN, MICE, etc. — more complex)

In this notebook:
- For **numeric** columns, we will impute using the **mean**.
- For **categorical** columns, we will impute using the **most frequent** category.

In [None]:
# Missingness percentage (top)
import pandas as pd
from IPython.display import display

missing_pct = (df.isna().mean() * 100).sort_values(ascending=False).reset_index()
missing_pct.columns = ["column", "missing_%"]

display(
    missing_pct.head(12).style
        .format({"missing_%": "{:.2f}%"})
        .bar(subset=["missing_%"], align="mid")
        .set_caption("Missingness percentage (top columns)")
)


> **Inference**
> - If missingness is very high in a column (say 40%+), consider dropping it or collecting better data.
> - Here, missingness is intentionally small. This makes imputation a reasonable choice.

### 3.1 Simple imputation demo (before pipelines)

We will demonstrate imputation on a small subset.
Later, we will do this properly inside a pipeline.

In [None]:
# SimpleImputer is scikit-learn's standard tool for filling missing values.
from sklearn.impute import SimpleImputer

numeric_cols = ["area", "beds", "bathrooms", "balconies", "area_rate"]
numeric_cols = [c for c in numeric_cols if c in df.columns]

cat_cols = ["city", "furnishing"]
cat_cols = [c for c in cat_cols if c in df.columns]

num_imputer = SimpleImputer(strategy="mean")
cat_imputer = SimpleImputer(strategy="most_frequent")

df_num_demo = pd.DataFrame(num_imputer.fit_transform(df[numeric_cols]), columns=numeric_cols)
df_cat_demo = pd.DataFrame(cat_imputer.fit_transform(df[cat_cols]), columns=cat_cols)

(df[numeric_cols].isna().sum().sum(), df_num_demo.isna().sum().sum()), (df[cat_cols].isna().sum().sum(), df_cat_demo.isna().sum().sum())

> **Inference**
> - After imputation, the demo subsets contain **no missing values**.
> - This is exactly what we want before training most models.
> - We will do the same thing again inside a pipeline (the safer approach).

## 4) Summary statistics

We inspect numeric summaries to spot outliers and scale differences.

In [None]:
# Summary statistics (styled)
import pandas as pd
from IPython.display import display

summary = df.describe(include="all").transpose()
# Show a compact view first (top rows) then let user scroll in output if needed
display(
    summary.style
        .set_caption("Summary statistics (all columns)")
)


> **Inference**
> - Rent often has a long right tail (premium listings).
> - Outliers can dominate RMSE.
> - Category counts can be unbalanced (some cities have far more listings).

## 5) Split into train / validation / test

We use three splits:
- **train**: fit the model  
- **validation**: compare models / choices  
- **test**: final unbiased evaluation  

We keep the test set untouched until the end.

In [None]:
# train_test_split gives a reproducible split.
from sklearn.model_selection import train_test_split

X = df.drop(columns=["rent"])
y = df["rent"]

# 20% test
X_trainval, X_test, y_trainval, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# From remaining 80%, take 25% as validation -> 20% overall
X_train, X_val, y_train, y_val = train_test_split(
    X_trainval, y_trainval, test_size=0.25, random_state=42
)

(len(X_train), len(X_val), len(X_test))

> **Inference**
> - This is a safe pattern: we do not tune on the test set.
> - Validation is where we compare models.

## 6) Exploratory Data Analysis (EDA)

EDA builds intuition.
We ask:
- What does rent look like?
- Which numeric features move rent?
- How do cities and furnishing levels differ?

We will use **Seaborn** for cleaner, more “report-like” visuals.

In [None]:
# Seaborn sits on top of Matplotlib and gives nicer default plots.
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter

sns.set_theme(style="whitegrid")

def format_inr(x, pos=None):
    try:
        return "₹ {:,}".format(int(x))
    except Exception:
        return str(x)

inr_formatter = FuncFormatter(format_inr)

### 6.1 Distribution of rent (histogram + box)

This is inspired by many Kaggle EDA notebooks:
- histogram for shape
- boxplot for outliers

In [None]:
import matplotlib.gridspec as gridspec

fig = plt.figure(figsize=(11, 6))
gs = gridspec.GridSpec(2, 1, height_ratios=[4, 1], hspace=0.05)

ax1 = fig.add_subplot(gs[0])
ax2 = fig.add_subplot(gs[1], sharex=ax1)

sns.histplot(y_train, bins=50, kde=True, ax=ax1, color="#4C78A8")
ax1.set_title("Distribution of Monthly Rent (Train Set)")
ax1.set_xlabel("")
ax1.set_ylabel("Count")
ax1.xaxis.set_major_formatter(inr_formatter)

sns.boxplot(x=y_train, ax=ax2, color="#72B7B2")
ax2.set_xlabel("Monthly Rent (₹)")
ax2.xaxis.set_major_formatter(inr_formatter)

plt.show()

> **Inference**
> - The distribution is right-skewed. A few premium rentals exist.
> - The boxplot makes outliers obvious.
> - RMSE will be sensitive to extreme rents.

### 6.2 Numeric features vs rent

We plot rent against each numeric feature.
We also add a trend line to make the direction clearer.

In [None]:
numeric_cols = ["area", "beds", "bathrooms", "balconies", "area_rate"]
numeric_cols = [c for c in numeric_cols if c in df.columns]

for col in numeric_cols:
    plt.figure(figsize=(10, 5.5))
    sns.regplot(
        data=df,
        x=col,
        y="rent",
        scatter_kws={"alpha": 0.25, "s": 18},
        line_kws={"linewidth": 2},
        color="#F58518"
    )
    plt.gca().yaxis.set_major_formatter(inr_formatter)
    plt.title(f"{col} vs Rent (with trend)")
    plt.xlabel(col)
    plt.ylabel("Monthly Rent (₹)")
    plt.show()

### 6.2.1 A 3D view (rent vs area vs area_rate)

A 3D scatter can be a fun way to see how **two** inputs relate to rent at the same time.
Use it for intuition, not for precise conclusions.

In [None]:
import plotly
import plotly.io as pio

print("Plotly version:", plotly.__version__)
print("Available renderers:", pio.renderers)

# Best defaults for JupyterLab:
# Try one of these (first one usually works)
pio.renderers.default = "notebook_connected"   # good for JupyterLab + classic
#pio.renderers.default = "jupyterlab"         # works in some setups
#pio.renderers.default = "iframe"             # always works, slightly heavier


In [None]:
import plotly.express as px

# Prepare clean data for 3D visualization
cols3d = ["area", "area_rate", "rent"]
df3d = df[cols3d + ["city"]].dropna().copy()

fig = px.scatter_3d(
    df3d,
    x="area",
    y="area_rate",
    z="rent",
    color="city",                     # remove this line if you want single-color points
    opacity=0.5,
    title="3D View: Area vs Area Rate vs Rent",
)

fig.update_layout(
    scene=dict(
        xaxis_title="Area (sqft)",
        yaxis_title="Area Rate",
        zaxis_title="Monthly Rent (₹)",
    ),
    margin=dict(l=0, r=0, b=0, t=40),
)

fig.show()


> **Inference**
> - Listings with similar area can have very different rent if the area rate differs.
> - This supports why both `area` and `area_rate` can be useful features.

> **Inference**
> - `area` typically shows the clearest upward trend.
> - Discrete counts (`beds`, `bathrooms`) create “bands”.
> - The trend line is a guide. It does not guarantee linearity.

### 6.3 Categorical overview (counts)

We check how many categories exist.
Then we visualize the top counts.

In [None]:
import pandas as pd
from IPython.display import display

# Explicitly include both legacy object and new string dtypes
categorical_cols = df.select_dtypes(include=["object", "string"]).columns.tolist()

cardinality = (
    pd.DataFrame({
        "column": categorical_cols,
        "unique_categories": [df[c].nunique(dropna=True) for c in categorical_cols]
    })
    .sort_values("unique_categories", ascending=False)
)

display(
    cardinality.style
        .format({"unique_categories": "{:,}"})
        .bar(subset=["unique_categories"], align="mid")
        .set_caption("Category counts (unique values)")
)


> **Inference**
> - `city` and `furnishing` are low-cardinality. They are great for one-hot encoding.
> - `locality` can have many categories. We will skip it in the first baseline model.

In [None]:
top_cities = (
    df["city"]
    .value_counts()
    .head(12)
    .reset_index()
)
top_cities.columns = ["city", "count"]

plt.figure(figsize=(11, 5.5))
sns.barplot(
    data=top_cities,
    x="city",
    y="count",
    hue="city",
    palette="viridis",
    legend=False
)
plt.title("Top Cities by Listing Count")
plt.xlabel("City")
plt.ylabel("Number of Listings")
plt.xticks(rotation=25, ha="right")
plt.show()


> **Inference**
> - If one city dominates, the model may become city-biased.
> - Later, check error by city.

In [None]:
plt.figure(figsize=(8.5, 5))
sns.countplot(
    data=df,
    x="furnishing",
    hue="furnishing",     # explicit hue
    order=df["furnishing"].value_counts().index,
    palette="Set2",
    legend=False
)
plt.title("Furnishing Distribution")
plt.xlabel("Furnishing")
plt.ylabel("Count")
plt.xticks(rotation=15, ha="right")
plt.show()


> **Inference**
> - Furnishing often shifts the rent baseline.
> - This is a strong reason to keep it as a feature.

### 6.4 Rent by category (boxplots)

Boxplots show how rent spreads across categories.
We plot the top cities for readability.

In [None]:
top_city_names = df["city"].value_counts().head(8).index.tolist()
df_top = df[df["city"].isin(top_city_names)].copy()

plt.figure(figsize=(12, 6))
sns.boxplot(data=df_top, x="city", y="rent", hue="city", palette="pastel", showfliers=False, legend=False)
plt.gca().yaxis.set_major_formatter(inr_formatter)
plt.title("Rent Distribution by City (Top 8 Cities)")
plt.xlabel("City")
plt.ylabel("Monthly Rent (₹)")
plt.xticks(rotation=25, ha="right")
plt.show()

> **Inference**
> - Some cities have a clearly higher rent range.
> - A global model needs to learn both city-level shifts and within-city trends.

## Key insights from EDA

Before we model, we write down a few observations from the charts.
This keeps the workflow realistic: **EDA should influence decisions**.

> **Inference**
> - Note the top cities. A model may learn dominant cities better than rare ones.
> - Compare median rent across furnishing levels. If the gaps are large, furnishing should help prediction.
> - Check which numeric feature correlates most with rent (often `area` or `area_rate`).
> - Confirm missingness exists. This justifies imputation steps later.

## 7) Correlation (numeric features only)

Correlation is a quick way to rank numeric relationships.
It does not prove causation.

In [None]:
# Correlation with rent (numeric only)
import pandas as pd
from IPython.display import display

corr = df[numeric_cols + ["rent"]].corr(numeric_only=True)

corr_with_rent = corr["rent"].drop("rent").sort_values(ascending=False).reset_index()
corr_with_rent.columns = ["feature", "correlation_with_rent"]

display(
    corr_with_rent.style
        .format({"correlation_with_rent": "{:.3f}"})
        .bar(subset=["correlation_with_rent"], align="mid")
        .set_caption("Numeric correlation with rent (higher magnitude → stronger linear relationship)")
)


### 7.1 Correlation heatmap (numeric features)

A heatmap makes correlation patterns easier to scan.
We only include numeric columns here.

In [None]:
# A heatmap is a compact way to view correlations.
# We keep it to numeric columns to avoid mixing types.
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm", square=True, cbar=True)
plt.title("Correlation Heatmap (Numeric Features)")
plt.show()

> **Inference**
> - `area` and `area_rate` often rank high.
> - Categorical effects do not show here. Use boxplots for that.

## 8) Feature selection (baseline model)

We start with a small, stable set of features.

We use:
- numeric: `area`, `beds`, `bathrooms`, `balconies`, `area_rate`
- categorical: `city`, `furnishing`

We skip for now:
- `locality` (high-cardinality)
- `house_type` (free text)

Future idea:
- Use `locality` to map to a **pincode** or zone, then use grouped features.

In [None]:
num_features = ["area", "beds", "bathrooms", "balconies", "area_rate"]
cat_features = ["city", "furnishing"]

X_train_sel = X_train[num_features + cat_features].copy()
X_val_sel   = X_val[num_features + cat_features].copy()
X_test_sel  = X_test[num_features + cat_features].copy()

X_train_sel.head()

> **Inference**
> - Baselines should be simple and explainable.
> - You can add complexity later once you trust the pipeline.

## 9) Preprocessing with pipelines (imputation + one-hot)

A pipeline chains steps safely.
It also avoids data leakage.

Our preprocessor will:
- impute numeric missing values with the **mean**
- impute categorical missing values with the **most frequent**
- one-hot encode categorical columns

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

numeric_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean")),
])

categorical_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore")),
])

preprocess = ColumnTransformer(transformers=[
    ("num", numeric_pipeline, num_features),
    ("cat", categorical_pipeline, cat_features),
])

preprocess

> **Inference**
> - We do not clean the DataFrame by hand.
> - The pipeline learns imputation values from training data only.
> - `handle_unknown="ignore"` makes inference safer if a new city appears later.

## 10) Cross-validation (CV) for model comparison

Cross-validation trains the model multiple times on different folds of the training set.
This reduces the risk of a lucky or unlucky split.

We compare three models:
- Linear Regression
- Ridge Regression
- Decision Tree Regressor

Metric: RMSE (lower is better).

In [None]:
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.tree import DecisionTreeRegressor
import numpy as np

In [None]:
cv = KFold(n_splits=5, shuffle=True, random_state=42)

candidates = {
    "LinearRegression": LinearRegression(),
    "Ridge(alpha=1.0)": Ridge(alpha=1.0, random_state=42),
    "DecisionTree": DecisionTreeRegressor(random_state=42),
}

def cv_rmse_scores(model, X, y):
    from sklearn.pipeline import Pipeline
    pipe = Pipeline(steps=[("preprocess", preprocess), ("model", model)])
    scores = cross_val_score(pipe, X, y, cv=cv, scoring="neg_mean_absolute_error")
    return -scores  # convert to positive MAE

cv_scores = {name: cv_rmse_scores(model, X_train_sel, y_train) for name, model in candidates.items()}
cv_scores

> **Inference**
> - You see fold-by-fold RMSE for each model.
> - Lower RMSE is better.
> - High spread across folds means instability.

In [None]:
plt.figure(figsize=(11, 6))
sns.boxplot(data=pd.DataFrame(cv_scores), palette="Set3")
plt.title("Cross-Validation MAE by Model (Train Set)")
plt.ylabel("MAE")
plt.show()

[(k, float(np.mean(v)), float(np.std(v))) for k, v in cv_scores.items()]

> **Inference**
> - Prefer models with low mean RMSE and low variance.
> - Trees can overfit and show higher variance.

## 11) Train each model and evaluate on the validation set

Now we train on the training split once.
Then we score on the validation split.
This is a realistic model-selection loop.

In [None]:
from sklearn.metrics import mean_absolute_error, r2_score

def train_and_eval(model, X_tr, y_tr, X_eval, y_eval):
    from sklearn.pipeline import Pipeline
    pipe = Pipeline(steps=[("preprocess", preprocess), ("model", model)])
    pipe.fit(X_tr, y_tr)
    preds = pipe.predict(X_eval)
    mae = float(mean_absolute_error(y_eval, preds))
    r2 = float(r2_score(y_eval, preds))
    return pipe, preds, mae, r2

val_results = []
val_preds = {}
trained = {}

for name, model in candidates.items():
    pipe, preds, mae, r2 = train_and_eval(model, X_train_sel, y_train, X_val_sel, y_val)
    trained[name] = pipe
    val_preds[name] = preds
    val_results.append((name, mae, r2))

sorted(val_results, key=lambda x: x[1])

> **Inference**
> - Validation RMSE is easiest to interpret in ₹.
> - R² is a sanity check. Do not optimize for it blindly.

### 11.1 Validation residuals (per model)

Residual = actual − predicted.
We want residuals centered around 0 with reasonable spread.

In [None]:
plt.figure(figsize=(12, 6))
for name, preds in val_preds.items():
    residuals = y_val.values - preds
    sns.histplot(residuals, bins=35, stat="count", element="step", fill=False, label=name)

plt.gca().xaxis.set_major_formatter(inr_formatter)
plt.title("Validation Residual Distributions (Actual − Predicted)")
plt.xlabel("Residual (₹)")
plt.ylabel("Count")
plt.legend()
plt.show()

> **Inference**
> - Narrower residual distributions usually mean lower error.
> - Long tails usually come from premium listings and outliers.

## 12) Pick the best model and evaluate on the test set

We pick the model with the best validation RMSE.
Then we refit it on train + validation.
Finally, we evaluate once on the test set.

In [None]:
best_name, best_rmse, best_r2 = sorted(val_results, key=lambda x: x[1])[0]
best_name, best_rmse, best_r2

> **Inference**
> - After this point, we stop comparing models.
> - The test set is used only once.

In [None]:
best_model = candidates[best_name]

X_trainval_sel = X_trainval[num_features + cat_features].copy()

from sklearn.pipeline import Pipeline
final_pipe = Pipeline(steps=[("preprocess", preprocess), ("model", best_model)])
final_pipe.fit(X_trainval_sel, y_trainval)

test_preds = final_pipe.predict(X_test_sel)

test_rmse = float(mean_absolute_error(y_test, test_preds))
test_r2 = float(r2_score(y_test, test_preds))

test_rmse, test_r2

> **Inference**
> - This test RMSE is your final unbiased estimate.
> - Do not tune hyperparameters using the test set.

## 13) Error analysis on the test set

We inspect:
- residual distribution
- biggest mistakes (largest absolute error)
- average error by city (quick check)

In [None]:
test_residuals = y_test.values - test_preds

plt.figure(figsize=(11, 5.5))
sns.histplot(test_residuals, bins=40, kde=True, color="#B279A2")
plt.gca().xaxis.set_major_formatter(inr_formatter)
plt.title(f"Test Residual Distribution — {best_name}")
plt.xlabel("Residual (₹)  (Actual − Predicted)")
plt.ylabel("Count")
plt.show()

> **Inference**
> - If residuals are shifted left or right, the model is biased.
> - Wide spread means the model is not precise.
> - Long tails indicate outliers.

In [None]:
# Biggest errors (top 15)
from IPython.display import display
import pandas as pd
import numpy as np

# Reconstruct the error table explicitly (safe if cells are run out of order)
errors = X_test_sel.copy()
errors["actual_rent"] = y_test.values
errors["predicted_rent"] = test_preds
errors["abs_error"] = np.abs(errors["actual_rent"] - errors["predicted_rent"])

top_err = errors.sort_values("abs_error", ascending=False).head(15).copy()

display(
    top_err.style
        .format({
            "actual_rent": "₹ {:,.0f}",
            "predicted_rent": "₹ {:,.0f}",
            "abs_error": "₹ {:,.0f}",
        })
        .bar(subset=["abs_error"], align="mid")
        .set_caption("Largest absolute errors on the test set")
)


> **Inference**
> - This table is actionable.
> - It shows which listings need better features or a stronger model.

In [None]:
if "city" in errors.columns:
    city_mae = errors.groupby("city")["abs_error"].mean().sort_values(ascending=False)

    plt.figure(figsize=(10.5, 5.5))
    sns.barplot(x=city_mae.index[:10], y=city_mae.values[:10], hue=city_mae.index[:10], palette="magma", legend=False)
    plt.gca().yaxis.set_major_formatter(inr_formatter)
    plt.title("Average Absolute Error by City (Top 10) — Test Set")
    plt.xlabel("City")
    plt.ylabel("Avg Absolute Error (₹)")
    plt.xticks(rotation=25, ha="right")
    plt.show()

    city_mae.head(10)

# Also show the table for quick reading
from IPython.display import display
import pandas as pd

city_mae_table = city_mae.head(10).reset_index()
city_mae_table.columns = ["city", "avg_abs_error"]

display(
    city_mae_table.style
        .format({"avg_abs_error": "₹ {:,.0f}"})
        .bar(subset=["avg_abs_error"], align="mid")
        .set_caption("Avg absolute error by city (top 10) — test set")
)


> **Inference**
> - If one city has much higher error, you may need more data or better features for that city.
> - This is a common production issue for geo-based models.

## 14) Recap and next steps

You built a realistic ML workflow:
- EDA with readable plots
- explicit missing data handling section
- train / validation / test split
- preprocessing pipeline (impute + one-hot)
- model comparison using cross-validation
- final evaluation on the test set
- error analysis

**Next steps**
- Try stronger models (RandomForest, GradientBoosting)
- Engineer features (log(rent), rent per sqft)
- Add locality carefully (high-cardinality strategies)
- Map locality to zones/pincodes for richer location signals