# End-to-End Machine Learning Project (Regression)
## Predicting House Rent in Indian Cities

We will build a model that predicts **monthly rent** from rental listing attributes.

This notebook is practical and step-by-step.
After most code cells, you will see a short **inference** section.
It tells you what to observe and what not to over-interpret.

**Workflow**
1. Load and inspect data  
2. Explore patterns (EDA)  
3. Split into train / validation / test  
4. Prepare data with pipelines (missing values + one-hot)  
5. Compare models with cross-validation  
6. Select a final model and evaluate on the test set  
7. Analyze errors and next steps

## 1) Load the dataset

We load the CSV into a Pandas DataFrame.
Then we preview a few rows.

In [None]:
# Pandas is the standard library for tabular data in Python.
import pandas as pd

DATA_PATH = r"data/cities_magicbricks_rental_prices.csv"
df = pd.read_csv(DATA_PATH)

df.head()

**Inference**
- `rent` is the target. It is a continuous number, so this is a regression task.
- We have numeric features (`area`, `beds`, …) and categorical features (`city`, `furnishing`).
- `house_type` is long text. We will skip it in the first model.

## 2) Quick data checks

We inspect:
- number of rows and columns
- data types
- missing values

This tells us what preprocessing we need.

In [None]:
df.shape

**Inference**
- More rows usually means more stable model comparisons.
- If your dataset is tiny, cross-validation results can swing a lot.

In [None]:
df.info()

**Inference**
- Categorical columns show up as `object`.
- Numeric columns show up as `int64` or `float64`.
- You should see a small amount of missing values.
  We injected them on purpose so we can demonstrate imputation.

In [None]:
df.isna().sum().sort_values(ascending=False).head(12)

**Inference**
- Missing values are scattered, not nicely grouped.
- This is common in real data pipelines.

## 3) Summary statistics

We look at numeric summaries.
This helps us spot scale differences and outliers.

In [None]:
df.describe(include="all").transpose()

**Inference**
- Rent usually has a long right tail (premium listings).
- Outliers can dominate error metrics like RMSE.
- Category counts can be unbalanced (some cities have more listings).

## 4) Split into train / validation / test

We use three splits:
- **train**: fit model parameters
- **validation**: compare choices (model selection)
- **test**: final unbiased evaluation

We do not touch the test set until the end.

In [None]:
# train_test_split gives a clean and reproducible split.
from sklearn.model_selection import train_test_split

X = df.drop(columns=["rent"])
y = df["rent"]

# 20% test
X_trainval, X_test, y_trainval, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# From the remaining 80%, make 25% validation -> 20% overall
X_train, X_val, y_train, y_val = train_test_split(
    X_trainval, y_trainval, test_size=0.25, random_state=42
)

(len(X_train), len(X_val), len(X_test))

**Inference**
- This is a safe pattern: test is held out early and never reused.
- Validation is where we compare models.
- Fixing `random_state` makes your demo repeatable.

## 5) Exploratory Data Analysis (EDA)

EDA builds intuition.
We ask:
- What does rent look like?
- Which features have a visible relationship with rent?
- How do categories like city and furnishing differ?

We format currency as Indian Rupees (₹).

In [None]:
# Matplotlib is a common plotting library.
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter

def format_inr(x, pos=None):
    # Example: ₹ 1,20,000
    try:
        return "₹ {:,}".format(int(x))
    except Exception:
        return str(x)

inr_formatter = FuncFormatter(format_inr)

### 5.1 Rent distribution

We plot a histogram of rent values.
This shows skew and typical ranges.

In [None]:
plt.figure(figsize=(10.5, 5.5))
plt.hist(y_train.dropna(), bins=45, color="#4C78A8", edgecolor="white", alpha=0.9)
plt.gca().xaxis.set_major_formatter(inr_formatter)
plt.title("Rent Distribution (Train Set)")
plt.xlabel("Monthly Rent")
plt.ylabel("Number of Listings")
plt.grid(axis="y", alpha=0.25)
plt.show()

**Inference**
- A right-skew is normal in rent data.
- RMSE is sensitive to the high-rent tail.
- Do not assume “most rentals are expensive” just because the tail exists.

### 5.2 Numeric features vs rent

We plot rent against each numeric feature.
This makes trends and outliers visible.

In [None]:
numeric_cols = ["area", "beds", "bathrooms", "balconies", "area_rate"]
numeric_cols = [c for c in numeric_cols if c in df.columns]

for col in numeric_cols:
    plt.figure(figsize=(9.5, 5.5))
    plt.scatter(df[col], df["rent"], alpha=0.25, s=20, color="#F58518")
    plt.gca().yaxis.set_major_formatter(inr_formatter)
    plt.title(f"{col} vs Rent")
    plt.xlabel(col)
    plt.ylabel("Monthly Rent")
    plt.grid(alpha=0.2)
    plt.show()

**Inference**
- `area` usually shows the clearest upward trend.
- Discrete counts (`beds`, `bathrooms`) create horizontal “bands”.
- Outliers matter. They can drive model error and influence model choice.

### 5.3 Categorical overview

First, we check how many unique categories exist.
Then we visualize counts.

In [None]:
categorical_cols = ["city", "furnishing", "locality"]
categorical_cols = [c for c in categorical_cols if c in df.columns]

{c: df[c].nunique(dropna=True) for c in categorical_cols}

**Inference**
- `city` and `furnishing` are low-cardinality. They are great for one-hot encoding.
- `locality` can be high-cardinality. It can create many one-hot columns.
- For a first model, we keep features simple.

In [None]:
city_counts = df["city"].value_counts().head(15)

plt.figure(figsize=(11, 6))
plt.bar(city_counts.index.astype(str), city_counts.values, color="#54A24B")
plt.title("Top Cities by Listing Count")
plt.xlabel("City")
plt.ylabel("Number of Listings")
plt.xticks(rotation=30, ha="right")
plt.grid(axis="y", alpha=0.25)
plt.show()

**Inference**
- If one city dominates, the model can become “city-biased”.
- Always check error by city later when you productionize.

In [None]:
furn_counts = df["furnishing"].value_counts()

plt.figure(figsize=(8.5, 5.2))
plt.bar(furn_counts.index.astype(str), furn_counts.values, color="#E45756")
plt.title("Furnishing Distribution")
plt.xlabel("Furnishing")
plt.ylabel("Number of Listings")
plt.xticks(rotation=20, ha="right")
plt.grid(axis="y", alpha=0.25)
plt.show()

**Inference**
- Furnishing is a strong “step feature” in rent.
- This is a practical reason to keep categorical features.

### 5.4 Rent spread by category

Boxplots show rent distributions by category.
We plot top cities to keep the chart readable.

In [None]:
top_cities = df["city"].value_counts().head(8).index.tolist()
df_top = df[df["city"].isin(top_cities)].copy()

groups = [df_top.loc[df_top["city"] == c, "rent"].dropna().values for c in top_cities]

plt.figure(figsize=(12, 6))
plt.boxplot(
    groups,
    labels=top_cities,
    showfliers=False,
    patch_artist=True,
    boxprops=dict(facecolor="#72B7B2", alpha=0.85),
    medianprops=dict(color="#000000"),
)
plt.gca().yaxis.set_major_formatter(inr_formatter)
plt.title("Rent Distribution by City (Top 8 Cities)")
plt.xlabel("City")
plt.ylabel("Monthly Rent")
plt.xticks(rotation=25, ha="right")
plt.grid(axis="y", alpha=0.25)
plt.show()

**Inference**
- City is not just noise. It shifts the entire rent range.
- A single global model must learn both local and global patterns.

## 6) Correlation (numeric only)

Correlation helps us rank numeric relationships quickly.
It does not prove causation.

In [None]:
corr = df[numeric_cols + ["rent"]].corr(numeric_only=True)
corr["rent"].sort_values(ascending=False)

**Inference**
- `area` and `area_rate` often rank high.
- Categorical effects do not show here. Use boxplots for that.

## 7) Feature selection (first version)

We start with a clean feature set.
We keep it interpretable and stable.

We will use:
- numeric: `area`, `beds`, `bathrooms`, `balconies`, `area_rate`
- categorical: `city`, `furnishing`

We skip:
- `house_type` (long text)
- `locality` (can be high-cardinality)

In [None]:
num_features = ["area", "beds", "bathrooms", "balconies", "area_rate"]
cat_features = ["city", "furnishing"]

X_train_sel = X_train[num_features + cat_features].copy()
X_val_sel   = X_val[num_features + cat_features].copy()
X_test_sel  = X_test[num_features + cat_features].copy()

X_train_sel.head()

**Inference**
- Start simple. Then iterate.
- If you add `locality`, re-check training time and feature explosion.

## 8) Preprocessing with pipelines

A pipeline chains steps safely.
It also avoids data leakage.

We will build a **preprocessor**:
- numeric: impute missing values using median
- categorical: impute missing using most frequent, then one-hot encode

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

numeric_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
])

categorical_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore")),
])

preprocess = ColumnTransformer(transformers=[
    ("num", numeric_pipeline, num_features),
    ("cat", categorical_pipeline, cat_features),
])

preprocess

**Inference**
- We do not “clean the DataFrame by hand”.
- The pipeline learns imputation values from training data only.
- One-hot makes categories usable for most ML models.

## 9) Cross-validation (CV) for model comparison

CV trains the model on multiple folds of the training set.
It reduces the risk of a lucky/unlucky split.

We will compare three models:
- Linear Regression
- Ridge Regression
- Decision Tree Regressor

We use RMSE as the main metric.

In [None]:
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.tree import DecisionTreeRegressor
import numpy as np

In [None]:
cv = KFold(n_splits=5, shuffle=True, random_state=42)

candidates = {
    "LinearRegression": LinearRegression(),
    "Ridge(alpha=1.0)": Ridge(alpha=1.0, random_state=42),
    "DecisionTree": DecisionTreeRegressor(random_state=42),
}

def cv_rmse_scores(model, X, y):
    pipe = Pipeline(steps=[("preprocess", preprocess), ("model", model)])
    scores = cross_val_score(pipe, X, y, cv=cv, scoring="neg_root_mean_squared_error")
    return -scores  # convert to positive RMSE

cv_scores = {}
for name, model in candidates.items():
    cv_scores[name] = cv_rmse_scores(model, X_train_sel, y_train)

cv_scores

**Inference**
- You now see fold-by-fold RMSE for each model.
- Lower RMSE is better.
- High variation across folds suggests instability.

In [None]:
# Visualize CV RMSE distribution per model
plt.figure(figsize=(10.5, 5.5))
plt.boxplot(
    [cv_scores[k] for k in cv_scores.keys()],
    labels=list(cv_scores.keys()),
    patch_artist=True,
    boxprops=dict(facecolor="#9D755D", alpha=0.75),
    medianprops=dict(color="#000000")
)
plt.title("Cross-Validation RMSE by Model (Train Set)")
plt.ylabel("RMSE")
plt.grid(axis="y", alpha=0.25)
plt.show()

# Show mean ± std
[(k, float(np.mean(v)), float(np.std(v))) for k, v in cv_scores.items()]

**Inference**
- A model can have a low mean RMSE but high variance (risky choice).
- Ridge often improves stability over plain Linear Regression.
- Trees can overfit and show higher variance.

## 10) Train on train split and evaluate on validation split

Cross-validation compares models inside the training data.
Validation is a separate held-out set.
It is a strong signal for model selection here.

We train each model once on `train` and score on `validation`.

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

def evaluate(model, X_tr, y_tr, X_eval, y_eval):
    pipe = Pipeline(steps=[("preprocess", preprocess), ("model", model)])
    pipe.fit(X_tr, y_tr)
    preds = pipe.predict(X_eval)
    rmse = float(np.sqrt(mean_squared_error(y_eval, preds)))
    r2 = float(r2_score(y_eval, preds))
    return pipe, preds, rmse, r2

val_results = []
trained_pipes = {}
val_preds = {}

for name, model in candidates.items():
    pipe, preds, rmse, r2 = evaluate(model, X_train_sel, y_train, X_val_sel, y_val)
    trained_pipes[name] = pipe
    val_preds[name] = preds
    val_results.append((name, rmse, r2))

sorted(val_results, key=lambda x: x[1])

**Inference**
- Validation RMSE is the main number we use to pick the final model.
- R² helps sanity-check fit quality, but RMSE is easier to interpret in ₹.

### 10.1 Validation residuals (per model)

Residual = actual - predicted.
We want residuals centered near 0 with reasonable spread.

In [None]:
plt.figure(figsize=(11, 6))
for name, preds in val_preds.items():
    residuals = y_val.values - preds
    plt.hist(residuals, bins=35, alpha=0.45, label=name)

plt.gca().xaxis.set_major_formatter(inr_formatter)
plt.title("Validation Residual Distributions (Actual - Predicted)")
plt.xlabel("Residual (₹)")
plt.ylabel("Count")
plt.legend()
plt.grid(axis="y", alpha=0.25)
plt.show()

**Inference**
- Narrower residual histograms usually indicate lower error.
- Long tails often come from premium listings and outliers.

## 11) Final model selection and test evaluation

We pick the best model based on validation RMSE.
Then we refit on **train + validation**.
Finally, we evaluate once on the test set.

In [None]:
best_name, best_rmse, best_r2 = sorted(val_results, key=lambda x: x[1])[0]
best_name, best_rmse, best_r2

**Inference**
- We are done selecting. We do not try more models after this point.
- The next step is final testing.

In [None]:
best_model = candidates[best_name]

X_trainval_sel = X_trainval[num_features + cat_features].copy()

final_pipe = Pipeline(steps=[("preprocess", preprocess), ("model", best_model)])
final_pipe.fit(X_trainval_sel, y_trainval)

test_preds = final_pipe.predict(X_test_sel)

test_rmse = float(np.sqrt(mean_squared_error(y_test, test_preds)))
test_r2 = float(r2_score(y_test, test_preds))

test_rmse, test_r2

**Inference**
- This test RMSE is the closest estimate of real-world performance.
- Do not tune hyperparameters using the test set.

## 12) Error analysis on the test set

We inspect:
- residual distribution
- biggest mistakes (largest absolute error)

This helps identify where the model fails.

In [None]:
test_residuals = y_test.values - test_preds

plt.figure(figsize=(10.5, 5.5))
plt.hist(test_residuals, bins=40, color="#B279A2", edgecolor="white", alpha=0.9)
plt.gca().xaxis.set_major_formatter(inr_formatter)
plt.title(f"Test Residual Distribution (Actual - Predicted) — {best_name}")
plt.xlabel("Residual (₹)")
plt.ylabel("Count")
plt.grid(axis="y", alpha=0.25)
plt.show()

**Inference**
- If residuals are shifted left or right, the model is biased.
- A wide spread means the model is not precise.
- Outliers show up as long tails.

In [None]:
errors = X_test_sel.copy()
errors["actual_rent"] = y_test.values
errors["predicted_rent"] = test_preds
errors["abs_error"] = np.abs(errors["actual_rent"] - errors["predicted_rent"])

errors.sort_values("abs_error", ascending=False).head(15)

**Inference**
- This table is actionable.
- It tells you which cases need better features or different modeling.

### 12.1 Error by city (quick check)

City can shift the rent baseline.
We check average absolute error by city on the test set.

In [None]:
city_err = errors.groupby("city")["abs_error"].mean().sort_values(ascending=False)
print(city_err)

**Inference**
- If one city has much higher error, you may need more data or better features for that city.
- This is a common production issue in geo-based models.

## 13) Recap and next steps

You built a realistic ML workflow:
- EDA with readable plots
- train / validation / test split
- preprocessing pipeline (missing values + one-hot)
- model comparison using cross-validation
- model selection using validation
- final evaluation on the test set
- error analysis

**Next steps**
- Try stronger models (Random Forest, Gradient Boosting)
- Engineer features (log(rent), rent per sqft)
- Add locality with high-cardinality strategies
- Add hyperparameter tuning after you lock the baseline