
# End-to-End Regression: Predicting House Rent in Indian Cities

This notebook walks through a complete machine learning workflow using a real Indian rental housing dataset.

**Goal:** Predict monthly rent using housing attributes.  
**Focus:** Practical ML workflow, not mathematical depth.  
**Style:** Clean steps, clear reasoning, reusable code.



## 1. Load the data

We begin by loading the dataset and taking a first look at its structure.


In [None]:

import pandas as pd

df = pd.read_csv("data/indian_house_rent.csv")
df.head()



## 2. Understand the data

We inspect the schema, data types, and basic statistics.
This helps identify missing values and feature types.


In [None]:

df.info()


In [None]:

df.describe()



We also inspect categorical distributions to understand location and furnishing patterns.


In [None]:

df["city"].value_counts()


In [None]:

df["furnishing"].value_counts()



## 3. Define the machine learning task

- **Target:** `rent` (continuous value)
- **Type:** Supervised regression problem


In [None]:

X = df.drop("rent", axis=1)
y = df["rent"]



## 4. Trainâ€“test split

We separate data into training and test sets.
The test set simulates unseen future data.


In [None]:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)



## 5. Exploratory Data Analysis (EDA)

We explore key relationships using simple visualizations.


In [None]:

import matplotlib.pyplot as plt

plt.hist(y_train, bins=50)
plt.xlabel("Monthly Rent")
plt.ylabel("Count")
plt.title("Rent Distribution")
plt.show()


In [None]:

plt.scatter(df["area"], df["rent"], alpha=0.3)
plt.xlabel("Area (sqft)")
plt.ylabel("Rent")
plt.title("Area vs Rent")
plt.show()



## 6. Correlation analysis

We check numeric correlations to build intuition.


In [None]:

numeric_cols = ["area", "beds", "bathrooms", "balconies", "area_rate"]
df[numeric_cols + ["rent"]].corr()["rent"].sort_values(ascending=False)



## 7. Feature selection

We keep clean, interpretable features.
Free-text columns are excluded for simplicity.


In [None]:

num_features = ["area", "beds", "bathrooms", "balconies", "area_rate"]
cat_features = ["city", "furnishing"]

X_train = X_train[num_features + cat_features]
X_test = X_test[num_features + cat_features]



## 8. Preprocessing pipelines

We prepare numeric and categorical features separately.
This prevents data leakage and keeps transformations reusable.


In [None]:

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

numeric_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median"))
])

categorical_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

preprocess = ColumnTransformer([
    ("num", numeric_pipeline, num_features),
    ("cat", categorical_pipeline, cat_features)
])



## 9. Model training

We train three regression models using the same preprocessing pipeline.


In [None]:

from sklearn.linear_model import LinearRegression, Ridge
from sklearn.tree import DecisionTreeRegressor

models = {
    "LinearRegression": LinearRegression(),
    "Ridge": Ridge(alpha=1.0),
    "DecisionTree": DecisionTreeRegressor(random_state=42)
}


In [None]:

from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

results = {}

for name, model in models.items():
    pipe = Pipeline([
        ("preprocess", preprocess),
        ("model", model)
    ])
    
    pipe.fit(X_train, y_train)
    preds = pipe.predict(X_test)
    
    rmse = np.sqrt(mean_squared_error(y_test, preds))
    r2 = r2_score(y_test, preds)
    
    results[name] = {"RMSE": rmse, "R2": r2}

results



## 10. Error analysis

We inspect where predictions fail the most.


In [None]:

best_model = Pipeline([
    ("preprocess", preprocess),
    ("model", LinearRegression())
])

best_model.fit(X_train, y_train)
preds = best_model.predict(X_test)

errors = X_test.copy()
errors["actual_rent"] = y_test
errors["predicted_rent"] = preds
errors["abs_error"] = abs(errors["actual_rent"] - errors["predicted_rent"])

errors.sort_values("abs_error", ascending=False).head(10)



## 11. Wrap-up

- We walked through an end-to-end regression workflow.
- The same structure applies to many ML problems.
- Classification follows a similar pipeline with different targets and metrics.

**Next steps:**
- Cross-validation
- Better models (Random Forest, Gradient Boosting)
- Feature engineering
