# Linear Regression for House Prices
Linear regression assumes a straight-line relationship between inputs (features) and a continuous output. We predict house prices from **area**, **bedrooms**, and **location**. To judge how well the model generalises, we use **K-Fold Cross-Validation**, which repeatedly trains the model on different data splits and averages the scores. This reduces the risk of overfitting to a single train/test split.

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold, cross_val_score

In [None]:
# Sample data: price is in thousands of dollars
data = pd.DataFrame({
    "area_sqft": [800, 950, 1100, 1300, 1500, 1700, 1900, 2100, 2300, 2500],
    "bedrooms": [2, 2, 3, 3, 3, 4, 4, 4, 5, 5],
    "location": ["City", "Suburb", "City", "Suburb", "City", "City", "Suburb", "Suburb", "City", "Suburb"],
    "price_k": [180, 190, 220, 240, 260, 310, 330, 350, 370, 390]
})

# One-hot encode the location so the model can understand categories
features = pd.get_dummies(data[["area_sqft", "bedrooms", "location"]], drop_first=True)
target = data["price_k"]

In [None]:
# Create the regression model and a K-Fold splitter
model = LinearRegression()
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Negative MSE is returned by scikit-learn, so take the absolute value
mse_scores = -cross_val_score(model, features, target, cv=kf, scoring="neg_mean_squared_error")
r2_scores = cross_val_score(model, features, target, cv=kf, scoring="r2")

print("Mean MSE across folds:", mse_scores.mean())
print("Mean R^2 across folds:", r2_scores.mean())