# Theoretical Concepts and Linear Regression

## Part 1: Theoretical Concepts
1. Define the following terms in the context of statistical learning:
- **Training error**: It is the error the model makes on the trainning dataset. It measures how well a model performs on the training dataset
- **Test error**: it evaluates the model's performance on unseen data. A low testing error indicates good generalization, while a high testing error may suggest overfitting or underfitting
- **Bias-cariance trade-off**: describes the trade-off between a model's ability to fit the training data well (low bias) and its ability to generalize to unseen data (low variance)
- **Overfitting**: When a model learns the training data too well, including the noise, and performs poorly on unseen data
- **Model complexity**: how well a model can fit the training data and potentially generalize to new, unseen data

2.  Explain the difference between parametric and nonparametric models in statistical learning. Provide an example of each type of model. 

| Parametric Models | Nonparametric Models |
|:-----------|:------------|
| fixed structure | Flexible |
| Few parameters | Data-driven, Parameters grow with data |
| simpler and computationally efficient | more computationally demanding and prone to overfitting with small datasets |
| Linear regression | k-Nearest Neighbors |
| Suppose we model house price based on square footage. A parametric model (linear regression) assumes a straight-line relationship: as square footage increases, price increases at a fixed rate. It simplifies reality into slope + intercept. | Instead of assuming a straight line, k-NN looks at prices of the most similar houses (neighbors in size, location, number of rooms, etc.) and predicts based on those. This adapts to neighborhoods where price jumps aren’t linear (e.g., houses near a beach are disproportionately expensive). |




3.    Discuss the bias-variance trade-off in relation to model performance. Provide an example of how a highly complex model could lead to overfitting and how a simple model might underfit the data. 
- The bias-variance trade-off explains how model complexity affects prediction:
    - **Simple models** have high bias (too rigid), which can cause *underfitting*-they miss important patterns
    - **Complex models** have high variance (too sensitive to data), which can cause *overfitting*-they memorize noise instead of generalizing
- Example:
    - A simple linear model predicting house prices from size alone may underfit
    - A deep neural network with many layers may overfit by memorizing training data
- The goal is to balance bias and variance for the best performance.

---


## **Part 2: Linear Regression** 

### **Fit a simple linear regression model using Size to predict Price.**

#### Step 1: Import Libraries and Load Data

In [3]:
import pandas as pd
import statsmodels.api as sm

# Load dataset (California Housing dataset)
data = pd.read_csv("dataset/housing.csv")
data.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


#### Step 2: Add a Constant (Intercept)

In [4]:

# Independent variable (predictor)
X = data["total_rooms"]

# Dependent variable (target)
y = data["median_house_value"]

# Add a constant so regression includes intercept
X = sm.add_constant(X)
X.head()

Unnamed: 0,const,total_rooms
0,1.0,880.0
1,1.0,7099.0
2,1.0,1467.0
3,1.0,1274.0
4,1.0,1627.0


#### Step 3: Fit the Regression Model

In [5]:
# Fit simple linear regression
model = sm.OLS(y, X).fit()
model.summary()

0,1,2,3
Dep. Variable:,median_house_value,R-squared:,0.018
Model:,OLS,Adj. R-squared:,0.018
Method:,Least Squares,F-statistic:,378.2
Date:,"Tue, 16 Sep 2025",Prob (F-statistic):,1.69e-83
Time:,10:41:08,Log-Likelihood:,-269680.0
No. Observations:,20640,AIC:,539400.0
Df Residuals:,20638,BIC:,539400.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.882e+05,1248.379,150.717,0.000,1.86e+05,1.91e+05
total_rooms,7.0960,0.365,19.448,0.000,6.381,7.811

0,1,2,3
Omnibus:,2504.907,Durbin-Watson:,0.333
Prob(Omnibus):,0.0,Jarque-Bera (JB):,3511.341
Skew:,0.991,Prob(JB):,0.0
Kurtosis:,3.393,Cond. No.,5370.0


#### Step 4: Extract Coefficients

In [6]:
# Extract intercept and slope
intercept, slope = model.params
print(f"Intercept: {intercept:.2f}, Slope: {slope:.2f}")

Intercept: 188152.52, Slope: 7.10


#### Step 5: Build the Regression Equation


The regression equation is:

$$
\text{Price} = \text{Intercept} + \text{Slope} \times \text{Total\_Rooms}
$$

Substituting the coefficients:

$$
\text{Price} = 188{,}152.52 + 7.10 \times \text{Total\_Rooms}
$$

##### Interpretation:
- **Intercept (188,152.52):**  
  This represents the predicted house price when the number of rooms is zero. Although not realistic in practice, it serves as the model's baseline.

- **Slope (7.10):**  
  For each additional room, the model predicts that the median house value increases by approximately **\$7,096**, assuming all else remains constant.


#### **How AI/ML Finds the Equation**

The regression model uses **Ordinary Least Squares (OLS)** to minimize the **Sum of Squared Errors (SSE):**
$$
SSE = \sum (y_i - \hat{y}_i)^2
$$

**Definitions:**

$$
y_i = \text{actual house price}
$$

$$
\hat{y}_i = \text{predicted house price}
$$

The slope and intercept are chosen to minimize this error, producing the best-fit regression line.

---



### **Multiple linear regression model**

We will fit a **multiple linear regression** model to predict house price (`median_house_value`) using:

- `total_rooms` (proxy for Size)  
- `total_bedrooms` (proxy for Bedrooms)  
- `housing_median_age` (proxy for Age)  
- `longitude` (used here as proxy for distance from city center)

#### Step 1: Import Libraries and Load Data

In [None]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler

# 1) Load dataset
data = pd.read_csv("dataset/housing.csv")  # <- adjust path if needed

#### Step 2: Define Variables

In [None]:
# Build a single DataFrame for aligned cleaning
df = data.loc[:, ["median_house_value", "total_rooms", "total_bedrooms", "housing_median_age", "longitude"]].copy()
df.columns = ["Price", "Size", "Bedrooms", "Age", "Distance_to_city_center"]

# Replace ±inf with NaN and drop rows with missing values in X or y
df = df.replace([np.inf, -np.inf], np.nan).dropna(subset=["Price","Size","Bedrooms","Age","Distance_to_city_center"])

# Split into X, y and add intercept
y = df["Price"]
X = df[["Size","Bedrooms","Age","Distance_to_city_center"]]
X = sm.add_constant(X)

# (Optional sanity check)
assert not X.isna().any().any(), "X still has NaNs"
assert not np.isinf(X.values).any(), "X still has ±inf"
assert not y.isna().any(), "y has NaNs"
assert not np.isinf(y.values).any(), "y has ±inf"

#### Step 3: Add Constant and Fit Model

In [None]:
# Fit OLS
model = sm.OLS(y, X).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                  Price   R-squared:                       0.090
Model:                            OLS   Adj. R-squared:                  0.090
Method:                 Least Squares   F-statistic:                     503.6
Date:                Tue, 16 Sep 2025   Prob (F-statistic):               0.00
Time:                        13:44:46   Log-Likelihood:            -2.6621e+05
No. Observations:               20433   AIC:                         5.324e+05
Df Residuals:                   20428   BIC:                         5.325e+05
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                              coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------
const                    1

#### Step 4: Extract Coefficients

In [22]:
# Coefficients
intercept, b_size, b_bedrooms, b_age, b_long = model.params
print("\nCoefficients:")
print(f"  Intercept: {intercept:.2f}")
print(f"  Size (total_rooms): {b_size:.2f}")
print(f"  Bedrooms (total_bedrooms): {b_bedrooms:.2f}")
print(f"  Age (housing_median_age): {b_age:.2f}")
print(f"  Distance proxy (longitude): {b_long:.2f}")



Coefficients:
  Intercept: 11493.05
  Size (total_rooms): 38.64
  Bedrooms (total_bedrooms): -156.16
  Age (housing_median_age): 1700.36
  Distance proxy (longitude): -1077.33


In [None]:
# Standardized (beta) coefficients to compare impact strengths
# Standardize X and y so coefficients are unitless
scaler = StandardScaler()
X_std = scaler.fit_transform(df[["Size","Bedrooms","Age","Distance_to_city_center"]])
y_std = (y - y.mean()) / y.std()

X_std = sm.add_constant(X_std)
model_beta = sm.OLS(y_std, X_std).fit()

beta = model_beta.params
print("Standardized (beta) coefficients:")
for k, v in beta_map.items():
    print(f"  {k:>24s}: {v:+.4f}")


Standardized (beta) coefficients:
                      Size: +0.7314
                  Bedrooms: -0.5700
                       Age: +0.1855
   Distance_to_city_center: -0.0187



#### Step 5: Regression Equation

The regression equation is:

$$
\text{{Price}} = {intercept:.2f}
\;+\; {b_size:.4f}\cdot \text{{Size}}
\;+\; {b_bedrooms:.4f}\cdot \text{{Bedrooms}}
\;+\; {b_age:.4f}\cdot \text{{Age}}
\;+\; {b_long:.4f}\cdot \text{{Distance\_to\_city\_center}}
$$


Where:
- **Intercept** = baseline predicted price  
- **Coefficients** = effect of each independent variable on Price

#### Step 6: Interpretation

- **Intercept:** Predicted house price when all independent variables are zero (not meaningful, but anchors the regression line).  
- **Size (total_rooms):** How much the price changes with an additional room, holding other factors constant.  
- **Bedrooms (total_bedrooms):** Price change for an additional bedroom, holding other factors constant.  
- **Age (housing_median_age):** Effect of one more year in house age on price.  
- **Longitude (proxy for distance):** Effect of moving one unit further west/east on price.  

The variable with the **largest absolute coefficient** (after adjusting for scale) has the **strongest impact** on house prices.



## References:
- Lecture note: https://colab.research.google.com/drive/15Lq4kiSe4JkdOJifvjS4Ac2_jMODrdoI#scrollTo=pBHLavikDQUC
- Model Complexity: https://ishanjainoffical.medium.com/model-complexity-explained-intuitively-e179e38866b6