
#Madhura Vilas Suroshe
#24102C2002
# Task
Perform a comprehensive house price prediction analysis using the "housing.csv" dataset. This analysis should include data preprocessing (handling missing values, encoding categorical features, and scaling numerical features), splitting the data into training and testing sets, and training and evaluating three different regression models: Linear Regression, Ridge Regression, and Decision Tree Regressor. For each model, calculate and report RMSE (Train), RMSE (Test), and MAE (Test). Conclude the analysis by comparing the models' performances, discussing instances of underfitting and overfitting, commenting on at least one relevant real-world machine learning issue, and providing a comprehensive summary of the findings.

Load dataset

In [None]:
import requests

# URL of the raw housing.csv file on GitHub
url = 'https://raw.githubusercontent.com/ageron/handson-ml/master/datasets/housing/housing.csv'

# Define the local filename
filename = 'housing.csv'

# Send a GET request to the URL
response = requests.get(url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Write the content to a local file
    with open(filename, 'wb') as f:
        f.write(response.content)
    print(f"'{filename}' downloaded successfully.")
else:
    print(f"Failed to download '{filename}'. Status code: {response.status_code}")


import pandas as pd

df = pd.read_csv('housing.csv')

print("First 5 rows of the DataFrame:")
print(df.head())

print("\nConcise summary of the DataFrame (info()):")
df.info()

print("\nDescriptive statistics of the DataFrame (describe()):")
print(df.describe())

'housing.csv' downloaded successfully.
First 5 rows of the DataFrame:
   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    -122.23     37.88                41.0        880.0           129.0   
1    -122.22     37.86                21.0       7099.0          1106.0   
2    -122.24     37.85                52.0       1467.0           190.0   
3    -122.25     37.85                52.0       1274.0           235.0   
4    -122.25     37.85                52.0       1627.0           280.0   

   population  households  median_income  median_house_value ocean_proximity  
0       322.0       126.0         8.3252            452600.0        NEAR BAY  
1      2401.0      1138.0         8.3014            358500.0        NEAR BAY  
2       496.0       177.0         7.2574            352100.0        NEAR BAY  
3       558.0       219.0         5.6431            341300.0        NEAR BAY  
4       565.0       259.0         3.8462            342200.0        NEAR BAY  

Conc

## Handle Missing Values and Categorical Features

In [None]:
print("Missing values before handling:")
print(df.isnull().sum())

Missing values before handling:
longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64


Handling missing values

In [None]:
median_total_bedrooms = df['total_bedrooms'].median()
df['total_bedrooms'].fillna(median_total_bedrooms, inplace=True)

print("Missing values after filling 'total_bedrooms':")
print(df.isnull().sum())

# Identify categorical features
categorical_features = df.select_dtypes(include=['object']).columns
print(f"\nCategorical features identified: {list(categorical_features)}")

# Apply one-hot encoding to 'ocean_proximity'
df = pd.get_dummies(df, columns=['ocean_proximity'], drop_first=False)

print("\nDataFrame after one-hot encoding (first 5 rows):")
print(df.head())

print("\nConcise summary of the DataFrame after preprocessing (info()):")
df.info()

Missing values after filling 'total_bedrooms':
longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
ocean_proximity       0
dtype: int64

Categorical features identified: ['ocean_proximity']

DataFrame after one-hot encoding (first 5 rows):
   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    -122.23     37.88                41.0        880.0           129.0   
1    -122.22     37.86                21.0       7099.0          1106.0   
2    -122.24     37.85                52.0       1467.0           190.0   
3    -122.25     37.85                52.0       1274.0           235.0   
4    -122.25     37.85                52.0       1627.0           280.0   

   population  households  median_income  median_house_value  \
0       322.0       126.0         8.3252            452600.0   
1      2401.0    

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['total_bedrooms'].fillna(median_total_bedrooms, inplace=True)


In [None]:
median_total_bedrooms = df['total_bedrooms'].median()
df['total_bedrooms'] = df['total_bedrooms'].fillna(median_total_bedrooms)

print("Missing values after filling 'total_bedrooms':")
print(df.isnull().sum())

# Identify categorical features
categorical_features = df.select_dtypes(include=['object']).columns
print(f"\nCategorical features identified: {list(categorical_features)}")

# Apply one-hot encoding to 'ocean_proximity' if it still exists
if 'ocean_proximity' in df.columns:
    df = pd.get_dummies(df, columns=['ocean_proximity'], drop_first=False)
    print("\n'ocean_proximity' column successfully one-hot encoded.")
else:
    print("\n'ocean_proximity' column not found, likely already one-hot encoded in a previous step.")

print("\nDataFrame after preprocessing (first 5 rows):")
print(df.head())

print("\nConcise summary of the DataFrame after preprocessing (info()):")
df.info()

Missing values after filling 'total_bedrooms':
longitude                     0
latitude                      0
housing_median_age            0
total_rooms                   0
total_bedrooms                0
population                    0
households                    0
median_income                 0
median_house_value            0
ocean_proximity_<1H OCEAN     0
ocean_proximity_INLAND        0
ocean_proximity_ISLAND        0
ocean_proximity_NEAR BAY      0
ocean_proximity_NEAR OCEAN    0
dtype: int64

Categorical features identified: []

'ocean_proximity' column not found, likely already one-hot encoded in a previous step.

DataFrame after preprocessing (first 5 rows):
   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    -122.23     37.88                41.0        880.0           129.0   
1    -122.22     37.86                21.0       7099.0          1106.0   
2    -122.24     37.85                52.0       1467.0           190.0   
3    -122.25     37.

## Feature Scaling

In [None]:
from sklearn.preprocessing import StandardScaler

# 1. Separate the target variable `median_house_value` from the features
X = df.drop('median_house_value', axis=1)
y = df['median_house_value']

# 2. Identify all numerical feature columns in `X` that are not boolean
numerical_features = X.select_dtypes(include=['float64', 'int64']).columns

# 3. Import StandardScaler and 4. Instantiate a StandardScaler object
scaler = StandardScaler()

# 5. Fit the StandardScaler to the identified numerical features in `X` and then transform these features
X[numerical_features] = scaler.fit_transform(X[numerical_features])

# 6. Print the first few rows of the updated `X` DataFrame to verify the scaling
print("First 5 rows of X after feature scaling:")
print(X.head())

First 5 rows of X after feature scaling:
   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0  -1.327835  1.052548            0.982143    -0.804819       -0.972476   
1  -1.322844  1.043185           -0.607019     2.045890        1.357143   
2  -1.332827  1.038503            1.856182    -0.535746       -0.827024   
3  -1.337818  1.038503            1.856182    -0.624215       -0.719723   
4  -1.337818  1.038503            1.856182    -0.462404       -0.612423   

   population  households  median_income  ocean_proximity_<1H OCEAN  \
0   -0.974429   -0.977033       2.344766                      False   
1    0.861439    1.669961       2.332238                      False   
2   -0.820777   -0.843637       1.782699                      False   
3   -0.766028   -0.733781       0.932968                      False   
4   -0.759847   -0.629157      -0.012881                      False   

   ocean_proximity_INLAND  ocean_proximity_ISLAND  ocean_proximity_NEAR BAY  \
0 

## Split Data

In [None]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (16512, 13)
Shape of X_test: (4128, 13)
Shape of y_train: (16512,)
Shape of y_test: (4128,)


# Task
Train and evaluate a Linear Regression model using `X_train` and `y_train`, then predict on `X_train` and `X_test`, calculating and printing the RMSE for both training and testing sets, and the MAE for the testing set.

## Train and Evaluate Linear Regression

### Subtask:
Instantiate and train a Linear Regression model using `X_train` and `y_train`. Make predictions on both training and testing sets, then calculate and print RMSE for both sets and MAE for the testing set.


In [10]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np

# 2. Instantiate a LinearRegression model. Initialize a dictionary named `results` to store metrics for all models.
lin_reg = LinearRegression()
results = {}

# 3. Train the LinearRegression model using X_train and y_train.
lin_reg.fit(X_train, y_train)

# 4. Make predictions on X_train and X_test, storing them as y_train_pred and y_test_pred, respectively.
y_train_pred = lin_reg.predict(X_train)
y_test_pred = lin_reg.predict(X_test)

# 5. Calculate the Root Mean Squared Error (RMSE) for the training set
rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))

# 6. Calculate the Root Mean Squared Error (RMSE) for the test set
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))

# 7. Calculate the Mean Absolute Error (MAE) for the test set
mae_test = mean_absolute_error(y_test, y_test_pred)

# 8. Store these metrics in the `results` dictionary
results['Linear Regression'] = {
    'RMSE (Train)': rmse_train,
    'RMSE (Test)': rmse_test,
    'MAE (Test)': mae_test
}

# 9. Print the calculated metrics for Linear Regression.
print(f"Linear Regression Metrics:")
print(f"  RMSE (Train): {results['Linear Regression']['RMSE (Train)']:.2f}")
print(f"  RMSE (Test): {results['Linear Regression']['RMSE (Test)']:.2f}")
print(f"  MAE (Test): {results['Linear Regression']['MAE (Test)']:.2f}")

Linear Regression Metrics:
  RMSE (Train): 68433.94
  RMSE (Test): 70060.52
  MAE (Test): 50670.74


## Train and Evaluate Ridge Regression

### Subtask:
Instantiate and train a Ridge Regression model using `X_train` and `y_train`. Make predictions on both training and testing sets, then calculate and print RMSE for both sets and MAE for the testing set. Store the results in the `results` dictionary.

#### Instructions
1. Import `Ridge` from `sklearn.linear_model`.
2. Instantiate a `Ridge` model. (You can set `random_state=42` for reproducibility if desired, although Ridge regression is generally deterministic without it).
3. Train the `Ridge` model using `X_train` and `y_train`.
4. Make predictions on `X_train` and `X_test`, storing them as `y_train_pred_ridge` and `y_test_pred_ridge`, respectively.
5. Calculate the Root Mean Squared Error (RMSE) for the training set by taking the square root of `mean_squared_error` between `y_train` and `y_train_pred_ridge`. Store this as `rmse_train_ridge`.
6. Calculate the Root Mean Squared Error (RMSE) for the test set by taking the square root of `mean_squared_error` between `y_test` and `y_test_pred_ridge`. Store this as `rmse_test_ridge`.
7. Calculate the Mean Absolute Error (MAE) for the test set using `mean_absolute_error` between `y_test` and `y_test_pred_ridge`. Store this as `mae_test_ridge`.
8. Store these metrics in the `results` dictionary under the key 'Ridge Regression', using sub-keys 'RMSE (Train)', 'RMSE (Test)', and 'MAE (Test)'.
9. Print the calculated `RMSE (Train)`, `RMSE (Test)`, and `MAE (Test)` for Ridge Regression.

In [11]:
from sklearn.linear_model import Ridge

# 2. Instantiate a Ridge model.
ridge_reg = Ridge(random_state=42)

# 3. Train the Ridge model using X_train and y_train.
ridge_reg.fit(X_train, y_train)

# 4. Make predictions on X_train and X_test
y_train_pred_ridge = ridge_reg.predict(X_train)
y_test_pred_ridge = ridge_reg.predict(X_test)

# 5. Calculate the Root Mean Squared Error (RMSE) for the training set
rmse_train_ridge = np.sqrt(mean_squared_error(y_train, y_train_pred_ridge))

# 6. Calculate the Root Mean Squared Error (RMSE) for the test set
rmse_test_ridge = np.sqrt(mean_squared_error(y_test, y_test_pred_ridge))

# 7. Calculate the Mean Absolute Error (MAE) for the test set
mae_test_ridge = mean_absolute_error(y_test, y_test_pred_ridge)

# 8. Store these metrics in the `results` dictionary
results['Ridge Regression'] = {
    'RMSE (Train)': rmse_train_ridge,
    'RMSE (Test)': rmse_test_ridge,
    'MAE (Test)': mae_test_ridge
}

# 9. Print the calculated metrics for Ridge Regression.
print(f"Ridge Regression Metrics:")
print(f"  RMSE (Train): {results['Ridge Regression']['RMSE (Train)']:.2f}")
print(f"  RMSE (Test): {results['Ridge Regression']['RMSE (Test)']:.2f}")
print(f"  MAE (Test): {results['Ridge Regression']['MAE (Test)']:.2f}")

Ridge Regression Metrics:
  RMSE (Train): 68435.00
  RMSE (Test): 70067.35
  MAE (Test): 50677.17


## Train and Evaluate Decision Tree Regressor

### Subtask:
Instantiate and train a Decision Tree Regressor model using `X_train` and `y_train`. Make predictions on both training and testing sets, then calculate and print RMSE for both sets and MAE for the testing set. Store the results in the `results` dictionary.

#### Instructions
1. Import `DecisionTreeRegressor` from `sklearn.tree`.
2. Instantiate a `DecisionTreeRegressor` model. Set `random_state=42` for reproducibility.
3. Train the `DecisionTreeRegressor` model using `X_train` and `y_train`.
4. Make predictions on `X_train` and `X_test`, storing them as `y_train_pred_dt` and `y_test_pred_dt`, respectively.
5. Calculate the Root Mean Squared Error (RMSE) for the training set by taking the square root of `mean_squared_error` between `y_train` and `y_train_pred_dt`. Store this as `rmse_train_dt`.
6. Calculate the Root Mean Squared Error (RMSE) for the test set by taking the square root of `mean_squared_error` between `y_test` and `y_test_pred_dt`. Store this as `rmse_test_dt`.
7. Calculate the Mean Absolute Error (MAE) for the test set using `mean_absolute_error` between `y_test` and `y_test_pred_dt`. Store this as `mae_test_dt`.
8. Store these metrics in the `results` dictionary under the key 'Decision Tree Regressor', using sub-keys 'RMSE (Train)', 'RMSE (Test)', and 'MAE (Test)'.
9. Print the calculated `RMSE (Train)`, `RMSE (Test)`, and `MAE (Test)` for Decision Tree Regressor.

In [12]:
from sklearn.tree import DecisionTreeRegressor

# 2. Instantiate a DecisionTreeRegressor model.
decision_tree_reg = DecisionTreeRegressor(random_state=42)

# 3. Train the DecisionTreeRegressor model using X_train and y_train.
decision_tree_reg.fit(X_train, y_train)

# 4. Make predictions on X_train and X_test
y_train_pred_dt = decision_tree_reg.predict(X_train)
y_test_pred_dt = decision_tree_reg.predict(X_test)

# 5. Calculate the Root Mean Squared Error (RMSE) for the training set
rmse_train_dt = np.sqrt(mean_squared_error(y_train, y_train_pred_dt))

# 6. Calculate the Root Mean Squared Error (RMSE) for the test set
rmse_test_dt = np.sqrt(mean_squared_error(y_test, y_test_pred_dt))

# 7. Calculate the Mean Absolute Error (MAE) for the test set
mae_test_dt = mean_absolute_error(y_test, y_test_pred_dt)

# 8. Store these metrics in the `results` dictionary
results['Decision Tree Regressor'] = {
    'RMSE (Train)': rmse_train_dt,
    'RMSE (Test)': rmse_test_dt,
    'MAE (Test)': mae_test_dt
}

# 9. Print the calculated metrics for Decision Tree Regressor.
print(f"Decision Tree Regressor Metrics:")
print(f"  RMSE (Train): {results['Decision Tree Regressor']['RMSE (Train)']:.2f}")
print(f"  RMSE (Test): {results['Decision Tree Regressor']['RMSE (Test)']:.2f}")
print(f"  MAE (Test): {results['Decision Tree Regressor']['MAE (Test)']:.2f}")

Decision Tree Regressor Metrics:
  RMSE (Train): 0.00
  RMSE (Test): 69116.03
  MAE (Test): 43600.67


**Reasoning**:
All three requested models have been trained and evaluated, and their metrics stored in the `results` dictionary. Now, I will display the `results` dictionary in a clear format to prepare for the final comparison and analysis as required by the main task.



In [13]:
import pandas as pd

# Display the results dictionary as a DataFrame for easy comparison
results_df = pd.DataFrame(results).T
print("\nModel Performance Comparison:")
print(results_df.round(2))



Model Performance Comparison:
                         RMSE (Train)  RMSE (Test)  MAE (Test)
Linear Regression            68433.94     70060.52    50670.74
Ridge Regression             68435.00     70067.35    50677.17
Decision Tree Regressor          0.00     69116.03    43600.67


## Model Comparison and Analysis

**Performance Comparison:**

Let's analyze the performance of the three models based on the calculated metrics:

| Model                   | RMSE (Train) | RMSE (Test) | MAE (Test) |
|:------------------------|:-------------|:------------|:-----------|
| Linear Regression       | 68433.94     | 70060.52    | 50670.74   |
| Ridge Regression        | 68435.00     | 70067.35    | 50677.17   |
| Decision Tree Regressor | 0.00         | 69116.03    | 43600.67   |

**Observations:**

1.  **Linear Regression and Ridge Regression:**
    *   Both models show very similar performance metrics for both training and testing sets. This suggests that the regularization applied by Ridge Regression (with default alpha) had a minimal effect, likely because the dataset does not have strong multicollinearity or excessively large coefficients. Their RMSE (Train) and RMSE (Test) are close, indicating a reasonable fit to the data without significant overfitting or underfitting to the training set alone. The test RMSEs are slightly higher than training RMSEs, which is expected.

2.  **Decision Tree Regressor:**
    *   **RMSE (Train): 0.00** This is a strong indicator of **overfitting**. A perfect RMSE on the training set means the model has learned the training data too well, essentially memorizing it, including noise. This typically happens with deep decision trees that are allowed to grow without restrictions (like `max_depth`).
    *   **RMSE (Test): 69116.03** While the training RMSE is 0, the test RMSE is comparable to, or slightly better than, Linear and Ridge Regression. This shows that despite severe overfitting to the training data, its generalization performance is not drastically worse than the linear models, and its MAE (Test) is even the best among the three. This could imply that there are some underlying patterns that the Decision Tree is capturing more effectively, but its lack of generalization on novel data points is masked by the overfitting.
    *   **MAE (Test): 43600.67** The Decision Tree Regressor achieved the lowest MAE on the test set, suggesting that on average, its predictions are closer to the actual values than the linear models, despite its high variance (as implied by the perfect training score).

**Underfitting and Overfitting:**

*   **Underfitting:** None of the models appear to be significantly underfitting. Linear and Ridge Regression show comparable train and test scores, suggesting they are capturing the general trends. If they were underfitting, both train and test scores would be high and likely similar, indicating the model is too simple to learn the patterns.
*   **Overfitting:** The Decision Tree Regressor exhibits severe **overfitting**. Its RMSE of 0.00 on the training set confirms that it has perfectly learned the training data, but this performance does not generalize well to unseen data, even if its test RMSE is competitive. The significant gap between its training RMSE and test RMSE (or the perfect train RMSE vs. a non-perfect test RMSE) is the hallmark of overfitting.

**Relevant Real-World Machine Learning Issue: Interpretability vs. Performance Trade-off**

This analysis highlights the classic trade-off between model interpretability and predictive performance, often tied to complexity. Linear and Ridge Regression are highly interpretable models; their coefficients directly indicate the impact of each feature on the house price. However, they might struggle to capture complex non-linear relationships. Decision Trees, especially when unconstrained, can capture very complex relationships, leading to high performance on the training data. However, their interpretability can decrease with depth, and they are prone to overfitting. In a real-world scenario for house price prediction, a highly interpretable model might be preferred by stakeholders (e.g., real estate agents, appraisers) to understand **why** a house is priced a certain way, even if a slightly less interpretable model (like a Random Forest or Gradient Boosting, which are ensembles of Decision Trees) offers marginally better predictive accuracy. The choice often depends on the specific business needs: is it about accurate prediction *or* understanding the driving factors?

## Final Task

### Subtask:
Summarize the performance metrics for all three models (Linear Regression, Ridge Regression, and Decision Tree Regressor), including RMSE (Train), RMSE (Test), and MAE (Test) for each. This will allow for a comparative analysis of their performance.


## Summary:

### Q&A
The task was to summarize the performance metrics for Linear Regression, Ridge Regression, and Decision Tree Regressor models, including RMSE (Train), RMSE (Test), and MAE (Test) for each, to facilitate a comparative analysis.

The performance metrics for the models are summarized as follows:

| Model                   | RMSE (Train) | RMSE (Test) | MAE (Test) |
| :---------------------- | :----------- | :---------- | :--------- |
| Linear Regression       | \$68433.94   | \$70060.52  | \$50670.74 |
| Ridge Regression        | \$68435.00   | \$70067.35  | \$50677.17 |
| Decision Tree Regressor | \$0.00       | \$69116.03  | \$43600.67 |

### Data Analysis Key Findings

*   **Linear and Ridge Regression Performance:** Both Linear Regression and Ridge Regression models exhibited very similar performance, with RMSE (Train) at \$68433.94 and \$68435.00 respectively, and RMSE (Test) at \$70060.52 and \$70067.35 respectively. Their MAE (Test) were also comparable at \$50670.74 and \$50677.17. This suggests that the regularization in Ridge Regression had a minimal impact, possibly due to the dataset's characteristics.
*   **Decision Tree Regressor Overfitting:** The Decision Tree Regressor showed clear signs of severe overfitting, achieving a perfect RMSE (Train) of \$0.00. This indicates it memorized the training data entirely.
*   **Decision Tree Regressor Test Performance:** Despite overfitting on the training data, the Decision Tree Regressor's RMSE (Test) was competitive at \$69116.03, and it achieved the lowest MAE (Test) among all models at \$43600.67, suggesting better average absolute predictions on unseen data.
*   **No Significant Underfitting:** None of the models appeared to be significantly underfitting, as indicated by reasonably close train and test scores for the linear models and competitive test scores for the Decision Tree.