In [4]:
#importing required libraries
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt


In [5]:
# Load the California Housing dataset
#fetch_california_housing function from Scikit-learn to load the dataset.

housing = fetch_california_housing()
df = pd.DataFrame(housing.data, columns=housing.feature_names)
df['MedHouseVal'] = housing.target

# Display the first few rows of the dataframe
df.head()


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [6]:
# Define features (X) and target (y)
X = df.drop('MedHouseVal', axis=1)
y = df['MedHouseVal']


#Split the data into 80% training and 20% testing.
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [7]:
#Scikit-learn's LinearRegression to train the model.

# Initialize and train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)


The **coefficient of determination**, denoted as R², is a key metric used to evaluate the performance of a regression model. It provides a measure of how well the independent variables (features) explain the variability of the dependent variable (target).

**Interpretation:**

R² = 1: The model perfectly explains all the variability of the target variable.

R² = 0: The model does not explain any of the variability of the target variable.

0 < R² < 1: The model explains part of the variability of the target variable.

R² < 0: The model performs worse than a horizontal line (mean of the target variable).

**Explanation:**
R² measures the proportion of the variance in the dependent variable that is predictable from the independent variables. An R² value close to 1 indicates a high level of explanatory power of the model.


**Mean Squared Error (MSE)** is another important metric for evaluating the performance of a regression model. It represents the average squared difference between the actual and predicted values.

**Interpretation:**

MSE = 0: Perfect prediction, no error.

MSE > 0: The higher the MSE, the larger the average squared difference between the actual and predicted values, indicating poorer model performance.



**Explanation:**

MSE gives an idea of how close the predicted values are to the actual values. By squaring the differences, MSE penalizes larger errors more than smaller ones. It's useful for understanding the accuracy of the model's predictions in absolute terms.

In [8]:
#Calculating the coefficient of determination (R²) and Mean Squared Error (MSE).

# Make predictions
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

# Evaluate the model
mse_train = mean_squared_error(y_train, y_pred_train)
mse_test = mean_squared_error(y_test, y_pred_test)
r2_train = r2_score(y_train, y_pred_train)
r2_test = r2_score(y_test, y_pred_test)

print(f"Training MSE: {mse_train}")
print(f"Testing MSE: {mse_test}")
print(f"Training R²: {r2_train}")
print(f"Testing R²: {r2_test}")


Training MSE: 0.5179331255246699
Testing MSE: 0.5558915986952444
Training R²: 0.6125511913966952
Testing R²: 0.5757877060324508


**Training MSE (0.5179)** and **Testing MSE (0.5559)**: These values are close, indicating the model's predictive performance is consistent between training and testing sets.

**Training R² (0.6126)** and **Testing R² (0.5758)**: These values suggest that the model explains a moderate amount of the variance in the data, with a small drop in explanatory power when applied to new data.

In [9]:
#Examine the coefficients of the linear regression model to understand the impact of different features.

feature_coefficients = pd.DataFrame({'Feature': X.columns, 'Coefficient': model.coef_})
feature_coefficients = feature_coefficients.sort_values(by='Coefficient', ascending=False)
feature_coefficients


Unnamed: 0,Feature,Coefficient
3,AveBedrms,0.783145
0,MedInc,0.448675
1,HouseAge,0.009724
4,Population,-2e-06
5,AveOccup,-0.003526
2,AveRooms,-0.123323
6,Latitude,-0.419792
7,Longitude,-0.433708


### Interpretation and Analysis Summary

**Model Fit:**
- Both training and testing R² values are above 0.5, indicating that the model explains a significant portion of the variance in housing prices.
- The model is not a perfect fit, suggesting other factors may influence housing prices that are not captured by the current model.

**Generalization:**
- Training MSE (0.5179) is slightly lower than testing MSE (0.5559), and training R² (0.6126) is slightly higher than testing R² (0.5758).
- This indicates the model generalizes well to new data without significant overfitting.
- The small increase in MSE and decrease in R² for the testing set suggests robustness and absence of overfitting.

**Model Performance:**
- The model explains about 61% of the variance in training data and about 58% in testing data, indicating reasonable performance in predicting housing prices.
- There is potential for improvement by incorporating additional features, using advanced modeling techniques, or fine-tuning hyperparameters.