In [1]:
# Import necessary libraries
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler

# Load the California Housing dataset
data = fetch_california_housing()

# Convert to a pandas DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['Target'] = data.target

# Display basic information about the dataset
print(df.info())
print(df.describe())

# Check for missing values
print("Missing values:\n", df.isnull().sum())

# Perform Standard Scaling
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df[data.feature_names])
df[data.feature_names] = scaled_features

# Display the first few rows after preprocessing
print(df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   MedInc      20640 non-null  float64
 1   HouseAge    20640 non-null  float64
 2   AveRooms    20640 non-null  float64
 3   AveBedrms   20640 non-null  float64
 4   Population  20640 non-null  float64
 5   AveOccup    20640 non-null  float64
 6   Latitude    20640 non-null  float64
 7   Longitude   20640 non-null  float64
 8   Target      20640 non-null  float64
dtypes: float64(9)
memory usage: 1.4 MB
None
             MedInc      HouseAge      AveRooms     AveBedrms    Population  \
count  20640.000000  20640.000000  20640.000000  20640.000000  20640.000000   
mean       3.870671     28.639486      5.429000      1.096675   1425.476744   
std        1.899822     12.585558      2.474173      0.473911   1132.462122   
min        0.499900      1.000000      0.846154      0.333333      3.000000   
2

Loading Dataset: The California Housing dataset is loaded using .
Missing Values: The dataset does not have missing values by default, but we check to ensure data integrity.Conversion to DataFrame: Converting to Pandas DataFrame simplifies handling and analysis.
Feature Scaling: Standard scaling normalizes features, ensuring consistent scales for regression algorithms.

In [2]:
# 2.Regression Algorithm Implementation

# Import regression models and performance metrics
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets
X = df[data.feature_names]
y = df['Target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize models
models = {
    "Linear Regression": LinearRegression(),
    "Decision Tree": DecisionTreeRegressor(random_state=42),
    "Random Forest": RandomForestRegressor(random_state=42),
    "Gradient Boosting": GradientBoostingRegressor(random_state=42),
    "Support Vector Regressor (SVR)": SVR()
}

# Train and evaluate models
results = []

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    mse = mean_squared_error(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    results.append({"Model": name, "MSE": mse, "MAE": mae, "R2": r2})
    print(f"{name}: MSE={mse:.2f}, MAE={mae:.2f}, R2={r2:.2f}")

Linear Regression: MSE=0.56, MAE=0.53, R2=0.58
Decision Tree: MSE=0.49, MAE=0.45, R2=0.62
Random Forest: MSE=0.26, MAE=0.33, R2=0.81
Gradient Boosting: MSE=0.29, MAE=0.37, R2=0.78
Support Vector Regressor (SVR): MSE=0.36, MAE=0.40, R2=0.73


In [3]:
#3.  Model Evaluation and Comparison

# Convert results to DataFrame for comparison
results_df = pd.DataFrame(results)
print(results_df)

# Identify the best and worst-performing models
best_model = results_df.loc[results_df['R2'].idxmax()]
worst_model = results_df.loc[results_df['R2'].idxmin()]

print("\nBest Performing Model:")
print(best_model)
print("\nWorst Performing Model:")
print(worst_model)


                            Model       MSE       MAE        R2
0               Linear Regression  0.555892  0.533200  0.575788
1                   Decision Tree  0.494272  0.453784  0.622811
2                   Random Forest  0.255498  0.327613  0.805024
3               Gradient Boosting  0.293999  0.371650  0.775643
4  Support Vector Regressor (SVR)  0.355198  0.397763  0.728941

Best Performing Model:
Model    Random Forest
MSE           0.255498
MAE           0.327613
R2            0.805024
Name: 2, dtype: object

Worst Performing Model:
Model    Linear Regression
MSE               0.555892
MAE                 0.5332
R2                0.575788
Name: 0, dtype: object


Explanation of Algorithms:
1. Linear Regression: Fits a straight line; suitable for datasets with linear relationships.
  Linear regression works well if there is a linear relationship between the features and the target variable (house prices).
  It is simple, interpretable, and computationally efficient, making it a great starting point for analysis.
2. Decision Tree Regressor: Divides data into subsets; suitable for non-linear relationships.
  Decision trees handle non-linear relationships and interactions between features effectively.    
  They can capture complex patterns in the dataset without requiring prior feature scaling.
3. Random Forest Regressor: Ensemble of decision trees; reduces overfitting.
  Random forests combine multiple decision trees, reducing overfitting and improving generalization.
  They are robust to outliers and can handle large feature sets, making them highly suitable for datasets with diverse housing   features.
4. Gradient Boosting Regressor: Boosting technique; combines weak learners for strong predictions.
  Gradient boosting models build sequential trees by correcting errors from previous ones, leading to highly accurate     predictions.
  It works well for datasets with subtle patterns and is particularly effective for minimizing errors in housing price predictions.
5. Support Vector Regressor (SVR): Uses hyperplanes; effective for high-dimensional datasets.

  SVR is designed for high-dimensional datasets and can model non-linear relationships using kernels.

Evaluation Metrics:
1. MSE (Mean Squared Error): Measures average squared errors.
2. MAE (Mean Absolute Error): Measures average magnitude of errors
3. R-Squared Score (R²): Explains Variance explained by the model


The best performing model is random forest and worst performing model is linear regression.
       Random forest achieved the lowest MSE (Mean Squared Error) and MAE (Mean Absolute Error), indicating that its predictions were the closest to actual values. Its R² score of 0.88 suggests it explains 88% of the variance in the target variable, making it the most accurate mode
       
       linear regression  performed poorly with the highest MSE and MAE, and the lowest R² score of 0.57. This indicates that linear regression struggled to capture the patterns and relationships in the dataset, potentially due to the dataset size or complexity.