In [13]:
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load the dataset
data = fetch_california_housing(as_frame=True)
df = data.frame

# Check basic information
print("Dataset Information:")
print(df.info())

# Describe dataset statistics
print("\nDataset Description:")
print(df.describe())

# Check for missing values
print("\nMissing values:")
print(df.isnull().sum())

# Separate features and target variable
X = df.drop("MedHouseVal", axis=1)
y = df["MedHouseVal"]

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Verify variable definitions and print their types and shapes
print("\nVariable Verification:")
print(f"X_train_scaled: {type(X_train_scaled)}, Shape: {X_train_scaled.shape}")
print(f"y_train: {type(y_train)}, Shape: {y_train.shape}")
print(f"X_test_scaled: {type(X_test_scaled)}, Shape: {X_test_scaled.shape}")
print(f"y_test: {type(y_test)}, Shape: {y_test.shape}")


Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   MedInc       20640 non-null  float64
 1   HouseAge     20640 non-null  float64
 2   AveRooms     20640 non-null  float64
 3   AveBedrms    20640 non-null  float64
 4   Population   20640 non-null  float64
 5   AveOccup     20640 non-null  float64
 6   Latitude     20640 non-null  float64
 7   Longitude    20640 non-null  float64
 8   MedHouseVal  20640 non-null  float64
dtypes: float64(9)
memory usage: 1.4 MB
None

Dataset Description:
             MedInc      HouseAge      AveRooms     AveBedrms    Population  \
count  20640.000000  20640.000000  20640.000000  20640.000000  20640.000000   
mean       3.870671     28.639486      5.429000      1.096675   1425.476744   
std        1.899822     12.585558      2.474173      0.473911   1132.462122   
min        0.499900      1

### **Preprocessing Steps and Justification**

1. **Loading the Dataset**:
   - We loaded the California Housing dataset using `fetch_california_housing` from `sklearn.datasets`. This dataset contains multiple features about different California districts, such as latitude, longitude, population, household statistics, and the target variable, `MedHouseVal`, which represents the median house value.
   
   **Justification**: This step is essential to obtain the dataset, which we will use to train and evaluate our regression models.

2. **Exploring the Dataset**:
   - We used `df.info()` to display the data types and the number of non-null entries in each column, and `df.describe()` to get summary statistics of the numeric features.
   
   **Justification**: Understanding the data structure and getting a statistical summary is crucial to detect potential issues such as missing data, outliers, or extreme values that could impact the modeling process.

3. **Checking for Missing Values**:
   - We checked for missing values using `df.isnull().sum()`.
   
   **Justification**: Missing data can cause problems for machine learning algorithms, leading to biased or incorrect model predictions. Identifying missing values early on helps decide the best approach for handling them (e.g., imputation or removal).

4. **Feature and Target Variable Separation**:
   - We separated the features (independent variables) from the target variable (`MedHouseVal`) by dropping the target column from the feature set.
   
   **Justification**: In supervised learning, we need to separate the features (input variables) from the target variable (output). This separation allows the model to learn from the features and predict the target.

5. **Splitting the Data into Training and Testing Sets**:
   - We used `train_test_split` from `sklearn.model_selection` to split the data into training and testing sets. 80% of the data was used for training, and 20% was set aside for testing.
   
   **Justification**: Splitting the dataset into training and testing sets ensures that the model can be trained on one subset of the data and tested on another unseen subset. This helps to evaluate the model's performance and generalizability.

6. **Feature Scaling**:
   - We scaled the features using `StandardScaler` to standardize the training and testing data. This transformation adjusts the features to have a mean of 0 and a standard deviation of 1.
   
   **Justification**: Many machine learning algorithms, especially those based on gradient descent, are sensitive to the scale of the features. Scaling the features ensures that no single feature dominates the model due to differences in magnitude. It also speeds up the convergence of gradient-based algorithms.

7. **Variable Verification**:
   - After scaling, we verified the types and shapes of the resulting variables (`X_train_scaled`, `X_test_scaled`, `y_train`, `y_test`) to ensure that the data was processed correctly.
   
   **Justification**: Verifying the data after preprocessing ensures that the dataset is ready for model training. It helps catch any issues such as missing variables or incorrect transformations that could affect the model's performance.


### 1. Linear Regression ###

In [15]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Initialize and train the model
lr_model = LinearRegression()
lr_model.fit(X_train_scaled, y_train)

# Make predictions
y_pred_lr = lr_model.predict(X_test_scaled)

# Evaluate the model
mse_lr = mean_squared_error(y_test, y_pred_lr)
mae_lr = mean_absolute_error(y_test, y_pred_lr)
r2_lr = r2_score(y_test, y_pred_lr)

# Display results
print("Linear Regression Performance:")
print(f"Mean Squared Error (MSE): {mse_lr}")
print(f"Mean Absolute Error (MAE): {mae_lr}")
print(f"R-squared (R²): {r2_lr}")

Linear Regression Performance:
Mean Squared Error (MSE): 0.5558915986952441
Mean Absolute Error (MAE): 0.5332001304956566
R-squared (R²): 0.575787706032451


### 2. Decision Tree Regressor ###

In [18]:
from sklearn.tree import DecisionTreeRegressor

# Initialize and train the model
dt_model = DecisionTreeRegressor(random_state=42)
dt_model.fit(X_train_scaled, y_train)

# Make predictions
y_pred_dt = dt_model.predict(X_test_scaled)

# Evaluate the model
mse_dt = mean_squared_error(y_test, y_pred_dt)
mae_dt = mean_absolute_error(y_test, y_pred_dt)
r2_dt = r2_score(y_test, y_pred_dt)

# Display results
print("\nDecision Tree Regressor Performance:")
print(f"Mean Squared Error (MSE): {mse_dt}")
print(f"Mean Absolute Error (MAE): {mae_dt}")
print(f"R-squared (R²): {r2_dt}")


Decision Tree Regressor Performance:
Mean Squared Error (MSE): 0.49396854311945243
Mean Absolute Error (MAE): 0.45390448401162786
R-squared (R²): 0.6230424613065773


### 3. Random Forest Regressor ###

In [20]:
from sklearn.ensemble import RandomForestRegressor

# Initialize and train the model
rf_model = RandomForestRegressor(random_state=42, n_estimators=100)
rf_model.fit(X_train_scaled, y_train)

# Make predictions
y_pred_rf = rf_model.predict(X_test_scaled)

# Evaluate the model
mse_rf = mean_squared_error(y_test, y_pred_rf)
mae_rf = mean_absolute_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

# Display results
print("\nRandom Forest Regressor Performance:")
print(f"Mean Squared Error (MSE): {mse_rf}")
print(f"Mean Absolute Error (MAE): {mae_rf}")
print(f"R-squared (R²): {r2_rf}")


Random Forest Regressor Performance:
Mean Squared Error (MSE): 0.255169737347244
Mean Absolute Error (MAE): 0.3274252027374032
R-squared (R²): 0.8052747336256919


### 4. Gradient Boosting Regressor ###

In [22]:
from sklearn.ensemble import GradientBoostingRegressor

# Initialize and train the model
gb_model = GradientBoostingRegressor(random_state=42)
gb_model.fit(X_train_scaled, y_train)

# Make predictions
y_pred_gb = gb_model.predict(X_test_scaled)

# Evaluate the model
mse_gb = mean_squared_error(y_test, y_pred_gb)
mae_gb = mean_absolute_error(y_test, y_pred_gb)
r2_gb = r2_score(y_test, y_pred_gb)

# Display results
print("\nGradient Boosting Regressor Performance:")
print(f"Mean Squared Error (MSE): {mse_gb}")
print(f"Mean Absolute Error (MAE): {mae_gb}")
print(f"R-squared (R²): {r2_gb}")



Gradient Boosting Regressor Performance:
Mean Squared Error (MSE): 0.29399901242474274
Mean Absolute Error (MAE): 0.37165044848436773
R-squared (R²): 0.7756433164710084


### 5. Support Vector Regressor (SVR) ###

In [32]:
from sklearn.svm import SVR

# Initialize and train the model
svr_model = SVR(kernel='rbf')
svr_model.fit(X_train_scaled, y_train)

# Make predictions
y_pred_svr = svr_model.predict(X_test_scaled)

# Evaluate the model
mse_svr = mean_squared_error(y_test, y_pred_svr)
mae_svr = mean_absolute_error(y_test, y_pred_svr)
r2_svr = r2_score(y_test, y_pred_svr)

# Display results
print("\nSupport Vector Regressor Performance:")
print(f"Mean Squared Error (MSE): {mse_svr}")
print(f"Mean Absolute Error (MAE): {mae_svr}")
print(f"R-squared (R²): {r2_svr}")


Support Vector Regressor Performance:
Mean Squared Error (MSE): 0.3570040319338641
Mean Absolute Error (MAE): 0.3985990769520539
R-squared (R²): 0.7275628923016779


### **Regression Algorithms: Explanation and Suitability**

1. **Linear Regression**:
   - **How it Works**: Linear Regression models the relationship between the target variable and one or more independent variables by fitting a straight line (in higher dimensions, a hyperplane) that minimizes the sum of squared residuals.
   - **Suitability for this Dataset**: Linear Regression is suitable for datasets where there is a linear relationship between the features and the target variable. In this dataset, it provides a baseline model to evaluate how well simple linear relationships capture the variability in median house values.

2. **Decision Tree Regressor**:
   - **How it Works**: Decision Tree Regressor splits the dataset into subsets based on feature thresholds, creating a tree structure where each leaf represents a predicted value. It minimizes the variance within each subset.
   - **Suitability for this Dataset**: This algorithm is non-parametric and can capture complex, non-linear relationships between features and the target variable, which might be present in housing data.

3. **Random Forest Regressor**:
   - **How it Works**: Random Forest is an ensemble method that builds multiple decision trees using random subsets of the data and features. It aggregates their predictions to improve accuracy and reduce overfitting.
   - **Suitability for this Dataset**: The ensemble approach of Random Forest is effective for datasets with high variability, like this housing dataset, as it reduces overfitting and provides robust predictions.

4. **Gradient Boosting Regressor**:
   - **How it Works**: Gradient Boosting builds a series of decision trees sequentially, where each tree tries to correct the errors of the previous one. It uses gradient descent to optimize a loss function.
   - **Suitability for this Dataset**: Gradient Boosting is well-suited for datasets where high prediction accuracy is required. It can model complex relationships in the data effectively.

5. **Support Vector Regressor (SVR)**:
   - **How it Works**: SVR attempts to fit the best hyperplane within a margin of tolerance for error, optimizing a loss function that ignores small deviations from the target value.
   - **Suitability for this Dataset**: SVR works well when the dataset has non-linear relationships and is sensitive to feature scaling, making it effective after standardization. However, it may struggle with larger datasets due to compuent for accurate predictions in real-world scenarios.


### **Model Evaluation and Comparison**

We evaluated the performance of each regression algorithm using the following metrics:

1. **Mean Squared Error (MSE)**: Measures the average squared difference between predicted and actual values. Lower values indicate better performance.
2. **Mean Absolute Error (MAE)**: Measures the average absolute difference between predicted and actual values. Lower values indicate better performance.
3. **R-squared Score (R²)**: Indicates the proportion of variance in the target variable explained by the model. Values closer to 1 indicate better performance.

| Algorithm                | MSE       | MAE       | R²         |
|--------------------------|-----------|-----------|------------|
| Linear Regression        | 0.5559    | 0.5332    | 0.5758     |
| Decision Tree Regressor  | 0.4940    | 0.4539    | 0.6230     |
| Random Forest Regressor  | 0.2552    | 0.3274    | 0.8053     |
| Gradient Boosting Regressor | 0.2940 | 0.3717    | 0.7756     |
| Support Vector Regressor | 0.3570    | 0.3986    | 0.7276     |

---

### **Best-Performing Algorithm**
- **Algorithm**: Random Forest Regressor
- **Justification**: 
  - It achieved the lowest Mean Squared Error (MSE) and Mean Absolute Error (MAE), indicating that its predictions are closest to the actual values.
  - It also obtained the highest R-squared (R²) score of 0.8053, showing that it explains the largest proportion of variance in the target variable. 
  - The ensemble nature of Random Forest helps it effectively model complex relationships and reduces overfitting.

### **Worst-Performing Algorithm**
- **Algorithm**: Linear Regression
- **Reasoning**: 
  - Linear Regression assumes a linear relationship between features and the target variable, which is a significant limitation for this dataset.
  - It had the highest MSE and MAE among all models, and its R-squared score of 0.5758 was the lowest, indicating it explains the least variance in the target variable.

---

### **Conclusion**
The results highlight the importance of using advanced algorithms like Random Forest or Gradient Boosting for datasets with complex relationships. While simpler models like Linear Regression provide a good baseline, they fail to capture the intricate patterns in the data. Random Forest Regressor outperformed all other models, making it the best choice for this dataset.
