# USA - California Housing Price Prediction 🏡  
This notebook trains a **Random Forest Regressor** on the **California Housing Dataset** from `scikit-learn`.  
The goal is to predict house prices based on various features.

## 📌 Steps:
1. Load the dataset
2. Preprocess the data
3. Train a model
4. Evaluate performance
5. Save the trained model

In [None]:
import numpy as np
import pandas as pd
import joblib
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, root_mean_squared_error, r2_score
from sklearn.datasets import fetch_california_housing

In [None]:
# 📌 Step 1: Load dataset
data = fetch_california_housing()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

print("Dataset loaded successfully!")
df.head()  # Display first 5 rows

## 📌 Step 2: Split Data  
We'll split the dataset into **training (80%)** and **testing (20%)** sets.

In [None]:
X = df.drop(columns=['target'])  # Drop target column only
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Data split complete!")

## 📌 Step 3: Train the Model  
We'll use a **Random Forest Regressor** with 100 trees.

In [None]:
# Train model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

print("Model training complete!")

## 📌 Step 4: Evaluate Model Performance  
We'll measure performance using **Mean Absolute Error (MAE)**, **Root Mean Squared Error (RMSE)**, and **R² Score**.

In [None]:

y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
rmse = root_mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Convert errors to percentage relative to mean house price
mean_price = np.mean(y_test)
mae_percentage = (mae / mean_price) * 100
rmse_percentage = (rmse / mean_price) * 100

# Calculate model accuracy as percentage
model_accuracy = (1 - (mae / mean_price)) * 100
error_percentage = 100 - model_accuracy

# Print raw evaluation results
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"R² Score: {r2:.2f}")
print(f"MAE as Percentage: {mae_percentage:.2f}%")
print(f"RMSE as Percentage: {rmse_percentage:.2f}%")
print(f"Model Accuracy: {model_accuracy:.2f}%")
print(f"Error Percentage: {error_percentage:.2f}%")

### 🔍 Model Evaluation Summary:
- **Mean Absolute Error (MAE):** 0.33 (15.94%)
- **Root Mean Squared Error (RMSE):** 0.51 (24.59%)
- **R² Score:** 0.81
- **Model Accuracy:** 84.06%
- **Error Percentage:** 15.94%

💡 The model predicts California house prices with approximately **84.06% accuracy**. This means that about **15.94%** of predictions have some level of error. A higher accuracy and a lower error percentage indicate that the model is performing well. However, since no model is perfect, some prediction errors are expected.
```dictions have some level of error. A higher accuracy and a lower error percentage indicate that the model is performing well. However, since no model is perfect, some prediction errors are expected.
```

## 📌 Step 5: Feature Importance Analysis  
We'll visualize the importance of different features in predicting house prices.

In [None]:
# Get feature importances
feature_importances = model.feature_importances_
feature_names = X.columns

# Create a DataFrame for visualization
importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importances})
importance_df = importance_df.sort_values(by='Importance', ascending=False)

# Plot feature importance
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=importance_df, palette='viridis', hue='Feature', legend=False)
plt.title('Feature Importance in House Price Prediction')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()

### 🔍 Understanding Feature Importance:
Feature importance helps us understand which variables have the most influence on house price predictions. Features with higher importance contribute more to the model's predictions. By analyzing this, we can focus on the most impactful variables and potentially improve the model by selecting the most relevant features.

- Features with **higher bars** have a **stronger impact** on house prices.  
- Less important features may be removed in future models to improve efficiency.  
- This insight can guide homeowners, real estate agents, and policymakers in understanding which factors matter most in property valuation.  

🔹 **Top Influential Features:**
1. **MedInc (Median Income)** - The most significant factor affecting house prices.
2. **AveRooms (Average Rooms per Household)** - More rooms per household tend to increase value.
3. **AveOccup (Average Occupancy per Household)** - Reflects population density and housing demand.
4. **Latitude** - Geographic location plays a role in property valuation.
5. **Longitude** - Along with latitude, this helps define price variations by region.


## 📌 Step 6: Save the Trained Model  
We'll save the trained model as `model.pkl` for deployment.

In [None]:
joblib.dump(model, "model.pkl")
print("Model saved as model.pkl")