---
# Analyzing and Predicting California Housing Prices
---
## Introduction:

In this analysis, we will be working with a comprehensive dataset focused on the California housing market. The dataset provides valuable information about various features that influence housing prices in the region. By exploring and analyzing this dataset, we aim to gain insights into the factors that affect housing values in different areas of California.


---
# Dataset: Overview and Features
---
## Data Description
The dataset comprises multiple columns representing different aspects of housing in California. Here is an overview of the features included in the dataset:

1. **Longitude**: The longitude coordinate of the housing unit's location.
2. **Latitude**: The latitude coordinate of the housing unit's location.
3. **Housing Median Age**: The median age of houses in the specified area.
4. **Total Rooms**: The total number of rooms in a housing unit.
5. **Total Bedrooms**: The total number of bedrooms in a housing unit.
6. **Population**: The total population count in the specified area.
7. **Households**: The total number of households in the specified area.
8. **Median Income**: The median income of households in the area.
9. **Median House Value**: The median value of owner-occupied houses in the area, which serves as the target variable.
10. **Ocean Proximity**: The proximity of the housing unit to the ocean (categories include '<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN').
---
## Purpose

The dataset's purpose is to provide a comprehensive and detailed representation of various factors influencing housing prices in California. With this dataset, we can explore the relationships between different features and the median house value, allowing us to gain insights into the key determinants of housing prices in different locations.

---
## Importance

Understanding the factors that contribute to housing prices is crucial for various stakeholders, including home buyers, sellers, real estate agents, and policymakers. By analyzing this dataset, we can identify important trends, patterns, and correlations, ultimately enabling us to make informed decisions and predictions related to housing prices in California.

---
## Conclusion
The dataset obtained from Kaggle provides a comprehensive snapshot of the California housing market, encompassing various features that influence housing prices. Through the analysis of this dataset, we aim to uncover valuable insights into the factors that affect housing values and develop accurate predictive models to forecast median house prices based on the available features. This analysis will contribute to a deeper understanding of the California housing market and facilitate informed decision-making for individuals and professionals in the real estate industry.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

---
# Prepare Data
---
 # Import

In [2]:
data = pd.read_csv('California Housing Prices.csv')

In [3]:
data.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [None]:
data['median_house_value'].min()

In [None]:
data['median_house_value'].max()

In [None]:
data.info()

In [None]:
data.isnull().sum()

In [None]:
data.dropna(inplace = True)

In [None]:
# Assuming 'data' is your DataFrame
columns = ['housing_median_age', 'total_rooms', 'total_bedrooms', 'population',
             'households', 'median_income', 'median_house_value', 'ocean_proximity']

# Set the size of the figure and define the number of rows and columns
fig, axes = plt.subplots(nrows=2, ncols=4, figsize=(16, 8))

# Flatten the axes array
axes = axes.flatten()

# Loop through each variable and create a histogram
for i, columns in enumerate(columns):
    ax = axes[i]
    sns.histplot(data=data, x=columns, kde=True, color='blue', ax=ax)
    ax.set_title(f"Distribution of {columns}")
    ax.set_xlabel(columns)
    ax.set_ylabel("Frequency")

# Adjust spacing between subplots
plt.tight_layout()
# Save the plot to a file
plt.savefig('NO_clean-Distributions.png')
# Display the plot
plt.show()


In [None]:
# Apply logarithmic transformation to fix skewness
data['median_house_value'] = np.log1p(data['median_house_value'])

# Plot the distribution after transformation
plt.figure(figsize=(8, 6))
sns.histplot(df['median_house_value'], kde=True)
plt.title('Distribution of Transformed Median House Value')
plt.xlabel('Transformed Median House Value')
plt.ylabel('Frequency')
plt.show()


In [29]:
import numpy as np

# Calculate the IQR
Q1 = np.percentile(data['median_house_value'], 25)
Q3 = np.percentile(data['median_house_value'], 75)
IQR = Q3 - Q1

# Define the outlier boundaries
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Remove outliers
data = data[(data['median_house_value'] >= lower_bound) & (data['median_house_value'] <= upper_bound)]

# Plot the distribution after removing outliers
plt.figure(figsize=(8, 6))
sns.histplot(data['median_house_value'], kde=True)
plt.title('Distribution of Median House Value (Outliers Removed)')
plt.xlabel('Median House Value')
plt.ylabel('Frequency')
plt.show()


IndexError: cannot do a non-empty take from an empty axes.

By using square root transformation. This method helps to make the distribution more symmetrical and reduce the skewness. By taking the square root of each value in the "['housing_median_age', 'total_rooms', 'total_bedrooms', 'population',
             'households', 'median_income']" variables, you can achieve a more balanced representation of the data.

In [None]:
import plotly.express as px

# Create a map plot
fig = px.scatter_mapbox(
    data_frame=data, lat='latitude', lon='longitude',
    color='median_house_value', hover_data=['ocean_proximity'],
                        zoom=9, mapbox_style='carto-positron', 
    title='Ocean Proximity Map'
)

# Update marker properties
fig.update_traces(marker=dict(size=10, opacity=0.8))

# Customize layout
fig.update_layout(
    margin=dict(l=0, r=0, t=50, b=0),
    font=dict(family='Arial', size=12),
    coloraxis_colorbar=dict(title='Median House Value'),
)

# Show the plot
fig.show()


In [None]:
# Group the data by 'ocean_proximity' and calculate the mean of 'median_house_value'
mean_values = data.groupby('ocean_proximity')['median_house_value'].mean()

# Plot the bar plot
sns.barplot(x=mean_values.index, y=mean_values.values, palette="viridis")

# Add labels and title
plt.title('Mean Median House Value by Ocean Proximity', fontsize=14, fontweight='bold')
plt.xlabel('Ocean Proximity', fontsize=12)
plt.ylabel('Mean Median House Value', fontsize=12)

# Save the plot to a file
plt.savefig('Mean Median House Value by Ocean Proximity.png')
# Show the plot
plt.show()


The observation suggests that the ISLAND category has the highest median house prices, while the INLAND category has the lowest prices. The NEAR OCEAN, <1H OCEAN, and NEAR BAY categories exhibit relatively higher prices but are not significantly different from each other.

### Build Wrangle Function 


In [9]:
from scipy.stats import boxcox
from scipy import stats
def wrangle(filepath):
    # Open csv file
    df = pd.read_csv(filepath)
    
    # Reduce the skewness by using Square Root Transformation
    df['total_rooms'] = np.sqrt(df['total_rooms'])
    df['total_bedrooms'] = np.sqrt(df['total_bedrooms'])
    df['population'] = np.sqrt(df['population'])
    df['households'] = np.sqrt(df['households'])
    df['median_income'] = np.sqrt(df['median_income'])
     
    

    # Calculate the IQR
    Q1 = np.percentile(df['median_house_value'], 25)
    Q3 = np.percentile(df['median_house_value'], 75)
    IQR = Q3 - Q1

    # Define the outlier boundaries
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Remove outliers
    df = df[(df['median_house_value'] >= lower_bound) & (df['median_house_value'] <= upper_bound)]

   
    # Drop columns
    df = df.drop(['housing_median_age'], axis=1)
   
    # Split the 'ocean_proximity' column into separate columns
    ocean_proximity_split = df['ocean_proximity'].str.get_dummies(sep=',')
   
    # Concatenate the new columns with the original DataFrame
    df = pd.concat([df, ocean_proximity_split], axis=1)

    # Drop the original 'ocean_proximity' column
    df.drop('ocean_proximity', axis=1, inplace=True)
    
    # Drop NaN values
    df.dropna(inplace=True)
    
    return df


In [10]:
df = wrangle('California Housing Prices.csv')
df.head()

Unnamed: 0,longitude,latitude,total_rooms,total_bedrooms,population,households,median_income,median_house_value,<1H OCEAN,INLAND,ISLAND,NEAR BAY,NEAR OCEAN
0,-122.23,37.88,29.664794,11.357817,17.944358,11.224972,2.885342,452600.0,0,0,0,1,0
1,-122.22,37.86,84.255564,33.256578,49.0,33.734256,2.881215,358500.0,0,0,0,1,0
2,-122.24,37.85,38.301436,13.784049,22.271057,13.304135,2.693956,352100.0,0,0,0,1,0
3,-122.25,37.85,35.693137,15.32971,23.622024,14.798649,2.375521,341300.0,0,0,0,1,0
4,-122.25,37.85,40.336088,16.733201,23.769729,16.093477,1.961173,342200.0,0,0,0,1,0


In [None]:
df.info()

## Exploring Distributions of Multiple Variables in the Dataset

In [None]:
columns = ['total_rooms', 'total_bedrooms', 'population', 
              'median_income', 'median_house_value', '<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN']

# Set the size of the figure and define the number of rows and columns
fig, axes = plt.subplots(nrows=3, ncols=4, figsize=(16, 8))

# Flatten the axes array
axes = axes.flatten()

# Loop through each variable and create a histogram
for i, columns in enumerate(columns):
    ax = axes[i]
    sns.histplot(data=df, x=columns, kde=True, color='blue', ax=ax, alpha = 0.4)
    ax.set_title(f"Distribution of {columns}")
    ax.set_xlabel(columns)
    ax.set_ylabel("Frequency")

# Adjust spacing between subplots
plt.tight_layout()

# Save the plot to a file
plt.savefig('Clean_Distributions.png')
# Display the plot
plt.show() 


In [None]:
df.corr()

In [None]:
# Calculate the correlation matrix
corr_matrix = df.corr()

# Create a heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Correlation Matrix Heatmap')
# Save the plot to a file
plt.savefig('Correlation.png')
plt.show()


Important relationships:

- There is a strong negative correlation (-0.92) between 'longitude' and 'latitude', indicating that locations with higher longitude tend to have lower latitude values.
- 'housing_median_age' has a weak negative correlation (-0.11) with 'longitude' and a weak positive correlation (0.11) with 'median_income'.
- 'total_rooms' and 'total_bedrooms' have a strong positive correlation (0.94) with each other, indicating that areas with more rooms tend to have more bedrooms.
- 'population' and 'households' have a strong positive correlation (0.92) with 'total_rooms' and 'total_bedrooms', suggesting that areas with more rooms and bedrooms tend to have larger populations and households.
- 'median_income' has a moderate positive correlation (0.68) with 'median_house_value', indicating that higher median incomes are associated with higher median house values.
- Among the 'ocean_proximity' categories, '<1H OCEAN' has a moderate positive correlation (0.32) with 'median_house_value', suggesting that areas close to the ocean have higher median house values. Conversely, 'INLAND' has a moderate negative correlation (-0.48) with 'median_house_value', indicating that inland areas have lower median house values.
- 'NEAR BAY' and 'NEAR OCEAN' also show some correlation with 'median_house_value', although less pronounced compared to '<1H OCEAN' and 'INLAND'.
- 'ISLAND' does not show a significant correlation with 'median_house_value'.

These observations provide insights into the relationships between different variables in the dataset and their potential impact on the target variable, 'median_house_value'.

## Split the Data

In [12]:
df.head()

Unnamed: 0,longitude,latitude,total_rooms,total_bedrooms,population,households,median_income,median_house_value,<1H OCEAN,INLAND,ISLAND,NEAR BAY,NEAR OCEAN
0,-122.23,37.88,29.664794,11.357817,17.944358,11.224972,2.885342,452600.0,0,0,0,1,0
1,-122.22,37.86,84.255564,33.256578,49.0,33.734256,2.881215,358500.0,0,0,0,1,0
2,-122.24,37.85,38.301436,13.784049,22.271057,13.304135,2.693956,352100.0,0,0,0,1,0
3,-122.25,37.85,35.693137,15.32971,23.622024,14.798649,2.375521,341300.0,0,0,0,1,0
4,-122.25,37.85,40.336088,16.733201,23.769729,16.093477,1.961173,342200.0,0,0,0,1,0


In [11]:
# Split the DataFrame into target and feature variables
target = df['median_house_value']
features = df.drop('median_house_value', axis=1)

In [13]:
target

0        452600.0
1        358500.0
2        352100.0
3        341300.0
4        342200.0
           ...   
20635     78100.0
20636     77100.0
20637     92300.0
20638     84700.0
20639     89400.0
Name: median_house_value, Length: 19369, dtype: float64

In [14]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

In [15]:
X_train.head(4)

Unnamed: 0,longitude,latitude,total_rooms,total_bedrooms,population,households,median_income,<1H OCEAN,INLAND,ISLAND,NEAR BAY,NEAR OCEAN
9898,-122.28,38.3,22.93469,12.328828,15.652476,11.401754,1.418943,0,0,0,1,0
11939,-117.42,33.93,53.712196,24.392622,38.845849,24.289916,1.783115,0,1,0,0,0
3931,-118.59,34.21,48.321838,25.865034,44.56456,25.39685,1.7313,1,0,0,0,0
19481,-120.97,37.66,52.535702,25.13961,35.496479,24.0,1.422217,0,1,0,0,0


# Build Model

In [16]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import GridSearchCV

## Iterate 

**LinearRegression Model**

In [17]:
#Linear Regression 
linear_model = LinearRegression()
#fit the model
linear_model.fit(X_train, y_train)

LinearRegression()

In [18]:
linear_y_predict = linear_model.predict(X_test)
linear_y_predict

array([244742.35408019,  95670.97446307, 189400.16941342, ...,
       166108.63083582,  85133.35110414, 229242.59230777])

In [19]:
# Evaluate Linear Regression model
linear_mse = mean_squared_error(y_test, linear_y_predict)
linear_rmse = mean_squared_error(y_test, linear_y_predict, squared=False)
linear_r2 = r2_score(y_test, linear_y_predict)


# Create the results DataFrame
results = pd.DataFrame({
    'Model': ['Linear Regression'],
    'Mean Squared Error (MSE)': [linear_mse],
    'Root Mean Squared Error (RMSE)': [linear_rmse],
    'R-squared (R2)': [linear_r2]
})
results


Unnamed: 0,Model,Mean Squared Error (MSE),Root Mean Squared Error (RMSE),R-squared (R2)
0,Linear Regression,3600926000.0,60007.713743,0.610666


In [None]:
# Set the style of the plot
sns.set(style='whitegrid')

# Create a scatter plot with a linear regression line
fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(y_test, linear_y_predict, edgecolors=(0, 0, 0), alpha=0.5)
ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k-', lw=4)
ax.set_xlabel('Actual Values')
ax.set_ylabel('Predicted Values')
plt.title('Actual vs. Predicted Values (Linear Regression)')

# Set background and gridlines
sns.despine()
plt.grid(True, linestyle='-', linewidth=0.5, alpha=0.5)

# Save the plot as an image file
plt.savefig('linear_regression_plot.png')

# Show the plot
plt.show()

**Random Forest Regression**

In [None]:
# Random Forest Regression
rf_model = RandomForestRegressor()

# Train the model using the training data
rf_model.fit(X_train, y_train)

# Make predictions on the test set
rf_y_pred = rf_model.predict(X_test)

# Evaluate Random Forest Regression model
rf_mse = mean_squared_error(y_test, rf_y_pred)
rf_rmse = mean_squared_error(y_test, rf_y_pred, squared=False)
rf_r2 = r2_score(y_test, rf_y_pred)

# Create a table DataFrame for the evaluation metrics
results = pd.DataFrame({
'Model': ['Random Forest Regression'],
'Mean Squared Error (MSE)': [rf_mse],
'Root Mean Squared Error (RMSE)': [rf_rmse],
'R-squared (R2)': [rf_r2]
})
results

- Mean Squared Error (MSE): The MSE value for the Random Forest Regression model is 1.940063e+09. This metric measures the average squared difference between the predicted and actual values. A lower MSE indicates better model performance in terms of minimizing the prediction errors.

- Root Mean Squared Error (RMSE): The RMSE value for the Random Forest Regression model is 44046.148555. This metric is the square root of the MSE and provides a measure of the average magnitude of the prediction errors. A lower RMSE indicates better accuracy and closer fit to the actual values.

- R-squared (R2): The R-squared value for the Random Forest Regression model is 0.801203. R-squared represents the proportion of the variance in the target variable that is predictable from the independent variables. A higher R-squared indicates that the model explains a larger portion of the variability in the data, with 1 being the perfect fit. In this case, the model has an R-squared of 0.801203, indicating that approximately 80.12% of the variance in the target variable can be explained by the independent variables.

Overall, the Random Forest Regression model performs well with a relatively low MSE and RMSE, suggesting accurate predictions. Additionally, the R-squared value of 0.801203 indicates that the model captures a significant amount of the variance in the target variable.

In [None]:
# Set seaborn style
sns.set(style='ticks')

# Create a scatter plot with line of best fit
fig, ax = plt.subplots()
ax.scatter(y_test, rf_y_pred, edgecolors=(0, 0, 0), alpha=0.5)
ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k-', lw=4)
ax.set_xlabel('Actual Values')
ax.set_ylabel('Predicted Values')
plt.title('Actual vs Predicted Values (Random Forest Regression)')

# Set gridlines
plt.grid(True, linestyle='--', linewidth=0.5, alpha=0.5)
plt.savefig('Actual vs Predicted Values (Random Forest Regression).png')

plt.show()


**GridSearchCV**

In [None]:
# Define the parameter grid for grid search
param_grid = {
    'n_estimators': [100, 200, 300],  # Test different numbers of estimators
    'max_depth': [None, 5, 10],  # Test different maximum depths
    'min_samples_split': [2, 5, 10],  # Test different minimum samples split
    'min_samples_leaf': [1, 2, 4]  # Test different minimum samples leaf
}



# Create the GridSearchCV object
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=5)

# Fit the model to the training data
grid_search.fit(X_train, y_train) 

# Get the best parameter combination found by GridSearchCV
best_params = grid_search.best_params_

# Create a new Random Forest Regression model with the best parameters
best_rf_model = RandomForestRegressor(**best_params)

# Train the model using the training data
best_rf_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = best_rf_model.predict(X_test)



In [None]:
# Calculate the accuracy (R^2) of the model on the test set
accuracy = best_rf_model.score(X_test, y_test)

# Print the accuracy
print("Accuracy: {:.2f}".format(accuracy))


In [None]:
# Create a scatter plot of actual versus predicted values
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, edgecolors=(0, 0, 0), alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k-', lw=4)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs Predicted Values (Random Forest Regression)')

# Set background and gridlines
sns.despine()
plt.grid(True, linestyle='--', linewidth=0.5, alpha=0.5)
plt.savefig('Act Vs. Prd (RDFR_Tun).png')
# Show the plot
plt.show()

In [None]:
# Save the model

In [None]:
import joblib

# Save the trained model to a file
joblib.dump(best_rf_model, 'random_forest_regression_model.pkl')


## The most importanr features

In [None]:
from sklearn.ensemble import RandomForestRegressor

# Define the best parameters
best_params = grid_search.best_params_

# Create a new Random Forest Regression model with the best parameters
best_rf_model = RandomForestRegressor(**best_params)

# Fit the model to the training data
best_rf_model.fit(X_train, y_train)


In [None]:
# Get feature importances
importances = best_rf_model.feature_importances_

# Create a list of feature names
feature_names = X_train.columns

# Create a dictionary to store feature importance scores
feature_importances = dict(zip(feature_names, importances))

# Sort the feature importances in descending order
sorted_feature_importances = sorted(feature_importances.items(), key=lambda x: x[1], reverse=True)

# Print the most important features
num_features = 10  # Number of top features to display
print(f"Top {num_features} most important features:")
for feature, importance in sorted_feature_importances[:num_features]:
    print(f"{feature}: {importance}")


In [None]:
# Extract top features and their importances
top_features = [feature for feature, importance in sorted_feature_importances[:num_features]]
top_importances = [importance for feature, importance in sorted_feature_importances[:num_features]]

# Plot the most important features
sns.barplot(x=top_importances, y=top_features, orient='horizontal')
plt.title(f"Top {num_features} Most Important Features")
plt.xlabel("Importance Score")
plt.ylabel("Feature")
plt.savefig('Top 5 important features.png')
plt.show()

1. `median_income` has the highest importance score of 0.4418. This suggests that the median income of the households in the area has a strong influence on the prediction of the target variable. Higher median income is likely associated with higher median house values.

2. `INLAND` is the second most important feature with an importance score of 0.1657. It indicates whether the property is located inland or not. The presence of this feature suggests that the location of the property significantly affects the predicted house values, with inland properties potentially having different price dynamics compared to those near the coast.

3. `longitude` and `latitude` have importance scores of 0.1181 and 0.1091, respectively. These geographical coordinates indicate the precise location of the properties. The importance of these features suggests that specific geographic locations can have a noticeable impact on the predicted house values.

4. `housing_median_age` has an importance score of 0.0486, indicating that the age of the houses in the area is a somewhat influential factor in predicting median house values. Older or newer houses may have different price ranges.

5. `population`, `total_rooms`, `total_bedrooms`, and `households` have importance scores ranging from 0.0347 to 0.0192. These features represent population-related characteristics and the number of rooms, bedrooms, and households in the area. While not as influential as some other features, they still contribute to the prediction of median house values.

6. `NEAR OCEAN` has the lowest importance score of 0.0076. It indicates whether the property is located near the ocean. Although it has the lowest importance among the top 10 features, it still contributes to the model's predictions, suggesting that proximity to the ocean can have a minor impact on the median house values.

Overall, the importance scores of the features provide insights into their relative contribution to the prediction of median house values. It suggests that factors such as median income, location (inland vs. coastal), geographic coordinates, and population-related characteristics play significant roles in determining the house values in the given dataset.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Create a figure and subplots
fig, axes = plt.subplots(1, 3, figsize=(15, 6))

# Plotting feature importances
axes[0].barh([feature for feature, _ in sorted_feature_importances[:num_features]],
             [importance for _, importance in sorted_feature_importances[:num_features]])
axes[0].set_title(f"Top {num_features} Most Important Features")
axes[0].set_xlabel("Importance Score")
axes[0].set_ylabel("Feature")

# Plotting actual vs predicted values
axes[1].scatter(y_test, y_pred, edgecolors=(0, 0, 0))
axes[1].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k-', lw=4)
axes[1].set_title("Actual vs Predicted Values")
axes[1].set_xlabel("Actual")
axes[1].set_ylabel("Predicted")

# Plotting residuals
residuals = y_test - y_pred
sns.histplot(residuals, kde=True, ax=axes[2])
axes[2].set_title("Residuals Distribution")
axes[2].set_xlabel("Residuals")
axes[2].set_ylabel("Density")

# Adjust spacing between subplots
plt.tight_layout()

# Save the figure as a PNG image
plt.savefig("best_model.png", dpi=300)

# Display the figure
plt.show()


If the residuals of a regression model fall within the specified range and exhibit a normal distribution, it suggests that the model is making accurate and unbiased predictions, meets the assumptions of linear regression, and provides reliable results for stakeholders to base their decisions on.

# Conclusion:

The findings from this analysis provide valuable insights in the California housing market. The 
relationship between ocean proximity and median house values suggests that properties near the 
ocean or on islands tend to have higher prices. Additionally, variables such as longitude, latitude, 
housing age, room and bedroom counts, population, household size, and median income all have 
varying degrees of impact on housing prices.

The Random Forest Regression model, with its improved accuracy of 0.80, demonstrates strong 
predictive performance, indicating its usefulness in estimating housing prices. This information can 
guide in making informed decisions related to real estate investments, property valuation, and
market analysis.

By leveraging the insights and predictions derived from this analysis, stakeholders can gain a 
competitive edge in the California housing market and make more informed decisions in their 
respective areas of interest.

In [None]:
import random

# Assuming you have trained a RandomForestRegressor model named 'rf_model'

# Create the prediction data with the necessary features
prediction_data = pd.DataFrame({
    'longitude': [random.uniform(-124.3, -114.3)],
    'latitude': [random.uniform(32.5, 42.5)],
    'total_rooms': [random.randint(1, 10000)],
    'total_bedrooms': [random.randint(1, 5000)],
    'population': [random.randint(1, 10000)],
    'households': [random.randint(1, 5000)],
    'median_income': [random.uniform(0, 15)],
    '<1H OCEAN': [random.randint(0, 1)],
    'INLAND': [random.randint(0, 1)],
    'ISLAND': [random.randint(0, 1)],
    'NEAR BAY': [random.randint(0, 1)],
    'NEAR OCEAN': [random.randint(0, 1)]
})

# Make predictions using the loaded model and the prediction data
predictions = rf_model.predict(prediction_data)

# Use the predictions as needed
print(predictions)


In [None]:
target

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

# Convert the actual values to a NumPy array
actual_values = target.values

# Calculate the mean squared error (MSE)
mse = mean_squared_error(actual_values, predictions)

# Calculate the root mean squared error (RMSE)
rmse = np.sqrt(mse)

# Calculate the R-squared (R2) score
r2 = r2_score(actual_values, predictions)

print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
