<a href="https://colab.research.google.com/github/jspe406/C964/blob/main/house_pred.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.datasets
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from sklearn import metrics

In [None]:
# using built in data on California Housing Data from sklearn
house_price_dataset = sklearn.datasets.fetch_california_housing()
print(house_price_dataset)

This needs to be converted to a pandas dataframe which will enable us to visualize, clean and use the data more efficiently

In [None]:
housing = pd.DataFrame(house_price_dataset.data, columns=house_price_dataset.feature_names)
housing.head(10)

In [None]:
# Add target column
housing['Price'] = house_price_dataset.target
housing.head(10)

In [None]:
# Check the number of rows and colums
housing.shape

In [None]:
# check for missing values
housing.isnull().sum

In [None]:
# Statistical measures of the dataset
housing.describe()

In [None]:
# Correlation between various features in the dataset
correlation = housing.corr()
# Heatmap to understand the correlation
plt.figure(figsize=(10,10))
sns.heatmap(correlation, cbar=True, square=True, fmt='.1f', annot=True, annot_kws={'size':8}, cmap='Blues')

In [None]:
# Split the data and target
X = housing.drop(['Price'], axis=1)
Y = housing['Price']
print(X)
print(Y)

In [None]:
# Split the data into training data and test data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=2)
print(X.shape, X_train.shape, X_test.shape)

In [None]:
# Load the XGBoost Regressor for model training
model = XGBRegressor()

In [None]:
# Training the model with X_train
model.fit(X_train, Y_train)

In [None]:
# Accuracy for prediction on Training data
training_data_prediction = model.predict(X_train)
print(training_data_prediction)

In [None]:
# R squared Error
r2_train = metrics.r2_score(Y_train, training_data_prediction)
print('R squared Error: ', r2_train)

# Mean Absolute Error
mae_train = metrics.mean_absolute_error(Y_train, training_data_prediction)
print('Mean Absolute Error: ', mae_train)

In [None]:
# Accuracy for prediction on test data
test_data_prediction = model.predict(X_test)
print(test_data_prediction)

In [None]:
# R squared Error
r2_test = metrics.r2_score(Y_test, test_data_prediction)
print('R squared Error: ', r2_test)

# Mean Absolute Error
mae_test = metrics.mean_absolute_error(Y_test, test_data_prediction)
print('Mean Absolute Error: ', mae_test)

## 1. Scatter Plot with Regression Line:
This plot compares the actual house prices (Y_test) with the predicted house prices (test_data_prediction) from the model. It also includes a regression line which shows the general trend of the model's predictions. Ideally, the points should cluster tightly around the regression line, indicating good predictions. If the points are scattered widely or deviate from the line significantly, it implies that the model is not accurately capturing the relationship between the features and the target.

In [None]:
# 1. Scatter Plot with Regression Line
plt.figure(figsize=(8, 6))
sns.regplot(x=Y_test, y=test_data_prediction, scatter_kws={'alpha':0.5})
plt.xlabel("Actual Price")
plt.ylabel("Predicted Price")
plt.title("Actual vs Predicted Price (Test Data)")
plt.show()

## 2. Residual Plot:
The residuals are the differences between the actual prices and the predicted prices (Y_test - test_data_prediction). This plot shows the residuals against the predicted values. A good model should have residuals randomly scattered around zero, with no clear patterns or trends. If we see patterns in the residuals, it suggests that the model is systematically under- or over-estimating the prices in certain regions. For example, if the residuals have a curved shape, it might indicate that the model's predictions are not linear.

In [None]:
# 2. Residual Plot
plt.figure(figsize=(8, 6))
residuals = Y_test - test_data_prediction
plt.scatter(test_data_prediction, residuals, alpha=0.5)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel("Predicted Price")
plt.ylabel("Residuals")
plt.title("Residual Plot (Test Data)")
plt.show()

## 3. Distribution of Residuals:
This histogram shows the distribution of the residuals. A well-performing model should have residuals that are normally distributed with a mean of zero. If the distribution is skewed or has a large spread, it indicates that the model's errors are not consistent or reliable. A normal distribution implies that the model's errors are random and unbiased, which is desired for accurate predictions.


In [None]:
# 3. Distribution of Residuals
plt.figure(figsize=(8, 6))
sns.histplot(residuals, kde=True)
plt.xlabel("Residuals")
plt.ylabel("Frequency")
plt.title("Distribution of Residuals (Test Data)")
plt.show()

In summary, these three plots help us assess the performance and potential issues of the trained model by visually inspecting the relationship between actual and predicted values, the presence of patterns in the errors, and the distribution of the errors.

In [None]:
# Create a new column in the housing dataframe with the predicted prices
housing['Predicted_Price'] = model.predict(X)

# Display the updated dataframe with the predicted prices
print(housing.head())


In [None]:
#@title Interactive Querying with ipywidgets

import ipywidgets as widgets
from IPython.display import display

# Function to filter the DataFrame based on user input
def filter_data(MedInc, HouseAge):
  filtered_df = housing[(housing['MedInc'] >= MedInc) & (housing['HouseAge'] <= HouseAge)]
  print(filtered_df.head())


# Create widgets for user input
medinc_slider = widgets.FloatSlider(value=2, min=0, max=15, step=0.1, description='MedInc:')
houseage_slider = widgets.IntSlider(value=50, min=0, max=50, step=1, description='HouseAge:')

# Create an output widget to display the filtered data
output = widgets.Output()

# Define a function to handle widget interactions
def on_value_change(change):
  with output:
    output.clear_output()
    filter_data(medinc_slider.value, houseage_slider.value)

# Observe the widget values for changes
medinc_slider.observe(on_value_change, names='value')
houseage_slider.observe(on_value_change, names='value')


# Display widgets and output
display(medinc_slider, houseage_slider, output)
