<a href="https://colab.research.google.com/github/joehawkens/MachineLearning/blob/main/MODULE_3_Thursday.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**MODULE 3**

# **Module Resources**

**Problem:** Regression <n>

**Target:** Home Price

- Module Overview: https://byui-cse.github.io/cse450-course/module-03/
- Data Dictionary: https://byui-cse.github.io/cse450-course/module-03/housing-dictionary.txt
- Dataset: https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/housing.csv
- Holdout Dataset: https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/housing_holdout_test.csv
- Holdout Mini Dataset: https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/housing_holdout_test_mini.csv
- Module Hints: https://byui-cse.github.io/cse450-course/module-03/hints.html


# **Data Normalization and Cleaning**
- Are there outliers that will skew the data?
- Is there any misssing data?

In [80]:
import pandas as pd
import altair as alt
house_data = pd.read_csv("https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/housing.csv")

# Finding missing values
# Check if the DataFrame contains any missing data:
if house_data.isnull().any().any():
    print('The DataFrame contains missing data')
else:
    print('The DataFrame does not contain missing data')


# Features I think won't be useful:

# lat - Latitude - We already havea  zip code.
# long - Longitude - We already have a zip code.
# id - Unique ID for each home sold - This is a database key, it has nothing to do with the home price.

# All the features I think have potential:

# date - Date of the home sale
# price - Price of each home sold
# bedrooms - Number of bedrooms
# bathrooms - Number of bathrooms, where .5 accounts for a room with a toilet but no shower
# sqft_living - Square footage of the apartments interior living space
# sqft_lot - Square footage of the land space
# floors - Number of floors
# waterfront - A dummy variable for whether the apartment was overlooking the waterfront or not
# view - An index from 0 to 4 of how good the view of the property was
# condition - An index from 1 to 5 on the condition of the apartment,
# grade - An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high quality level of construction and design.
# sqft_above - The square footage of the interior housing space that is above ground level
# sqft_basement - The square footage of the interior housing space that is below ground level
# yr_built - The year the house was initially built
# yr_renovated - The year of the house’s last renovation
# zipcode - What zipcode area the house was listed in
# sqft_living15 - The square footage of interior housing living space for the nearest 15 neighbors
# sqft_lot15 - The square footage of the land lots of the nearest 15 neighbors



house_data['price'].describe()

The DataFrame does not contain missing data


count    2.000000e+04
mean     5.394367e+05
std      3.664334e+05
min      7.500000e+04
25%      3.220000e+05
50%      4.500000e+05
75%      6.416250e+05
max      7.700000e+06
Name: price, dtype: float64

# **Data Exploration**

## **PRICE**


In [81]:
house_data['price'].value_counts()

price_scaled = house_data['price'] / 10000

ranges = list(range(0, 6000000, 250000))  # Create ranges from 0 to 10 million with 250,000 intervals
labels = [f'{r/1000}K-{(r+250000)/1000}K' for r in ranges[:-1]]  # Create labels for the ranges

house_data['price_range'] = pd.cut(house_data['price'], ranges, labels=labels)

count_data = house_data['price_range'].value_counts().reset_index()
count_data.columns = ['Price Range', 'Count']

chart = alt.Chart(count_data).mark_bar().encode(
    x=alt.X('Price Range:O', title='Price Range'),
    y=alt.Y('Count:Q', title='Count'),
).properties(
    title='Count of House Prices by Range'
).configure_axis(
    labelFontSize=12,
    titleFontSize=14
).configure_title(
    fontSize=16
)

chart

max = max(house_data['price'])
print(max)


TypeError: ignored

## **SQFT_BASEMENT**

In [None]:
house_data['sqft_basement'].value_counts()



sampled_data = house_data.sample(5000)  # Create a random sample of 5000 rows

chart = alt.Chart(sampled_data).mark_bar().encode(
    x='sqft_basement:O',
    y='count()'
).properties(
    title='Value Counts of sqft_basement'
)

chart


## **BEDROOMS**

In [None]:
# Group the data by the number of bedrooms and calculate the average price for each category
price_avg = house_data.groupby('bedrooms')['price'].mean().reset_index()

# Create an Altair bar chart
chart = alt.Chart(price_avg).mark_bar().encode(
    x=alt.X('bedrooms:O', axis=alt.Axis(title='Number of Bedrooms')),
    y=alt.Y('price:Q', axis=alt.Axis(title='Price')),
    tooltip=['bedrooms:O', 'price:Q']
).properties(
    title='Average Price by Number of Bedrooms'
)

# Display the chart
chart


# nine_bedroom = house_data[house_data['bedrooms'] == 9]
# nine_bedroom['price'].value_counts()
house_data['bedrooms'].value_counts()

bedrooms_filtered = house_data[(house_data['bedrooms'] >= 0) & (house_data['bedrooms'] <= 5)]

correlation = bedrooms_filtered['bedrooms'].corr(bedrooms_filtered['price'])
correlation

## **YR_RENOVATED**

In [None]:
house_data['yr_renovated'].value_counts()

## **DATE BUILT**

In [None]:
import numpy as np
from IPython.display import display

# Step 1: Calculate the decade for each year
house_data['decade_built'] = (house_data['yr_built'] // 10) * 10

# Step 2: Create bins for the decades
decade_bins = np.arange(house_data['decade_built'].min(), house_data['decade_built'].max() + 10, 10)

# Step 3: Assign each year to its corresponding decade bin
house_data['decade_built'] = pd.cut(house_data['yr_built'], bins=decade_bins, labels=decade_bins[:-1])

# Print the resulting DataFrame
house_data['decade_built'].value_counts()

# Assuming you have a DataFrame named "house_data" with columns "decade_built" and "housing_price"

# Enable Altair rendering in Google Colab
# alt.renderers.enable('colab')

# Group the data by decade and calculate the average housing price for each decade
decade_price_avg = house_data.groupby('decade_built')['price'].mean().reset_index()

# Create an Altair bar chart
chart = alt.Chart(decade_price_avg).mark_bar().encode(
    x='decade_built:O',
    y='price:Q',
    tooltip=['decade_built:O', 'price:Q']
).properties(
    title='Average Housing Price by Decade Built'
)

# Display the chart
display(chart)
house_data.head(5)

## **FLOORS**

In [None]:
house_data['floors'].value_counts()


# Group the data by the number of floors and calculate the average home price for each category
floor_price_avg = house_data.groupby('floors')['price'].mean().reset_index()

# Create an Altair bar chart
chart = alt.Chart(floor_price_avg).mark_bar().encode(
    x='floors:O',
    y='price:Q',
    tooltip=['floors:O', 'price:Q']
).properties(
    title='Average Home Price by Floors'
)

# Display the chart
chart


## **CONDITION** - Use scale

In [None]:
house_data['condition'].value_counts()

# Assuming you have a DataFrame named "house_data" with columns "condition" and "home_price"

# Group the data by the condition and calculate the average home price for each category
condition_price_avg = house_data.groupby('condition')['price'].mean().reset_index()

# Create an Altair bar chart
chart = alt.Chart(condition_price_avg).mark_bar().encode(
    x='condition:O',
    y='price:Q',
    tooltip=['condition:O', 'price:Q']
).properties(
    title='Average Home Price by Condition'
)

# Display the chart
chart

#house_data['condition'].value_counts()




## **VIEW**

In [None]:
house_data['view'].value_counts()

# Assuming you have a DataFrame named "house_data" with columns "condition" and "home_price"

# Group the data by the condition and calculate the average home price for each category
condition_price_avg = house_data.groupby('view')['price'].mean().reset_index()

# Create an Altair bar chart
chart = alt.Chart(condition_price_avg).mark_bar().encode(
    x='view:O',
    y='price:Q',
    tooltip=['view:O', 'price:Q']
).properties(
    title='Average Home Price by View'
)

# Display the chart
chart


house_data['view'].value_counts()



## **SQFT_LOT**

In [None]:
house_data['sqft_lot'].value_counts()

In [None]:
# Define the binning parameters
bin_width = 30000  # Adjust bin width as per your preference

# Create a binned column for sqft_lot and convert to string representation
house_data['sqft_lot_bin'] = pd.cut(house_data['sqft_lot'], bins=range(0, int(house_data['sqft_lot'].max()) + bin_width, bin_width)).astype(str)

# Group the data by the sqft_lot bin and calculate the average price for each bin
price_avg = house_data.groupby('sqft_lot_bin')['price'].mean().reset_index()

# Create an Altair bar chart
chart = alt.Chart(price_avg).mark_bar().encode(
    x=alt.X('sqft_lot_bin:O', axis=alt.Axis(title='Lot Size')),
    y=alt.Y('price:Q', axis=alt.Axis(title='Price')),
    tooltip=['sqft_lot_bin:O', 'price:Q']
).properties(
    title='Average Price by Lot Size'
)

# Display the chart
chart

## **WATERFRONT**

In [None]:
house_data['waterfront'].value_counts()

# Assuming you have a DataFrame named "house_data" with columns "condition" and "home_price"

# Group the data by the condition and calculate the average home price for each category
condition_price_avg = house_data.groupby('waterfront')['price'].mean().reset_index()

# Create an Altair bar chart
chart = alt.Chart(condition_price_avg).mark_bar().encode(
    x='waterfront:O',
    y='price:Q',
    tooltip=['waterfront:O', 'price:Q']
).properties(
    title='Average Home Price by Waterfront Property'
)

# Display the chart
chart


house_data['waterfront'].value_counts()



## **DATE**

In [None]:
import datetime
house_data['date'].value_counts()
month_numbers = []

# for date_string in house_data['date']:
#     try:
#         date_object = datetime.datetime.strptime(date_string, "%Y%m%dT%H%M%S")
#         month_number = date_object.month
#         month_numbers.append(month_number)
#     except ValueError:
#         print(f"Ignoring invalid date string: {date_string}")


df = pd.DataFrame({'Month': month_numbers})

df_agg = df.groupby('Month').size().reset_index(name='Count')

chart = alt.Chart(df_agg).mark_bar().encode(
    alt.X('Month:O', title='Month'),
    alt.Y('Count:Q', title='Count')
).properties(
    title='Count of Events by Month'
)

chart

# Housing price by month sold. ======================================================================

house_data['date'] = pd.to_datetime(house_data['date'])

# Extract month from 'date' column
house_data['month'] = house_data['date'].dt.month

# Calculate the average sell price for each month
average_price_by_month = house_data.groupby('month')['price'].mean().reset_index()

# Create a bar chart using Altair
chart = alt.Chart(average_price_by_month).mark_bar().encode(
    alt.X('month:O', title='Month', sort=alt.EncodingSortField(field='price', op='mean', order='descending')),
    alt.Y('price:Q', title='Average Sell Price')
).properties(
    title='Average Sell Price by Month'
)

# Display the chart
chart


# Housing price by year sold ======================================================================
# house_data['date'] = pd.to_datetime(house_data['date'])

# house_data['year'] = house_data['date'].dt.year

# average_price_by_year = house_data.groupby('year')['price'].mean().reset_index()

# Create a bar chart using Altair
# chart = alt.Chart(average_price_by_year).mark_bar().encode(
#     alt.X('year:O', title='Year'),
#     alt.Y('price:Q', title='Average Sell Price')
# ).properties(
#     title='Average Sell Price by Year'
# )

#chart
# house_data['year'] = house_data['date'].dt.year
# house_data['year'].value_counts()


## **CORRELATION**

In [None]:
# Calculate the correlation between each feature and price
feature_data = house_data[['date', 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'sqft_living15', 'sqft_lot15']]
correlation_with_price = feature_data.corr()['price'].drop('price')
correlation_df = correlation_with_price.reset_index().rename(columns={'index': 'feature', 'price': 'correlation'})

# Creae heatmap in altair:
heatmap = alt.Chart(correlation_df).mark_rect().encode(
    x='feature:O',
    y=alt.Y('correlation:O', axis=alt.Axis(format='0.2f')),
    color='correlation:Q'
).properties(
    width=400,
    height=300,
    title='Correlation with Price Heatmap'
)


# 1 = high positive correlation, 0 = no correlation, -1 = high negative correlation
heatmap
correlation_with_price.sort_values()

In [None]:
feature_data = house_data[['date', 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'sqft_living15', 'sqft_lot15']]

# Calculate the correlation matrix
correlation_matrix = feature_data.corr()

# Reset the index to convert the correlation matrix into a dataframe
correlation_df = correlation_matrix.reset_index()

# Melt the dataframe to convert it into long format for heatmap visualization
melted_df = pd.melt(correlation_df, id_vars='index', value_vars=correlation_df.columns[1:], var_name='feature1', value_name='correlation')

# Create the heatmap using Altair
heatmap = alt.Chart(melted_df).mark_rect().encode(
    x='index:O',
    y='feature1:O',
    color='correlation:Q'
).properties(
    width=300,
    height=300,
    title='Correlation Heatmap'
)

# Display the heatmap
heatmap

# **Feature Selection**
- We need a metric to determine which features are most useful in determining home price.


### High Positive Correlation with Price:
- Sqft_living   = 0.70
- Grade         = 0.66
- sqft_above    = 0.60
- sqft_living15 = 0.58
- bathrooms     = 0.52

### High Negative Correlation with Price:
- None (yet discovered)

### To be determined:
- Condition
- Zip Code (High income zip codes sell for more)
- Year Built (I binned the homes into decades they were built, there's some unique distributions worth looking into)
- Date (May, April, July, June are when the most homes are sold)


### **Ignore** these features:
- sqft_lot15                (almost no correlation)
- sqft_lot                  (almost no correlation)
- Lat                       (Useless data point)
- Long                      (Useless data point)
- Id                        (Only used in the database to store the row)
- yr_renovated              (Over 95% of the data falls under 0 - no data)
- bedrooms                  (very low positive correlation + outliers skew data)
- view                      (Highly imbalanced distribution among values, over 95% is 0)
- waterfront                (Highly imbalanced distribution among values, over 99% is 0)
- sqft_basement             (Over 95% of the data falls under 0 - no data)

## **FEATURE CLEANING**

### Price

In [None]:
house_data['price'] # Probably needs to be scaled, too widely distributed.

### **Sqft_living**

In [91]:
house_data['sqft_living'].value_counts()
house_data = house_data[house_data['sqft_living'] <= 8000]


# Calculate the average price for each sqft_living
average_price_by_sqft_living = house_data.groupby('sqft_living')['price'].mean().reset_index()

# Create the bar graph for average price by sqft_living
bar_chart_sqft_living = alt.Chart(average_price_by_sqft_living).mark_bar().encode(
    x=alt.X('sqft_living:Q', title='Sqft Living'),
    y=alt.Y('price:Q', title='Average Price'),
    tooltip=['sqft_living', 'price']
).properties(
    title='Average Price by Sqft Living'
)

# Display the bar charts side by side
bar_chart_sqft_living
# house_data = house_data[house_data['sqft_living'] <= 8000]

### **Sqft_living15**

In [84]:
house_data['sqft_living15'].value_counts()
house_data = house_data[house_data['sqft_living15'] >= 500]
house_data = house_data[house_data['sqft_living15'] <= 4000]
# Calculate the count of sqft_living15 values
count_sqft_living15 = house_data['sqft_living15'].value_counts().reset_index()

# Rename the columns
count_sqft_living15.columns = ['sqft_living15', 'count']

# Create the bar chart for count of sqft_living15
bar_chart = alt.Chart(count_sqft_living15).mark_bar().encode(
    x=alt.X('sqft_living15:Q', title='Sqft Living 15'),
    y=alt.Y('count:Q', title='Count'),
    tooltip=['sqft_living15', 'count']
).properties(
    title='Count of Sqft Living 15'
)

# Display the bar chart
bar_chart

## **Bathrooms**

In [85]:
house_data = house_data[house_data['bathrooms'] >= 1]
house_data = house_data[house_data['bathrooms'] <= 5]
house_data['bathrooms'].value_counts()

2.50    4976
1.00    3552
1.75    2841
2.25    1875
2.00    1769
1.50    1346
2.75    1091
3.00     678
3.50     627
3.25     502
3.75     131
4.00     105
4.50      76
4.25      56
4.75      16
5.00      13
1.25       8
Name: bathrooms, dtype: int64

### **Grade**

In [86]:
sorted = house_data['grade'].sort_values()
sorted.value_counts()

import altair as alt

# Calculate the average price for each grade
average_price_by_grade = house_data.groupby('grade')['price'].mean().reset_index()

# Create the bar graph using Altair
bar_chart = alt.Chart(average_price_by_grade).mark_bar().encode(
    x='grade:O',
    y='price:Q'
)

# Display the bar chart
bar_chart


# **XGBoost Model**

TEST AND TRAIN SETS

In [87]:
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split


X = house_data[['sqft_living', 'grade', 'sqft_above', 'sqft_living15', 'bathrooms']]
y = house_data['price']


# How do we want to do the testing size?
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X

Unnamed: 0,sqft_living,grade,sqft_above,sqft_living15,bathrooms
0,3760,8,2740,3280,3.25
1,1460,7,1040,1310,1.75
2,1340,7,1340,1900,1.00
3,1440,8,1440,1790,1.75
4,1780,7,1080,1690,1.50
...,...,...,...,...,...
19994,4400,11,3390,2150,4.50
19995,1000,7,1000,1000,1.50
19996,3087,8,3087,2927,2.50
19997,2120,7,2120,1690,2.50


In [None]:
# from sklearn.preprocessing import StandardScaler

# # Assuming house_data is a pandas DataFrame and 'price' is a column in the DataFrame

# # Create a StandardScaler object
# scaler = StandardScaler()

# # Fit the scaler to the 'price' column
# scaler.fit(house_data[['price']])

# # Transform the 'price' column using the fitted scaler
# house_data['price_scaled'] = scaler.transform(house_data[['price']])
# house_data['price_scaled']

TRAINING

In [88]:
model = XGBRegressor()
             
model.fit(X_train, y_train)

PREDICTIONS

In [89]:
predictions = model.predict(X_test)
predictions

array([723629.1 , 450964.47, 417803.53, ..., 392257.9 , 499058.44,
       557258.25], dtype=float32)

RESULTS

In [90]:
from sklearn.metrics import mean_squared_error
# holdout_mini = pd.read_csv("https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/housing_holdout_test_mini.csv")


result = mean_squared_error(y_test, predictions, squared=False)
result # Not good.

211013.23812023952

In [None]:
# # Assuming scaler is the StandardScaler used for scaling the 'price' feature
# predictions_inverse = scaler.inverse_transform(predictions.reshape(-1, 1))

# # Calculate the MSE in the original scale
# result = mean_squared_error(y_test, predictions_inverse, squared=False)
# result