<a href="https://colab.research.google.com/github/joehawkens/MachineLearning/blob/main/MODULE_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**MODULE 3**

# **Module Resources**

**Problem:** Regression <n>

**Target:** Home Price

- Module Overview: https://byui-cse.github.io/cse450-course/module-03/
- Data Dictionary: https://byui-cse.github.io/cse450-course/module-03/housing-dictionary.txt
- Dataset: https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/housing.csv
- Holdout Dataset: https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/housing_holdout_test.csv
- Holdout Mini Dataset: https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/housing_holdout_test_mini.csv
- Module Hints: https://byui-cse.github.io/cse450-course/module-03/hints.html


# **Data Normalization and Cleaning**
- Are there outliers that will skew the data?
- Is there any misssing data?

In [27]:
import pandas as pd
import altair as alt
house_data = pd.read_csv("https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/housing.csv")

# Finding missing values
# Check if the DataFrame contains any missing data:
if house_data.isnull().any().any():
    print('The DataFrame contains missing data')
else:
    print('The DataFrame does not contain missing data')


# Features I think won't be useful:

# lat - Latitude - We already havea  zip code.
# long - Longitude - We already have a zip code.
# id - Unique ID for each home sold - This is a database key, it has nothing to do with the home price.

# All the features I think have potential:

# date - Date of the home sale
# price - Price of each home sold
# bedrooms - Number of bedrooms
# bathrooms - Number of bathrooms, where .5 accounts for a room with a toilet but no shower
# sqft_living - Square footage of the apartments interior living space
# sqft_lot - Square footage of the land space
# floors - Number of floors
# waterfront - A dummy variable for whether the apartment was overlooking the waterfront or not
# view - An index from 0 to 4 of how good the view of the property was
# condition - An index from 1 to 5 on the condition of the apartment,
# grade - An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high quality level of construction and design.
# sqft_above - The square footage of the interior housing space that is above ground level
# sqft_basement - The square footage of the interior housing space that is below ground level
# yr_built - The year the house was initially built
# yr_renovated - The year of the house’s last renovation
# zipcode - What zipcode area the house was listed in
# sqft_living15 - The square footage of interior housing living space for the nearest 15 neighbors
# sqft_lot15 - The square footage of the land lots of the nearest 15 neighbors



house_data['price'].describe()

The DataFrame does not contain missing data


count    2.000000e+04
mean     5.394367e+05
std      3.664334e+05
min      7.500000e+04
25%      3.220000e+05
50%      4.500000e+05
75%      6.416250e+05
max      7.700000e+06
Name: price, dtype: float64

# **Data Exploration**

In [31]:
feature_data = house_data[['date', 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'sqft_living15', 'sqft_lot15']]

# Calculate the correlation matrix
correlation_matrix = feature_data.corr()

# Reset the index to convert the correlation matrix into a dataframe
correlation_df = correlation_matrix.reset_index()

# Melt the dataframe to convert it into long format for heatmap visualization
melted_df = pd.melt(correlation_df, id_vars='index', value_vars=correlation_df.columns[1:], var_name='feature1', value_name='correlation')

# Create the heatmap using Altair
heatmap = alt.Chart(melted_df).mark_rect().encode(
    x='index:O',
    y='feature1:O',
    color='correlation:Q'
).properties(
    width=300,
    height=300,
    title='Correlation Heatmap'
)

# Display the heatmap
heatmap

  correlation_matrix = feature_data.corr()


In [32]:
# Calculate the correlation between each feature and price
feature_data = house_data[['date', 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'sqft_living15', 'sqft_lot15']]
correlation_with_price = feature_data.corr()['price'].drop('price')
correlation_df = correlation_with_price.reset_index().rename(columns={'index': 'feature', 'price': 'correlation'})

# Creae heatmap in altair:
heatmap = alt.Chart(correlation_df).mark_rect().encode(
    x='feature:O',
    y=alt.Y('correlation:O', axis=alt.Axis(format='0.2f')),
    color='correlation:Q'
).properties(
    width=400,
    height=300,
    title='Correlation with Price Heatmap'
)


# 1 = high positive correlation, 0 = no correlation
heatmap

  correlation_with_price = feature_data.corr()['price'].drop('price')


# **Feature Selection**
- We need a metric to determine which features are most useful in determining home price.

# **XGBoost Model**