# Project 2
#### COMP4433
#### Joseph Beightol

### Project Description
#### With data of your choice (apart from built-in data sources) use Plotly, Dash and any other required Python modules to build an interactive Dash application. This should be a user-facing application, so the final product should be reasonably well-polished and ready to be deployed.  

#### While the first project in this course was exploratory in nature, this application should be more explanatory or should serve a specific purpose. For example, your final product might be used to convey relevant information to the user, provide an empirical solution to a question based on user input, or offer an interface for a user to intentionally engage with data on a topic of interest. 

#### Your final product should be a deployment-ready application. While deployment is not a requirement, you will be asked to provide a link to a GitHub repository with all the necessary materials to run your project in localhost.  

#### The data that you use may be collected by your script. This can range in complexity from using pandas.read_csv() to pull a CSV from the web to making API calls to scraping. Be sure to include a data file or files for redundancy purposes.  Alternatively, you may strictly use static files. Note that the emphasis should be on your visualizations and communication, not the process through which your application gathers and ingests data. Ensure that the README in your project details any nuances that someone may need to know when executing your code. 

#### As with the first project, the aesthetics of your plots should be fine tuned to the extent that they reflect best practices discussed throughout the course, and relevant plot elements such as tick labeling, titles, descriptive string formatting and legends should be included as appropriate. 

Be sure to include: 
- At least four Dash Core Components (dropdowns, radio buttons, text entry fields, etc.) to ingest user input. 
- At least one callback decorator to achieve interactivity. 
- At least three different plots from Plotly. 
- Sufficient narrative and/or instructional information for users to be able to navigate the application and understand its intent. 

## Import Libraries

In [1]:
#import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import norm
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.preprocessing import LabelEncoder

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import root_mean_squared_error
from sklearn.model_selection import cross_val_score, cross_validate, GridSearchCV, KFold
import matplotlib.ticker as mticker
import matplotlib.dates as mdates

import plotly.express as px
import plotly.io as pio
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from dash import Dash, dcc, html, Input, Output
#pio.renderers.default='notebook_connected'
pio.renderers.default='vscode'
pio.templates.default = 'simple_white' 
import yfinance as yf
pio.renderers.default='vscode'
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

import ssl
ssl._create_default_https_context = ssl._create_unverified_context #work around for SSL certificate


## Introduction



https://www.zillow.com/research/data/

https://www.kaggle.com/datasets/ahmedshahriarsakib/usa-real-estate-dataset

https://fred.stlouisfed.org/series/MEHOINUSCOA646N

## Zillow Data

### Load in Datasets

In [2]:
#City Data
zillowCitySF= pd.read_csv('City_zhvi_uc_sfr_tier_0.33_0.67_sm_sa_month.csv') #city single family data
zillowCitySF['HouseType'] = 'Single Family' #add column for house type and make it single family
zillowCityCondo = pd.read_csv('City_zhvi_uc_condo_tier_0.33_0.67_sm_sa_month.csv') #city condo data
zillowCityCondo['HouseType'] = 'Condo' #add column for house type and make it condo

#Metro Data
zillowMetroSF= pd.read_csv('Metro_zhvi_uc_sfr_tier_0.33_0.67_sm_sa_month.csv') #city single family data
zillowMetroSF['HouseType'] = 'Single Family' #add column for house type and make it single family
zillowMetroCondo = pd.read_csv('Metro_zhvi_uc_condo_tier_0.33_0.67_sm_sa_month.csv') #city condo data
zillowMetroCondo['HouseType'] = 'Condo' #add column for house type and make it condo

### Combine Datasets

In [3]:
#Concatenate DF into one large set
zillowCombined = pd.concat([zillowCityCondo, zillowCitySF, zillowMetroCondo, zillowMetroSF], ignore_index=True) #concatenate the dataframes vertically
zillowCombined.drop(columns=['RegionID', 'SizeRank', 'State', 'Metro', 'CountyName'], inplace=True) #drop unnecessary columns
usHousing = zillowCombined[zillowCombined['RegionType'] == 'country'] #set US housing by grabbing only data from the 
zillowCombined = zillowCombined[zillowCombined['RegionType'] != 'country'] #drop rows that are for all US
zillowCombined.head() #display first few rows of data

Unnamed: 0,RegionName,RegionType,StateName,2000-01-31,2000-02-29,2000-03-31,2000-04-30,2000-05-31,2000-06-30,2000-07-31,...,2024-04-30,2024-05-31,2024-06-30,2024-07-31,2024-08-31,2024-09-30,2024-10-31,2024-11-30,2024-12-31,HouseType
0,New York,city,NY,216377.029413,217627.547532,218882.589493,221574.088587,224531.335867,227760.610358,230909.418754,...,670674.913778,673313.146899,674645.475852,677312.199033,681294.870945,683913.973515,685577.105579,689223.756544,695048.507313,Condo
1,Los Angeles,city,CA,176367.406469,176672.09334,177556.285177,179300.884466,181430.166041,183442.597566,185482.516636,...,660290.734719,659878.540633,658407.685428,659083.417436,660800.973725,663763.978872,665172.477921,665981.183285,666476.917671,Condo
2,Houston,city,TX,83032.804087,83133.682319,83345.001432,83309.964602,83191.665522,82929.5492,83312.82287,...,156944.602859,156861.78753,156411.128135,155576.912233,154700.307105,153782.129679,152940.677974,152149.12425,151500.41603,Condo
3,Chicago,city,IL,163040.128431,163197.27794,163821.121244,165271.56225,167189.660811,169196.97913,171123.164659,...,274667.028219,274706.059405,274028.610787,273385.196813,273639.711072,273922.778199,274388.71471,274367.56315,275420.554431,Condo
4,San Antonio,city,TX,75224.145174,75278.643879,75308.984151,75357.497866,75012.034561,74561.299405,74096.249129,...,188497.733036,188046.772226,187181.722482,185636.305265,184203.993144,182853.173707,181799.273732,180642.371734,179647.797402,Condo


### Fill NA

In [4]:
#define function to fill na values with regressor model
def fill_missing_rows_numeric(df):
    numDF = df.select_dtypes(include=[np.number]) #only look at numeric values
    
    for idx in range(len(numDF)):  #iterate through rows
        row = numDF.iloc[idx]  #grab the row data
        knownValues = row.dropna()  #grab the known values
        missingDate = row[row.isna()].index  #grab dates (columns) with missing data

        if len(missingDate) > 0 and len(knownValues) > 1:  #if missing and enough known values
            X_train = np.arange(len(knownValues)).reshape(-1, 1)  #x-train
            y_train = knownValues.values  #y-train
            X_pred = np.arange(len(row)).reshape(-1, 1)[row.isna()]  #predicted
            #model=RandomForestRegressor(random_state=42) #<------- for slower but more accurate results
            model = LinearRegression()  #set model as linear regressor
            model.fit(X_train, y_train)  #fit the model
            predictions = model.predict(X_pred)  #predict missing values
            numDF.loc[idx, missingDate] = predictions  #replace missing data

    df.update(numDF)  #update df
    return df #return


In [5]:
#fill na values
zillowCombined = fill_missing_rows_numeric(zillowCombined)
zillowCombined.head()

Unnamed: 0,RegionName,RegionType,StateName,2000-01-31,2000-02-29,2000-03-31,2000-04-30,2000-05-31,2000-06-30,2000-07-31,...,2024-04-30,2024-05-31,2024-06-30,2024-07-31,2024-08-31,2024-09-30,2024-10-31,2024-11-30,2024-12-31,HouseType
0,New York,city,NY,216377.029413,217627.547532,218882.589493,221574.088587,224531.335867,227760.610358,230909.418754,...,670674.913778,673313.146899,674645.475852,677312.199033,681294.870945,683913.973515,685577.105579,689223.756544,695048.507313,Condo
1,Los Angeles,city,CA,176367.406469,176672.09334,177556.285177,179300.884466,181430.166041,183442.597566,185482.516636,...,660290.734719,659878.540633,658407.685428,659083.417436,660800.973725,663763.978872,665172.477921,665981.183285,666476.917671,Condo
2,Houston,city,TX,83032.804087,83133.682319,83345.001432,83309.964602,83191.665522,82929.5492,83312.82287,...,156944.602859,156861.78753,156411.128135,155576.912233,154700.307105,153782.129679,152940.677974,152149.12425,151500.41603,Condo
3,Chicago,city,IL,163040.128431,163197.27794,163821.121244,165271.56225,167189.660811,169196.97913,171123.164659,...,274667.028219,274706.059405,274028.610787,273385.196813,273639.711072,273922.778199,274388.71471,274367.56315,275420.554431,Condo
4,San Antonio,city,TX,75224.145174,75278.643879,75308.984151,75357.497866,75012.034561,74561.299405,74096.249129,...,188497.733036,188046.772226,187181.722482,185636.305265,184203.993144,182853.173707,181799.273732,180642.371734,179647.797402,Condo


### Melt Data Frames

In [6]:
zillowCombined = zillowCombined.melt(
    id_vars=['RegionName', 'RegionType', 'StateName', 'HouseType'], 
    var_name='Date', 
    value_name='Price'
)

### Drop NA

In [7]:
zillowCombined.dropna(inplace=True)
zillowCombined['Date'] = pd.to_datetime(zillowCombined['Date'])

### Get median cost data

In [8]:
#grab only median of each state
stateMedian = zillowCombined.groupby('StateName')['Price'].median().round(2).reset_index() #create new dataframe
stateMedian['Rank'] = stateMedian['Price'].rank(method='first', ascending=False).astype(int) #add rank
stateMedian = stateMedian.sort_values(by='Rank')  #sort
stateMedian.head()

Unnamed: 0,StateName,Price,Rank
11,HI,482468.43,1
7,DC,394983.37,2
4,CA,356467.61,3
31,NJ,335977.0,4
19,MA,312910.13,5


### Bar Chart

In [35]:
#create bar plot
fig = px.bar(stateMedian, 
             x='StateName', 
             y='Price', 
             text='Price',  # Display price on bars
             color='StateName',  # Different colors for each state
             labels={'StateName': 'State', 'Price': 'Median Home Price'},
             title='Median Home Price by State',
             template='plotly_dark')

# Format text on bars
fig.update_traces(texttemplate='$%{text:,.0f}', textposition='outside')

# Adjust layout
fig.update_layout(
    xaxis_title='State',
    yaxis_title='Price ($)',
    width=1200,
    height=800
)

fig.show()


### Map

In [9]:
fig3d = px.choropleth(stateMedian, 
                      locations='StateName', 
                      locationmode='USA-states',
                      color='Price', 
                      color_continuous_scale='Viridis',
                      scope='usa',
                      labels={'StateName': 'State', 'Rank': 'Rank', 'Price': 'Meidan Cost'},
                      title='US Median Home Cost',
                      template='plotly_dark',
                      hover_data={'Rank': True, 'Price': True})
fig3d.show() 

### Animation

In [25]:
zillowCO = zillowCombined[zillowCombined['StateName']=='CO'] #grab only data with Colorado
zillowCO = zillowCO[zillowCO['RegionName'].isin(['Denver', 'Superior', 'Highlands Ranch', 'Boulder', 'Longmont', 'Colorado Springs', 'Aurora', 'Lakewood', 'Fort Collins', 'Broomfield'])] #grab only certain cities
cityCOMedian = zillowCO.groupby(['RegionName', 'Date'])['Price'].median().round(2).reset_index() #create new dataframe
cityCOMedian['Rank'] = cityCOMedian['Price'].rank(method='first', ascending=False).astype(int) #add rank
cityCOMedian = cityCOMedian.sort_values(by='Date')  #sort
#cityCOMedian = cityCOMedian[(cityCOMedian['Rank'] >= 50) & (cityCOMedian['Rank'] <=100)]  #keep only top 25 priced cities
cityCOMedian

Unnamed: 0,RegionName,Date,Price,Rank
0,Aurora,2000-01-31,123886.18,2984
2700,Superior,2000-01-31,188222.86,2324
600,Broomfield,2000-01-31,154203.55,2817
900,Colorado Springs,2000-01-31,125714.39,2978
300,Boulder,2000-01-31,227209.50,1687
...,...,...,...,...
899,Broomfield,2024-12-31,534446.34,203
599,Boulder,2024-12-31,787248.08,30
299,Aurora,2024-12-31,390861.91,603
2699,Longmont,2024-12-31,489076.20,270


In [26]:
#Animation
gpm = px.data.gapminder()

In [33]:
fig = px.scatter(cityCOMedian, 
                  x='Date', 
                  y='Price', 
                 #color='lifeExp',
                 color='RegionName',
                 #symbol='continent',
                 color_discrete_sequence=px.colors.qualitative.D3,
                 size='Rank', 
                 size_max=50,
                 hover_data={'RegionName': True,
                             'Date': False,
                             'Rank': True,
                             'Price': ':.0f'
                            }, 
                 labels=dict(RegionName='City',
                             Price='Median House PRice'),
                  animation_frame='Date',
                  range_x=[cityCOMedian['Date'].min(), cityCOMedian['Date'].max()],
                  range_y=[0, cityCOMedian['Price'].max()],
                  title='Gapminder: PerCap GDP x Life Exp',
                  template='plotly_dark')

fig.update_layout(legend=dict(
    yanchor="bottom",
    y=0,
    xanchor="right",
    x=1
))

fig.update_layout(
    width=1200,  # Increase width
    height=800,  # Increase height
    legend=dict(
        yanchor='bottom',
        y=0,
        xanchor='right',
        x=1
    )
)

fig.show()

## Median House Income

### Load in Data

In [None]:
medHouseIncomeCO = pd.read_csv('MEHOINUSCOA646N.csv') #colorado median house income
medHouseIncomeCO.rename(columns={'MEHOINUSCOA646N': 'Price', 'observation_date': 'Date'}, inplace=True) #rename columns
medHouseIncomeCO.tail() #look at end of data to see where the date ends

#grab starting from 2000
medHouseIncomeCO['Date'] = pd.to_datetime(medHouseIncomeCO['Date']) #switch to date type
medHouseIncomeCO = medHouseIncomeCO[medHouseIncomeCO['Date'].dt.year >= 2000]  #only grab years from 2000 on
medHouseIncomeCO.head()

In [10]:
#Separate data
zillowSingleFamilyCO = zillowCO[zillowCO['HouseType']=='Single Family'] #grab single family colorado homes
zillowCondoCO = zillowCO[zillowCO['HouseType']=='Condo'] #grab condo colorado homes

#Melt data to have date be a column
zillowMeltedSFCO = zillowSingleFamilyCO.melt(
    id_vars=['RegionName', 'RegionType', 'StateName', 'HouseType'], 
    var_name='Date', 
    value_name='Price'
)

zillowMeltedCondoCO = zillowCondoCO.melt(
    id_vars=['RegionName', 'RegionType', 'StateName', 'HouseType'], 
    var_name='Date', 
    value_name='Price'
)

#Convert Date column to datetime format
zillowMeltedSFCO['Date'] = pd.to_datetime(zillowMeltedSFCO['Date'])
zillowMeltedCondoCO['Date'] = pd.to_datetime(zillowMeltedCondoCO['Date'])

In [None]:
medianPriceCO = zillowCOMelted.groupby('Date')['Price'].median().reset_index() #create new dataframe
medianPriceCO.head()

In [17]:
medianPriceCO = medianPriceCO[medianPriceCO['Date'].dt.month == 1] #only grab info for January
medianPriceCO['Date'] = pd.to_datetime(medianPriceCO['Date']).dt.strftime('%Y-%m') #change format of how to display date for median housing price
medianPriceCO = medianPriceCO[pd.to_datetime(medianPriceCO['Date']).dt.year <= 2023] #grab data that is before or in 2023 (remove 2024)
medHouseIncomeCO['Date'] = pd.to_datetime(medHouseIncomeCO['Date']).dt.strftime('%Y-%m') #change format of how to display date for median income
medianPriceCO['Date'] = pd.to_datetime(medianPriceCO['Date']) #change format to datetime object
medHouseIncomeCO['Date'] = pd.to_datetime(medHouseIncomeCO['Date']) #change format to datetime object

In [None]:
df = pd.DataFrame({'Date': dates, 'Median Home Price': median_home_prices, 'Median Income': median_income})

# Create Figure
fig = go.Figure()

# Add initial traces (Housing Price & Income)
fig.add_trace(go.Scatter(x=df['Date'], y=df['Median Home Price'], mode='lines+markers',
                         name="Median Home Price", yaxis="y1", line=dict(color='blue')))
fig.add_trace(go.Scatter(x=df['Date'], y=df['Median Income'], mode='lines+markers',
                         name="Median Income", yaxis="y2", line=dict(color='red')))

# Create animation frames
frames = []
for i in range(len(df)):
    frames.append(go.Frame(
        data=[
            go.Scatter(x=df['Date'][:i+1], y=df['Median Home Price'][:i+1], mode='lines+markers',
                       name="Median Home Price", yaxis="y1", line=dict(color='blue')),
            go.Scatter(x=df['Date'][:i+1], y=df['Median Income'][:i+1], mode='lines+markers',
                       name="Median Income", yaxis="y2", line=dict(color='red'))
        ],
        name=str(df['Date'][i])
    ))

# Add animation to figure
fig.update(frames=frames)

# Layout settings for dual axes
fig.update_layout(
    title="Median Home Price & Household Income Over Time",
    xaxis=dict(title="Date"),
    yaxis=dict(title="Median Home Price ($)", titlefont=dict(color="blue"), tickfont=dict(color="blue")),
    yaxis2=dict(title="Median Household Income ($)", titlefont=dict(color="red"),
                tickfont=dict(color="red"), overlaying="y", side="right"),
    updatemenus=[{
        "buttons": [
            {"args": [None, {"frame": {"duration": 500, "redraw": True}, "fromcurrent": True}], 
             "label": "Play", "method": "animate"},
            {"args": [[None], {"frame": {"duration": 0, "redraw": True}, "mode": "immediate"}], 
             "label": "Pause", "method": "animate"}
        ],
        "direction": "left",
        "pad": {"r": 10, "t": 87},
        "showactive": False,
        "type": "buttons",
        "x": 0.1,
        "xanchor": "right",
        "y": 0,
        "yanchor": "top"
    }]
)

fig.show()


## House Feature and Price Data

In [None]:
realtorDF= pd.read_csv('realtor-data.zip.csv') #load in data
realtorDF.drop(columns=['prev_sold_date', 'status', 'brokered_by', 'street'],inplace=True) #drop unnecessary columns
realtorCO = realtorDF[realtorDF['state']=='Colorado'] #grab subset for colorado
realtorCO = realtorCO.dropna() #drop all missing values
realtorCO.head

In [None]:
#Correlation matrix
fig, ax = plt.subplots(figsize=(6, 6)) #initiate plot
corrMat = realtorCO.corr(numeric_only=True) #create correlation matrix
mask = np.triu(np.ones_like(corrMat, dtype=bool)) #only care about bottom half
sns.heatmap(corrMat, square=True, annot=True, cbar=True, cmap='crest', mask=mask, linewidths=0.5) #create heatmap to visualize strength of correlation
plt.tight_layout
fig.show()

In [None]:
#grab top correlated features
realtorCONum = realtorCO[['price', 'bed', 'bath', 'acre_lot', 'house_size']]
corr_df = realtorCONum.corr().stack().reset_index().rename(columns = {'level_0':'variable1', 
                                                            'level_1':'variable2', 
                                                             0:'correlation'})
corr_df = corr_df[corr_df.variable2 > corr_df.variable1]
corr_df = corr_df.loc[corr_df.correlation.abs().sort_values(ascending= False).index]
corr_df

## Dash

Ideas for dash
1. callback function to look at realtor data to see how different home characteristics affect