# Project 2
#### COMP4433
#### Joseph Beightol

### Project Description
#### With data of your choice (apart from built-in data sources) use Plotly, Dash and any other required Python modules to build an interactive Dash application. This should be a user-facing application, so the final product should be reasonably well-polished and ready to be deployed.  

#### While the first project in this course was exploratory in nature, this application should be more explanatory or should serve a specific purpose. For example, your final product might be used to convey relevant information to the user, provide an empirical solution to a question based on user input, or offer an interface for a user to intentionally engage with data on a topic of interest. 

#### Your final product should be a deployment-ready application. While deployment is not a requirement, you will be asked to provide a link to a GitHub repository with all the necessary materials to run your project in localhost.  

#### The data that you use may be collected by your script. This can range in complexity from using pandas.read_csv() to pull a CSV from the web to making API calls to scraping. Be sure to include a data file or files for redundancy purposes.  Alternatively, you may strictly use static files. Note that the emphasis should be on your visualizations and communication, not the process through which your application gathers and ingests data. Ensure that the README in your project details any nuances that someone may need to know when executing your code. 

#### As with the first project, the aesthetics of your plots should be fine tuned to the extent that they reflect best practices discussed throughout the course, and relevant plot elements such as tick labeling, titles, descriptive string formatting and legends should be included as appropriate. 

Be sure to include: 
- At least four Dash Core Components (dropdowns, radio buttons, text entry fields, etc.) to ingest user input. 
- At least one callback decorator to achieve interactivity. 
- At least three different plots from Plotly. 
- Sufficient narrative and/or instructional information for users to be able to navigate the application and understand its intent. 

## Import Libraries

In [13]:
#import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import norm
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.preprocessing import LabelEncoder

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import root_mean_squared_error
from sklearn.model_selection import cross_val_score, cross_validate, GridSearchCV, KFold
import matplotlib.ticker as mticker
import matplotlib.dates as mdates

import plotly.express as px
import plotly.io as pio
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from dash import Dash, dcc, html, Input, Output
#pio.renderers.default='notebook_connected'
pio.renderers.default='vscode'
pio.templates.default = 'simple_white' 
import yfinance as yf
pio.renderers.default='vscode'
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

import ssl
ssl._create_default_https_context = ssl._create_unverified_context #work around for SSL certificate


## Introduction



https://www.zillow.com/research/data/

https://www.kaggle.com/datasets/ahmedshahriarsakib/usa-real-estate-dataset

https://fred.stlouisfed.org/series/MEHOINUSCOA646N

## Load in Datasets

#### First step is to load in the datasets. We will first focus on the Zillow dataset to explore how the housing prices have increased over the years. Below we import four different datasets: City and Metro data for Single Family and Condo Homes.

In [2]:
#City Data
zillowCitySF= pd.read_csv('City_zhvi_uc_sfr_tier_0.33_0.67_sm_sa_month.csv') #city single family data
zillowCitySF['HouseType'] = 'Single Family' #add column for house type and make it single family
zillowCityCondo = pd.read_csv('City_zhvi_uc_condo_tier_0.33_0.67_sm_sa_month.csv') #city condo data
zillowCityCondo['HouseType'] = 'Condo' #add column for house type and make it condo

#Metro Data
zillowMetroSF= pd.read_csv('Metro_zhvi_uc_sfr_tier_0.33_0.67_sm_sa_month.csv') #city single family data
zillowMetroSF['HouseType'] = 'Single Family' #add column for house type and make it single family
zillowMetroCondo = pd.read_csv('Metro_zhvi_uc_condo_tier_0.33_0.67_sm_sa_month.csv') #city condo data
zillowMetroCondo['HouseType'] = 'Condo' #add column for house type and make it condo

## Data Manipulation

In [None]:
#Concatenate DF into one large set
zillowCombined = pd.concat([zillowCityCondo, zillowCitySF, zillowMetroCondo, zillowMetroSF], ignore_index=True) #concatenate the dataframes vertically
zillowCombined.drop(columns=['RegionID', 'SizeRank', 'State', 'Metro', 'CountyName'], inplace=True) #drop unnecessary columns
usHousing = zillowCombined[zillowCombined['RegionType'] == 'country'] #set US housing by grabbing only data from the 
zillowCombined = zillowCombined[zillowCombined['RegionType'] != 'country'] #drop rows that are for all US
zillowCombined.head() #display first few rows of data

## Data Cleaning

In [5]:
#define function to fill na values with regressor model
def fill_missing_rows_numeric(df):
    numDF = df.select_dtypes(include=[np.number]) #only look at numeric values
    
    for idx in range(len(numDF)):  #iterate through rows
        row = numDF.iloc[idx]  #grab the row data
        knownValues = row.dropna()  #grab the known values
        missingDate = row[row.isna()].index  #grab dates (columns) with missing data

        if len(missingDate) > 0 and len(knownValues) > 1:  #if missing and enough known values
            X_train = np.arange(len(knownValues)).reshape(-1, 1)  #x-train
            y_train = knownValues.values  #y-train
            X_pred = np.arange(len(row)).reshape(-1, 1)[row.isna()]  #predicted
            #model=RandomForestRegressor(random_state=42) #<------- for slower but more accurate results
            model = LinearRegression()  #set model as linear regressor
            model.fit(X_train, y_train)  #fit the model
            predictions = model.predict(X_pred)  #predict missing values
            numDF.loc[idx, missingDate] = predictions  #replace missing data

    df.update(numDF)  #update df
    return df #return


In [None]:
#fill na values
zillowPredictCombined = fill_missing_rows_numeric(zillowCombined)
zillowPredictCombined.head()

In [7]:
zillowCombined = zillowCombined.melt(
    id_vars=['RegionName', 'RegionType', 'StateName', 'HouseType'], 
    var_name='Date', 
    value_name='Price'
)

zillowPredictCombined = zillowPredictCombined.melt(
    id_vars=['RegionName', 'RegionType', 'StateName', 'HouseType'], 
    var_name='Date', 
    value_name='Price'
)

In [8]:
zillowCombined.dropna(inplace=True)
zillowCombined['Date'] = pd.to_datetime(zillowCombined['Date'])
zillowPredictCombined.dropna(inplace=True)
zillowPredictCombined['Date'] = pd.to_datetime(zillowPredictCombined['Date'])

zillowCO = zillowCombined[zillowCombined['StateName']=='CO'] #grab only data with Colorado
zillowCOPredict = zillowPredictCombined[zillowPredictCombined['StateName']=='CO']

In [None]:
medHouseIncomeCO = pd.read_csv('MEHOINUSCOA646N.csv') #colorado median house income
medHouseIncomeCO.rename(columns={'MEHOINUSCOA646N': 'Price', 'observation_date': 'Date'}, inplace=True) #rename columns
medHouseIncomeCO.tail() #look at end of data to see where the date ends

#grab starting from 2000
medHouseIncomeCO['Date'] = pd.to_datetime(medHouseIncomeCO['Date']) #switch to date type
medHouseIncomeCO = medHouseIncomeCO[medHouseIncomeCO['Date'].dt.year >= 2000]  #only grab years from 2000 on
medHouseIncomeCO.head()

## Data Visualization

In [10]:
#Separate data
zillowSingleFamilyCO = zillowCO[zillowCO['HouseType']=='Single Family'] #grab single family colorado homes
zillowCondoCO = zillowCO[zillowCO['HouseType']=='Condo'] #grab condo colorado homes

#Melt data to have date be a column
zillowMeltedSFCO = zillowSingleFamilyCO.melt(
    id_vars=['RegionName', 'RegionType', 'StateName', 'HouseType'], 
    var_name='Date', 
    value_name='Price'
)

zillowMeltedCondoCO = zillowCondoCO.melt(
    id_vars=['RegionName', 'RegionType', 'StateName', 'HouseType'], 
    var_name='Date', 
    value_name='Price'
)

#Convert Date column to datetime format
zillowMeltedSFCO['Date'] = pd.to_datetime(zillowMeltedSFCO['Date'])
zillowMeltedCondoCO['Date'] = pd.to_datetime(zillowMeltedCondoCO['Date'])

In [None]:
medianPriceCO = zillowCOMelted.groupby('Date')['Price'].median().reset_index() #create new dataframe
medianPriceCO.head()

In [17]:
medianPriceCO = medianPriceCO[medianPriceCO['Date'].dt.month == 1] #only grab info for January
medianPriceCO['Date'] = pd.to_datetime(medianPriceCO['Date']).dt.strftime('%Y-%m') #change format of how to display date for median housing price
medianPriceCO = medianPriceCO[pd.to_datetime(medianPriceCO['Date']).dt.year <= 2023] #grab data that is before or in 2023 (remove 2024)
medHouseIncomeCO['Date'] = pd.to_datetime(medHouseIncomeCO['Date']).dt.strftime('%Y-%m') #change format of how to display date for median income
medianPriceCO['Date'] = pd.to_datetime(medianPriceCO['Date']) #change format to datetime object
medHouseIncomeCO['Date'] = pd.to_datetime(medHouseIncomeCO['Date']) #change format to datetime object

## Analysis on House Features and Price

In [None]:
realtorDF= pd.read_csv('realtor-data.zip.csv') #load in data
realtorDF.head()

In [None]:
realtorDF.drop(columns=['prev_sold_date', 'status', 'brokered_by', 'street'],inplace=True)
realtorDF.head()

In [None]:
realtorCO = realtorDF[realtorDF['state']=='Colorado'] #grab subset for colorado
len(realtorCO)

In [None]:
realtorCO = realtorCO.dropna() #drop all missing values
realtorCO.isna().sum()

In [None]:
#Correlation matrix
fig, ax = plt.subplots(figsize=(6, 6)) #initiate plot
corrMat = realtorCO.corr(numeric_only=True) #create correlation matrix
mask = np.triu(np.ones_like(corrMat, dtype=bool)) #only care about bottom half
sns.heatmap(corrMat, square=True, annot=True, cbar=True, cmap='crest', mask=mask, linewidths=0.5) #create heatmap to visualize strength of correlation
plt.tight_layout
fig.show()

In [None]:
#grab top correlated features
realtorCONum = realtorCO[['price', 'bed', 'bath', 'acre_lot', 'house_size']]
corr_df = realtorCONum.corr().stack().reset_index().rename(columns = {'level_0':'variable1', 
                                                            'level_1':'variable2', 
                                                             0:'correlation'})
corr_df = corr_df[corr_df.variable2 > corr_df.variable1]
corr_df = corr_df.loc[corr_df.correlation.abs().sort_values(ascending= False).index]
corr_df

## Plotly

Ideas for plotly visualization
1. scatter plot of income vs housing price
2. animation of home cost over the years for states and areas of colorado
3. Map of prices by state

## Dash

Ideas for dash
1. callback function to look at realtor data to see how different home characteristics affect