# Assignment 1

You have been engaged by the Salt Lake County Regional Economic Development office to investigate the effects of building a sports stadium on nearby property values. One component of your analysis will be to evaluate the effects of building the Rio Tinto Stadium (now America First Credit Union Field) in Sandy, UT (home of Real Salt Lake). Did it have an effect on property values?

To prepare your analysis, you have been provided the MLS sales data for Sandy, UT as well as several other nearby suburbs in Salt Lake County. You may choose which of these to include in your analysis.

### Environment

Need to include the libraries we will be using in this notebook so Julia knows what functions we will be referencing

In [None]:
using CSV
using DataFrames
using Geodesy
using Plots
using StatsPlots

### Data

Reading in and cleaning up the MLS data for Sandy. Converting text columns into dummy variables.

In [None]:
# Read in comma-separated data into DataFrame
mls_data = CSV.read("../../Data/SandyData.csv", DataFrame);
draper = CSV.read("../../Data/DraperData.csv", DataFrame);
mls_data = vcat(mls_data, draper);

# Drop extra columns and rows w/ missing values
mls_data = mls_data[(mls_data.Latitude .> 0.0) .& (mls_data.Longitude .< 0.0), 
    [:SoldPrice, :SOLDYRMO, :Acres, :TotSqf, :TotBed, :TotBth, :GaragCap, :Latitude, :Longitude, :AirType, :Heat]];
mls_data = dropmissing(mls_data);

# Create Dummy Variables for AC and Heat
mls_data[!, "AC"] = map(occursin, repeat("Central Air", nrow(mls_data)), mls_data.AirType) * 1;
mls_data[!, "CentralHeating"] = map(occursin, repeat("Central", nrow(mls_data)), mls_data.Heat) * 1;
# Drop old heat/air columns
mls_data = select(mls_data, Not(:AirType));
mls_data = select(mls_data, Not(:Heat));

first(mls_data, 5)

Getting the dummy variables for examining the impact of the stadium

In [None]:
# Create transformation for coordinates->UTM (1 point = 1 meter)
utm_utah = UTMfromLLA(12, true, wgs84)

# America First Field
stadium_loc = utm_utah(LLA(40.5829, -111.8934, 0.0))

function distance_from_stadium(lat, lon)
    # Convert to UTM
    house_loc = utm_utah(LLA(lat, lon, 0.0));
    # Calculate distance
    diff = stadium_loc.x - house_loc.x, stadium_loc.y - house_loc.y
    dist = √(sum(diff).^2);
    # Convert to miles from meters
    return dist / 1609.3
end

mls_data[!, "distFromStadium"] = map(distance_from_stadium, mls_data.Latitude, mls_data.Longitude);
mls_data[!, "nearStadium"] = 1 * (mls_data.distFromStadium .<= 1);
mls_data[!, "awayFromStadium"] = 1 * (mls_data.distFromStadium .>= 4);
mls_data = mls_data[mls_data.nearStadium + mls_data.awayFromStadium .> 0, :]

describe(mls_data.distFromStadium)

### Questions

1. Describe how you plan to address your research question. What is the dependent variable? What is/are the key independent variable(s)? Do you want to recover a causal effect?

To explore the impact the America First field had on the housing market in Sandy, we will try to examine how home prices changed based on their proximity to the site and whether the price is from before or after the field was completed. Using the home sale data, the dependent variable will be the sold price of the homes and the independent variables will be dummy variables for whether the house was near the stadium or not and whether the house was sold before or after the stadium was constructed as well as most of the main variables that would affect the house price (total square ft, acreage, no. bedrooms, no. bathrooms, etc.)

2. Assuming you will use regression analysis to explore this issue, describe the model you propose.

We propose a linear regression model that includes the aforementioned independent variables as well as the interaction between proximity and the house being sold after the stadium was completed.

*SoldPrice* = &beta;<sub>0</sub> + &beta;<sub>1</sub>*AfterStadiumBuilt* + &beta;<sub>2</sub>*NearStadium* + &beta;<sub>3</sub>*SquareFootage* + &beta;<sub>3</sub>*TotBth* + ... &beta;<sub>*N*</sub>*X*<sub>*N*</sub> ... + &beta;<sub>1</sub>&beta;<sub>2</sub>*NearStadium*\**AfterStadiumBuilt* + &epsilon;

3. What variables do you propose to use? Why? What effect do you expect they will have on your dependent variable? What informs your expectation?

From the original data, we will be using those which we think have the biggest explanatory power on home price, such as acreage, square footage, number of bedrooms, having AC, having central heating, and number of bathrooms which we believe all have a positive relationship with the home price. The variables regarding the treatment: selling before or after the stadium and being near the stadium we believe will also have a positive relation with the home price based on the assumption that a new sporting stadium will attract more business with all the people coming in from outside the city. 

4. Do your data contain all the independent variables you intend to use? If not, what other variables do you need? Where will you get them?

The original data contains most of the data needed, but other independent variables to use include the unemployment rate and the interest rate which can be gotten from the FRED database.

5. Provide a description of your data
    1. What do these data contain?
    
    The data contains the details surrounding home sales in the city of Sandy, Utah from 2000 to 2017. This includes all of the home's properties that would be of interest to a buyer listed by the real estate agent as well as the price and date (month) of the sale.
    
    2. What do the data not contain that may be of interest?
    
    The data is missing some exogenous variables that would affect the housing market such as inflows and outflows of people in the state, the unemployment rate, and the interest rate. These would have an impact on people's ability to purchase a house which would then affect the prices seen, especially over time since we want to examine prices before and after the stadium without mixing it up with the other exogenous factors.
    
    3. Time period covered – is the time period sufficient for your analysis
    
    The data goes back to 2000, 8 years before the stadium's completion and reaches until 2017, 9 years after its completion which should be enough for our purposes

6. Provide a table of univariate statistics for key variables such as price (explain if it’s the sold price or the net price as discussed in class), square footage, acreage, etc.

* SoldPrice: Contracted sold price of the residence in U.S. dollars
* Acres: Total acreage of the property sold
* TotBed: Total number of bedrooms in the residence
* TotBth: Total number of bathrooms in the residence
* Central Air: Any form of central air conditioning in the residence
* CentralHeating: Any form of central heating in the residence, including radiant and forced air
* TotSqf: Total square footage of the residence

In [None]:
describe(mls_data)

7. Do any observations indicate they should be deleted? Why?

Data points that have are empty for any of the variables being used will be tossed out bringing down the number of observations. There is a home on a property much larger than the rest of the data that would skew the impact of acreage on the price as well. 

In [None]:
scatter(mls_data.Acres, mls_data.SoldPrice)