## Final Project Submission

Please fill out:
* Student name: James Benedito
* Student pace: Part Time
* Instructor name: Morgan Jones

## Introduction

Driving through a suburban neighborhood, it's inevitable that one will encounter at least one "for sale" sign. These signs typically have the logo of some big-time real estate company, like Century 21 or REMAX. Real estate companies are businesses that deal with buying and selling properties. When purchasing a house, everyone has their non-negotiables. Some people want a big kitchen or backyard. Others may want a basement or 'x' amount of bathrooms. In this Jupyter notebook, I will explore the different variables that impact a home's value. The information in this analysis will be applicable to real estate companies and people who are trying to sell their house.

## Business Problem

King County is a county located in Washington state with a population of approximately 2.2 million people, according to the 2022 US Census. As of 2021, the median household income is about $106,000 (https://www.census.gov/quickfacts/kingcountywashington).

A theoretical real estate company in King County helps homeowners sell their homes. They want to provide concrete, data-driven advice to their clients regarding renovations that can be done to boost their house's value before putting it up for sale on the market. My data analysis will bring light to the variables that are most impactful to a home's sale price in King County and will thus inform this real estate company's clients on the renovations they should prioritize.  

## Goal

My goal is to come up with three concrete suggestions based on a linear regression analysis. The final linear regression model will include the three variables that are most impactful to a house's sale price, which will serve as the dependent variable of interest throughout the entire exploration process. In other words, the independent variables in my final model will be those that are the best predictors of a house's sale price (the dependent variable). 

## Dataset

The dataset I am using for the analysis is kc_house_data.csv. This dataset includes data on houses in King County, which is where the theoretical real estate company and their clients are located.

## Dataset Exploration

I will begin by exploring the dataset, using the .head() method to visualize the first five rows in a table format. Furthermore, I will use the .info() and .describe() methods to get a better understanding of the overall dataset.

In [2]:
# initial exploration of dataset
import pandas as pd

house_data = pd.read_csv('data\\kc_house_data.csv')
house_data.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,greenbelt,...,sewer_system,sqft_above,sqft_basement,sqft_garage,sqft_patio,yr_built,yr_renovated,address,lat,long
0,7399300360,5/24/2022,675000.0,4,1.0,1180,7140,1.0,NO,NO,...,PUBLIC,1180,0,0,40,1969,0,"2102 Southeast 21st Court, Renton, Washington ...",47.461975,-122.19052
1,8910500230,12/13/2021,920000.0,5,2.5,2770,6703,1.0,NO,NO,...,PUBLIC,1570,1570,0,240,1950,0,"11231 Greenwood Avenue North, Seattle, Washing...",47.711525,-122.35591
2,1180000275,9/29/2021,311000.0,6,2.0,2880,6156,1.0,NO,NO,...,PUBLIC,1580,1580,0,0,1956,0,"8504 South 113th Street, Seattle, Washington 9...",47.502045,-122.2252
3,1604601802,12/14/2021,775000.0,3,3.0,2160,1400,2.0,NO,NO,...,PUBLIC,1090,1070,200,270,2010,0,"4079 Letitia Avenue South, Seattle, Washington...",47.56611,-122.2902
4,8562780790,8/24/2021,592500.0,2,2.0,1120,758,2.0,NO,NO,...,PUBLIC,1120,550,550,30,2012,0,"2193 Northwest Talus Drive, Issaquah, Washingt...",47.53247,-122.07188


In [4]:
house_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30155 entries, 0 to 30154
Data columns (total 25 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             30155 non-null  int64  
 1   date           30155 non-null  object 
 2   price          30155 non-null  float64
 3   bedrooms       30155 non-null  int64  
 4   bathrooms      30155 non-null  float64
 5   sqft_living    30155 non-null  int64  
 6   sqft_lot       30155 non-null  int64  
 7   floors         30155 non-null  float64
 8   waterfront     30155 non-null  object 
 9   greenbelt      30155 non-null  object 
 10  nuisance       30155 non-null  object 
 11  view           30155 non-null  object 
 12  condition      30155 non-null  object 
 13  grade          30155 non-null  object 
 14  heat_source    30123 non-null  object 
 15  sewer_system   30141 non-null  object 
 16  sqft_above     30155 non-null  int64  
 17  sqft_basement  30155 non-null  int64  
 18  sqft_g

From the .info() method, we see that there are 25 columns in the dataset. The heat_source and sewer_system columns have missing values. In terms of dtype, the dataset contains a mix of integers, float values, and objects.

In [5]:
house_data.describe()

Unnamed: 0,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,sqft_above,sqft_basement,sqft_garage,sqft_patio,yr_built,yr_renovated,lat,long
count,30155.0,30155.0,30155.0,30155.0,30155.0,30155.0,30155.0,30155.0,30155.0,30155.0,30155.0,30155.0,30155.0,30155.0,30155.0
mean,4538104000.0,1108536.0,3.41353,2.334737,2112.424739,16723.6,1.543492,1809.826098,476.039396,330.211142,217.412038,1975.163953,90.922301,47.328076,-121.317397
std,2882587000.0,896385.7,0.981612,0.889556,974.044318,60382.6,0.567717,878.306131,579.631302,285.770536,245.302792,32.067362,416.473038,1.434005,5.725475
min,1000055.0,27360.0,0.0,0.0,3.0,402.0,1.0,2.0,0.0,0.0,0.0,1900.0,0.0,21.27424,-157.79148
25%,2064175000.0,648000.0,3.0,2.0,1420.0,4850.0,1.0,1180.0,0.0,0.0,40.0,1953.0,0.0,47.40532,-122.326045
50%,3874011000.0,860000.0,3.0,2.5,1920.0,7480.0,1.5,1560.0,0.0,400.0,150.0,1977.0,0.0,47.55138,-122.225585
75%,7287100000.0,1300000.0,4.0,3.0,2619.5,10579.0,2.0,2270.0,940.0,510.0,320.0,2003.0,0.0,47.669913,-122.116205
max,9904000000.0,30750000.0,13.0,10.5,15360.0,3253932.0,4.0,12660.0,8020.0,3580.0,4370.0,2022.0,2022.0,64.82407,-70.07434


The .describe() method shows statistics for 15 columns, which are of dtype integer or float. The columns containing object dtypes are not included in the table generated above.

Next, I will perform a correlation between sale price and the integer and float variables.

In [None]:
# visualize data distribution to see if it needs to be normalized first

In [None]:
# start by figuring out the Top 3 variables (in terms of corr coefficients) w/ price

In [None]:
# price is dependent variable
# see how 3 different independent variables impact price and use for final model