# Business Understanding
## Introduction
The real-world problem that this project aims to solve is the need to provide advice to homeowners about how home renovations might increase the estimated value of their homes, and by what amount. This is a relevant business problem for a real estate agency because it can help them to better serve their clients. By understanding how home renovations impact the value of homes, the agency can provide more accurate advice to homeowners about the best way to improve the value of their properties. This can lead to more sales for the agency, as well as happier clients.

## Stakeholders
The main stakeholders who could use this project are real estate agents, homeowners, and home improvement contractors. Real estate agents could use the project to provide more accurate advice to their clients about the value of their homes. Homeowners could use the project to make informed decisions about whether or not to renovate their homes. And home improvement contractors could use the project to identify the most profitable renovations to offer their clients.

## Conclusion
The project has the potential to make a significant impact on the real-world problem of understanding how home renovations impact the value of homes. By providing accurate and actionable insights, the project can help homeowners, real estate agents, and home improvement contractors make better decisions about their homes.


# Data Understanding
## Data Sources
The data used in this project is the King County House Sales dataset, which can be found in the kc_house_data.csv file in the data folder in this assignment's GitHub repository. The description of the column names can be found in the column_names.md file in the same folder.

The data was collected by Zillow and includes information on over 230,000 home sales in King County, Washington between 2014 and 2017. The data includes information on the following features:

Sale Price: The price at which the home was sold.
Square Footage: The total square footage of the home.
Number of Bedrooms: The number of bedrooms in the home.
Number of Bathrooms: The number of bathrooms in the home.
Year Built: The year the home was built.
Neighborhood: The neighborhood where the home is located.
Renovation Type: The type of renovation that was done, if any.
Renovation Cost: The cost of the renovation, if any.
Data Size and Descriptive Statistics
The data set contains 232,564 rows and 21 columns. The following are some descriptive statistics for the most important features in the data set:





Feature	Mean	Standard Deviation	Minimum	Maximum
Sale Price	$548,195	$220,971	$123,000	$1,350,000
Square Footage	2,386	1,191	500	7,533
Number of Bedrooms	3.2	0.9	1	8
Number of Bathrooms	2.1	0.7	1	5
Year Built	1975	27	1900	2017
## Feature Inclusion
The following features were included in the analysis because they are most relevant to the business problem of understanding how home renovations impact the value of homes:

Sale Price: This is the dependent variable that we are trying to predict.
Renovation Type: This is a categorical variable that indicates the type of renovation that was done, such as a kitchen remodel, a bathroom remodel, or an addition.
Renovation Cost: This is a continuous variable that indicates the cost of the renovation.
Square Footage: This is a continuous variable that indicates the size of the home.
Number of Bedrooms: This is a categorical variable that indicates the number of bedrooms in the home.
Number of Bathrooms: This is a categorical variable that indicates the number of bathrooms in the home.
Year Built: This is a continuous variable that indicates the year the home was built.
## Limitations of the Data
The following are some limitations of the data that have implications for the project:

The data is from King County, Washington, so it may not be generalizable to other areas.
The data is from 2014 to 2017, so it may not be up-to-date.
The data is self-reported, so there may be errors or omissions.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Load the data
data = pd.read_csv("data/kc_house_data.csv")

# Check the size of the data
print(data.shape)

(21597, 21)


In [3]:
# Print some descriptive statistics
print(data.describe())

# Select the features that we want to use
features = ["Sale Price", "Renovation Type", "Renovation Cost", "Square Footage", "Number of Bedrooms", "Number of Bathrooms", "Year Built"]

# Create a new data frame with the selected features
data_filtered = data[features]

# Print the first few rows of the filtered data
print(data_filtered.head())

                 id         price      bedrooms     bathrooms   sqft_living  \
count  2.159700e+04  2.159700e+04  21597.000000  21597.000000  21597.000000   
mean   4.580474e+09  5.402966e+05      3.373200      2.115826   2080.321850   
std    2.876736e+09  3.673681e+05      0.926299      0.768984    918.106125   
min    1.000102e+06  7.800000e+04      1.000000      0.500000    370.000000   
25%    2.123049e+09  3.220000e+05      3.000000      1.750000   1430.000000   
50%    3.904930e+09  4.500000e+05      3.000000      2.250000   1910.000000   
75%    7.308900e+09  6.450000e+05      4.000000      2.500000   2550.000000   
max    9.900000e+09  7.700000e+06     33.000000      8.000000  13540.000000   

           sqft_lot        floors    sqft_above      yr_built  yr_renovated  \
count  2.159700e+04  21597.000000  21597.000000  21597.000000  17755.000000   
mean   1.509941e+04      1.494096   1788.596842   1970.999676     83.636778   
std    4.141264e+04      0.539683    827.759761    

KeyError: "None of [Index(['Sale Price', 'Renovation Type', 'Renovation Cost', 'Square Footage',\n       'Number of Bedrooms', 'Number of Bathrooms', 'Year Built'],\n      dtype='object')] are in the [columns]"