## Final Project Submission

Please fill out:
* Student name: 
* Student pace: self paced / part time / full time
* Scheduled project review date/time: 
* Instructor name: 
* Blog post URL:

# Business Problem
Our data analytics firm has been hired by a local flipper company. Their business is to purchase homes that are selling below market value, repair them, and then sell them for the higest possible price. 

Our client wants to know what type of home has the highest price and volume of sales, what homes they should purchase, and how much renovation they should put into the property.

# Summary of Recommendations

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# import and inspect data
df = pd.read_csv('data/kc_house_data.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21597 entries, 0 to 21596
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21597 non-null  int64  
 1   date           21597 non-null  object 
 2   price          21597 non-null  float64
 3   bedrooms       21597 non-null  int64  
 4   bathrooms      21597 non-null  float64
 5   sqft_living    21597 non-null  int64  
 6   sqft_lot       21597 non-null  int64  
 7   floors         21597 non-null  float64
 8   waterfront     19221 non-null  object 
 9   view           21534 non-null  object 
 10  condition      21597 non-null  object 
 11  grade          21597 non-null  object 
 12  sqft_above     21597 non-null  int64  
 13  sqft_basement  21597 non-null  object 
 14  yr_built       21597 non-null  int64  
 15  yr_renovated   17755 non-null  float64
 16  zipcode        21597 non-null  int64  
 17  lat            21597 non-null  float64
 18  long  

# Column Names and Descriptions for King County Data Set
* `id` - Unique identifier for a house
* `date` - Date house was sold
* `price` - Sale price (prediction target)
* `bedrooms` - Number of bedrooms
* `bathrooms` - Number of bathrooms
* `sqft_living` - Square footage of living space in the home
* `sqft_lot` - Square footage of the lot
* `floors` - Number of floors (levels) in house
* `waterfront` - Whether the house is on a waterfront
  * Includes Duwamish, Elliott Bay, Puget Sound, Lake Union, Ship Canal, Lake Washington, Lake Sammamish, other lake, and river/slough waterfronts
* `view` - Quality of view from house
  * Includes views of Mt. Rainier, Olympics, Cascades, Territorial, Seattle Skyline, Puget Sound, Lake Washington, Lake Sammamish, small lake / river / creek, and other
* `condition` - How good the overall condition of the house is. Related to maintenance of house.
  * See the [King County Assessor Website](https://info.kingcounty.gov/assessor/esales/Glossary.aspx?type=r) for further explanation of each condition code
* `grade` - Overall grade of the house. Related to the construction and design of the house.
  * See the [King County Assessor Website](https://info.kingcounty.gov/assessor/esales/Glossary.aspx?type=r) for further explanation of each building grade code
* `sqft_above` - Square footage of house apart from basement
* `sqft_basement` - Square footage of the basement
* `yr_built` - Year when house was built
* `yr_renovated` - Year when house was renovated
* `zipcode` - ZIP Code used by the United States Postal Service
* `lat` - Latitude coordinate
* `long` - Longitude coordinate
* `sqft_living15` - The square footage of interior housing living space for the nearest 15 neighbors
* `sqft_lot15` - The square footage of the land lots of the nearest 15 neighbors


In [3]:
df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,10/13/2014,221900.0,3,1.0,1180,5650,1.0,,NONE,...,7 Average,1180,0.0,1955,0.0,98178,47.5112,-122.257,1340,5650
1,6414100192,12/9/2014,538000.0,3,2.25,2570,7242,2.0,NO,NONE,...,7 Average,2170,400.0,1951,1991.0,98125,47.721,-122.319,1690,7639
2,5631500400,2/25/2015,180000.0,2,1.0,770,10000,1.0,NO,NONE,...,6 Low Average,770,0.0,1933,,98028,47.7379,-122.233,2720,8062
3,2487200875,12/9/2014,604000.0,4,3.0,1960,5000,1.0,NO,NONE,...,7 Average,1050,910.0,1965,0.0,98136,47.5208,-122.393,1360,5000
4,1954400510,2/18/2015,510000.0,3,2.0,1680,8080,1.0,NO,NONE,...,8 Good,1680,0.0,1987,0.0,98074,47.6168,-122.045,1800,7503


In [4]:
# Any dulplicated records?
len(df[df.duplicated(subset=['id'], keep=False)].sort_values(by='id'))

353

In [5]:
# Remove duplicates
df.drop_duplicates(subset=['id'], keep='first', inplace=True)

In [6]:
# How many columns have NaN?
print(df.isna().sum())

df[df.isnull().any(axis=1)].head()

id                  0
date                0
price               0
bedrooms            0
bathrooms           0
sqft_living         0
sqft_lot            0
floors              0
waterfront       2353
view               63
condition           0
grade               0
sqft_above          0
sqft_basement       0
yr_built            0
yr_renovated     3804
zipcode             0
lat                 0
long                0
sqft_living15       0
sqft_lot15          0
dtype: int64


Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,10/13/2014,221900.0,3,1.0,1180,5650,1.0,,NONE,...,7 Average,1180,0.0,1955,0.0,98178,47.5112,-122.257,1340,5650
2,5631500400,2/25/2015,180000.0,2,1.0,770,10000,1.0,NO,NONE,...,6 Low Average,770,0.0,1933,,98028,47.7379,-122.233,2720,8062
7,2008000270,1/15/2015,291850.0,3,1.5,1060,9711,1.0,NO,,...,7 Average,1060,0.0,1963,0.0,98198,47.4095,-122.315,1650,9711
10,1736800520,4/3/2015,662500.0,3,2.5,3560,9796,1.0,,NONE,...,8 Good,1860,1700.0,1965,0.0,98007,47.6007,-122.145,2210,8925
12,114101516,5/28/2014,310000.0,3,1.0,1430,19901,1.5,NO,NONE,...,7 Average,1430,0.0,1927,,98028,47.7558,-122.229,1780,12697


In [7]:
# Any placeholders?
# Look for top occuring values
print('Dataframe\n')
for col in df.columns:
    print(col, '\n', df[col].value_counts(normalize = True).head(10), '\n')

Dataframe

id 
 7129300520    0.000047
8562720230    0.000047
2856100360    0.000047
8929000230    0.000047
3543900418    0.000047
8137500730    0.000047
104500730     0.000047
7575610760    0.000047
629800540     0.000047
7215730120    0.000047
Name: id, dtype: float64 

date 
 6/23/2014     0.006629
6/25/2014     0.006116
6/26/2014     0.006116
7/8/2014      0.005929
4/27/2015     0.005882
3/25/2015     0.005696
7/9/2014      0.005649
4/14/2015     0.005602
6/24/2014     0.005556
10/28/2014    0.005462
Name: date, dtype: float64 

price 
 350000.0    0.008030
450000.0    0.007983
550000.0    0.007283
500000.0    0.007049
425000.0    0.007003
325000.0    0.006863
400000.0    0.006769
375000.0    0.006443
525000.0    0.006116
300000.0    0.006116
Name: price, dtype: float64 

bedrooms 
 3     0.454295
4     0.319748
2     0.127731
5     0.074043
6     0.012372
1     0.008917
7     0.001774
8     0.000607
9     0.000280
10    0.000140
Name: bedrooms, dtype: float64 

bathrooms 
 2.50   

## Replaceing or Removing NaN values
- waterfront replace NaN with 'NO'
- `yr_renovated`: replace 0 with `NaN` as this would mean 'never renovated'
- replace '?' placeholder in `sqft_basement` with 0
- Change categorical values in columns condition, view, and waterfront and to integers and map to a dicitonary
    - `view`: replace `NaN` with `NONE` (use a dicitonary `0: 'None'`, `1: 'Good'`, ect)

In [8]:
# replacing waterfront NaN with 'NO'
df['waterfront'].fillna('NO', inplace=True)

In [9]:
# replace 0 in `yr_renovated` with NaN as this would mean 'never renovated'# 
df.replace(0, np.nan, inplace=True)

In [10]:
# clean and fix data type of sqft_basement
## sqft_basement has '?' as a placeholder. Set this to 0.
print('sqft_basement with ? as placeholder:', len(df.loc[df['sqft_basement'] == '?', 'sqft_basement']))
df.loc[df['sqft_basement'] == '?', 'sqft_basement'] = 0.0
df['sqft_basement'] = df['sqft_basement'].astype(float)
print('Removed ? as placeholder:', len(df.loc[df['sqft_basement'] == '?', 'sqft_basement']))

sqft_basement with ? as placeholder: 452
Removed ? as placeholder: 0


In [11]:
# replace `Nan` with `NONE` for column `view`
df['view'].fillna('NONE', inplace=True)

In [12]:
# Clean up grade column
## strip out by spaces and keep the first string, which should be the number
df['grade'] = df['grade'].apply(lambda x: x.split(' ', 1)[0]).astype(int)

In [13]:
# Check that the date is in the correct format
## May have to control for date for my analysis
# format example: 10/13/2014

df['date'] = pd.to_datetime(df['date'])

# Make new columns with Year, Month of sale
df['Year'] = df['date'].apply(lambda x:x.strftime('%Y')).astype(int)
df['Month'] = df['date'].apply(lambda x:x.strftime('%m')).astype(int)

In [14]:
df.describe()

Unnamed: 0,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15,Year,Month
count,21420.0,21420.0,21420.0,21420.0,21420.0,21420.0,21420.0,21420.0,21420.0,21420.0,21420.0,740.0,21420.0,21420.0,21420.0,21420.0,21420.0,21420.0,21420.0
mean,4580940000.0,540739.3,3.37395,2.118429,2083.132633,15128.04,1.495985,7.662792,1791.170215,285.904342,1971.092997,1996.017568,98077.87437,47.560197,-122.213784,1988.38408,12775.718161,2014.318954,6.590336
std,2876761000.0,367931.1,0.925405,0.76872,918.808412,41530.8,0.540081,1.171971,828.692965,440.008202,29.387141,15.578983,53.47748,0.138589,0.140791,685.537057,27345.621867,0.466082,3.107924
min,1000102.0,78000.0,1.0,0.5,370.0,520.0,1.0,3.0,370.0,0.0,1900.0,1934.0,98001.0,47.1559,-122.519,399.0,651.0,2014.0,1.0
25%,2123537000.0,322500.0,3.0,1.75,1430.0,5040.0,1.0,7.0,1200.0,0.0,1952.0,1987.0,98033.0,47.4712,-122.328,1490.0,5100.0,2014.0,4.0
50%,3904921000.0,450000.0,3.0,2.25,1920.0,7614.0,1.5,7.0,1560.0,0.0,1975.0,2000.0,98065.0,47.5721,-122.23,1840.0,7620.0,2014.0,6.0
75%,7308900000.0,645000.0,4.0,2.5,2550.0,10690.5,2.0,8.0,2220.0,550.0,1997.0,2008.0,98117.0,47.6781,-122.125,2370.0,10086.25,2015.0,9.0
max,9900000000.0,7700000.0,33.0,8.0,13540.0,1651359.0,3.5,13.0,9410.0,4820.0,2015.0,2015.0,98199.0,47.7776,-121.315,6210.0,871200.0,2015.0,12.0


In [15]:
# Change id, zipcode, lat, long to string
df.describe(include = 'object')

Unnamed: 0,waterfront,view,condition
count,21420,21420,21420
unique,2,5,5
top,NO,NONE,Average
freq,21274,19316,13900


In [16]:
# Change categorical values in columns condition, view, and waterfront and to integers and map to a dicitonary (maybe)