# Data Wrangling
The housing price data comes from two different data sets found at:
1. [Washington D.C. Data](https://www.kaggle.com/christophercorrea/dc-residential-properties#DC_Properties.csv)
2. [King County Data](https://www.kaggle.com/harlfoxem/housesalesprediction)

It is important to firstly define what the raw data will be.

## Table 1: Washington D.C. Housing Data

In [1]:
# Import necessary modules
import pandas as pd
import numpy as np

In [2]:
# Filepath when using pc
dc_data = "/Users/Garrett/Desktop/Springboard/capstone_project_2/data/dc_data.csv"

dc_df = pd.read_csv(dc_data)
dc_df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0.1,Unnamed: 0,BATHRM,HF_BATHRM,HEAT,AC,NUM_UNITS,ROOMS,BEDRM,AYB,YR_RMDL,...,LONGITUDE,ASSESSMENT_NBHD,ASSESSMENT_SUBNBHD,CENSUS_TRACT,CENSUS_BLOCK,WARD,SQUARE,X,Y,QUADRANT
0,0,4,0,Warm Cool,Y,2.0,8,4,1910.0,1988.0,...,-77.040832,Old City 2,040 D Old City 2,4201.0,004201 2006,Ward 2,152,-77.040429,38.914881,NW
1,1,3,1,Warm Cool,Y,2.0,11,5,1898.0,2007.0,...,-77.040764,Old City 2,040 D Old City 2,4201.0,004201 2006,Ward 2,152,-77.040429,38.914881,NW
2,2,3,1,Hot Water Rad,Y,2.0,9,5,1910.0,2009.0,...,-77.040678,Old City 2,040 D Old City 2,4201.0,004201 2006,Ward 2,152,-77.040429,38.914881,NW
3,3,3,1,Hot Water Rad,Y,2.0,8,5,1900.0,2003.0,...,-77.040629,Old City 2,040 D Old City 2,4201.0,004201 2006,Ward 2,152,-77.040429,38.914881,NW
4,4,2,1,Warm Cool,Y,1.0,11,3,1913.0,2012.0,...,-77.039361,Old City 2,040 D Old City 2,4201.0,004201 2006,Ward 2,152,-77.040429,38.914881,NW


## Table 2: King County Housing Data

In [3]:
# Filepath when using pc
kc_data = "/Users/Garrett/Desktop/Springboard/capstone_project_2/data/kc_data.csv"

kc_df = pd.read_csv(kc_data)
kc_df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


# Feature Selection
The two data sets share some common features. These features are the ones that I would like to look closer into and how they contribute to the overall price of a housing unit in each respective geographical area. By focusing on the features that each data set shares rather than the unique ones, it will be more clear on how each one impacts the price. The following columns are the housing features I will examine:

1. Price
2. Sale Date
3. #Bathrooms
4. #Bedrooms
5. Living SQFT
6. Lot SQFT
7. Stories
8. Condition
9. Grade
10. Year Built
11. Year Remodeled

In [4]:
# Define the columns from each data frame
dc_cols = ['PRICE', 'SALEDATE', 'BATHRM', 'HF_BATHRM', 'BEDRM', 'GBA', 
           'LANDAREA', 'STORIES', 'CNDTN', 'GRADE', 'AYB', 'YR_RMDL']
kc_cols = ['price', 'date', 'bathrooms', 'bedrooms', 'sqft_living', 
           'sqft_lot', 'floors', 'condition', 'grade', 'yr_built', 'yr_renovated']

dc_df = dc_df[dc_cols]
kc_df = kc_df[kc_cols]

In [5]:
# Combine full and half bathrooms into one column for D.C. 
dc_df['BATHRM'] = dc_df['BATHRM'] + dc_df['HF_BATHRM']*0.5
dc_df.drop('HF_BATHRM', inplace=True, axis=1)

In [6]:
# Add location tag to each data frame
dc_df['location'] = 'DC'
kc_df['location'] = 'KC'
dc_cols.append('location')
kc_cols.append('location')

# Filtering the Data to Keep Data Frames Consistent

One additional modification that I am going to make is the dates of each housing price. The King County data set only contains sales betweeen May 2014 and May 2015. Due to this, to keep things consistent I will filter the Washington D.C. houing sales data to only contain entries from May 2014 - May 2015 (to not skew the predicted housing price model).

Not only is it important that I look at the columns that each data set has in common, but I also want to make sure that the data in each corresponding column is of the same type prior to merging the two data frames together. I also want to make sure that missing values are accounted for and that I handle these cases prior to the merge. Thus, I am going to remove rows where the price of the sold housing unit is missing.

In [7]:
# Rename D.C. data frame columns to match K.C. column names
dc_df.columns = kc_cols

# Filter D.C. data to May 2014 - May 2015 rows and remove rows where the 'price' is missing
dc_df = dc_df.loc[(dc_df.date >= '2014-05-01') & (dc_df.date <= '2015-05-31')].reset_index(drop=True)
dc_df.dropna(axis=0, subset=['price'], inplace=True)

In [8]:
len(dc_df)

7160

In [9]:
# Convert date columns to Datetime
dc_df.date = pd.to_datetime(dc_df.date)
kc_df.date = pd.to_datetime(kc_df.date)

# Convert sqft_living in K.C. data frame to float
kc_df.sqft_living = kc_df.sqft_living.astype('float')

# Convert yr_built in D.C. data frame to int
dc_df.yr_built = dc_df.yr_built.astype('int')

In [10]:
dc_df.head()

Unnamed: 0,price,date,bathrooms,bedrooms,sqft_living,sqft_lot,floors,condition,grade,yr_built,yr_renovated,location
1,993500.0,2014-10-08,5.0,3,1148.0,814,2.0,Very Good,Average,1907,2014.0,DC
2,1280000.0,2014-08-19,2.5,3,1630.0,1000,2.0,Good,Good Quality,1906,2004.0,DC
4,1440000.0,2015-04-22,3.5,4,1686.0,1424,2.0,Very Good,Above Average,1908,2015.0,DC
5,1050000.0,2014-12-23,2.0,2,1440.0,1800,2.0,Average,Above Average,1885,1984.0,DC
8,900000.0,2014-06-05,1.5,2,1728.0,900,3.0,Good,Average,1880,2003.0,DC


In [11]:
kc_df.head()

Unnamed: 0,price,date,bathrooms,bedrooms,sqft_living,sqft_lot,floors,condition,grade,yr_built,yr_renovated,location
0,221900.0,2014-10-13,1.0,3,1180.0,5650,1.0,3,7,1955,0,KC
1,538000.0,2014-12-09,2.25,3,2570.0,7242,2.0,3,7,1951,1991,KC
2,180000.0,2015-02-25,1.0,2,770.0,10000,1.0,3,6,1933,0,KC
3,604000.0,2014-12-09,3.0,4,1960.0,5000,1.0,5,7,1965,0,KC
4,510000.0,2015-02-18,2.0,3,1680.0,8080,1.0,3,8,1987,0,KC


In [12]:
# Replace K.C 'condition' and 'grade' numerical values to categorical values found in D.C. data
kc_df.condition.replace([1,2,3,4,5], ['Poor', 'Average', 'Good', 'Very Good', 'Excellent'], inplace=True)
kc_df.grade.replace([1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13],
             ['Fair Quality', 'Average', 'Above Average', 'Good Quality', 'Very Good', 'Excellent', 'Superior', 'Exceptional-A',
             'Exceptional-B', 'Exceptional-C', 'Exceptional-D', 'Exceptional-D'], 
             inplace=True)

In [36]:
type(dc_df.bathrooms.iloc[0])

numpy.float64