In [None]:
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
df = pd.read_csv('data/kc_house_data.csv')

Examined head of data, data types, and null values.

In [None]:
df.head()

In [None]:
df.info()

There are 6 columns that contain object data types. If kept, these will need to be converted in order to apply linear regression models.

In [None]:
df.isna().sum()

View, waterfront, and year renovated all contain null values. We will inspect each individually in order to determine if these will be dropped or recategorized.

In [None]:
df['waterfront'].value_counts()

Several rows contain null values. In order to preserve data we decided to recategorize null at "no"

In [None]:
df["waterfront"] = df["waterfront"].fillna("NO")

In [None]:
df['view'].value_counts()

After reading the description of view and seeing the large number of houses with no view, we decided not to use this in our analysis and dropped it from the dataframe.

In [None]:
df = df.drop('view', axis=1)

In [None]:
df['yr_renovated'].value_counts()

A value of 0 indicates that a house has not been renovated. Some houses had unknown renovation status. After exploring prices of renovated and not renovated houses and seeing that they had different average prices, we sought whether unknown renovation status houses behaved like renovated or not renovated houses. Unknown renovations had very similar houses to no renovations, so we set the value of unknown renovations to 0.

In [None]:
df["yr_renovated"] = df["yr_renovated"].fillna(0)

In [None]:
df['sqft_basement'].value_counts()

We saw that there is a '?' as a value. This was not explained and did not add to our knowledge so we decided to add a column that calculated the square footage of basement by subtracting total square footage from square footabge above ground.

In [None]:
df['sqft_basment_calc'] = df['sqft_living'] - df['sqft_above']

In [None]:
df = df.loc[df['bedrooms'] < 11]

In [None]:
df.head()

To better address what features can increase the value of a house we decided to drop the following numerical values that are used for data entry purposes only:

id

date

For previously explained reasons we will be dropping the following columns as well:

sqft_basement
view

In [None]:
df.drop('id', axis=1, inplace=True)
df.drop('date', axis=1, inplace=True)
df.drop('sqft_basement', axis=1, inplace=True)

We used Ordinal Encoder to transform categorical string data in grade, condition, and waterfront into numerical data, for the purpose of running multivariate linear regressions. We chose Ordinal Encoder rather than One Hot Encoder because grade, condition, and waterfront represented categorical data whose different values inidcated differences in quality that scaled from worst to best, rather than differences without patterns.

In [None]:
grades = df[['grade']]
categories = [['3 Poor', '4 Low', '5 Fair', '6 Low Average', '7 Average', '8 Good', '9 Better', '10 Very Good', '11 Excellent', '12 Luxury', '13 Mansion']]
ords = OrdinalEncoder(categories=categories)
ords.fit(grades)
ords.transform(grades)

In [None]:
grades_encoded = pd.DataFrame(
 
    ords.transform(grades),
    
    index=df.index
)

In [None]:
df.drop('grade', axis=1, inplace=True)

In [None]:
df = pd.concat([df, grades_encoded], axis=1)

In [None]:
df['grades'] = df[0]

In [None]:
df.drop(0, axis=1, inplace=True)

In [None]:
waterfront_e = df[['waterfront']]
categories1 = [['NO', 'YES']]
ords1 = OrdinalEncoder(categories=categories1)
ords1.fit(waterfront_e)
ords1.transform(waterfront_e)

In [None]:
waterfront_encoded = pd.DataFrame(
   
    ords1.transform(waterfront_e),

    index=df.index
)

In [None]:
df.drop('waterfront', axis=1, inplace=True)
df = pd.concat([df, waterfront_encoded], axis=1)

In [None]:
df['waterfront'] = df[0]
df.drop(0, axis=1, inplace=True)

In [None]:
conditions_e = df[['condition']]
categories2 = [['Poor', 'Fair', 'Average', 'Good', 'Very Good']]
ords2 = OrdinalEncoder(categories=categories2)
ords2.fit(conditions_e)
ords2.transform(conditions_e)

In [None]:
conditions_encoded = pd.DataFrame(
    
    ords2.transform(conditions_e),

    index=df.index
)


In [None]:
df.drop('condition', axis=1, inplace=True)
df = pd.concat([df, conditions_encoded], axis=1)

In [None]:
df['condition'] = df[0]
df.drop(0, axis=1, inplace=True)
df.info()

Now all data is numeric and ready for EDA.

In [None]:
df.info()

In [None]:
df.to_csv('cleaned_kc_house_data.csv', index=False)