In [None]:
import pandas as pd
from functions import impute_knn

df = pd.read_csv('housing.csv')
df.info()

In [None]:
df.head()

We can see that some values do not have attributes. We have to fill these in to ensure data consistency.

In [None]:
# imported function to fill in missing data using a kNeighborsRegressor
df = impute_knn(df)
df.info()

Now we have this data filled in. The original data always looked at the housing data per district. For my use case I want to look at it per house which is why I add some fields, rename some columns and delete no longer necessary columns in the following.

In [None]:
#turn this into a function in the end
df['rooms'] = df['total_rooms'] / df['households']
df['bedrooms'] = df['total_bedrooms'] / df['households']
df['number_of_people'] = df['population'] / df['households']

#create the average rooms per house for a district derived from the number of rooms and households in the district. Rounding makes sure it only takes whole numbers
df['rooms'] = df['rooms'].round(0).astype(int)

#Same procedure as with the rooms
df['bedrooms'] = df['bedrooms'].round(0).astype(int)
df['number_of_people'] = df['number_of_people'].round(0).astype(int)
df.rename(columns={"housing_median_age": "house_age", "median_income": "monthly_income_in_k_USD",  "median_house_value": "house_value"}, inplace = True)
drop_list = ["ocean_proximity", "total_rooms", "total_bedrooms", "households", "population"]
for element in drop_list:
    df.drop(element, axis=1, inplace=True)
df.head()

Now, I look at the data to find any outliers or other issues with data that need to be fixed.

In [None]:
df.hist(bins=60, figsize=(15,9))

There seems to be an issue with the house value having a lot of outliers in the maximum. Therefore, I want to get of these and print the code again to see if the adjustment fixed the issue. 

In [None]:
#changed code from original idea, same source as impute_knn function
maxval = df['house_value'].max()
df = df[df['house_value'] != maxval]
print(df.columns)
df.hist(bins=60, figsize=(15,9))

In [None]:
#save the data as a cv to be used in the method for estimating the house price using linear regression
df.to_csv('final_data.csv')