# House prices preparing dataset

Kairos (April 2018)


## Description
Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

## Data
79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa.

## Challenge
Predicting the final price of each home.

## Method
We'll use Tensorflow as out method to develop the project.

## 1. Set Up
In this first cell, we'll load the necessary libraries.

In [1]:
import math

from IPython import display
from matplotlib import cm
from matplotlib import gridspec
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import metrics
import tensorflow as tf
from tensorflow.contrib.learn.python.learn import learn_io, estimator

tf.logging.set_verbosity(tf.logging.ERROR)
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format

ModuleNotFoundError: No module named 'tensorflow'

## 2. Load our data set
Next, we'll load our data set and show information about it.

In [None]:
housing_dataframe = pd.read_csv("input/train.csv", sep=",")
housing_dataframe.shape
housing_dataframe.describe()
housing_dataframe.info()
housing_dataframe

## 3. Clean dirty data

### Handle Missing Values
Let's compute the number of missing values and determine how to handle them.


In [None]:
null_counts = housing_dataframe.isnull().sum()
import itertools
print("Number of null values in each column:\n")
for name, val in itertools.izip(null_counts.index, null_counts):
    if val > 0:
      print name, val


Notice while most of the columns have 0 missing values, there are 18 that don't.
Let's remove columns entirely where more than 1% of the rows for that column contain a null value. In addition, we'll remove the remaining rows containing null values, which means we'll lose a bit of data, but in return keep some extra features to use for prediction.

### Let's remove columns entirely where more than 1% (15) of the rows for that column contain a null value.

In [None]:
cols = []
for name, val in itertools.izip(null_counts.index, null_counts):
    if val > 15:
      cols.append(name)

housing_dataframe.drop(cols, inplace=True, axis=1)
housing_dataframe   

### Let's see the rest of columns with NaN values

In [None]:
null_counts = housing_dataframe.isnull().sum()
print("Number of null values in each column:\n\n")
for name, val in itertools.izip(null_counts.index, null_counts):
    if val > 0:
      print name, val


### Let's use the dropna method to remove all rows from 'MasVnrType', 'MasVnrArea' and 'Electrical' containing any missing values.

In [None]:
housing_dataframe = housing_dataframe.dropna()


### Let's test there aren't any missing data

In [None]:
#missing data
total = housing_dataframe.isnull().sum().sort_values(ascending=False)
percent = (housing_dataframe.isnull().sum()/housing_dataframe.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)
print(missing_data)

### Let's investigate Categorical Columns
Keep in mind, the goal in this section is to have all the columns as numeric columns (int or float data type), and containing no missing values. We just dealt with the missing values, so let's now find out the number of columns that are of the object data type and then move on to process them into numeric form.

In [None]:
print("Data types and their frequency\n{}".format(housing_dataframe.dtypes.value_counts()))


We have 29 object columns that contain text which need to be converted into numeric features. Let's select just the object columns using the DataFrame method select_dtype, then display a sample row to get a better sense of how the values in each column are formatted.

In [None]:
object_columns_df = housing_dataframe.select_dtypes(include=['object'])
print(object_columns_df.iloc[0])


 These columns seem to represent categorical values.

In [None]:
object_filter_df = housing_dataframe.select_dtypes(include=['object']).copy()
object_filter_df


### Let's convert these columns to values by category
    1. Convert character/object to values.
    2. Drop columns from housing_dataframe.
    3. Concatenate both dataframes.

In [None]:
# Create a values dataframe for new values
values_df = object_filter_df.select_dtypes(include=['object']).copy()

char_cols = object_filter_df.dtypes.pipe(lambda x: x[x == 'object']).index

for c in char_cols:
    values_df[c] = pd.factorize(object_filter_df[c])[0]
    # Dropping columns
    housing_dataframe.drop(c, inplace=True, axis=1)

# Concatenating both dataframes
housing_dataframe = pd.concat([housing_dataframe, values_df], axis=1)    
housing_dataframe   
    

### Let's check everything is ok
Every column is int or float type in housing_dataframe.

In [None]:
housing_dataframe.info()

### Let's create a dictionary to save the encoding for future use.

In [None]:
char_cols = object_filter_df.dtypes.pipe(lambda x: x[x == 'object']).index
label_mapping = {}

for c in char_cols:
    object_filter_df[c], label_mapping[c] = pd.factorize(object_filter_df[c])
print label_mapping    

## 4. Save cleaned data to CSV
It is a good practice to store the final output of each section or stage of your workflow in a separate csv file. One of the benefits of this practice is that it helps us to make changes in our data processing flow without having to recalculate everything.

In [None]:
housing_dataframe.to_csv("input/cleaned_houses_prices.csv",index=False)