# Course 3: Project - Part 2 - House prices

<a name="top"></a>
This notebook is concerned with Part 2 - House prices.

**Contents:**
* [Imports](#step-0)
* [Data cleaning](#step-1)

## Imports<a name="step-0"></a> ([top](#top))
---

In [None]:
# Standard library:
import collections
import json
import pathlib
import typing

# 3rd party:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

%matplotlib inline

## Data cleaning<a name="step-1"></a> ([top](#top))
---

In [None]:
df = pd.read_csv(pathlib.Path.cwd() / 'house-prices.csv')

We start by taking a look at the data set:

In [None]:
df.shape

In [None]:
with pd.option_context('display.max_columns', None):
    display(df.head())

In [None]:
df.info(verbose=True, null_counts=True)

### Step 1 - Find and handle incorrect and missing values
---

We are told that the data set contains incorrect and missing values. Our plan is:

* **Qualitative variables:** Make sur that they only take valid values. To that end, we have prepared a little JSON document that, for each nominal and ordinal variable, lists the valid values. (This document was generated by parsing the documentation. The code is in `qualitative_variables.py`.)

In [None]:
def count_null(series):
    return series.isna().sum()


def count_invalid(series, valid_values):
    return (~series.isin(valid_values)).sum()
    

def check_quantitative(series):
    print('null: {}'.format(count_null(series)))
    print(series.describe())

In [None]:
ql_fixes = []

#### Qualitative variables

We read the definitions of the qualitative variables:

In [None]:
def load_qualitative_variables(path):
    with open(path, 'r') as f:
        definitions = json.load(f)
        result = collections.OrderedDict()
        for definition in definitions:
            feature = definition['name']
            attrs = {
                'kind': definition['kind'],
                'values': set(definition['values'])
            }
            result[feature] = attrs
        return result

In [None]:
ql_vars = load_qualitative_variables('qualitative_variables.json')

# As per the description of the task, we will not check 'PID':
ql_vars.pop('PID')

print(f'qualitative variables: {len(ql_vars)}')

The 1st thing we notice when we try check the qualitative variables is that the names of some values differ between the data set and the documentation. *We decide to align the definitions to match the data set:*

In [None]:
def rename_var(cur_name, new_name):
    attrs = ql_vars.pop(cur_name)
    ql_vars[new_name] = attrs
    
    
# 'Exterior 1' is 'Exterior 1st' in the data set:
rename_var('Exterior 1', 'Exterior 1st')
# 'Exterior 2' is 'Exterior 2nd' in the data set:
rename_var('Exterior 2', 'Exterior 2nd')
# 'BsmtFinType 2' is 'BsmtFin Type 2' in the data set:
rename_var('BsmtFinType 2', 'BsmtFin Type 2')
# 'HeatingQC' is 'Heating QC' in the data set:
rename_var('HeatingQC', 'Heating QC')
# 'KitchenQual' is 'Kitchen Qual' in the data set:
rename_var('KitchenQual', 'Kitchen Qual')
# 'FireplaceQu' is 'Fireplace Qu' in the data set:
rename_var('FireplaceQu', 'Fireplace Qu')

The 2nd thing we notice is that _NA_ in the documentation is represented by `np.nan` in the data set. *We decide to align the definitions to match the data set:*

In [None]:
for feature, attrs in ql_vars.items():
    values = attrs['values']
    if 'NA' in values:
        print(f'fixing: {feature}')
        values.remove('NA')
        values.add(np.nan)

We can now check the qualitative variables and output a data-frame with the number of null values and the number of invalid values:

In [None]:
data = []
for feature, attrs in ql_vars.items():
    series = df[feature]
    null_count = count_null(series)
    invalid_count = count_invalid(series, attrs['values'])
    data.append((feature, attrs['kind'], null_count, invalid_count))
df_ql = pd.DataFrame(data=data, columns=['feature', 'kind', 'null_count', 'invalid_count'])

Here are the qualitative variables that we need to investigage:

In [None]:
df_ql[df_ql['invalid_count'] > 0]

**MS Zoning:**

We look at the invalid values:

In [None]:
feature = 'MS Zoning'
values = ql_vars[feature]['values']
invalid = df.loc[~df[feature].isin(values), feature]
invalid.unique()

We implement a correction and register it for later use:

In [None]:
def correct(df, ql_vars):
    feature = 'MS Zoning'
    values = ql_vars[feature]['values']
    # We only update invalid values:
    invalid = df.loc[~df[feature].isin(values), feature]
    corrected = invalid.map({'I (all)': 'I', 'C (all)': 'C', 'A (agr)': 'A'})
    df.loc[corrected.index, feature] = corrected
    return df


# Apply the correction:
df = correct(df, ql_vars)
assert count_invalid(df[feature], values) == 0, f"variable: '{feature}' not properly corrected"

# Register the correction:
ql_fixes.append(correct)

**Neighborhood (Nominal):**

We look at the invalid values:

In [None]:
feature = 'Neighborhood'
values = ql_vars[feature]['values']
invalid = df.loc[~df[feature].isin(values), feature]
invalid.unique()

Given the capitalization for `NWAmes`, we decide to align the definition to match the data set:

In [None]:
feature = 'Neighborhood'
values = ql_vars[feature]['values']
values.remove('Names')
values.add('NAmes')

assert count_invalid(df[feature], values) == 0, f"variable: '{feature}' not properly corrected"