# Course 3: Project - Part 2 - House prices

<a name="top"></a>
This notebook is concerned with Part 2 - House prices.

**Contents:**
* [Imports](#imports)
* [Data cleaning](#stage-1)
  * [Step 1 - Find and handle incorrect and missing values](#stage-1-step-1)
  * [Step 2 - Correct inconsistencies](#stage-1-step-2)
  * [Step 3 - Handle outliers](#stage-1-step-3)

## Imports<a name="imports"></a> ([top](#top))
---

In [1]:
# Standard library:
import collections
import json
import pathlib
import typing as T

# 3rd party:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

%matplotlib inline

In [2]:
pd.__version__

'0.24.1'

## Data cleaning<a name="stage-1"></a> ([top](#top))
---

In [48]:
df = pd.read_csv(pathlib.Path.cwd() / 'house-prices.csv')

We start by taking a look at the data set:

In [4]:
df.shape

(2430, 82)

In [5]:
with pd.option_context('display.max_columns', None):
    display(df.head())

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,Utilities,Lot Config,Land Slope,Neighborhood,Condition 1,Condition 2,Bldg Type,House Style,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Roof Style,Roof Matl,Exterior 1st,Exterior 2nd,Mas Vnr Type,Mas Vnr Area,Exter Qual,Exter Cond,Foundation,Bsmt Qual,Bsmt Cond,Bsmt Exposure,BsmtFin Type 1,BsmtFin SF 1,BsmtFin Type 2,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,Heating,Heating QC,Central Air,Electrical,1st Flr SF,2nd Flr SF,Low Qual Fin SF,Gr Liv Area,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath,Bedroom AbvGr,Kitchen AbvGr,Kitchen Qual,TotRms AbvGrd,Functional,Fireplaces,Fireplace Qu,Garage Type,Garage Yr Blt,Garage Finish,Garage Cars,Garage Area,Garage Qual,Garage Cond,Paved Drive,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,484,528275070,60,RL,,8795,Pave,,IR1,Lvl,AllPub,Inside,Gtl,Gilbert,Norm,Norm,1Fam,2Story,7,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,,0.0,Gd,TA,PConc,Gd,TA,No,GLQ,300.0,Unf,0.0,652.0,952.0,GasA,Ex,Y,SBrkr,980,1276,0,2256,0.0,0.0,2,1,4,1,Gd,8,Typ,1,TA,BuiltIn,2000.0,Fin,2.0,554.0,TA,TA,Y,224,54,0,0,0,0,,,,0,4,2009,WD,Normal,236000
1,2586,535305120,20,RL,75.0,10170,Pave,,Reg,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,6,6,1951,1951,Hip,CompShg,Wd Sdng,Wd Sdng,BrkFace,522.0,TA,TA,CBlock,TA,TA,No,Unf,0.0,Unf,0.0,216.0,216.0,GasA,TA,Y,SBrkr,1575,0,0,1575,0.0,0.0,1,1,2,1,Gd,5,Typ,1,Gd,Attchd,1951.0,Unf,2.0,400.0,TA,TA,Y,0,0,0,0,0,0,,,,0,6,2006,WD,Normal,155000
2,2289,923228250,160,RM,21.0,2001,Pave,,Reg,Lvl,AllPub,Inside,Gtl,MeadowV,Norm,Norm,Twnhs,2Story,4,5,1970,1970,Gable,CompShg,CemntBd,CmentBd,BrkFace,80.0,TA,TA,CBlock,TA,TA,No,Unf,0.0,Unf,0.0,546.0,546.0,GasA,Fa,Y,SBrkr,546,546,0,1092,0.0,0.0,1,1,3,1,TA,6,Typ,0,,Attchd,1970.0,Unf,1.0,286.0,TA,TA,Y,0,0,0,0,0,0,,,,0,1,2007,WD,Normal,75000
3,142,535152150,20,RL,70.0,10552,Pave,,IR1,Lvl,AllPub,Inside,Gtl,NAmes,Norm,Norm,1Fam,1Story,5,5,1959,1959,Hip,CompShg,BrkFace,BrkFace,,0.0,TA,TA,CBlock,TA,TA,No,Rec,1018.0,Unf,0.0,380.0,1398.0,GasA,Gd,Y,SBrkr,1700,0,0,1700,0.0,1.0,1,1,4,1,Gd,6,Typ,1,Gd,Attchd,1959.0,RFn,2.0,447.0,TA,TA,Y,0,38,0,0,0,0,,,,0,4,2010,WD,Normal,165500
4,2042,903475060,190,RM,60.0,10120,Pave,,IR1,Bnk,AllPub,Inside,Gtl,OldTown,Feedr,Norm,2fmCon,2.5Unf,7,4,1910,1950,Hip,CompShg,Wd Sdng,Wd Sdng,,0.0,Fa,TA,CBlock,TA,TA,No,Unf,0.0,Unf,0.0,925.0,925.0,GasA,TA,N,FuseF,964,925,0,1889,0.0,0.0,1,1,4,2,TA,9,Typ,1,Gd,Detchd,1960.0,Unf,1.0,308.0,TA,TA,N,0,0,264,0,0,0,,MnPrv,,0,1,2007,WD,Normal,122000


In [6]:
df.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2430 entries, 0 to 2429
Data columns (total 82 columns):
Order              2430 non-null int64
PID                2430 non-null int64
MS SubClass        2430 non-null int64
MS Zoning          2430 non-null object
Lot Frontage       2010 non-null float64
Lot Area           2430 non-null int64
Street             2430 non-null object
Alley              163 non-null object
Lot Shape          2430 non-null object
Land Contour       2430 non-null object
Utilities          2430 non-null object
Lot Config         2430 non-null object
Land Slope         2430 non-null object
Neighborhood       2430 non-null object
Condition 1        2430 non-null object
Condition 2        2430 non-null object
Bldg Type          2430 non-null object
House Style        2430 non-null object
Overall Qual       2430 non-null int64
Overall Cond       2430 non-null int64
Year Built         2430 non-null int64
Year Remod/Add     2430 non-null int64
Roof Style         24

In order to make it easier to work with all the variables, we have prepared a little JSON document that, for each variable, lists its name and its kind (nominal, ordinal, discrete, continuous). For qualitative variables, it also lists the valid values. (This document was generated by parsing the documentation. The code is in `parse_qualitative_variables.py`.)

In [7]:
Attrs = T.Dict[str, T.Any]


def load_variables(path: str) -> T.Dict[str, Attrs]:
    
    def make_pair(definition):
        name = definition['name']
        attrs = {
            'kind': definition['kind']
        }
        if 'values' in definition:
            attrs['values'] = set(definition['values'])
        return (name, attrs)
    
    with open(path, 'r') as f:
        definitions = json.load(f)
        pairs = [make_pair(definition) for definition in definitions]
        return collections.OrderedDict(pairs)       

    
def is_qualitative(attrs):
    return attrs['kind'] in ['Nominal', 'Ordinal']


def is_quantitative(attrs):
    return attrs['kind'] in ['Discrete', 'Continuous']


variables = load_variables('variables.json')

We drop `Order` and `PID` and quickly check the counts:

In [8]:
variables.pop('Order')
variables.pop('PID')

kinds = ['Nominal', 'Ordinal', 'Discrete', 'Continuous']
for kind in kinds:
    count = sum(1 if attrs['kind'] == kind else 0 for attrs in variables.values())
    print(f'variables of kind {kind.lower()}: {count}')

variables of kind nominal: 23
variables of kind ordinal: 22
variables of kind discrete: 14
variables of kind continuous: 19


### Step 1 - Find and handle incorrect and missing values<a name="stage-1-step-1"></a> ([top](#top))
---

We are told that the data set contains incorrect and missing values. Our plan is:

* **Qualitative variables:** We want to make sure that qualitative variables take only valid values.
* **Quantitative variables:** FIXME

In [9]:
def count_null(series: pd.Series) -> int:
    return series.isna().sum()


def count_invalid(series: pd.Series, valid_values: T.Set[str]) -> int:
    return (~series.isin(valid_values)).sum()


def get_invalid(df: pd.DataFrame,
                variables: T.Dict[str, Attrs],
                name: str) -> pd.Series:
    attrs = variables[name]
    values = attrs['values']
    invalid = df.loc[~df[name].isin(values), name]
    return invalid.unique()


def replace_values(df: pd.DataFrame,
                   variables: T.Dict[str, Attrs],
                   name: str,
                   replacements: T.Dict[str, str]) -> pd.DataFrame:
    attrs = variables[name]
    values = attrs['values']
    # We only update invalid values:
    invalid = df.loc[~df[name].isin(values), name]
    corrected = invalid.map(replacements)
    df.loc[corrected.index, name] = corrected
    return df

We will collect required corrections as we go so that we can apply them to any new set:

In [10]:
corrections = []

#### Qualitative variables
---

The 1st thing we noticed when we tried to check qualitative variables is that the names of some variables differ between the data set and the documentation. **We decide to align the definitions to match the data set.**

In [11]:
replacements = {
    # 'Exterior 1' is 'Exterior 1st' in the data set:
    'Exterior 1': 'Exterior 1st',
    # 'Exterior 2' is 'Exterior 2nd' in the data set:
    'Exterior 2': 'Exterior 2nd',
    # 'BsmtFinType 2' is 'BsmtFin Type 2' in the data set:
    'BsmtFinType 2': 'BsmtFin Type 2',
    # 'HeatingQC' is 'Heating QC' in the data set:
    'HeatingQC': 'Heating QC',
    # 'KitchenQual' is 'Kitchen Qual' in the data set:
    'KitchenQual': 'Kitchen Qual',
    # 'FireplaceQu' is 'Fireplace Qu' in the data set:
    'FireplaceQu': 'Fireplace Qu'
}

# We preserve the order:
pairs = [(replacements.get(name, name), attrs) for name, attrs in variables.items()]
variables = collections.OrderedDict(pairs)

The 2nd thing we noticed is that _NA_ in the documentation is represented by `np.nan` in the data set. **We decide to align the data set to match the definitions.**

In [12]:
def correct(df, variables):
    for name, attrs in variables.items():
        if not is_qualitative(attrs):
            continue
        if 'NA' in attrs['values']:
            df[name] = df[name].fillna('NA')
    return df


# Register:
corrections.append(correct)

# Apply:
df = correct(df, variables)

We check qualitative variables and build a data-frame with the number of null values and the number of invalid values:

In [13]:
data = []
for name, attrs in variables.items():
    if not is_qualitative(attrs):
        continue
    series = df[name]
    null_count = count_null(series)
    invalid_count = count_invalid(series, attrs['values'])
    data.append((name, attrs['kind'], null_count, invalid_count))
df_ql = pd.DataFrame(data=data, columns=['feature', 'kind', 'null_count', 'invalid_count'])

The qualitative variables that we need to investigate are:

In [14]:
df_ql[(df_ql['null_count'] > 0) | (df_ql['invalid_count'] > 0)]

Unnamed: 0,feature,kind,null_count,invalid_count
1,MS Zoning,Nominal,0,21
9,Neighborhood,Nominal,0,361
12,Bldg Type,Nominal,0,230
19,Exterior 2nd,Nominal,0,181
20,Mas Vnr Type,Nominal,20,20
32,Electrical,Ordinal,1,1
43,Sale Type,Nominal,0,2116


**`MS Zoning`:**

We take a look at the invalid values:

In [15]:
name = 'MS Zoning'
values = variables[name]['values']
get_invalid(df, variables, name)

array(['I (all)', 'C (all)', 'A (agr)'], dtype=object)

We need to map invalid values to valid ones. **We decide to correct the data set.**

In [16]:
def correct(df, variables, name=name):
    replacements = {'I (all)': 'I', 'C (all)': 'C', 'A (agr)': 'A'}
    return replace_values(df, variables, name, replacements)


# Register:
corrections.append(correct)

# Apply & check:
df = correct(df, variables)
assert count_invalid(df[name], values) == 0

**`Neighborhood`:**

We take a look at the invalid values:

In [17]:
name = 'Neighborhood'
values = variables[name]['values']
get_invalid(df, variables, name)

array(['NAmes'], dtype=object)

Looking at other values, we see that `Northwest Ames` is capitalized as `NWAmes`. Thus it would make sense for `North Ames` to be capitalized as `NAmes`. **We decide to align the definition to match the data set.**

In [18]:
# Align & check:
values.remove('Names')
values.add('NAmes')
assert count_invalid(df[name], values) == 0

**`Bldg Type`:**

We take a look at the invalid values:

In [19]:
name = 'Bldg Type'
values = variables[name]['values']
get_invalid(df, variables, name)

array(['Twnhs', '2fmCon', 'Duplex'], dtype=object)

In [20]:
df[name].value_counts()

1Fam      2016
TwnhsE     184
Twnhs       88
Duplex      86
2fmCon      56
Name: Bldg Type, dtype: int64

Regarding `Duplex` (6 characters), this makes more sense than `Duplx`. **We decide to align the definition to match the data set.** Regarding the remaining values, we need to map invalid values to valid ones. Since `TwnhsI` is completely absent from the data set, we assume that `Twnhs` should be mapped to `TwnhsI`. **We decide to correct the data set.**

In [21]:
# Align:
values.remove('Duplx')
values.add('Duplex')


def correct(df, variables, name=name):
    replacements = {'Twnhs': 'TwnhsI', '2fmCon': '2FmCon'}
    return replace_values(df, variables, name, replacements)


# Register:
corrections.append(correct)

# Apply & check:
df = correct(df, variables)
assert count_invalid(df[name], values) == 0

**`Exterior 2nd`:**

We take a look at the invalid values:

In [22]:
name = 'Exterior 2nd'
values = variables[name]['values']
get_invalid(df, variables, name)

array(['CmentBd', 'Wd Shng', 'Brk Cmn'], dtype=object)

In [23]:
df[name].value_counts()

VinylSd    850
MetalSd    366
HdBoard    342
Wd Sdng    327
Plywood    228
CmentBd     94
Wd Shng     67
BrkFace     43
Stucco      37
AsbShng     32
Brk Cmn     20
ImStucc     12
Stone        5
AsphShn      4
CBlock       2
PreCast      1
Name: Exterior 2nd, dtype: int64

We need to map invalid values to valid ones. Since `WdShing` is completely absent from the data set, we assume that `Wd Shng` should be mapped to `WdShing`. **We decide to correct the data set.**

In [24]:
def correct(df, variables, name=name):
    replacements = {'CmentBd': 'CemntBd', 'Wd Shng': 'WdShing', 'Brk Cmn': 'BrkComm'}
    return replace_values(df, variables, name, replacements)


# Register:
corrections.append(correct)

# Apply & check:
df = correct(df, variables)
assert count_invalid(df[name], values) == 0

**`Mas Vnr Type`:**

We take a look at the invalid values:

In [99]:
name = 'Mas Vnr Type'
values = variables[name]['values']
get_invalid(df, variables, name)

array([nan], dtype=object)

In [100]:
df[name].value_counts(dropna=False)

None       1442
BrkFace     736
Stone       210
BrkCmn       21
NaN          20
CBlock        1
Name: Mas Vnr Type, dtype: int64

As this point it would make sense to look at `Mas Vnr Type` and `Mas Vnr Area` together, so we will return to `Mas Vnr Type` when we investigate quantitative variables below.

**Electrical:**

We take a look at the invalid values:

In [27]:
name = 'Electrical'
values = variables[name]['values']
get_invalid(df, variables, name)

array([nan], dtype=object)

We need to map `np.nan` to a valid value but have not way to do so with certainty. Since there is only 1 row impacted, we could drop it or use the mode. **We decide to introduce a new value - `Unknown`.**

In [28]:
# Extend:
values.add('Unknown')


def correct(df, variables, name=name):
    df[name] = df[name].fillna('Unknown')
    return df


# Register:
corrections.append(correct)

# Apply & check:
df = correct(df, variables)
assert count_invalid(df[name], values) == 0

**Sale Type:**

We take a look at the invalid values:

In [29]:
name = 'Sale Type'
values = variables[name]['values']
get_invalid(df, variables, name)

array(['WD '], dtype=object)

Here we are bitten by a trailing white space. **We decide to correct the data set.** It actually seems reasonable to strip leading and trailing white spaces from all qualitative variables as a precaution and to make this correction the first one that we appply. 

In [30]:
def correct(df, variables):
    for name, attrs in variables.items():
        if not is_qualitative(attrs):
            continue
        if pd.api.types.is_object_dtype(df[name]):
            df[name] = df[name].str.strip()
    return df


# Register:
corrections.insert(0, correct)  # prepend

# Apply & check:
df = correct(df, variables)
assert count_invalid(df[name], values) == 0

We also keep a table for values that indicate the absence of a feature:
    
| Variable           | Value    | Meaning            |
|--------------------|----------|--------------------|
| Alley              | NA       | No alley access    |
| Mas Vnr Type       | None     | None               |
| Bsmt Qual          | NA       | No basement        |
| Bsmt Cond          | NA       | No basement        |
| Bsmt Exposure      | NA       | No basement        |
| BsmtFin Type 1     | NA       | No basement        |
| BsmtFin Type 2     | NA       | No basement        |
| Electrical         | Unknown  | Unknown            |
| Fireplace Qu       | NA       | No Fireplace       |
| Garage Type        | NA       | No garage          |
| Garage Finish      | NA       | No garage          |
| Garage Qual        | NA       | No garage          |
| Garage Cond        | NA       | No garage          |
| Pool QC            | NA       | No pool            |
| Fence              | NA       | No fence           |
| Misc Feature       | NA       | None               |

#### Quantitative variables
---

The 1st thing we noticed when we tried to check quantitative variables is that the names of some variables differ between the data set and the documentation. **We decide to align the definitions to match the data set.**

In [31]:
replacements = {
    # 'Bedroom' is 'Bedroom AbvGr' in the data set:
    'Bedroom': 'Bedroom AbvGr',
    # 'Kitchen' is 'Kitchen AbvGr' in the data set:
    'Kitchen': 'Kitchen AbvGr',
    # 'TotRmsAbvGrd' is 'TotRms AbvGrd' in the data set:
    'TotRmsAbvGrd': 'TotRms AbvGrd',
    # '3-Ssn Porch' is '3Ssn Porch' in the data set:
    '3-Ssn Porch': '3Ssn Porch'
}

# We preserve the order:
pairs = [(replacements.get(name, name), attrs) for name, attrs in variables.items()]
variables = collections.OrderedDict(pairs)

We check quantitative variables and build a data-frame with the number of null values, zero values, the min and the max:

In [32]:
data = []
for name, attrs in variables.items():
    if not is_quantitative(attrs):
        continue
    series = df[name]
    null_count = count_null(series)
    zero_count = (series == 0.0).sum()
    data.append((name, attrs['kind'], null_count, zero_count, series.min(), series.max()))
df_qt = pd.DataFrame(data=data, columns=['feature', 'kind', 'null_count', 'zero_count', 'min', 'max'])

The quantitative variables that we need to investigate are:

In [33]:
df_qt[(df_qt['null_count'] > 0) | (df_qt['min'] < 0)]

Unnamed: 0,feature,kind,null_count,zero_count,min,max
0,Lot Frontage,Continuous,420,0,21.0,313.0
4,Mas Vnr Area,Continuous,20,1438,0.0,1600.0
5,BsmtFin SF 1,Continuous,1,773,0.0,5644.0
6,BsmtFin SF 2,Continuous,1,2135,0.0,1526.0
7,Bsmt Unf SF,Continuous,1,210,0.0,2336.0
8,Total Bsmt SF,Continuous,1,70,0.0,6110.0
12,Bsmt Full Bath,Discrete,2,1412,0.0,3.0
13,Bsmt Half Bath,Discrete,2,2285,0.0,2.0
20,Garage Yr Blt,Discrete,138,0,1896.0,2207.0
21,Garage Cars,Discrete,1,136,0.0,4.0


**`Lot Frontage`:**

Here, _null_ seems to mean that that particular piece of information is missing since 0 would not make sense. We cannot afford to drop 420 rows out of 2430 (17 %), so we will have to somehow impute the null values. We want to keep things as simple as possible. **We decide to replace null values by the median value in the neighborhood where the property is located if the median exists and by the median value overall otherwise.**

In [35]:
name = 'Lot Frontage'


def correct(df, variables, name=name):
    
    def median(series, default):
        median = series.median(skipna=False)
        return default if np.isnan(median) else median
        
    median_overall = df[name].median()
    df[name] = df.groupby(by='Neighborhood')[name].transform(
        lambda series: series.fillna(median(series, median_overall)))
    return df


# Register:
corrections.append(correct)

# Apply & check:
df = correct(df, variables)
assert df[name].isna().sum() == 0

**`Mas Vnr Area` and `Mas Vnr Type`:**

Here we have to be careful because for `Mas Vnr Area` we have both _Na_ and 0 values and for `Mas Vnr Type` We have both _Na_ and `None` values. We take a closer look at the different combinations.

In [49]:
name_area = 'Mas Vnr Area'
name_type = 'Mas Vnr Type'

In [103]:
df_vnr = df[[name_type, name_area]].copy()
df_vnr[f'{name_type} Simpl'] = (
    df_vnr[name_type]
    .mask(df_vnr[name_type].isna(), 'Na')
    .mask(df_vnr[name_type].notna() & (df_vnr[name_type] != 'None'), 'Other')
)
df_vnr[f'{name_area} Simpl'] = (
    df_vnr[name_area]
    .mask(df_vnr[name_area].isna(), 'Na')
    .mask(df_vnr[name_area] == 0, 'Zero')
    .mask(df_vnr[name_area].notna() & (df_vnr[name_area] != 0), 'Positive')
)
df_vnr['Count'] = 1
df_vnr.groupby(by=[f'{name_type} Simpl', f'{name_area} Simpl']).agg({'Count': 'sum'})

Unnamed: 0_level_0,Unnamed: 1_level_0,Count
Mas Vnr Type Simpl,Mas Vnr Area Simpl,Unnamed: 2_level_1
Na,Na,20
,Positive,7
,Zero,1435
Other,Positive,965
Other,Zero,3


In [107]:
df.loc[(df[name_type] == 'None') & (df[name_area] > 0), [name_type, name_area]]

Unnamed: 0,Mas Vnr Type,Mas Vnr Area
631,,285.0
1286,,1.0
1546,,344.0
1737,,312.0
1975,,1.0
2135,,288.0
2256,,1.0


In [108]:
df.loc[df[name_type].notna() & (df[name_type] != 'None') & (df[name_area] == 0), [name_type, name_area]]

Unnamed: 0,Mas Vnr Type,Mas Vnr Area
678,Stone,0.0
2220,BrkFace,0.0
2227,BrkFace,0.0


Notes:
* The expected cases are `(None, Zero)` and `(Other, Positive)`.
* For the case `(Na, Na)`: There is not way to know with certainty. We assume that the type and area were left out because they did not apply. **We decide to set the type to `None` and the area to 0.**
* For the case `(None, Positive)`: There is no way to know with certainty. We assume that an area of 1.0 square feet is a mistake. We also assume that an area greater than 1.0 square feet is correct and that the type was left out by mistake. **We decide to set the type to `Unknown` and the area to 0 if the area is equal to 1.0 and to only set the type to `Unknown` otherwise.**
* For the case `(Other, Zero)`: There is no way to know with certainty. We assume that the type is correct and that the area was left out by mistake. **We decide to set the area to the median value in the neighborhood where the property is located if the median exists and to the median value overall otherwise.**

**`BsmtFin SF 1, BsmtFin SF 2, Bsmt Unf SF, Total Bsmt SF`:**

In [119]:
names = ['BsmtFin SF 1', 'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF', 'Bsmt Full Bath', 'Bsmt Half Bath']
df.loc[df[names].isna().any(axis=1), names]

Unnamed: 0,BsmtFin SF 1,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,Bsmt Full Bath,Bsmt Half Bath
104,,,,,,
2056,0.0,0.0,0.0,0.0,,


Fortunately, only 2 records are impacted. There is not way to know with certainty. We assume that the values were left out because they did not apply. **We decide to set both continuous and discrete values to 0.**

### Step 2 - Correct inconsistencies<a name="stage-1-step-2"></a> ([top](#top))
---

### Step 3 - Handle outliers<a name="stage-1-step-3"></a> ([top](#top))
---

Laundry list:
* Ordinals as numbers and strings
  Numeric: Overall Qual

**`SalePrice`**: Our target variable is `SalePrice`, so we look at it first. We print its descriptive statistics and plot its distribution below. Turning to the Internet, we also found out that there are 2 related measures: [skewness](https://en.wikipedia.org/wiki/Skewness) and [kurtosis](https://en.wikipedia.org/wiki/Kurtosis) ("tailedness"), both available out-of-the-box in Pandas. We do not have any backround in statistics but, for comparison, the normal distribution has a skew of 0 and a kurtosis of 3. We also print these below.

In [None]:
def plot_distribution(series, xlabel):
    fig, ax = plt.subplots()
    ax.hist(series.dropna(), bins=100, color='cornflowerblue', ec='black', label=None)
    ax.axvline(series.mean(), color='red', linestyle='-', label='mean')
    ax.axvline(series.median(), color='lime', linestyle='-', label='median')
    ax.set_xlabel(xlabel)
    ax.set_ylabel('count')
    ax.legend()
    plt.show()

In [None]:
series = df['SalePrice']
series.describe()

In [None]:
plot_distribution(series, 'Sales price in USD')
print(f'skewness: {series.skew():.2f}')
print(f'kurtosis: {series.kurt():.2f}')

**Comment:** As we have already seen in the course, the sales price is positively skewed and a log transformation allows to transform its distribution into one that is closer to the normal distribution. 

In [None]:
mod_series = np.log(series)
print(mod_series.describe())

In [None]:
plot_distribution(mod_series, 'Natural logarithm of sales price in USD')
print(f'skewness: {mod_series.skew():.2f}')
print(f'kurtosis: {mod_series.kurt():.2f}')

In [None]:
df['SalePrice'] = mod_series