# 1_business_data_understanding_suto

**Purpose:** This notebook contains the business and data understanding according to [2020, Studer et al.](https://arxiv.org/abs/2003.05155) "Towards CRISP-ML(Q): A Machine Learning Process Model with Quality Assurance Methodology".

**Methodology:** Besides the methodology described by 2020, Studer et al., I will use the [EDA framework proposed by Tony Ojeda](https://www.youtube.com/watch?v=YEBRkLo568Q).

**Results:** Describe and comment the most important results.

---

**Suggested next steps**

- [ ] State suggested next steps, based on results obtained in this notebook.


# Setup

## Library import
We import all the required Python libraries

In [1]:
%matplotlib inline

import os
import pickle
from typing import List

# Data manipulation
import pandas as pd
import numpy as np

# Visualizations
import matplotlib as plt
from pandas_profiling import ProfileReport
import seaborn as sns
from sklearn.impute import KNNImputer
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.pipeline import PMMLPipeline
from pypmml import Model

os.chdir('../')
from src.utils.data_describe import breve_descricao, serie_nulos, cardinalidade
os.chdir('./notebooks/')

# Options for pandas
# pd.options.display.max_columns = None
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
# pd.options.display.max_rows = 120

# Autoreload extension
if 'autoreload' not in get_ipython().extension_manager.loaded:
    %load_ext autoreload
    
%autoreload 2

## Parameter definition
We set all relevant parameters for our notebook. By convention, parameters are uppercase, while all the 
other variables follow Python's guidelines.

In [2]:
RAW_FOLDER = '../data/raw/'
REPORTS_FOLDER = '../reports/'
RANDOM_STATE = 42


# Data import
We retrieve all the required data for the analysis.

In [3]:
df = pd.read_csv(RAW_FOLDER + 'train.csv', index_col=0)
df_evaluation = df.copy() 
df_evaluation.shape

(1460, 80)

## Initial evaluation

In [4]:
# Data types
df_evaluation.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 1 to 1460
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   MSSubClass     1460 non-null   int64  
 1   MSZoning       1460 non-null   object 
 2   LotFrontage    1201 non-null   float64
 3   LotArea        1460 non-null   int64  
 4   Street         1460 non-null   object 
 5   Alley          91 non-null     object 
 6   LotShape       1460 non-null   object 
 7   LandContour    1460 non-null   object 
 8   Utilities      1460 non-null   object 
 9   LotConfig      1460 non-null   object 
 10  LandSlope      1460 non-null   object 
 11  Neighborhood   1460 non-null   object 
 12  Condition1     1460 non-null   object 
 13  Condition2     1460 non-null   object 
 14  BldgType       1460 non-null   object 
 15  HouseStyle     1460 non-null   object 
 16  OverallQual    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  YearBuil

In [5]:
lst_columns_null = serie_nulos(df_evaluation, corte=0.5).index.tolist()

lst_columns_null

4 atributos/features/campos possuem mais de 0.5 de valores nulos.


['PoolQC', 'MiscFeature', 'Alley', 'Fence']

In [6]:
cardinalidade(df_evaluation)

Unnamed: 0,Atributo,DType,Cardinalidade,Valores,Proporção Nulos
38,CentralAir,object,2,"[N, Y]",0.0
3,Street,object,2,"[Grvl, Pave]",0.0
7,Utilities,object,2,"[AllPub, NoSeWa]",0.0
4,Alley,object,3,"[Grvl, NaN, Pave]",0.937671
45,BsmtHalfBath,int64,3,"[0, 1, 2]",0.0
47,HalfBath,int64,3,"[0, 1, 2]",0.0
9,LandSlope,object,3,"[Gtl, Mod, Sev]",0.0
61,PavedDrive,object,3,"[N, P, Y]",0.0
44,BsmtFullBath,int64,4,"[0, 1, 2, 3]",0.0
24,ExterQual,object,4,"[Ex, Fa, Gd, TA]",0.0


In [7]:
lst_time = [x for x in df_evaluation.columns if ('yr' in x.lower()) or ('year' in x.lower())]
# After reading the data description, I realized that 'MoSold' is a time attribute too.
lst_time.append('MoSold')

print(f"""There is/are {len(lst_time)} time attributes:
{lst_time}""")

There is/are 5 time attributes:
['YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'YrSold', 'MoSold']


In [8]:
lst_area = [x for x in df_evaluation.columns if ('area' in x.lower()) or ('sf' in x.lower())]

print(f"""There is/are {len(lst_area)} area attributes:
{lst_area}""")

There is/are 14 area attributes:
['LotArea', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'PoolArea']


In [9]:
lst_float = [
    x for x in df_evaluation.select_dtypes(include='float64').columns.tolist() if (x not in lst_area) and (x not in lst_time)
]

print(f"""There is/are {len(lst_float)} float attributes:
{lst_float}""")

There is/are 1 float attributes:
['LotFrontage']


### Partial conclusions:
- From the 81 attributes, we have:
 - float64(3), int64(35), object(43)

- There are 4 attributes with more than 50% of null values:
 - PoolQC         0.995205
 - MiscFeature    0.963014
 - Alley          0.937671
 - Fence          0.807534

**Action:**

**30/05/2022**:
- The field 'id' will be dropped.
- The 4 attributes wit more than 80% of null values will be dropped.
- For the baseline model, I will take only the numerical, and time fields (
 - 5 time attributes;
 - 14 area attributes; and
 - 1 float attributes.

## Data types

In the previous section, we didn't engage in a deep understanding in the type of each attribute. The objective of this section will be a deeper analysis of the data types of each attribute.

This analysis will be conducted in all attributes, except the four attributes with more than 50% of null values: 'PoolQC', 'MiscFeature', 'Alley', 'Fence'. And I will follow from the lowest cardinality to the highest one.

In [10]:
# Removing columns with high null proportion:
df_evaluation = df.drop(columns=lst_columns_null).copy()

print(f"df_evaluation's shape: {df_evaluation.shape}")

df_evaluation's shape: (1460, 76)


### Numerical atributes

Some of the numerical attributes are actually float and not integers (e.g.: any attribute related with area). Using the [data_descriptio.txt](../references/data_descriptio.txt), I will identify and change their data types to float.

In [11]:
df_numerical_cardinality = cardinalidade(df_evaluation.select_dtypes(include=[np.number, float, 'float64'])).copy()
df_numerical_cardinality

Unnamed: 0,Atributo,DType,Cardinalidade,Valores,Proporção Nulos
15,BsmtHalfBath,int64,3,"[0, 1, 2]",0.0
17,HalfBath,int64,3,"[0, 1, 2]",0.0
14,BsmtFullBath,int64,4,"[0, 1, 2, 3]",0.0
21,Fireplaces,int64,4,"[0, 1, 2, 3]",0.0
16,FullBath,int64,4,"[0, 1, 2, 3]",0.0
19,KitchenAbvGr,int64,4,"[0, 1, 2, 3]",0.0
22,GarageCars,int64,5,"[0, 1, 2, 3, 4]",0.0
32,YrSold,int64,5,"[2006, 2007, 2008, 2009, 2010]",0.0
18,BedroomAbvGr,int64,8,"[0, 1, 2, 3, 4, 5, 6, 8]",0.0
29,PoolArea,int64,8,"[0, 480, 512, 519, 555, 576, 648, 738]",0.0


**Int to Float**

Attributes described in 'square feet' in data description, but without 'area' or 'sf' in the name:

- 'EnclosedPorch'
- '3SsnPorch'
- 'ScreenPorch'

'SalePrice' is our target, but it is a float too and it will be transformed. 

In [12]:
# Attributes with area in the name or sf ('square feet') or described in 'square feet' in data description:

lst_area = [x for x in df_evaluation.columns if ('area' in x.lower()) or ('sf' in x.lower())]
lst_area.extend(['EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'SalePrice'])

print(f"""Qty of area attributes: {len(lst_area)}
---
Attributes:
{lst_area}""")

for column in lst_area:
    df_evaluation[column] = df[column].astype(float)
    
cardinalidade(df_evaluation.select_dtypes(include=np.number))

Qty of area attributes: 18
---
Attributes:
['LotArea', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'PoolArea', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'SalePrice']


Unnamed: 0,Atributo,DType,Cardinalidade,Valores,Proporção Nulos
6,BsmtHalfBath,int64,3,"[0, 1, 2]",0.0
8,HalfBath,int64,3,"[0, 1, 2]",0.0
5,BsmtFullBath,int64,4,"[0, 1, 2, 3]",0.0
12,Fireplaces,int64,4,"[0, 1, 2, 3]",0.0
7,FullBath,int64,4,"[0, 1, 2, 3]",0.0
10,KitchenAbvGr,int64,4,"[0, 1, 2, 3]",0.0
13,GarageCars,int64,5,"[0, 1, 2, 3, 4]",0.0
16,YrSold,int64,5,"[2006, 2007, 2008, 2009, 2010]",0.0
9,BedroomAbvGr,int64,8,"[0, 1, 2, 3, 4, 5, 6, 8]",0.0
2,OverallCond,int64,9,"[1, 2, 3, 4, 5, 6, 7, 8, 9]",0.0


###  Categorical attributes

#### Binaries

The attributes 'CentralAir', 'Street', and 'Utilities' have cardinality equal to two. But, according to [data_descriptio.txt](../references/data_descriptio.txt), 'Utilities' is not a binary attribute. In spite of this later information, I will deal 'Utilities' as a binary attubute, changing its name for 'is_all_pub_utilities'.

'CentralAir' and 'Street' will be replaced by 'has_central_air' and 'is_paved_street', respectively.

In [None]:
df_evaluation['has_central_air'] = np.where(df_evaluation['CentralAir']=='Y', 1, 0)
df_evaluation['is_paved_street'] = np.where(df_evaluation['Street']=='Pave', 1, 0)
df_evaluation['is_all_pub_utilities'] = np.where(df_evaluation['Utilities']=='AllPub', 1, 0)

df_evaluation.loc[:, [
    'has_central_air', 'CentralAir', 'is_paved_street', 'Street', 'is_all_pub_utilities', 'Utilities'
]].sample(10, random_state=RANDOM_STATE)

In [None]:
cardinalidade(df_evaluation).loc[cardinalidade(df_evaluation)['Cardinalidade']<=3, :]

## Categorical attributes

In [13]:
lst_non_categorical = lst_columns_null.copy()
lst_non_categorical.extend(lst_area)
lst_non_categorical.extend(lst_time)
lst_non_categorical.extend(lst_float)

df_categorical_eval = df[[x for x in df.columns if x not in lst_non_categorical]].copy()
lst_categorical = df_categorical_eval.select_dtypes(include=['object', 'int64']).columns.tolist()

df_categorical_eval = df_categorical_eval[lst_categorical]

df_categorical_eval.head()

Unnamed: 0_level_0,MSSubClass,MSZoning,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinType2,Heating,HeatingQC,CentralAir,Electrical,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageFinish,GarageCars,GarageQual,GarageCond,PavedDrive,MiscVal,SaleType,SaleCondition
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1
1,60,RL,Pave,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,Gable,CompShg,VinylSd,VinylSd,BrkFace,Gd,TA,PConc,Gd,TA,No,GLQ,Unf,GasA,Ex,Y,SBrkr,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,RFn,2,TA,TA,Y,0,WD,Normal
2,20,RL,Pave,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,Gable,CompShg,MetalSd,MetalSd,,TA,TA,CBlock,Gd,TA,Gd,ALQ,Unf,GasA,Ex,Y,SBrkr,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,RFn,2,TA,TA,Y,0,WD,Normal
3,60,RL,Pave,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,Gable,CompShg,VinylSd,VinylSd,BrkFace,Gd,TA,PConc,Gd,TA,Mn,GLQ,Unf,GasA,Ex,Y,SBrkr,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,RFn,2,TA,TA,Y,0,WD,Normal
4,70,RL,Pave,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,Gable,CompShg,Wd Sdng,Wd Shng,,TA,TA,BrkTil,TA,Gd,No,ALQ,Unf,GasA,Gd,Y,SBrkr,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,Unf,3,TA,TA,Y,0,WD,Abnorml
5,60,RL,Pave,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,Gable,CompShg,VinylSd,VinylSd,BrkFace,Gd,TA,PConc,Gd,TA,Av,GLQ,Unf,GasA,Ex,Y,SBrkr,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,RFn,3,TA,TA,Y,0,WD,Normal


In [14]:
# Qty of categorical attributes:
print(
    f"""Qty of categorical attributes: {len(lst_categorical)}
---
List:
{lst_categorical}"""
)


Qty of categorical attributes: 52
---
List:
['MSSubClass', 'MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageCars', 'GarageQual', 'GarageCond', 'PavedDrive', 'MiscVal', 'SaleType', 'SaleCondition']


In [None]:
cardinalidade(df_categorical_eval)[cardinalidade(df_categorical_eval)['Cardinalidade']<5]

# EDA framework

<img src="../references/eda_framework.png" alt="eda" class="bg-primary" width="500px">

In [None]:
df.sample(10, random_state=RANDOM_STATE)

In [None]:
df.loc[df['GarageYrBlt'].isna(), [x for x in df.columns if 'arage' in x]]

## Identity

### Types of information

The entities are basically registries of houses sold, and problably each regitry is a different house, because there's no duplicates.

### Entities in dataset

We could aggregate the entities in different views. These are some that crossed my mind:

1. House + Neighborhood + YrSold
2. Neighborhood + YrSold
3. MSZoning (zoning classification) + YrSold

In [None]:
# In the dataset, we have a 5-year timespan.

print("In the dataset, we have a 5-year timespan:")
sorted(df_evaluation['YrSold'].unique())

#### 1.  Neighborhood + SaleType + YrSold

In [None]:
df_vw_house_neighborhood_saletype_yrsold = df_evaluation.groupby(by=['Neighborhood', 'SaleType', 'YrSold']).agg({'SalePrice': 'median', 'MSSubClass': 'count'}).rename(
    columns={'SalePrice': 'SalePrice_median', 'MSSubClass': '#houses'}
).reset_index()

df_vw_house_neighborhood_saletype_yrsold.sort_values(by=['Neighborhood', 'SaleType', 'YrSold'], inplace=True)

df_vw_house_neighborhood_saletype_yrsold

#### 2. Neighborhood + YrSold

In [None]:
df_vw_house_neighborhood_yrsold = df_evaluation.groupby(by=['Neighborhood', 'YrSold']).agg({'SalePrice': 'median', 'MSSubClass': 'count'}).rename(
    columns={'SalePrice': 'SalePrice_median', 'MSSubClass': '#houses'}
).reset_index()

df_vw_house_neighborhood_yrsold.sort_values(by=['Neighborhood', 'YrSold'], inplace=True)

df_vw_house_neighborhood_yrsold

#### 3. MSZoning (zoning classification) + YrSold

In [None]:
df_vw_mszoning_yrsold = df_evaluation.groupby(by=['MSZoning', 'YrSold']).agg({'SalePrice': 'median', 'MSSubClass': 'count'}).rename(
    columns={'SalePrice': 'SalePrice_median', 'MSSubClass': '#houses'}
).reset_index()

df_vw_mszoning_yrsold.sort_values(by=['MSZoning', 'YrSold'], inplace=True)

df_vw_mszoning_yrsold

## Review

### Transformation methods

- Filtering
- Aggregation / disaggregation
- Pivoting
- Graph transformation

### Visualization methods

- Barcharts
- Multi-line graphs
- Scatter-plots
- Heatmaps
- Network visualizations

## Create

### Category aggregations

### Continuous bins

### Cluster categories

## Using pandas-profiling

In [None]:
profile = ProfileReport(df, title="Pandas Profiling Report", correlations={"cramers": {"calculate": False}})
profile.to_file(REPORTS_FOLDER + "EDA_01.html")
profile.to_notebook_iframe()