# Business and data understanding

## Purpose
This notebook contains the business and data understanding according to [2020, Studer et al.](https://arxiv.org/abs/2003.05155) "Towards CRISP-ML(Q): A Machine Learning Process Model with Quality Assurance Methodology".

## Methodology
Besides the methodology described by 2020, Studer et al., I will use the [EDA framework proposed by Tony Ojeda](https://www.youtube.com/watch?v=YEBRkLo568Q).

## WIP - improvements

## Results

## Suggested next steps
- [ ] It was not possible to use the 'cardinalidade' function on 16 attributes. <- Next step: analyze why it happened.


# Setup

## Library import
We import all the required Python libraries

In [2]:
import os

# Data manipulation
import pandas as pd
import numpy as np

# Visualizations
import matplotlib as plt
from pandas_profiling import ProfileReport
import plotly
import plotly.graph_objs as go
import plotly.offline as ply
import seaborn as sns

os.chdir('../')
from src.utils.data_describe import breve_descricao, serie_nulos, cardinalidade
os.chdir('./notebooks/')

# Options for pandas
# pd.options.display.max_columns = None
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
# pd.options.display.max_rows = 120

plotly.offline.init_notebook_mode(connected=True)

# Autoreload extension
if 'autoreload' not in get_ipython().extension_manager.loaded:
    %load_ext autoreload
    
%autoreload 2

## Parameter definition
We set all relevant parameters for our notebook. By convention, parameters are uppercase, while all the 
other variables follow Python's guidelines.

In [3]:
RAW_FOLDER = '../data/raw/'
REPORTS_FOLDER = '../reports/'
RANDOM_STATE = 42


# Data import
We retrieve all the required data for the analysis.

In [4]:
df = pd.read_csv(RAW_FOLDER + 'train.csv', index_col=0)
df_evaluation = df.copy() 
df_evaluation.shape

(1460, 80)

## Initial evaluation

In [5]:
# Data types
df_evaluation.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 1 to 1460
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   MSSubClass     1460 non-null   int64  
 1   MSZoning       1460 non-null   object 
 2   LotFrontage    1201 non-null   float64
 3   LotArea        1460 non-null   int64  
 4   Street         1460 non-null   object 
 5   Alley          91 non-null     object 
 6   LotShape       1460 non-null   object 
 7   LandContour    1460 non-null   object 
 8   Utilities      1460 non-null   object 
 9   LotConfig      1460 non-null   object 
 10  LandSlope      1460 non-null   object 
 11  Neighborhood   1460 non-null   object 
 12  Condition1     1460 non-null   object 
 13  Condition2     1460 non-null   object 
 14  BldgType       1460 non-null   object 
 15  HouseStyle     1460 non-null   object 
 16  OverallQual    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  YearBuil

In [6]:
lst_columns_null = serie_nulos(df_evaluation, corte=0.5).index.tolist()

lst_columns_null

4 atributos/features/campos possuem mais de 0.5 de valores nulos.


['PoolQC', 'MiscFeature', 'Alley', 'Fence']

In [7]:
lst_bad_columns = []
lst_good_columns = []

for column in df_evaluation.select_dtypes(include='object').columns:
    try:
        cardinalidade(df_evaluation[[column]])
        lst_good_columns.append(column)
    except Exception as e:
        lst_bad_columns.append(column)
        
print(f"""
Using the function 'cardinalidade':
- {len(lst_bad_columns)} columns could not be analyzed;
- {len(lst_good_columns)} columns could be analyzed.
""")


Using the function 'cardinalidade':
- 16 columns could not be analyzed;
- 27 columns could be analyzed.



In [8]:
cardinalidade(df_evaluation[lst_good_columns])

Unnamed: 0,Atributo,Cardinalidade,Valores
21,CentralAir,2,"[N, Y]"
1,Street,2,"[Grvl, Pave]"
4,Utilities,2,"[AllPub, NoSeWa]"
6,LandSlope,3,"[Gtl, Mod, Sev]"
24,PavedDrive,3,"[N, P, Y]"
16,ExterQual,4,"[Ex, Fa, Gd, TA]"
22,KitchenQual,4,"[Ex, Fa, Gd, TA]"
3,LandContour,4,"[Bnk, HLS, Low, Lvl]"
2,LotShape,4,"[IR1, IR2, IR3, Reg]"
10,BldgType,5,"[1Fam, 2fmCon, Duplex, Twnhs, TwnhsE]"


In [9]:
df_evaluation[lst_bad_columns].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 1 to 1460
Data columns (total 16 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Alley         91 non-null     object
 1   MasVnrType    1452 non-null   object
 2   BsmtQual      1423 non-null   object
 3   BsmtCond      1423 non-null   object
 4   BsmtExposure  1422 non-null   object
 5   BsmtFinType1  1423 non-null   object
 6   BsmtFinType2  1422 non-null   object
 7   Electrical    1459 non-null   object
 8   FireplaceQu   770 non-null    object
 9   GarageType    1379 non-null   object
 10  GarageFinish  1379 non-null   object
 11  GarageQual    1379 non-null   object
 12  GarageCond    1379 non-null   object
 13  PoolQC        7 non-null      object
 14  Fence         281 non-null    object
 15  MiscFeature   54 non-null     object
dtypes: object(16)
memory usage: 193.9+ KB


In [10]:
# Evaluating the int attributes:
cardinalidade(df_evaluation.select_dtypes(include='int64'))

Unnamed: 0,Atributo,Cardinalidade,Valores
15,BsmtHalfBath,3,"[0, 1, 2]"
17,HalfBath,3,"[0, 1, 2]"
14,BsmtFullBath,4,"[0, 1, 2, 3]"
21,Fireplaces,4,"[0, 1, 2, 3]"
16,FullBath,4,"[0, 1, 2, 3]"
19,KitchenAbvGr,4,"[0, 1, 2, 3]"
22,GarageCars,5,"[0, 1, 2, 3, 4]"
32,YrSold,5,"[2006, 2007, 2008, 2009, 2010]"
18,BedroomAbvGr,8,"[0, 1, 2, 3, 4, 5, 6, 8]"
29,PoolArea,8,"[0, 480, 512, 519, 555, 576, 648, 738]"


In [11]:
lst_time = [x for x in df_evaluation.columns if ('yr' in x.lower()) or ('year' in x.lower())]
# After reading the data description, I realized that 'MoSold' is a time attribute too.
lst_time.append('MoSold')

print(f"""There is/are {len(lst_time)} time attributes:
{lst_time}""")

There is/are 5 time attributes:
['YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'YrSold', 'MoSold']


In [12]:
lst_area = [x for x in df_evaluation.columns if ('area' in x.lower()) or ('sf' in x.lower())]

print(f"""There is/are {len(lst_area)} area attributes:
{lst_area}""")

There is/are 14 area attributes:
['LotArea', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'PoolArea']


In [13]:
lst_float = [
    x for x in df_evaluation.select_dtypes(include='float64').columns.tolist() if (x not in lst_area) and (x not in lst_time)
]

print(f"""There is/are {len(lst_float)} float attributes:
{lst_float}""")

There is/are 1 float attributes:
['LotFrontage']


### Partial conclusions:
- From the 81 attributes, we have:
 - float64(3), int64(35), object(43)

- There are 4 attributes with more than 50% of null values:
 - PoolQC         0.995205
 - MiscFeature    0.963014
 - Alley          0.937671
 - Fence          0.807534
 
- It was not possible to use the 'cardinalidade' function on 16 attributes. <- Next step.

**Action:**

**30/05/2022**:
- The field 'id' will be dropped.
- The 4 attributes wit more than 80% of null values will be dropped.
- For the baseline model, I will take only the numerical, and time fields (
 - 5 time attributes;
 - 14 area attributes; and
 - 1 float attributes.

# EDA framework

<img src="../references/eda_framework.png" alt="eda" class="bg-primary" width="500px">

In [14]:
df.sample(10, random_state=RANDOM_STATE)

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1
893,20,RL,70.0,8414,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Sawyer,Norm,Norm,1Fam,1Story,6,8,1963,2003,Hip,CompShg,HdBoard,HdBoard,,0.0,TA,TA,CBlock,TA,TA,No,GLQ,663,Unf,0,396,1059,GasA,TA,Y,SBrkr,1068,0,0,1068,0,1,1,0,3,1,TA,6,Typ,0,,Attchd,1963.0,RFn,1,264,TA,TA,Y,192,0,0,0,0,0,,MnPrv,,0,2,2006,WD,Normal,154500
1106,60,RL,98.0,12256,Pave,,IR1,Lvl,AllPub,Corner,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,1994,1995,Gable,CompShg,HdBoard,HdBoard,BrkFace,362.0,Gd,TA,PConc,Ex,TA,Av,GLQ,1032,Unf,0,431,1463,GasA,Ex,Y,SBrkr,1500,1122,0,2622,1,0,2,1,3,1,Gd,9,Typ,2,TA,Attchd,1994.0,RFn,2,712,TA,TA,Y,186,32,0,0,0,0,,,,0,4,2010,WD,Normal,325000
414,30,RM,56.0,8960,Pave,Grvl,Reg,Lvl,AllPub,Inside,Gtl,OldTown,Artery,Norm,1Fam,1Story,5,6,1927,1950,Gable,CompShg,WdShing,Wd Shng,,0.0,TA,TA,CBlock,TA,TA,No,Unf,0,Unf,0,1008,1008,GasA,Gd,Y,FuseA,1028,0,0,1028,0,0,1,0,2,1,TA,5,Typ,1,Gd,Detchd,1927.0,Unf,2,360,TA,TA,Y,0,0,130,0,0,0,,,,0,3,2010,WD,Normal,115000
523,50,RM,50.0,5000,Pave,,Reg,Lvl,AllPub,Corner,Gtl,BrkSide,Feedr,Norm,1Fam,1.5Fin,6,7,1947,1950,Gable,CompShg,CemntBd,CmentBd,,0.0,TA,Gd,CBlock,TA,TA,No,ALQ,399,Unf,0,605,1004,GasA,Ex,Y,SBrkr,1004,660,0,1664,0,0,2,0,3,1,TA,7,Typ,2,Gd,Detchd,1950.0,Unf,2,420,TA,TA,Y,0,24,36,0,0,0,,,,0,10,2006,WD,Normal,159000
1037,20,RL,89.0,12898,Pave,,IR1,HLS,AllPub,Inside,Gtl,Timber,Norm,Norm,1Fam,1Story,9,5,2007,2008,Hip,CompShg,VinylSd,VinylSd,Stone,70.0,Gd,TA,PConc,Ex,TA,Gd,GLQ,1022,Unf,0,598,1620,GasA,Ex,Y,SBrkr,1620,0,0,1620,1,0,2,0,2,1,Ex,6,Typ,1,Ex,Attchd,2008.0,Fin,3,912,TA,TA,Y,228,0,0,0,0,0,,,,0,9,2009,WD,Normal,315500
615,180,RM,21.0,1491,Pave,,Reg,Lvl,AllPub,Inside,Gtl,MeadowV,Norm,Norm,TwnhsE,SFoyer,4,6,1972,1972,Gable,CompShg,CemntBd,CmentBd,,0.0,TA,TA,CBlock,Gd,TA,Av,LwQ,150,GLQ,480,0,630,GasA,Ex,Y,SBrkr,630,0,0,630,1,0,1,0,1,1,TA,3,Typ,0,,,,,0,0,,,Y,96,24,0,0,0,0,,,,0,5,2010,WD,Normal,75500
219,50,RL,,15660,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,1.5Fin,7,9,1939,2006,Gable,CompShg,VinylSd,VinylSd,BrkFace,312.0,Gd,Gd,CBlock,TA,TA,No,BLQ,341,Unf,0,457,798,GasA,Ex,Y,SBrkr,1137,817,0,1954,0,1,1,1,3,1,Gd,8,Typ,2,TA,Attchd,1939.0,Unf,2,431,TA,TA,Y,0,119,150,0,0,0,,,,0,5,2008,WD,Normal,311500
1161,160,RL,24.0,2280,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NPkVill,Norm,Norm,Twnhs,2Story,6,5,1978,1978,Gable,CompShg,Plywood,Brk Cmn,,0.0,TA,TA,CBlock,Gd,TA,No,ALQ,311,Unf,0,544,855,GasA,Fa,Y,SBrkr,855,601,0,1456,0,0,2,1,3,1,TA,7,Typ,1,TA,Attchd,1978.0,Unf,2,440,TA,TA,Y,26,0,0,0,0,0,,,,0,7,2010,WD,Normal,146000
650,180,RM,21.0,1936,Pave,,Reg,Lvl,AllPub,Inside,Gtl,MeadowV,Norm,Norm,Twnhs,SFoyer,4,6,1970,1970,Gable,CompShg,CemntBd,CmentBd,,0.0,TA,TA,CBlock,Gd,TA,Av,BLQ,131,GLQ,499,0,630,GasA,Gd,Y,SBrkr,630,0,0,630,1,0,1,0,1,1,TA,3,Typ,0,,,,,0,0,,,Y,0,0,0,0,0,0,,MnPrv,,0,12,2007,WD,Normal,84500
888,50,RL,59.0,16466,Pave,,IR1,Lvl,AllPub,Inside,Gtl,Edwards,Norm,Norm,1Fam,1.5Fin,5,7,1955,1955,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,Gd,PConc,TA,TA,No,Unf,0,Unf,0,816,816,GasA,TA,Y,SBrkr,872,521,0,1393,0,0,1,1,3,1,TA,8,Typ,0,,Attchd,1955.0,Unf,1,300,TA,TA,Y,121,0,0,0,265,0,,,,0,4,2008,WD,Normal,135500


In [18]:
df.loc[df['GarageYrBlt'].isna(), [x for x in df.columns if 'arage' in x]]

Unnamed: 0_level_0,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
40,,,,0,0,,
49,,,,0,0,,
79,,,,0,0,,
89,,,,0,0,,
90,,,,0,0,,
100,,,,0,0,,
109,,,,0,0,,
126,,,,0,0,,
128,,,,0,0,,
141,,,,0,0,,


## Identity

### Types of information

The entities are basically registries of houses sold, and problably each regitry is a different house, because there's no duplicates.

### Entities in dataset

We could aggregate the entities in different views. These are some that crossed my mind:

1. House + Neighborhood + YrSold
2. Neighborhood + YrSold
3. MSZoning (zoning classification) + YrSold

In [67]:
# In the dataset, we have a 5-year timespan.

print("In the dataset, we have a 5-year timespan:")
sorted(df_evaluation['YrSold'].unique())

In the dataset, we have a 5-year timespan:


[2006, 2007, 2008, 2009, 2010]

#### 1.  Neighborhood + SaleType + YrSold

In [89]:
df_vw_house_neighborhood_saletype_yrsold = df_evaluation.groupby(by=['Neighborhood', 'SaleType', 'YrSold']).agg({'SalePrice': 'median', 'MSSubClass': 'count'}).rename(
    columns={'SalePrice': 'SalePrice_median', 'MSSubClass': '#houses'}
).reset_index()

df_vw_house_neighborhood_saletype_yrsold.sort_values(by=['Neighborhood', 'SaleType', 'YrSold'], inplace=True)

df_vw_house_neighborhood_saletype_yrsold

Unnamed: 0,Neighborhood,SaleType,YrSold,SalePrice_median,#houses
0,Blmngtn,New,2006,246578.0,3
1,Blmngtn,New,2007,183350.5,2
2,Blmngtn,WD,2006,214245.0,4
3,Blmngtn,WD,2008,175447.5,2
4,Blmngtn,WD,2009,175900.0,5
5,Blmngtn,WD,2010,192000.0,1
6,Blueste,COD,2008,151000.0,1
7,Blueste,WD,2009,124000.0,1
8,BrDale,COD,2008,85400.0,1
9,BrDale,COD,2009,112000.0,1


#### 2. Neighborhood + YrSold

In [80]:
df_vw_house_neighborhood_yrsold = df_evaluation.groupby(by=['Neighborhood', 'YrSold']).agg({'SalePrice': 'median', 'MSSubClass': 'count'}).rename(
    columns={'SalePrice': 'SalePrice_median', 'MSSubClass': '#houses'}
).reset_index()

df_vw_house_neighborhood_yrsold.sort_values(by=['Neighborhood', 'YrSold'], inplace=True)

df_vw_house_neighborhood_yrsold

Unnamed: 0,Neighborhood,YrSold,SalePrice_median,#houses
0,Blmngtn,2006,215000.0,7
1,Blmngtn,2007,183350.5,2
2,Blmngtn,2008,175447.5,2
3,Blmngtn,2009,175900.0,5
4,Blmngtn,2010,192000.0,1
5,Blueste,2008,151000.0,1
6,Blueste,2009,124000.0,1
7,BrDale,2006,93000.0,4
8,BrDale,2007,113000.0,3
9,BrDale,2008,94750.0,4


#### 3. MSZoning (zoning classification) + YrSold

In [90]:
df_vw_mszoning_yrsold = df_evaluation.groupby(by=['MSZoning', 'YrSold']).agg({'SalePrice': 'median', 'MSSubClass': 'count'}).rename(
    columns={'SalePrice': 'SalePrice_median', 'MSSubClass': '#houses'}
).reset_index()

df_vw_mszoning_yrsold.sort_values(by=['MSZoning', 'YrSold'], inplace=True)

df_vw_mszoning_yrsold

Unnamed: 0,MSZoning,YrSold,SalePrice_median,#houses
0,C (all),2006,71655.5,2
1,C (all),2007,133900.0,1
2,C (all),2008,60500.0,2
3,C (all),2009,59950.0,2
4,C (all),2010,68400.0,3
5,FV,2006,185000.0,13
6,FV,2007,208900.0,15
7,FV,2008,206725.0,14
8,FV,2009,229456.0,15
9,FV,2010,215600.0,8


## Review

### Transformation methods

- Filtering
- Aggregation / disaggregation
- Pivoting
- Graph transformation

### Visualization methods

- Barcharts
- Multi-line graphs
- Scatter-plots
- Heatmaps
- Network visualizations

## Create

### Category aggregations

### Continuous bins

### Cluster categories

## Using pandas-profiling

In [None]:
profile = ProfileReport(df, title="Pandas Profiling Report", correlations={"cramers": {"calculate": False}})
profile.to_file(REPORTS_FOLDER + "EDA_01.html")
profile.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

# Data processing
Put here the core of the notebook. Feel free di further split this section into subsections.

# References
We report here relevant references:
1. author1, article1, journal1, year1, url1
2. author2, article2, journal2, year2, url2