# Business and data understanding

## Purpose
This notebook contains the business and data understanding according to [2020, Studer et al.](https://arxiv.org/abs/2003.05155) "Towards CRISP-ML(Q): A Machine Learning Process Model with Quality Assurance Methodology".

## Methodology
Besides the methodology described by 2020, Studer et al., I will use the [EDA framework proposed by Tony Ojeda](https://www.youtube.com/watch?v=YEBRkLo568Q).

## WIP - improvements

## Results

## Suggested next steps
- [ ] It was not possible to use the 'cardinalidade' function on 16 attributes. <- Next step: analyze why it happened.


# Setup

## Library import
We import all the required Python libraries

In [20]:
import os

# Data manipulation
import pandas as pd
import numpy as np

# Options for pandas
pd.options.display.max_columns = None
pd.options.display.max_rows = 100

# Visualizations
import cufflinks as cf
import matplotlib as plt
import seaborn as sns
import plotly
import plotly.graph_objs as go
import plotly.offline as ply

os.chdir('../')
from src.utils.data_describe import breve_descricao, serie_nulos, cardinalidade
os.chdir('./notebooks/')


plotly.offline.init_notebook_mode(connected=True)
cf.go_offline(connected=True)
cf.set_config_file(theme='white')

# Autoreload extension
if 'autoreload' not in get_ipython().extension_manager.loaded:
    %load_ext autoreload
    
%autoreload 2

# Parameter definition
We set all relevant parameters for our notebook. By convention, parameters are uppercase, while all the 
other variables follow Python's guidelines.

In [16]:
RAW_FOLDER = '../data/raw/'
RANDOM_STATE = 42


# Data import
We retrieve all the required data for the analysis.

In [17]:
df = pd.read_csv(RAW_FOLDER + 'train.csv')
df.shape

(1460, 81)

## Initial evaluation

In [40]:
# Data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

In [52]:
serie_nulos(df, corte=0.5)

4 atributos/features/campos possuem mais de 0.5 de valores nulos.


PoolQC         0.995205
MiscFeature    0.963014
Alley          0.937671
Fence          0.807534
dtype: float64

In [47]:
lst_bad_columns = []
lst_good_columns = []

for column in df.select_dtypes(include='object').columns:
    try:
        cardinalidade(df[[column]])
        lst_good_columns.append(column)
    except Exception as e:
        lst_bad_columns.append(column)
        
print(f"""
Using the function 'cardinalidade':
- {len(lst_bad_columns)} columns could not be analyzed;
- {len(lst_good_columns)} columns could be analyzed.
""")


Using the function 'cardinalidade':
- 16 columns could not be analyzed;
- 27 columns could be analyzed.



In [48]:
cardinalidade(df[lst_good_columns])

Unnamed: 0,Atributo,Cardinalidade,Valores
21,CentralAir,2,"[N, Y]"
1,Street,2,"[Grvl, Pave]"
4,Utilities,2,"[AllPub, NoSeWa]"
6,LandSlope,3,"[Gtl, Mod, Sev]"
24,PavedDrive,3,"[N, P, Y]"
16,ExterQual,4,"[Ex, Fa, Gd, TA]"
22,KitchenQual,4,"[Ex, Fa, Gd, TA]"
3,LandContour,4,"[Bnk, HLS, Low, Lvl]"
2,LotShape,4,"[IR1, IR2, IR3, Reg]"
10,BldgType,5,"[1Fam, 2fmCon, Duplex, Twnhs, TwnhsE]"


In [50]:
df[lst_bad_columns].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 16 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Alley         91 non-null     object
 1   MasVnrType    1452 non-null   object
 2   BsmtQual      1423 non-null   object
 3   BsmtCond      1423 non-null   object
 4   BsmtExposure  1422 non-null   object
 5   BsmtFinType1  1423 non-null   object
 6   BsmtFinType2  1422 non-null   object
 7   Electrical    1459 non-null   object
 8   FireplaceQu   770 non-null    object
 9   GarageType    1379 non-null   object
 10  GarageFinish  1379 non-null   object
 11  GarageQual    1379 non-null   object
 12  GarageCond    1379 non-null   object
 13  PoolQC        7 non-null      object
 14  Fence         281 non-null    object
 15  MiscFeature   54 non-null     object
dtypes: object(16)
memory usage: 182.6+ KB


### Partial conclusions:
- From the 91 attributes, we have:
 - float64(3), int64(35), object(43)

- There are 4 attributes with more than 50% of null values:
 - PoolQC         0.995205
 - MiscFeature    0.963014
 - Alley          0.937671
 - Fence          0.807534
 
- It was not possible to use the 'cardinalidade' function on 16 attributes. <- Next step.

# Data processing
Put here the core of the notebook. Feel free di further split this section into subsections.

# References
We report here relevant references:
1. author1, article1, journal1, year1, url1
2. author2, article2, journal2, year2, url2