# **Análise Exploratória**

- Inspecionar o dataset
- Estatísticas descritivas
- Distribuição de dados (esp. target)
- Pré-tratamento de dados p/efetuar análises, extrair insights (correlações, etc)

## Bibliotecas

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sqlite3

## Funções

In [18]:
# Função para ler ou gravar dados no banco SQL
def manage_db(operation, table_name, db_path, df=None):
    """
    Função para ler ou gravar dados no banco de dados SQLite 'house_prices.db'.

    Parâmetros:
    operation (str): 'r' para ler ou 'w' para gravar.
    table_name (str): Nome da tabela para ler ou gravar.
    df (pd.DataFrame, opcional): DataFrame para gravar na tabela (necessário se operação for 'w').

    Retorna:
    pd.DataFrame se operação for 'r'. None se operação for 'w'.
    """
   
    # conecta ao banco de dados
    conn = sqlite3.connect(db_path)

    # Se for operação de leitura
    if operation == 'r':
        query = f"SELECT * FROM {table_name}"
        df = pd.read_sql(query, conn)
        conn.close()
        return df
    
    # Se for operação de escrita (salvar df)
    elif operation == 'w' and df is not None:
        df.to_sql(table_name, conn, if_exists='replace', index=False)
        conn.close()
        return None
    
    else:
        conn.close()
        raise ValueError("Operação inválida ou DataFrame não fornecido para escrita.")
# ---------------------------------------------------------------------------------------

# Função para converter os tipos das colunas do dataset
def convert_column_types(dataframe, column_types:dict):

    df = dataframe.copy()
    
    for column, dtype in column_types.items():
        df[column] = pd.to_numeric(df[column], errors='coerce') if dtype in [int, float] else df[column].astype(dtype)
    return df


## Carregando dados

In [5]:
df_train = manage_db(operation='r', table_name='df_train', db_path='data/house_prices.db')
df_train.head()

Unnamed: 0,Type,Region,MunicipalityCode,Prefecture,Municipality,DistrictName,NearestStation,TimeToNearestStation,MinTimeToNearestStation,MaxTimeToNearestStation,...,Breadth,CityPlanning,CoverageRatio,FloorAreaRatio,Period,Year,Quarter,Renovation,Remarks,TradePrice
0,Pre-owned Condominiums etc.,,13103,Tokyo,Minato Ward,Kaigan,Takeshiba,1,1.0,1.0,...,,Quasi-industrial Zone,60.0,400.0,1st quarter 2011,2011,1,Done,,24000000
1,Residential Land(Land and Building),Residential Area,13120,Tokyo,Nerima Ward,Nishiki,Kamiitabashi,15,15.0,15.0,...,4.0,Category I Exclusively Low-story Residential Zone,60.0,200.0,3rd quarter 2013,2013,3,,Dealings including private road,51000000
2,Residential Land(Land Only),Residential Area,13201,Tokyo,Hachioji City,Shimoongatamachi,Takao (Tokyo),1H-1H30,60.0,90.0,...,4.5,Category I Exclusively Low-story Residential Zone,40.0,80.0,4th quarter 2007,2007,4,,,14000000
3,Pre-owned Condominiums etc.,,13208,Tokyo,Chofu City,Kamiishiwara,Nishichofu,16,16.0,16.0,...,,Quasi-industrial Zone,60.0,200.0,2nd quarter 2015,2015,2,Not yet,,23000000
4,Residential Land(Land Only),Residential Area,13117,Tokyo,Kita Ward,Shimo,Shimo,6,6.0,6.0,...,4.5,Category I Exclusively Medium-high Residential...,60.0,200.0,4th quarter 2015,2015,4,,,33000000


## Informações Gerais do Dataset

- Dimensões
- Nomenclatura de colunas
- Estatísticas Descritivas
- Presença de nulos, duplicados, outliers
- Checagem de tipos

In [6]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 325260 entries, 0 to 325259
Data columns (total 37 columns):
 #   Column                       Non-Null Count   Dtype 
---  ------                       --------------   ----- 
 0   Type                         325260 non-null  object
 1   Region                       325260 non-null  object
 2   MunicipalityCode             325260 non-null  object
 3   Prefecture                   325260 non-null  object
 4   Municipality                 325260 non-null  object
 5   DistrictName                 325260 non-null  object
 6   NearestStation               325260 non-null  object
 7   TimeToNearestStation         325260 non-null  object
 8   MinTimeToNearestStation      325260 non-null  object
 9   MaxTimeToNearestStation      325260 non-null  object
 10  FloorPlan                    325260 non-null  object
 11  Area                         325260 non-null  object
 12  AreaIsGreaterFlag            325260 non-null  object
 13  UnitPrice     

### Checagem de tipos

Todas as colunas estão no formato 'string'. Vamos convertê-las aos tipos corretos.

In [19]:
# Mapeando os tipos corretos das colunas
column_types = {
    'Type': str,
    'Region': str,
    'MunicipalityCode': int,
    'Prefecture': str,
    'Municipality': str,
    'DistrictName': str,
    'NearestStation': str,
    'TimeToNearestStation': str,
    'MinTimeToNearestStation': int,
    'MaxTimeToNearestStation': int,
    'FloorPlan': str,
    'Area': int,
    'AreaIsGreaterFlag': int,
    'UnitPrice': float,
    'PricePerTsubo': float,
    'LandShape': str,
    'Frontage': float,
    'FrontageIsGreaterFlag': int,
    'TotalFloorArea': int,
    'TotalFloorAreaIsGreaterFlag': int,
    'BuildingYear': int,
    'PrewarBuilding': int,
    'Structure': str,
    'Use': str,
    'Purpose': str,
    'Direction': str,
    'Classification': str,
    'Breadth': float,
    'CityPlanning': str,
    'CoverageRatio': int,
    'FloorAreaRatio': int,
    'Period': str,
    'Year': int,
    'Quarter': int,
    'Renovation': str,
    'Remarks': str,
    'TradePrice': int
}

df_train = convert_column_types(df_train, column_types)
df_train.dtypes

Type                            object
Region                          object
MunicipalityCode                 int64
Prefecture                      object
Municipality                    object
DistrictName                    object
NearestStation                  object
TimeToNearestStation            object
MinTimeToNearestStation        float64
MaxTimeToNearestStation        float64
FloorPlan                       object
Area                             int64
AreaIsGreaterFlag                int64
UnitPrice                      float64
PricePerTsubo                  float64
LandShape                       object
Frontage                       float64
FrontageIsGreaterFlag          float64
TotalFloorArea                 float64
TotalFloorAreaIsGreaterFlag      int64
BuildingYear                   float64
PrewarBuilding                   int64
Structure                       object
Use                             object
Purpose                         object
Direction                

### Estatística Descritiva

### Tratamento de Nulos e Duplicados

### Tratamento de Outliers