# **Criando o Banco de Dados**

Notebook com os comandos necessários para criar um banco SQL no SQLite e carregar os dados dos arquivos fornecidos nele.

**Arquivos fornecidos:**
- *train_houses.xlsx*
- *test_houses.xlsx* --> sem valores na coluna de preço (target).

## Bibliotecas

In [16]:
import pandas as pd
import sqlite3
import re

## Funções

In [17]:
# Função para formatar os arquivos brutos
def format_table(df):
    # Passo 1: Substituir vírgulas seguidas de espaços por um espaço
    df = df.apply(lambda col: col.map(lambda x: re.sub(r',\s', ' ', x) if isinstance(x, str) else x))

    # Passo 2: Substituir vírgulas que não estão dentro de aspas duplas e seguidas por caracteres não vazios por ';'
    df = df.apply(lambda col: col.map(lambda x: re.sub(r',(?=(?:[^"]*"[^"]*")*[^"]*$)(?!\s)', ';', x) if isinstance(x, str) else x))

    # Etapa 3: Separar os valores por ponto e vírgula em colunas
    # Unir todas as colunas em uma única coluna para processar
    df_combined = df.apply(lambda row: ';'.join(row.dropna().astype(str)), axis=1)
    data = df_combined.str.split(';', expand=True)

    # Etapa 4: Configurar a primeira linha como cabeçalho
    columns = data.iloc[0]
    data = data[1:]
    data.columns = columns

    return data

## Carregando Dados

In [18]:
df_train = pd.read_excel('data/train_houses.xlsx', header=None)
df_train.head()

Unnamed: 0,0
0,"Type,Region,MunicipalityCode,Prefecture,Munici..."
1,"Pre-owned Condominiums, etc.,,13103,Tokyo,Mina..."
2,"Residential Land(Land and Building),Residentia..."
3,"Residential Land(Land Only),Residential Area,1..."
4,"Pre-owned Condominiums, etc.,,13208,Tokyo,Chof..."


In [19]:
df_test = pd.read_excel('data/test_houses.xlsx', header=None)
df_test.head()

Unnamed: 0,0
0,"Type,Region,MunicipalityCode,Prefecture,Munici..."
1,"Pre-owned Condominiums, etc.,,13103,Tokyo,Mina..."
2,"Pre-owned Condominiums, etc.,,13110,Tokyo,Megu..."
3,"Pre-owned Condominiums, etc.,,13112,Tokyo,Seta..."
4,"Pre-owned Condominiums, etc.,,13121,Tokyo,Adac..."


## Formatando Dados

Nem todas as vírgulas do dataset são delimitadores de colunas:
- vírgulas seguidas de espaço: *exemplo, etc*
- vírgulas dentro de uma expressão com aspas duplas: *"Cidade, Província"*

In [20]:
# Aplicar a função de formatação em df_train
df_train_formatted = format_table(df_train)

print("DataFrame de TREINO Corrigido:")
df_train_formatted.sample(5)

DataFrame de TREINO Corrigido:


Unnamed: 0,Type,Region,MunicipalityCode,Prefecture,Municipality,DistrictName,NearestStation,TimeToNearestStation,MinTimeToNearestStation,MaxTimeToNearestStation,...,Breadth,CityPlanning,CoverageRatio,FloorAreaRatio,Period,Year,Quarter,Renovation,Remarks,TradePrice
200950,Residential Land(Land and Building),Residential Area,13213,Tokyo,Higashimurayama City,Akitsucho,Shin-akitsu,17,17.0,17.0,...,5.0,Category I Exclusively Low-story Residential Zone,40.0,80.0,3rd quarter 2010,2010,3,,,41000000
131235,Residential Land(Land Only),Residential Area,13108,Tokyo,Koto Ward,Kameido,Kameidosuijin,2,2.0,2.0,...,22.5,Quasi-industrial Zone,80.0,400.0,1st quarter 2007,2007,1,,,90000000
279196,Residential Land(Land Only),Residential Area,13212,Tokyo,Hino City,Higashitoyoda,Toyota,18,18.0,18.0,...,6.0,Category I Exclusively Low-story Residential Zone,40.0,80.0,4th quarter 2013,2013,4,,,35000000
274722,Pre-owned Condominiums etc.,,13115,Tokyo,Suginami Ward,Wada,Higashikoenji,7,7.0,7.0,...,,Category I Exclusively Medium-high Residential...,60.0,200.0,4th quarter 2010,2010,4,Done,,31000000
246241,Pre-owned Condominiums etc.,,13107,Tokyo,Sumida Ward,Higashimukojima,Higashimukojima,2,2.0,2.0,...,,Commercial Zone,80.0,500.0,4th quarter 2013,2013,4,Not yet,,18000000


In [21]:
# Aplicar a função de formatação em df_test
df_test_formatted = format_table(df_test)

print("DataFrame de TESTE Corrigido:")
df_test_formatted.sample(5)

DataFrame de TESTE Corrigido:


Unnamed: 0,Type,Region,MunicipalityCode,Prefecture,Municipality,DistrictName,NearestStation,TimeToNearestStation,MinTimeToNearestStation,MaxTimeToNearestStation,...,Breadth,CityPlanning,CoverageRatio,FloorAreaRatio,Period,Year,Quarter,Renovation,Remarks,TradePrice
23067,Residential Land(Land and Building),Residential Area,13207,Tokyo,Akishima City,Mihoricho,Seibutachikawa,2,2,2,...,5.0,Category I Exclusively Medium-high Residential...,60,200,3rd quarter 2014,2014,3,,,
37125,Pre-owned Condominiums etc.,,13123,Tokyo,Edogawa Ward,Minamikasai,Kasairinkaikoen,26,26,26,...,,Category I Residential Zone,60,200,1st quarter 2011,2011,1,Not yet,,
1646,Residential Land(Land and Building),Residential Area,13112,Tokyo,Setagaya Ward,Shimmachi,Sakurashinmachi,9,9,9,...,4.0,Neighborhood Commercial Zone,80,300,2nd quarter 2008,2008,2,,,
3624,Pre-owned Condominiums etc.,,13201,Tokyo,Hachioji City,Kitanomachi,Kitano (Tokyo),9,9,9,...,,Quasi-industrial Zone,60,200,1st quarter 2009,2009,1,Not yet,,
48745,Pre-owned Condominiums etc.,,13114,Tokyo,Nakano Ward,Yayoicho,Nakanoshinbashi,4,4,4,...,,Category I Residential Zone,60,200,3rd quarter 2006,2006,3,Not yet,,


## Criando conexão ao banco SQL

In [22]:
# Conectar ao banco de dados SQLite (ou criar se não existir)
conn = sqlite3.connect('data/house_prices.db')

## Carregando dados no banco SQL

In [23]:
# Salvando no banco de dados como tabela 'df_train'
df_train_formatted.to_sql('df_train', conn, if_exists='replace', index=False)

# Salvando no banco de dados como tabela 'df_test'
df_test_formatted.to_sql('df_test', conn, if_exists='replace', index=False)

81314

## Checando tabelas do banco SQL

In [24]:
# Consultando os dados do banco para verificar se estão corretos
query_train = "SELECT * FROM df_train LIMIT 5"
df_train_check = pd.read_sql(query_train, conn)

print("Dados da tabela df_train:")
df_train_check

Dados da tabela df_train:


Unnamed: 0,Type,Region,MunicipalityCode,Prefecture,Municipality,DistrictName,NearestStation,TimeToNearestStation,MinTimeToNearestStation,MaxTimeToNearestStation,...,Breadth,CityPlanning,CoverageRatio,FloorAreaRatio,Period,Year,Quarter,Renovation,Remarks,TradePrice
0,Pre-owned Condominiums etc.,,13103,Tokyo,Minato Ward,Kaigan,Takeshiba,1,1.0,1.0,...,,Quasi-industrial Zone,60.0,400.0,1st quarter 2011,2011,1,Done,,24000000
1,Residential Land(Land and Building),Residential Area,13120,Tokyo,Nerima Ward,Nishiki,Kamiitabashi,15,15.0,15.0,...,4.0,Category I Exclusively Low-story Residential Zone,60.0,200.0,3rd quarter 2013,2013,3,,Dealings including private road,51000000
2,Residential Land(Land Only),Residential Area,13201,Tokyo,Hachioji City,Shimoongatamachi,Takao (Tokyo),1H-1H30,60.0,90.0,...,4.5,Category I Exclusively Low-story Residential Zone,40.0,80.0,4th quarter 2007,2007,4,,,14000000
3,Pre-owned Condominiums etc.,,13208,Tokyo,Chofu City,Kamiishiwara,Nishichofu,16,16.0,16.0,...,,Quasi-industrial Zone,60.0,200.0,2nd quarter 2015,2015,2,Not yet,,23000000
4,Residential Land(Land Only),Residential Area,13117,Tokyo,Kita Ward,Shimo,Shimo,6,6.0,6.0,...,4.5,Category I Exclusively Medium-high Residential...,60.0,200.0,4th quarter 2015,2015,4,,,33000000


In [25]:
# Consultando os dados do banco para verificar se estão corretos
query_test = "SELECT * FROM df_test LIMIT 5"
df_test_check = pd.read_sql(query_test, conn)

print("Dados da tabela df_test:")
df_test_check

Dados da tabela df_test:


Unnamed: 0,Type,Region,MunicipalityCode,Prefecture,Municipality,DistrictName,NearestStation,TimeToNearestStation,MinTimeToNearestStation,MaxTimeToNearestStation,...,Breadth,CityPlanning,CoverageRatio,FloorAreaRatio,Period,Year,Quarter,Renovation,Remarks,TradePrice
0,Pre-owned Condominiums etc.,,13103,Tokyo,Minato Ward,Toranomon,Kamiyacho,4,4,4,...,,Commercial Zone,80,500,3rd quarter 2016,2016,3,Not yet,,
1,Pre-owned Condominiums etc.,,13110,Tokyo,Meguro Ward,Higashiyama,Ikejiriohashi,7,7,7,...,,Category I Residential Zone,60,300,3rd quarter 2012,2012,3,,,
2,Pre-owned Condominiums etc.,,13112,Tokyo,Setagaya Ward,Kitakarasuyama,Chitosekarasuyama,25,25,25,...,,Category I Exclusively Low-story Residential Zone,50,100,4th quarter 2015,2015,4,Done,,
3,Pre-owned Condominiums etc.,,13121,Tokyo,Adachi Ward,Ayase,Ayase,4,4,4,...,,Commercial Zone,80,500,2nd quarter 2017,2017,2,Done,,
4,Residential Land(Land and Building),Residential Area,13107,Tokyo,Sumida Ward,Honjo,Honjoazumabashi,7,7,7,...,6.0,Neighborhood Commercial Zone,80,300,3rd quarter 2016,2016,3,,,


In [26]:
# Fechar a conexão com o banco de dados
conn.close()

print("Dados carregados com sucesso no banco de dados SQLite.")

Dados carregados com sucesso no banco de dados SQLite.
