# **Criando o Banco de Dados**

Notebook com os comandos necessários para criar um banco SQL no SQLite e carregar os dados dos arquivos fornecidos nele.

**Arquivos fornecidos:**
- *train_houses.xlsx*
- *test_houses.xlsx*

## Imports

In [1]:
import pandas as pd
import sqlite3

## Criando conexão ao banco SQL

In [3]:
# Conectar ao banco de dados SQLite (ou criar se não existir)
conn = sqlite3.connect('data/house_prices.db')

## Corrigindo formatação dos dados brutos

- Os dados brutos estavam em arquivos xlsx.
- Mas em inspeção visual, eles eram dados separados por vírgulas, como um csv.
- Por algum motivo, ler o XLSX no Pandas e depois gravar em CSV não estava funcionando. O Pandas não reconhecia o delimitador ','.
- Por isso, eu abri os arquivos XLSX no Excel;
- em seguida, eu **localizei e substituí** todas as ocorrências de ', ' (vírgula seguida de espaço), que era o causador da confusão com o Pandas.
- Por fim, **salvei como** CSV UTF-8 esse arquivo XLSX.
- Agora é só ler o arquivo CSV pelo Pandas e gravá-lo no banco SQL. 

In [60]:
# Ler o arquivo CSV de treino corrigido
df_train = pd.read_csv('data/train_houses_corrigido.csv', sep=',', encoding='utf-8')

df_train.head()

Unnamed: 0,Type,Region,MunicipalityCode,Prefecture,Municipality,DistrictName,NearestStation,TimeToNearestStation,MinTimeToNearestStation,MaxTimeToNearestStation,...,Breadth,CityPlanning,CoverageRatio,FloorAreaRatio,Period,Year,Quarter,Renovation,Remarks,TradePrice
0,Pre-owned Condominiums etc.,,13103.0,Tokyo,Minato Ward,Kaigan,Takeshiba,1,1.0,1.0,...,,Quasi-industrial Zone,60.0,400.0,1st quarter 2011,2011.0,1.0,Done,,24000000.0
1,Residential Land(Land and Building),Residential Area,13120.0,Tokyo,Nerima Ward,Nishiki,Kamiitabashi,15,15.0,15.0,...,4.0,Category I Exclusively Low-story Residential Zone,60.0,200.0,3rd quarter 2013,2013.0,3.0,,Dealings including private road,51000000.0
2,Residential Land(Land Only),Residential Area,13201.0,Tokyo,Hachioji City,Shimoongatamachi,Takao (Tokyo),1H-1H30,60.0,90.0,...,4.5,Category I Exclusively Low-story Residential Zone,40.0,80.0,4th quarter 2007,2007.0,4.0,,,14000000.0
3,Pre-owned Condominiums etc.,,13208.0,Tokyo,Chofu City,Kamiishiwara,Nishichofu,16,16.0,16.0,...,,Quasi-industrial Zone,60.0,200.0,2nd quarter 2015,2015.0,2.0,Not yet,,23000000.0
4,Residential Land(Land Only),Residential Area,13117.0,Tokyo,Kita Ward,Shimo,Shimo,6,6.0,6.0,...,4.5,Category I Exclusively Medium-high Residential...,60.0,200.0,4th quarter 2015,2015.0,4.0,,,33000000.0


In [61]:
# Ler o arquivo CSV de teste corrigido
df_test = pd.read_csv('data/test_houses_corrigido.csv', sep=',', encoding='utf-8')

df_test.head()

Unnamed: 0,Type,Region,MunicipalityCode,Prefecture,Municipality,DistrictName,NearestStation,TimeToNearestStation,MinTimeToNearestStation,MaxTimeToNearestStation,...,Breadth,CityPlanning,CoverageRatio,FloorAreaRatio,Period,Year,Quarter,Renovation,Remarks,TradePrice
0,Pre-owned Condominiums etc.,,13103.0,Tokyo,Minato Ward,Toranomon,Kamiyacho,4.0,4.0,4.0,...,,Commercial Zone,80.0,500.0,3rd quarter 2016,2016.0,3.0,Not yet,,
1,Pre-owned Condominiums etc.,,13110.0,Tokyo,Meguro Ward,Higashiyama,Ikejiriohashi,7.0,7.0,7.0,...,,Category I Residential Zone,60.0,300.0,3rd quarter 2012,2012.0,3.0,,,
2,Pre-owned Condominiums etc.,,13112.0,Tokyo,Setagaya Ward,Kitakarasuyama,Chitosekarasuyama,25.0,25.0,25.0,...,,Category I Exclusively Low-story Residential Zone,50.0,100.0,4th quarter 2015,2015.0,4.0,Done,,
3,Pre-owned Condominiums etc.,,13121.0,Tokyo,Adachi Ward,Ayase,Ayase,4.0,4.0,4.0,...,,Commercial Zone,80.0,500.0,2nd quarter 2017,2017.0,2.0,Done,,
4,"Residential Land(Land and Building),Residentia...",,,,,,,,,,...,,,,,,,,,,


## Carregando dados no banco SQL

In [63]:
# Salvar no banco de dados
df_train.to_sql('df_train', conn, if_exists='replace', index=False)
df_test.to_sql('df_test', conn, if_exists='replace', index=False)

81314

## Checando tabelas do banco SQL

In [65]:
# Consultar os dados da tabela df_train
query_train = "SELECT * FROM df_train LIMIT 5"
df_train_check = pd.read_sql(query_train, conn)
print("Dados da tabela df_train:")
df_train_check

Dados da tabela df_train:


Unnamed: 0,Type,Region,MunicipalityCode,Prefecture,Municipality,DistrictName,NearestStation,TimeToNearestStation,MinTimeToNearestStation,MaxTimeToNearestStation,...,Breadth,CityPlanning,CoverageRatio,FloorAreaRatio,Period,Year,Quarter,Renovation,Remarks,TradePrice
0,Pre-owned Condominiums etc.,,13103.0,Tokyo,Minato Ward,Kaigan,Takeshiba,1,1.0,1.0,...,,Quasi-industrial Zone,60.0,400.0,1st quarter 2011,2011.0,1.0,Done,,24000000.0
1,Residential Land(Land and Building),Residential Area,13120.0,Tokyo,Nerima Ward,Nishiki,Kamiitabashi,15,15.0,15.0,...,4.0,Category I Exclusively Low-story Residential Zone,60.0,200.0,3rd quarter 2013,2013.0,3.0,,Dealings including private road,51000000.0
2,Residential Land(Land Only),Residential Area,13201.0,Tokyo,Hachioji City,Shimoongatamachi,Takao (Tokyo),1H-1H30,60.0,90.0,...,4.5,Category I Exclusively Low-story Residential Zone,40.0,80.0,4th quarter 2007,2007.0,4.0,,,14000000.0
3,Pre-owned Condominiums etc.,,13208.0,Tokyo,Chofu City,Kamiishiwara,Nishichofu,16,16.0,16.0,...,,Quasi-industrial Zone,60.0,200.0,2nd quarter 2015,2015.0,2.0,Not yet,,23000000.0
4,Residential Land(Land Only),Residential Area,13117.0,Tokyo,Kita Ward,Shimo,Shimo,6,6.0,6.0,...,4.5,Category I Exclusively Medium-high Residential...,60.0,200.0,4th quarter 2015,2015.0,4.0,,,33000000.0


In [66]:
# Consultar os dados da tabela df_test
query_test = "SELECT * FROM df_test LIMIT 5"
df_test_check = pd.read_sql(query_test, conn)
print("\nDados da tabela df_test:")
df_test_check


Dados da tabela df_test:


Unnamed: 0,Type,Region,MunicipalityCode,Prefecture,Municipality,DistrictName,NearestStation,TimeToNearestStation,MinTimeToNearestStation,MaxTimeToNearestStation,...,Breadth,CityPlanning,CoverageRatio,FloorAreaRatio,Period,Year,Quarter,Renovation,Remarks,TradePrice
0,Pre-owned Condominiums etc.,,13103.0,Tokyo,Minato Ward,Toranomon,Kamiyacho,4.0,4.0,4.0,...,,Commercial Zone,80.0,500.0,3rd quarter 2016,2016.0,3.0,Not yet,,
1,Pre-owned Condominiums etc.,,13110.0,Tokyo,Meguro Ward,Higashiyama,Ikejiriohashi,7.0,7.0,7.0,...,,Category I Residential Zone,60.0,300.0,3rd quarter 2012,2012.0,3.0,,,
2,Pre-owned Condominiums etc.,,13112.0,Tokyo,Setagaya Ward,Kitakarasuyama,Chitosekarasuyama,25.0,25.0,25.0,...,,Category I Exclusively Low-story Residential Zone,50.0,100.0,4th quarter 2015,2015.0,4.0,Done,,
3,Pre-owned Condominiums etc.,,13121.0,Tokyo,Adachi Ward,Ayase,Ayase,4.0,4.0,4.0,...,,Commercial Zone,80.0,500.0,2nd quarter 2017,2017.0,2.0,Done,,
4,"Residential Land(Land and Building),Residentia...",,,,,,,,,,...,,,,,,,,,,


In [67]:
# Fechar a conexão com o banco de dados
conn.close()

print("Dados carregados com sucesso no banco de dados SQLite.")

Dados carregados com sucesso no banco de dados SQLite.
