## Carregando Dados e Bibliotecas necessárias.

In [2]:
import pandas as pd
import numpy as np

df = pd.read_csv('../raw/airbnb-dataset.csv', low_memory=False)
df_silver = df.copy()
print("Dataset carregado com sucesso!")
df_silver.head()

Dataset carregado com sucesso!


Unnamed: 0,id,NAME,host id,host_identity_verified,host name,neighbourhood group,neighbourhood,lat,long,country,...,service fee,minimum nights,number of reviews,last review,reviews per month,review rate number,calculated host listings count,availability 365,house_rules,license
0,1001254,Clean & quiet apt home by the park,80014485718,unconfirmed,Madaline,Brooklyn,Kensington,40.64749,-73.97237,United States,...,$193,10.0,9.0,10/19/2021,0.21,4.0,6.0,286.0,Clean up and treat the home the way you'd like...,
1,1002102,Skylit Midtown Castle,52335172823,verified,Jenna,Manhattan,Midtown,40.75362,-73.98377,United States,...,$28,30.0,45.0,5/21/2022,0.38,4.0,2.0,228.0,Pet friendly but please confirm with me if the...,
2,1002403,THE VILLAGE OF HARLEM....NEW YORK !,78829239556,,Elise,Manhattan,Harlem,40.80902,-73.9419,United States,...,$124,3.0,0.0,,,5.0,1.0,352.0,"I encourage you to use my kitchen, cooking and...",
3,1002755,,85098326012,unconfirmed,Garry,Brooklyn,Clinton Hill,40.68514,-73.95976,United States,...,$74,30.0,270.0,7/5/2019,4.64,4.0,1.0,322.0,,
4,1003689,Entire Apt: Spacious Studio/Loft by central park,92037596077,verified,Lyndon,Manhattan,East Harlem,40.79851,-73.94399,United States,...,$41,10.0,9.0,11/19/2018,0.1,3.0,1.0,289.0,"Please no smoking in the house, porch or on th...",


# Tratamento dos Dados.

## Padronização dos Nomes das Colunas.

In [3]:

def clean_column_names(df):
    cols = df.columns
    new_cols = []
    for col in cols:
        new_col = col.lower()  
        new_col = new_col.replace(' ', '_')  
        new_cols.append(new_col)
    df.columns = new_cols
    return df


df_silver = clean_column_names(df_silver)


print(df_silver.columns)

Index(['id', 'name', 'host_id', 'host_identity_verified', 'host_name',
       'neighbourhood_group', 'neighbourhood', 'lat', 'long', 'country',
       'country_code', 'instant_bookable', 'cancellation_policy', 'room_type',
       'construction_year', 'price', 'service_fee', 'minimum_nights',
       'number_of_reviews', 'last_review', 'reviews_per_month',
       'review_rate_number', 'calculated_host_listings_count',
       'availability_365', 'house_rules', 'license'],
      dtype='object')


## Remoção de Colunas Desnecessárias

Como quase todas as tuplas de **Licenses** estavam como nulas, optamos por não trabalhar com essa coluna. Além disso, optamos por remover as colunas de **Country** e **Country_code** vista que sabemos que todas se enquadram no Estados Unidos e possuem o códido do país como "US".

In [4]:

cols_to_drop = ['country', 'country_code', 'license']


df_silver.drop(columns=cols_to_drop, inplace=True)

print(df_silver.columns)

Index(['id', 'name', 'host_id', 'host_identity_verified', 'host_name',
       'neighbourhood_group', 'neighbourhood', 'lat', 'long',
       'instant_bookable', 'cancellation_policy', 'room_type',
       'construction_year', 'price', 'service_fee', 'minimum_nights',
       'number_of_reviews', 'last_review', 'reviews_per_month',
       'review_rate_number', 'calculated_host_listings_count',
       'availability_365', 'house_rules'],
      dtype='object')


## Correções dos Tipos de Dados.

1) **price** e **service_fee** possuem o caracter especial "$" e estão como String. Com isso, iremos altera-las para o tipo númerico (Float)

In [6]:
df_silver['price'] = df_silver['price'].str.replace('$', '', regex=False).str.replace(',', '', regex=False).str.strip().astype(float)
df_silver['service_fee'] = df_silver['service_fee'].str.replace('$', '', regex=False).str.replace(',', '', regex=False).str.strip().astype(float)



df_silver['construction_year'] = df_silver['construction_year'].astype('Int64')

df_silver['host_identity_verified'] = df_silver['host_identity_verified'].map({'verified': True, 'unconfirmed': False})

print("Tipos de dados corrigidos:")
print(df_silver[['price', 'service_fee', 'construction_year', 'host_identity_verified']].dtypes)

Tipos de dados corrigidos:
price                     float64
service_fee               float64
construction_year           Int64
host_identity_verified     object
dtype: object


2) **construction_year**: Não faria sentido estar sendo guardado em Float já que era o ano de construção. Ou seja, nunca viria um número decimal.

In [7]:
df_silver['construction_year'] = df_silver['construction_year'].astype('Int64')

print(df_silver['construction_year'].dtype)

Int64


3) **host_identity_verified**: A coluna host_identity_verified pode ser transformada em booleano (True/False), o que é mais eficiente e semanticamente correto.

In [9]:


df_silver['host_identity_verified'] = df_silver['host_identity_verified'].astype(str).str.lower().str.strip()

df_silver['host_identity_verified'] = df_silver['host_identity_verified'].map({
    'verified': True,
    'unconfirmed': False
})


df_silver['host_identity_verified'] = df_silver['host_identity_verified'].astype('boolean')


print("Tipo de dado da coluna APÓS o tratamento:")
print(df_silver['host_identity_verified'].dtype)
print("\nValores únicos na coluna APÓS o tratamento:")
print(df_silver['host_identity_verified'].unique())
print("\nContagem de valores na coluna APÓS o tratamento:")
print(df_silver['host_identity_verified'].value_counts(dropna=False))

Tipo de dado da coluna APÓS o tratamento:
boolean

Valores únicos na coluna APÓS o tratamento:
<BooleanArray>
[<NA>]
Length: 1, dtype: boolean

Contagem de valores na coluna APÓS o tratamento:
host_identity_verified
<NA>    102599
Name: count, dtype: Int64


## Correção de Inconsistência nos Dados.

### Correção nos erros de digitação no nome dos bairros que apresentavam "brookln" e "manhatan"

In [10]:

df_silver['neighbourhood_group'] = df_silver['neighbourhood_group'].replace({
    'brookln': 'Brooklyn',
    'manhatan': 'Manhattan'
})


print(df_silver['neighbourhood_group'].unique())

['Brooklyn' 'Manhattan' 'Queens' nan 'Staten Island' 'Bronx']


### Tratamento de Valores Ausentes.


1) Remoção de Anúncios sem preço.

In [11]:

df_silver.dropna(subset=['price', 'service_fee'], inplace=True)

print(f"Valores nulos em 'price' após remoção: {df_silver['price'].isnull().sum()}")

Valores nulos em 'price' após remoção: 0


2) Criação de Coluna booleana para house_rules: Como metade dos valores é nulo, iremos criar uma nova coluna para indicar se essa "casa" possui ou não regras definidas.

In [12]:

df_silver['has_house_rules'] = df_silver['house_rules'].notna()

# Podemos agora remover a coluna original se o conteúdo de texto não for usado
# df_silver.drop(columns=['house_rules'], inplace=True)


print(df_silver['has_house_rules'].value_counts())

has_house_rules
False    51853
True     50260
Name: count, dtype: int64


3) Preenchimento dos poucos anúncios sem nome com "Sem nome informado"

In [14]:
df_silver['name'] = df_silver['name'].fillna('Sem nome informado')

4) Remoção de linhas sem **neighbourhood** e **neighbourhood_group**

In [15]:
df_silver.dropna(subset=['neighbourhood', 'neighbourhood_group'], inplace=True)

## Tratamentos de Valores invalidados.

Foi identificado alguns valores negativos na coluna **availability_365** que não fazem sentido com o escopo.

In [16]:
df_silver = df_silver[df_silver['availability_365'] >= 0]
print(df_silver['availability_365'].min())

0.0


## Dicionario - Silver

In [20]:
print("--- Amostra Aleatória de 20 Linhas do DataFrame 'df_silver' ---")

# O .sample(20) pega 20 linhas aleatórias do DataFrame.
# O .reset_index(drop=True) é para a visualização ficar mais limpa, sem o índice antigo.
display(df_silver.sample(20).reset_index(drop=True))


print("\n\n--- Resumo das Informações (Info) ---")
df_silver.info()

--- Amostra Aleatória de 20 Linhas do DataFrame 'df_silver' ---


Unnamed: 0,id,name,host_id,host_identity_verified,host_name,neighbourhood_group,neighbourhood,lat,long,instant_bookable,...,service_fee,minimum_nights,number_of_reviews,last_review,reviews_per_month,review_rate_number,calculated_host_listings_count,availability_365,house_rules,has_house_rules
0,36957723,ENTIRE LARGE 4 BEDROOMS 2 BATHS NEXT TO ALL,2869888874,,Lucky Day,Queens,Rego Park,40.72714,-73.86146,False,...,141.0,1.0,8.0,7/4/2021,0.53,5.0,2.0,180.0,,False
1,20927215,A Good Night Sleep,2063633272,,Sixta,Brooklyn,East New York,40.65894,-73.89343,True,...,121.0,2.0,54.0,6/26/2019,5.63,4.0,2.0,60.0,"No smoking, no loud parties. I would consider ...",True
2,10939422,Upper East side Cozy apartment.,16613564066,,Ilkay,Manhattan,East Harlem,40.7992,-73.93879,False,...,26.0,1.0,0.0,,,2.0,1.0,0.0,,False
3,3578368,Gorgeous Apt 1 block from Subway,73783982967,,Cecile,Brooklyn,Cobble Hill,40.68785,-73.99181,False,...,46.0,3.0,13.0,8/14/2016,0.22,1.0,1.0,157.0,No Smoking. No Pets.,True
4,45320650,Private sun-filled 1-bedroom apt with own back...,85125586000,,Marc,Brooklyn,Bushwick,40.69207,-73.92331,False,...,114.0,3.0,17.0,5/25/2019,0.66,3.0,2.0,0.0,1.CHECK-IN TIME IS AFTER 3 P.M. EST AND CHECK-...,True
5,18038686,Beautiful room,84043686167,,Elizabeth,Bronx,University Heights,40.85989,-73.91189,True,...,165.0,2.0,7.0,1/1/2019,0.46,2.0,2.0,156.0,Please treat my apartment as if it were your own.,True
6,31821885,138 Bowery-Modern King Studio,72352821859,,Jeniffer,Manhattan,Lower East Side,40.72023,-73.99401,True,...,97.0,15.0,4.0,2/1/2022,0.32,5.0,51.0,47.0,,False
7,30350558,"Comfortable Private Room in Midwood area, Broo...",58505491149,,George,Brooklyn,Midwood,40.61446,-73.95445,True,...,141.0,3.0,14.0,2/19/2022,1.47,5.0,1.0,247.0,,False
8,31668346,Bedroom w/Private Bath,86582675587,,Sheena,Brooklyn,Prospect-Lefferts Gardens,40.65782,-73.96096,False,...,101.0,2.0,176.0,2/4/2022,2.01,2.0,1.0,86.0,,False
9,26422048,Private Room İn Hell's Kitchen 2,5200786755,,Ahmet,Manhattan,Hell's Kitchen,40.76487,-73.98912,True,...,137.0,1.0,1.0,6/9/2019,1.0,4.0,7.0,198.0,,False




--- Resumo das Informações (Info) ---
<class 'pandas.core.frame.DataFrame'>
Index: 101204 entries, 0 to 102598
Data columns (total 24 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   id                              101204 non-null  int64  
 1   name                            101204 non-null  object 
 2   host_id                         101204 non-null  int64  
 3   host_identity_verified          0 non-null       boolean
 4   host_name                       100816 non-null  object 
 5   neighbourhood_group             101204 non-null  object 
 6   neighbourhood                   101204 non-null  object 
 7   lat                             101196 non-null  float64
 8   long                            101196 non-null  float64
 9   instant_bookable                101118 non-null  object 
 10  cancellation_policy             101143 non-null  object 
 11  room_type                       101204 non-

## Salvando Dataset

In [21]:

df_silver.to_csv('airbnb-dataset-silver.csv', index=False)

print("Dataset da camada Silver salvo com sucesso!")

Dataset da camada Silver salvo com sucesso!
