### NYC Airbnb dataset

#### Author: Luiz Henrique
#### Date: April, 2022

In [2]:
import wandb
import pandas as pd
import pandas_profiling

In [3]:
run = wandb.init(
    project="nyc_airbnb", 
    group="eda",
    save_code=True
)

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mluizhenriqueds[0m (use `wandb login --relogin` to force relogin)
[34m[1mwandb[0m: wandb version 0.12.14 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade


#### Fetching the data from W&B

In [4]:
local_path = wandb.use_artifact("sample.csv:latest").file()

In [5]:
df = pd.read_csv(local_path)

#### Displaying a sample of the dataset

In [6]:
df.head(2)

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,9138664,Private Lg Room 15 min to Manhattan,47594947,Iris,Queens,Sunnyside,40.74271,-73.92493,Private room,74,2,6,2019-05-26,0.13,1,5
1,31444015,TIME SQUARE CHARMING ONE BED IN HELL'S KITCHEN...,8523790,Johlex,Manhattan,Hell's Kitchen,40.76682,-73.98878,Entire home/apt,170,3,0,,,1,188


#### Profiling dataset

In [7]:
profile = pandas_profiling.ProfileReport(df)
profile.to_widgets()

Summarize dataset:   0%|          | 0/29 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

#### Findings about the data

Here we list some other information & issues about the dataset:

- `calculated_host_listings_count`: high percentage of missing values
- 

#### Fixing data issues

##### Keeping prices on a reasonable scale

In [9]:
# Drop outliers
min_price = 10
max_price = 350

idx = df['price'].between(min_price, max_price)
df = df[idx].copy()

##### Converting `last_review` date
The field `last_review` consists of the timestamp of the last review event date. However, the field is encoded as object (string), and therefore we need to convert it to datetime to allow for effient date operations. Fortunatelly, pandas has already an available function to make this transformation.

In [10]:
# Convert last_review to datetime
df['last_review'] = pd.to_datetime(df['last_review'])

#### Data info

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19001 entries, 0 to 19999
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   id                              19001 non-null  int64         
 1   name                            18994 non-null  object        
 2   host_id                         19001 non-null  int64         
 3   host_name                       18993 non-null  object        
 4   neighbourhood_group             19001 non-null  object        
 5   neighbourhood                   19001 non-null  object        
 6   latitude                        19001 non-null  float64       
 7   longitude                       19001 non-null  float64       
 8   room_type                       19001 non-null  object        
 9   price                           19001 non-null  int64         
 10  minimum_nights                  19001 non-null  int64         
 11  nu

#### Uploading notebook to W&B

In [None]:
# run.finish()