# EDA for NYC Airbnb rental price estimator

This notebook does a little bit of exploratory data analysis into a sample from the dataset from NYC Airbnb rental price. In the next cells it will download and profile some sample data, as well as do very minimal data cleaning in order to get a clearer picture from the profiling.

In [None]:
# Basic config, getting .csv file from Weights & Biases and loading it into a pandas dataframe

import wandb
import pandas as pd

run = wandb.init(project="nyc_airbnb", group="eda", save_code=True)
local_path = wandb.use_artifact("sample.csv:latest").file()
df = pd.read_csv(local_path)

In [3]:
# generate a profile report to get an overview on data distribution, value range, etc

import pandas_profiling

profile = pandas_profiling.ProfileReport(df)
profile.to_widgets()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…


## Profiling report insights

We can see that there are some outlier values in the **price** column, that don't seem to make much sense (many entries close to zero dollars per night, or close to 10000 dollars per night). Dropping such values will help getting better insights from the report.\
**last_review** should be a datetime column, but it is saved in the string format. Converting it would be helpful.

In [4]:
# Drop outliers
min_price = 10
max_price = 350
idx = df['price'].between(min_price, max_price)
df = df[idx].copy()
# Convert last_review to datetime
df['last_review'] = pd.to_datetime(df['last_review'])

In [5]:
# Generate another profiling report, with now cleaner data
profile_fixed = pandas_profiling.ProfileReport(df)
profile_fixed.to_widgets()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

## Profiling report insights, revisited

Now that the **price** column interval has been reduced, we are able to get a better understaning of its distribution.

We can also view **last_review** minimum and maximum values.

In [6]:
# terminate the run
run.finish()

VBox(children=(Label(value='0.004 MB of 0.004 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…