# Exploratory Data analysis

## Loading data

In [1]:
import wandb
import pandas as pd

run = wandb.init(project="nyc_airbnb", group="eda", save_code=True)
local_path = wandb.use_artifact("sample.csv:latest").file()
df = pd.read_csv(local_path)

[34m[1mwandb[0m: Currently logged in as: [33mlaurent4ml[0m. Use [1m`wandb login --relogin`[0m to force relogin


### Installing dependencies

In [2]:
import sys
!{sys.executable} -m pip install ydata_profiling


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m


In [3]:
!{sys.executable} -m pip install ipywidgets


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m


## Creating Profile Report

In [4]:
from ydata_profiling import ProfileReport

profile = ProfileReport(df)
profile.to_widgets()

  @nb.jit


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

## Display report in iframe

In [5]:
profile.to_notebook_iframe()

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

## Discoveries from report
- Number of Observations: 48895
- Number of Features: 16 (10 Numeric, 3 Text, 2 Categorical, 1 DateTime)
- No duplicate rows
- 2.6% missing cells

### Price Feature Analysis
- Minimum Values: 0, 10, 11, 12, 13, 15,16, 18, 19, 20 (we should remove at least the 0)
- Maximum: 10000, 9999, 8500, 8000, 7703, 7500, 6800, 6500, 6419, 6000 
- 95-th percentile: 355
- Mean: 152.72

### Last Review Feature Analysis
- 20.6% Missing values
- Column format Date

### Latitude
- minimum: 40.499979
- maximum: 40.91306
- mean: 40.728949

### Longitude
- minimum: -74.24442
- maximum: -73.71299
- mean: -73.95217

#### Conclusion
- We should remove all rows where price is 0
- We want to limit the maximum price to 95th percentile
- We want to convert the last review format from date to date time 


### Drop price outliers and Convert last_review to datetime

In [6]:
min_price = 10
max_price = 350
idx = df['price'].between(min_price, max_price)
df = df[idx].copy()
# Convert last_review to datetime
df['last_review'] = pd.to_datetime(df['last_review'])

## Verifying Feature data type

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46428 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   id                              46428 non-null  int64         
 1   name                            46413 non-null  object        
 2   host_id                         46428 non-null  int64         
 3   host_name                       46407 non-null  object        
 4   neighbourhood_group             46428 non-null  object        
 5   neighbourhood                   46428 non-null  object        
 6   latitude                        46428 non-null  float64       
 7   longitude                       46428 non-null  float64       
 8   room_type                       46428 non-null  object        
 9   price                           46428 non-null  int64         
 10  minimum_nights                  46428 non-null  int64         
 11  nu

### Verify Minimum price

In [8]:
df['price'].min()

10

### Verify Maximum price

In [9]:
df['price'].max()

350

In [10]:
run.finish()