# EDA - nyc airbnb dataset

### Importing the libraries we will need for this exercise

In [1]:
import wandb
import pandas as pd
import pandas_profiling

## Downloading the dataset

We are going to download the file with the data: `sample.csv` the last version we have in **W&B** -> `sample.csv:latest` and load it in a pandas dataframe(**df**).

In [2]:
run = wandb.init(project="nyc_airbnb", group="eda", save_code=True)
local_path = wandb.use_artifact("sample.csv:latest").file()
df = pd.read_csv(local_path)

[34m[1mwandb[0m: Currently logged in as: [33mpruanju[0m. Use [1m`wandb login --relogin`[0m to force relogin


## Pandas profiling report


<p>We are going generate the report from pandas profiling that it will provide useful information to analyze the dataset like for example:</p>
<br>

<li>Essentials: type, unique values, indication of missing values.</li>
<li>Descriptive statistics: mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness.</li>
<li>Histograms: categorical and numerical.</li>
<li>Correlations: high correlation warnings.</li>
<li>Missing values: through counts, matrix, heatmap and dendrograms.</li>
<li>Duplicate rows: list of the most common duplicated rows.</li>
<li>Extreme values.</li>

In [3]:
profile = pandas_profiling.ProfileReport(df)

In [4]:
profile.to_widgets()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

### It is also possible to generate the report in `html` format to read it better

In [5]:
prof = pandas_profiling.ProfileReport(df)
prof.to_file(output_file='output.html')

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

## Analysis of the `pandas profiling` report

We can see that there are missing values in some columns like `last_review`and `reviews_per_month`, there are **4.123** missing values for these columns, so we need to think a strategy to input these missing values or delete the rows with missing data.

### Wrong data types
Another important point is that the column `last_review`the type is `string format`but has to be type `date`.

### Outliers
We can see that there are extreme values for the `price` column. They decided to delete prices lower than `10$` and bigger than `350$`



Pandas **dataframe.info()** function is used to get a concise summary of the dataframe. It comes really handy when doing exploratory analysis of the data. To get a quick overview of the dataset.


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              48895 non-null  int64  
 1   name                            48879 non-null  object 
 2   host_id                         48895 non-null  int64  
 3   host_name                       48874 non-null  object 
 4   neighbourhood_group             48895 non-null  object 
 5   neighbourhood                   48895 non-null  object 
 6   latitude                        48895 non-null  float64
 7   longitude                       48895 non-null  float64
 8   room_type                       48895 non-null  object 
 9   price                           48895 non-null  int64  
 10  minimum_nights                  48895 non-null  int64  
 11  number_of_reviews               48895 non-null  int64  
 12  last_review                     

As we said before we are going to delete **prices** between `10$` and `350$`, and convert **last_review** to `date type`.

In [7]:
# Drop outliers
min_price = 10
max_price = 350
idx = df['price'].between(min_price, max_price)
df = df[idx].copy()
# Convert last_review to datetime
df['last_review'] = pd.to_datetime(df['last_review'])

Quick check with `df.info()` after convert **last_review** to date type.

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46428 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   id                              46428 non-null  int64         
 1   name                            46413 non-null  object        
 2   host_id                         46428 non-null  int64         
 3   host_name                       46407 non-null  object        
 4   neighbourhood_group             46428 non-null  object        
 5   neighbourhood                   46428 non-null  object        
 6   latitude                        46428 non-null  float64       
 7   longitude                       46428 non-null  float64       
 8   room_type                       46428 non-null  object        
 9   price                           46428 non-null  int64         
 10  minimum_nights                  46428 non-null  int64         
 11  nu

## Finishing the notebook

When we finish we need to close the connection with **W&B**.

In [9]:
run.finish()

VBox(children=(Label(value='0.067 MB of 0.067 MB uploaded (0.037 MB deduped)\r'), FloatProgress(value=1.0, max…