#### Short Term Rentals
# Exporatory Data Analysis
Nikolas Hunt

[✉️](mailto:nikolashunt@protonmail.ch) | November 2022

## Introduction

Our property management company rents rooms and properties for short terms on various platforms. Our business problem is that we want a more accurate estimation of a typical price for a given property based on the features of that property.

Our company receives new data on properties in bulk every week. Our model to predict a property price will need retraining on that cadence, necessitating a retrainable pipeline on that cadence.

In a real scenario, I would spend a great deal more time on this phase, uncovering insights about the dataset; here the project is meant to focus more on the pipeline build aspect of the project, so the exploration has a light touch.

We will do some profiling of the data and produce an accompanying commentary, perform some data cleansing based on those observations, and log the code to [the Weights and Biases project](https://wandb.ai/nikohunt/nyc_airbnb?workspace=user-nikohunt).

In [1]:
import wandb
import pandas as pd
import pandas_profiling

## Weights and Biases Interaction

Here we login to Weights and Biases, ensuring we save the code of the run.

In [2]:
run = wandb.init(project="nyc_airbnb", group="eda", save_code=True)

[34m[1mwandb[0m: Currently logged in as: [33mnikohunt[0m (use `wandb login --relogin` to force relogin)
[34m[1mwandb[0m: wandb version 0.13.5 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade


Download our csv for analysis.

In [3]:
local_path = wandb.use_artifact("sample.csv:latest").file()
df = pd.read_csv(local_path)

## Profiling

We use pandas profiling to understand the basic characteristics of our dataset.

In [5]:
profile = pandas_profiling.ProfileReport(df)
profile.to_widgets()

Summarize dataset:   0%|          | 0/29 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

## Observations

* Only four features contain nulls: name, host_name, last_review and reviews_per_month.
* The features name and host_name have <0.1% nulls (16 and 21 respectively); last_review and reviews_per_month have a much greater frequency (20.6% each).
* The feature name will benefit from text processing as it is natural language. Interestingly, beyond expected words that relate to property and accommodation structurally (e.g. "room"), the adjective "cozy" is the sixth most frequently occurring word. It would be an avenue of exploration to see if this, and other popular or interesting adjectives correlate to a higher price.
* The feature last_review is a date but it is in string format.
* The price feature is highly skewed, owing to some very high prices. Notably, there are 11 zeroes. Although these represent less than 0.1% of observations, these should be investigated.

## Decisions

* After talking to stakeholders, we have taken the decision to consider from a minimum of $10 to a maximum of $350 per night.
* Occurrences of zeroes in price needs following up with product owner.
* The last_review feature should be converted to a datetime to aid manipulation and further analysis.

## Data Cleansing Implementation

According to the Decisions, we drop outliers in price and perform a datetime conversion to the last_review field.

In [7]:
# Drop outliers
min_price = 10
max_price = 350
idx = df['price'].between(min_price, max_price)
df = df[idx].copy()

# Convert last_review to datetime
df['last_review'] = pd.to_datetime(df['last_review'])

## Housekeeping

Only remains to finish the W&B run. 

**Please use Logout and Quit via the Jupyter Notebook UI to ensure that the W&B run finishes properly.**

In [8]:
run.finish()

VBox(children=(Label(value=' 0.12MB of 0.12MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…