# This notebook will serve the EDA for the cleaned dataset that we used for modeling

## 1. Libraries and loading CSV

In [None]:
# Import necassary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Setup axis for plots
sns.set_context("talk", font_scale=1.5)

In [None]:
# load dataset
df_master = pd.read_csv('../data/excellent_master.csv')

In [None]:
# First look at the dataset
df_master.head()

## 2. Remove unnecassary columns and show shape of dataframe

In [None]:
# Remove unnecassary unnamed columns
df_master.drop(['Unnamed: 0', 'Unnamed: 0.1', 'Unnamed: 0_x', 'Unnamed: 0_y'], axis=1, inplace=True)

In [None]:
# Shape of the dataset
print('The dataset contains %s oberservations and %s features' %(df_master.shape[0], df_master.shape[1]))

## 3. Convert date features in the right data type and show first description

In [None]:
# Convert column arrival_date to datetime
df_master['arrival_date'] = pd.to_datetime(df_master['arrival_date'])

In [None]:
# First description of the numerical features
round(df_master.describe(),3)

## 4. Number of properties and filtering for year 2019

In [None]:
# Number of unique properties, the included years and months
print('Number of unique properties:', df_master.listing_id.nunique())
print('The included years are', df_master.year.unique())

In [None]:
# Number of oberservations per year
print(df_master.query('year == 2019').shape[0])
print(df_master.query('year == 2019').shape[0] / df_master.shape[0] * 100)
print(df_master.query('year == 2020').shape[0])
print(df_master.query('year == 2020').shape[0] / df_master.shape[0] * 100)

The dataset included inquiries from the years 2019 and 2020. We have data from 13,761 different properties and a total of 1,521,132 observations, 476,790 observations are from 2019 (31.3%) and 1,044,342 observations are from 2020 (68.7%).

Due to the influence of the corona pandemic on the inquiries 2020 (as we colud see in the EDA inquiries), we are focussing on the year 2019. 

In [None]:
# Filter dataset for year 2019
df_master_2019 = df_master.query('year == 2019')

In [None]:
# Number of unique properties in year 2019
print('Number of unique properties in year 2019:', df_master_2019.listing_id.nunique())

The dataset includes 476,790 inquiries for 8,388 different properties in the year 2019.

## 5. Grouping / Clustering features by inquiry rate 

### 1. We will define three categories of inquiry rate: low, middle, high. Inquiry rate was calculated by expose views and inquiry count. Let's see the distribution of inquiry rate.

In [None]:
# Boxplot inquiry rate
ax = sns.boxplot(x=df_master_2019["inquiry_rate"])

In [None]:
# Inquiry rate per month
ax = sns.boxplot(x="month", y="inquiry_rate", data=df_master_2019)

We define the category "low" as the lowest 25% inquiry rates, the category "high" as the highest 25% inquiry rates and the category "middle" as the inquiry rates between the lowest and highest group.

In [None]:
# Calculate inquiry_rate for quartiles to define categorical groups
print(round(df_master_2019.inquiry_rate.describe(), 2))
print(df_master_2019.inquiry_rate.quantile(.25))
print(df_master_2019.inquiry_rate.quantile(.75))

In [None]:
# Create new column with the categories for inquiry rate

# Create a list of our conditions
conditions = [
    (df_master_2019['inquiry_rate'] <= df_master_2019.inquiry_rate.quantile(.25)),
    (df_master_2019['inquiry_rate'] > df_master_2019.inquiry_rate.quantile(.25)) & (df_master_2019['inquiry_rate'] < df_master_2019.inquiry_rate.quantile(.75)),
    (df_master_2019['inquiry_rate'] >= df_master_2019.inquiry_rate.quantile(.75))
]

# create a list of the values we want to assign for each condition
values = ['low', 'middle', 'high']

# create a new column and use np.select to assign values to it using our lists as arguments
df_master_2019['cat_inquiry_rate'] = np.select(conditions, values)

In [None]:
df_master_2019.head(2)

### 2. Group features by category inquiry rate

We only keep the plots, where we can see a positive difference.

In [None]:
list(df_master_2019.columns)

#### Inquiry rate

Of course, we can see a difference in inquiry rate by the categorical inquiry rate because we calculated the categories by this feature.

In [None]:
ax = sns.lineplot(x="cat_inquiry_rate", y="inquiry_rate", data=df_master_2019, order=['low', 'middle', 'high'])

#### Price per day

The price per day tends to be higher with a high inquiry rate than with lower rates.

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="filled_in_price_per_day", data=df_master_2019, order=['low', 'middle', 'high'])

#### Coordinates

Properties with a higher inquiry rate tends to be more in the east and north. This coincides with the highest request value for the Baltic Sea.

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="lng", data=df_master_2019, order=['low', 'middle', 'high'])

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="lat", data=df_master_2019, order=['low', 'middle', 'high'])

#### Living area



In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="living_area", data=df_master_2019, order=['low', 'middle', 'high'])

#### Maximum number of guests

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="max_guests", data=df_master_2019, order=['low', 'middle', 'high'])

#### Top listing

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="top", data=df_master_2019, order=['low', 'middle', 'high'])

#### Dishwasher

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="dishwasher", data=df_master_2019, order=['low', 'middle', 'high'])

#### Vacuum cleaner

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="vacuum_cleaner", data=df_master_2019, order=['low', 'middle', 'high'])

#### babybed and high chair

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="babybed", data=df_master_2019, order=['low', 'middle', 'high'])

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="high_chair", data=df_master_2019, order=['low', 'middle', 'high'])

#### DVD player

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="DVD-player", data=df_master_2019, order=['low', 'middle', 'high'])

#### Make-up mirror

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="make-up_mirror", data=df_master_2019, order=['low', 'middle', 'high'])

#### Toaster

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="toaster", data=df_master_2019, order=['low', 'middle', 'high'])

#### Freezer

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="freezer", data=df_master_2019, order=['low', 'middle', 'high'])

#### Fly-screen

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="fly_screen", data=df_master_2019, order=['low', 'middle', 'high'])


#### Crockery

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="crockery", data=df_master_2019, order=['low', 'middle', 'high'])

#### Spices

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="spices", data=df_master_2019, order=['low', 'middle', 'high'])

#### First aid kit

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="first-aid_kit", data=df_master_2019, order=['low', 'middle', 'high'])