# This notebook will serve the EDA for the cleaned dataset that we used for modeling

## 1. Libraries and loading CSV

In [None]:
# Import necassary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Setup axis for plots
sns.set_context("talk", font_scale=1.5)

In [None]:
# load dataset
df_master = pd.read_csv('../data/super_master.csv')

In [None]:
# First look at the dataset
df_master.head()

## 2. Remove unnecassary columns and show shape of dataframe

In [None]:
# Remove unnecassary unnamed columns
df_master.drop(['Unnamed: 0', 'Unnamed: 0.1', 'Unnamed: 0_x', 'Unnamed: 0_y'], axis=1, inplace=True)

In [None]:
# Shape of the dataset
print('The dataset contains %s oberservations and %s features' %(df_master.shape[0], df_master.shape[1]))

## 3. Convert date features in the right data type and show first description

In [None]:
# Convert column arrival_date to datetime
df_master['arrival_date'] = pd.to_datetime(df_master['arrival_date'])

In [None]:
# First description of the numerical features
round(df_master.describe(),3)

## 4. Number of properties and filtering for year 2019

In [None]:
# Number of unique properties, the included years and months
print('Number of unique properties:', df_master.listing_id.nunique())
print('The included years are', df_master.year.unique())

In [None]:
# Number of oberservations per year
print(df_master.query('year == 2019').shape[0])
print(df_master.query('year == 2019').shape[0] / df_master.shape[0] * 100)
print(df_master.query('year == 2020').shape[0])
print(df_master.query('year == 2020').shape[0] / df_master.shape[0] * 100)

The dataset included inquiries from the years 2019 and 2020. We have the data from 17,185 different properties. Of a total of 6,081,983 observations, 1,881,180 observations are from 2019 (30.9%) and 4,200,803 observations are from 2020 (69.1%).

Due to the influence of the corona pandemic on the inquiries 2020 (as we colud see in the EDA inquiries), we are focussing on the year 2019. 

In [None]:
# Filter dataset for year 2019
df_master_2019 = df_master.query('year == 2019')

In [None]:
# Save Master 2019 as csv
#df_master_2019.to_csv('../data/super_master_2019.csv')

# Import Master 2019
#df_master_2019 = pd.read_csv('../data/super_master_2019.csv')

In [None]:
# Number of unique properties in year 2019
print('Number of unique properties in year 2019:', df_master_2019.listing_id.nunique())

The dataset includes 1,881,180 inquiries for 10,270 different properties in the year 2019.

## 5. Grouping / Clustering features by inquiry rate 

### 1. We will define three categories of inquiry rate: low, middle, high. Inquiry rate was calculated by expose views and inquiry count. Let's see the distribution of inquiry rate.

In [None]:
# Boxplot inquiry rate
ax = sns.boxplot(x=df_master_2019["inquiry_rate"])

In [None]:
# Inquiry rate per month
ax = sns.boxplot(x="month", y="inquiry_rate", data=df_master_2019)

We define the category "low" as the lowest 25% inquiry rates, the category "high" as the highest 25% inquiry rates and the category "middle" as the inquiry rates between the lowest and highest group.

In [None]:
# Calculate inquiry_rate for quartiles to define categorical groups
print(round(df_master_2019.inquiry_rate.describe(), 2))
print(df_master_2019.inquiry_rate.quantile(.25))
print(df_master_2019.inquiry_rate.quantile(.75))

In [None]:
# Create new column with the categories for inquiry rate

# Create a list of our conditions
conditions = [
    (df_master_2019['inquiry_rate'] <= df_master_2019.inquiry_rate.quantile(.25)),
    (df_master_2019['inquiry_rate'] > df_master_2019.inquiry_rate.quantile(.25)) & (df_master_2019['inquiry_rate'] < df_master_2019.inquiry_rate.quantile(.75)),
    (df_master_2019['inquiry_rate'] >= df_master_2019.inquiry_rate.quantile(.75))
]

# create a list of the values we want to assign for each condition
values = ['low', 'middle', 'high']

# create a new column and use np.select to assign values to it using our lists as arguments
df_master_2019['cat_inquiry_rate'] = np.select(conditions, values)

In [None]:
df_master_2019.head(2)

### 2. Group features by category inquiry rate

We only keep the plots, where we can see a difference.

No difference in: top, children`s_room, corridor, dining_room, living_bedroom, separate_WC, washroom, wellness, Blu-ray_player, CDs_DVDs, additional_bed, awning_, beach_chair, bicycles, bread_service, bunk_bed, carport, chest_of_drawers, child's_bed, children_toilet_seat, coffee_machine, colouring_book_pencils, desk, double_bed, double_wash_basin, fireplace, first-aid_kit, food_processor, garage, garden_furniture, garden_shed, make-up_mirror, mirror, phone, playground, pond, private_parking, refrigerator, safe, sandpit, sandwich_toaster, shower, slide, socket_covers, spices, stair_gate, sunshade, swing, table_tennis, toilet, toys, trampoline, walk-in_shower, wardrobe, wash_basin, windbreak, cooking, pets_count, option_non_smoking_only, option_holiday_with_your_horse, option_close_to_the_beach, option_family_travel, option_holiday_with_your_baby, babybed, tv, pool, sauna, garden, bathrooms, bedrooms, DVD-player, living_room, bed_linen, cleaning_supplies, dining_table, drying_rack, hair_dryer, high_chair, hot_water, sofa_bed, single_bed, sun_umbrella_, tea_towels, towels, option_holiday_with_your_pet,

Positive Difference: inquiry_count, kitchen_living, armchair, bath_towels, bathtub, central_heating, crockery,laundry_service, lawn, mixer, radio, stereo_system, sun_loungers, option_allergic, option_holiday_with_your_dog, option_wheelchair_accessible, option_close_to_the_water, option_long_term_holiday, option_technicians, washingmachine, dryer, grill, max_guests, living_area, lat, lng, filled_in_price_per_day, inquiry_rate

In [None]:
Negative Difference: kitchen, living_/_dining_room, storage_room, books, egg_cooker, fire_alarm, flat_iron,
fly_screen, freezer, games, microwave, sofa, underfloor_heating, vacuum_cleaner,
option_fully_accessible, internet, balcony, 

In [None]:
toaster, adult_count, length_stay, option_close_to_the_ski_lift, dishwasher, 

In [None]:
Not numeric: title, month, holiday_region

In [None]:
list(df_master_2019.columns)

In [None]:
df_master_2019.holiday_region.value_counts()[:5]

In [None]:
region_list = ['Ostsee', 'Nordsee', 'Oberbayern', 'Oberallgäu', 'Bodensee']

region = df_master_2019[df_master_2019['holiday_region'].isin(region_list)]

In [None]:
ax = sns.boxplot(x="holiday_region", y="inquiry_rate", data=region)
ax.set_xticklabels(ax.get_xticklabels(),rotation = 90);

#### Inquiry rate

Of course, we can see a difference in inquiry rate by the categorical inquiry rate because we calculated the categories by this feature.

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="inquiry_rate", data=df_master_2019, order=['low', 'middle', 'high'])

#### Price per day

The price per day tends to be higher with a high inquiry rate than with lower rates.

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="filled_in_price_per_day", data=df_master_2019, order=['low', 'middle', 'high'])

#### Coordinates

Properties with a higher inquiry rate tends to be more in the east and north. This coincides with the highest request value for the Baltic Sea.

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="lng", data=df_master_2019, order=['low', 'middle', 'high'])

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="lat", data=df_master_2019, order=['low', 'middle', 'high'])

#### Living area



In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="living_area", data=df_master_2019, order=['low', 'middle', 'high'])

#### Maximum number of guests

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="max_guests", data=df_master_2019, order=['low', 'middle', 'high'])

#### Balcony

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="balcony", data=df_master_2019, order=['low', 'middle', 'high'])

#### Grill

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="grill", data=df_master_2019, order=['low', 'middle', 'high'])

#### Washingmaschine and Dryer

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="dryer", data=df_master_2019, order=['low', 'middle', 'high'])


In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="washingmachine", data=df_master_2019, order=['low', 'middle', 'high'])

#### Dishwasher

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="dishwasher", data=df_master_2019, order=['low', 'middle', 'high'])

#### Internet

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="internet", data=df_master_2019, order=['low', 'middle', 'high'])

#### Wheelchariaccess

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="wheelchairaccess", data=df_master_2019, order=['low', 'middle', 'high'])

#### Property close to the ski lift

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="option_close_to_the_ski_lift", data=df_master_2019, order=['low', 'middle', 'high'])

#### Option technician

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="option_technicians", data=df_master_2019, order=['low', 'middle', 'high'])

#### Option fully accessible

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="option_fully_accessible", data=df_master_2019, order=['low', 'middle', 'high'])

#### Option long term holiday

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="option_long_term_holiday", data=df_master_2019, order=['low', 'middle', 'high'])

#### Option close to the water

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="option_close_to_the_water", data=df_master_2019, order=['low', 'middle', 'high'])

#### Holiday with your dog

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="option_holiday_with_your_dog", data=df_master_2019, order=['low', 'middle', 'high'])

#### Option allergic

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="option_allergic", data=df_master_2019, order=['low', 'middle', 'high'])

#### Length of stay

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="length_stay", data=df_master_2019, order=['low', 'middle', 'high'])

#### Children Count

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="children_count", data=df_master_2019, order=['low', 'middle', 'high'])

#### Adult count

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="adult_count", data=df_master_2019, order=['low', 'middle', 'high'])

#### Vacuum cleaner

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="vacuum_cleaner", data=df_master_2019, order=['low', 'middle', 'high'])

#### Underfloor heating

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="underfloor_heating", data=df_master_2019, order=['low', 'middle', 'high'])

#### Toaster

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="toaster", data=df_master_2019, order=['low', 'middle', 'high'])

#### sun loungers

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="sun_loungers", data=df_master_2019, order=['low', 'middle', 'high'])

#### stereo system

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="stereo_system", data=df_master_2019, order=['low', 'middle', 'high'])

#### Sofa

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="sofa", data=df_master_2019, order=['low', 'middle', 'high'])

#### Radio

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="radio", data=df_master_2019, order=['low', 'middle', 'high'])

#### Mixer

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="mixer", data=df_master_2019, order=['low', 'middle', 'high'])

#### Microwave

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="microwave", data=df_master_2019, order=['low', 'middle', 'high'])

#### lawn

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="lawn", data=df_master_2019, order=['low', 'middle', 'high'])

#### Laundry-service

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="laundry_service", data=df_master_2019, order=['low', 'middle', 'high'])

#### Games

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="games", data=df_master_2019, order=['low', 'middle', 'high'])

#### Freezer

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="freezer", data=df_master_2019, order=['low', 'middle', 'high'])

#### Fly-screen

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="fly_screen", data=df_master_2019, order=['low', 'middle', 'high'])


#### Flat-iron

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="flat_iron", data=df_master_2019, order=['low', 'middle', 'high'])


#### Fire alarm

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="fire_alarm", data=df_master_2019, order=['low', 'middle', 'high'])

#### Egg cooker

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="egg_cooker", data=df_master_2019, order=['low', 'middle', 'high'])

#### Crockery

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="crockery", data=df_master_2019, order=['low', 'middle', 'high'])

#### Central heating

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="central_heating", data=df_master_2019, order=['low', 'middle', 'high'])

#### Books

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="books", data=df_master_2019, order=['low', 'middle', 'high'])

#### Bathtub

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="bathtub", data=df_master_2019, order=['low', 'middle', 'high'])

#### Bath towels

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="bath_towels", data=df_master_2019, order=['low', 'middle', 'high'])

#### Armchair

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="armchair", data=df_master_2019, order=['low', 'middle', 'high'])

#### Storage room

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="storage_room", data=df_master_2019, order=['low', 'middle', 'high'])

#### Living / Dining room and kitchen / living

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="living_/_dining_room", data=df_master_2019, order=['low', 'middle', 'high'])

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="kitchen_living", data=df_master_2019, order=['low', 'middle', 'high'])

#### Kitchen

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="kitchen", data=df_master_2019, order=['low', 'middle', 'high'])

#### Inquiry count

In [None]:
ax = sns.boxplot(x="cat_inquiry_rate", y="inquiry_count", data=df_master_2019, order=['low', 'middle', 'high'])

## 4. Correlations between some features

In [None]:
# generate the heatmap
corr = df_master.corr()
fig, ax = plt.subplots(figsize=(20, 20))
sns.heatmap(
    corr,
    vmin=-1, vmax=1, center=0,
    cmap=sns.diverging_palette(20, 220, n=200),
    square=True
)
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=45,
    horizontalalignment='right'
);

In [None]:
# Generate table with correlations 
df_master.corr().style.background_gradient(cmap='coolwarm')