# This notebook will serve for the initial EDA for the listings data for the TFW project

In [None]:
# Import necassary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Load the dataset
df_listings = pd.read_csv('../data/listings_20210707.csv')

In [None]:
# Shape of the dataset
print('The dataset contains %s different accommodations and %s features' %(df_listings.shape[0], df_listings.shape[1]))

In [None]:
# Have a first look at the dataset
df_listings.head()

In [None]:
# First look at the info
df_listings.info()

In [None]:
# First description of the numerical features
df_listings.describe()

In [None]:
# Looking for categorical features
df_listings.nunique()

The dataset contains many categorical features that we need to process further.

## First cleaning steps

Like Traum-Ferienwohnungen told, we've got a dataset with accomodations located in Germany as we can see in the feature country_title. Because of this, we can drop this column.

In [None]:
# Drop the column country_title
df_listings = df_listings.drop('country_title', axis=1)

The feature `pets` includes only missing values and zeros. In my opinion, this column records the number of pets that are allowed. If pets are allowed or not (or on request) are covered in following columns: `option_holiday_with_your_pet`, `option_holiday_with_your_horse`, `option_holiday_with_your_dog`. For this reason, we decided to drop this column too.

In [None]:
# Drop the column pets
print(df_listings.pets.unique())
df_listings = df_listings.drop('pets', axis=1)

## Feature converting

First, convert the date feature `contract_end` to datetime.

In [None]:
# Convert column contract_end to datetime
df_listings['contract_end'] = pd.to_datetime(df_listings['contract_end'])

The feature `living_area` contains values with range. Like Traum-Ferienwohnungen recommends, we take the first number as correct one and convert them to integers.

In [None]:
# Replace range of `living_area` with the first number
df_listings.replace(['70-280', '50-100', '50-70', '24-49', '16 - 26', '70-280', '18 - 26', '88-100', '46-73', '50-80', '52-65', '50-60'], ['70', '50', '50', '24', '16', '70', '18', '88', '46', '50', '52', '50'], inplace=True)

In [None]:
# Convert column `living_area`to integer
df_listings['living_area'] = df_listings['living_area'].astype(float)

To use the option features in the model, we convert the booleans / categories to integers as following:

- False / no / Not allowed >> 0
- True / yes / Allowed >> 1
- On request >> 2
- Unset >> 3

In [None]:
# Replacement to integers 
df_listings.replace(['False', 'no', 'not allowed', 'True', 'yes', 'allowed', 'on request', 'unset'], [0, 0, 0, 1, 1, 1, 2, 3], inplace=True)

## Looking for correlations

In [None]:
# generate the heatmap
corr = df_listings.corr()
fig, ax = plt.subplots(figsize=(20, 20))
sns.heatmap(
    corr,
    vmin=-1, vmax=1, center=0,
    cmap=sns.diverging_palette(20, 220, n=200),
    square=True
)
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=45,
    horizontalalignment='right'
);

In [None]:
# Generate table with correlations 
corr.style.background_gradient(cmap='coolwarm')

## Plotting distribution of the features

In [None]:
# Plotting histograms of numerical features 
df_listings.hist(bins=50, figsize = (30,30))
plt.show()

### Closer Look: categorical features

1. The histograms of the features `option_wheelchair_accessible` and `wheelchairaccess` look very similar. A check confirmed identical columns. For this reason we drop one column.

In [None]:
# Check for identical columns 
comparison_column = np.where(df_listings["option_wheelchair_accessible"] == df_listings["wheelchairaccess"], True, False)
print(np.all(comparison_column))

In [None]:
# Drop the column wheelchairaccess
df_listings = df_listings.drop('wheelchairaccess', axis=1)

2. The histograms of features `option_non_smoking_only` and `smoking` look contrary. A check confirmed contrary True / False values. For this reason we drop one column. We decided to drop the column `smoking` because the column `option_non_smoking_only` differentiate the unset and on request values.

In [None]:
# Count values for categories
print('option_non_smoking_only:\n', df_listings['option_non_smoking_only'].value_counts())
print('smoking:\n',df_listings['smoking'].value_counts())

In [None]:
# Create a sub dataset that contains only True / False values for the columns
smoking = df_listings.query("option_non_smoking_only == [0 ,1] & smoking == [1, 0]")

In [None]:
# Check for contrary columns 
comparison_column_smoking = np.where(smoking["option_non_smoking_only"] != smoking["smoking"], True, False)
print(np.all(comparison_column_smoking))

In [None]:
# Drop the column wheelchairaccess
df_listings = df_listings.drop('smoking', axis=1)

3. The histograms of features `close_to_the_beach` and `close_to_the_water` look very similar. All accommodations close to the beach are close to the water too. But because the differences between close to the water and close to the beach, like a dike, a habour or a lake are important for guests, we stay with both features.


4. For the features `close_to_the_beach`, `close_to_the_water`, `option_close_to_the_skilift`, `option_railway_station` and `option_airport` the amount of unset values is high: 

In [None]:
# Calculate amount of unset values
print('Percent of unset values in feature')
print('Beach nearby:', round(df_listings.query('option_close_to_the_beach == 3').count()[1]/df_listings.shape[0]*100, 1))
print('Water nearby:', round(df_listings.query('option_close_to_the_water == 3').count()[1]/df_listings.shape[0]*100, 1))
print('Ski lift nearby:', round(df_listings.query('option_close_to_the_ski_lift == 3').count()[1]/df_listings.shape[0]*100, 1))
print('Railway station:', round(df_listings.query('option_railway_station == 3').count()[1]/df_listings.shape[0]*100, 1))
print('Airport:', round(df_listings.query('option_airport == 3').count()[1]/df_listings.shape[0]*100, 1))

Because of high number of unset values in the features `option_railway_station` (95.2%) and `option_airport` (98.2%) we will drop these columns because it gives us no important information. At the moment we will keep the features ` option_close_to_the_beach` (67%), `option_close_to_the_water` (63.8%) and `option_close_to_the_ski_lift` (80.7%) because they could be important for the clsutering model and these features are an important information for the guest to decide for their right accommodation.

In [None]:
# Drop the column option_railway_station and option_airport
df_listings = df_listings.drop(['option_railway_station', 'option_airport'], axis=1)

### Closer Look: numerical features

#### Bathrooms

In [None]:
# Description of the feature bathrooms
df_listings.bathrooms.describe()

In [None]:
# Number of accommodation per bathroom number
df_listings.groupby('bathrooms')['listing_id'].count()

We have a few accommodations with a high number of bathrooms and we have to decide how we want to handle this.

In [None]:
# Because of a higher correlation between the numerical features, we're looking for the median value for all features per bathroom number to see if there is a connection 
numerical_features = df_listings[['bathrooms', 'bedrooms', 'max_guests', 'living_area']]
numerical_features.groupby('bathrooms').median()

With a higher number of bathrooms the number of bedrooms, maximum guests and living area also increase.

#### Bedrooms

In [None]:
# Description of the feature bedrooms
df_listings.bedrooms.describe()

In [None]:
# Number of accommodation per bedroom number
df_listings.groupby('bedrooms')['listing_id'].count()

In [None]:
# Because of a higher correlation between the numerical features, we're looking for the median value for all features per bedroom number to see if there is a connection 
numerical_features.groupby('bedrooms').median()

With a higher number of bedrooms the number of bathrooms, maximum guests and living area also increase.

#### Maximum guests

In [None]:
# Description of the feature maximum guests
df_listings.max_guests.describe()

In [None]:
# Number of accommodation per maximum guest number
df_listings.groupby('max_guests')['listing_id'].count()

In [None]:
# Because of a higher correlation between the numerical features, we're looking for the median value for all features per maximum guests number to see if there is a connection 
numerical_features.groupby('max_guests').median()

With a higher number of maximum guests the number of bathrooms, bedrooms and living area increase not constantly. There is no pattern.

#### Living area

In [None]:
# Description of the feature maximum guests
df_listings.living_area.describe()

## Regions

In [None]:
print(df_listings.region.nunique())
print(df_listings.region.unique())

In [None]:
print(df_listings.subregion.nunique())
print(df_listings.subregion.unique())

In [None]:
print(df_listings.holiday_region.nunique())
print(df_listings.holiday_region.unique())

In [None]:
print(df_listings.zip.nunique())

## Type accomodation

In [None]:
print(df_listings.property_type.nunique())
print(df_listings.property_type.unique())

## Missing values and outliers for numerical features

In [None]:
# Missing values
print('Missing values')
print('Bathrooms:', df_listings.bathrooms.isna().sum())
print('Bedrooms:', df_listings.bedrooms.isna().sum())
print('Maximum guests:', df_listings.max_guests.isna().sum())
print('Living area:', df_listings.living_area.isna().sum())

In [None]:
# Zero values
print('Zero values')
print('Bathrooms:', df_listings.query('bathrooms == 0').shape[0])
print('Bedrooms:', df_listings.query('bedrooms == 0').shape[0])

We will compare the missing data with the dataset room feature to see if we have information about the romms there.

In [None]:
# boxplots
df_listings.boxplot(column=['bathrooms', 'bedrooms'])

In [None]:
df_listings.boxplot(column='max_guests')

In [None]:
df_listings.boxplot(column='living_area')

In the numerical features we find outliers that we have to handle with.

In [None]:
# Calculate diffenrent quantiles for numerical features
print('Quantiles 0.95. 0.975 and 0.99 for:')
print('Bathrooms:\n', df_listings.bathrooms.quantile([.95, .975, .99, 1]))
print('-----------------------')
print('Bedrooms:\n', df_listings.bedrooms.quantile([.95, .975, .99, 1]))
print('-----------------------')
print('Maximum guests:\n', df_listings.max_guests.quantile([.95, .975, .99, 1]))
print('-----------------------')
print('Living_area:\n', df_listings.living_area.quantile([.95, .975, .99, 1]))

In [None]:
print(df_listings.query('bedrooms > 4').shape[0])
print(df_listings.query('bedrooms > 8').shape[0])
print('-----')
print(df_listings.query('bathrooms > 3').shape[0])
print(df_listings.query('bathrooms > 4').shape[0])
print(df_listings.query('bathrooms > 5').shape[0])
print('-----')
print(df_listings.query('max_guests > 13').shape[0])
print(df_listings.query('max_guests > 20').shape[0])
print('-----')
print(df_listings.query('living_area > 230').shape[0])
print(df_listings.query('living_area > 300').shape[0])
print(df_listings.query('living_area > 500').shape[0])

#### Outliers in bedrooms

We decided to drop all rows with a number of bedrooms equals and greater than 9.

In [None]:
# Get names of indexes for which column bedrooms has value greater than 9
indexNames = df_listings[df_listings['bedrooms'] >= 9].index
# Delete these row indexes from dataset
df_listings.drop(indexNames , inplace=True)

In [None]:
# Looking for the description of the numerical features after dropping the outliers in bedrooms
df_listings[numerical_features.columns].describe()

With dropping the outliers in bedrooms the feature bathrooms looks pretty well, so we don't have to clean this feature for outliers.

#### Outliers in maximum guests

In [None]:
# To see what could be a good threshold to define outliers in maximum guests, we group the numerical features by number of bedrooms and looking for the maximum value
df_listings[numerical_features.columns].groupby('bedrooms').max()

The maximum number of guests with a bedroom number of 8 is 32 guests. Let's see how many accommodations habe a value greater than 32 for maximum guests:

In [None]:
print(df_listings.query('max_guests > 32').shape[0])

We decide to drop the rows with a greater value than 32 of maximum guests from the dataset.

In [None]:
# Get names of indexes for which column maximum guests has value greater than 32
indexNames_guests = df_listings[df_listings['max_guests'] > 32].index
# Delete these row indexes from dataset
df_listings.drop(indexNames_guests , inplace=True)

In [None]:
df_listings[numerical_features.columns].groupby('bedrooms').max()

In [None]:
# Looking for the description of the numerical features after dropping the outliers in maximum guests
df_listings[numerical_features.columns].describe()

#### Outliers in living area

In [None]:
print(df_listings.query('living_area > 230').shape[0])
print(df_listings.query('living_area > 300').shape[0])
print(df_listings.query('living_area > 350').shape[0])
print(df_listings.query('living_area > 400').shape[0])
print(df_listings.query('living_area > 450').shape[0])
print(df_listings.query('living_area > 500').shape[0])
print(df_listings.query('living_area > 550').shape[0])

In [None]:
df_listings.query('living_area > 450')

In [None]:
# Number of accommodation per maximum guest number
df_listings.groupby('living_area')['listing_id'].count()