# Berlin Airbnb Open Data

## About Dataset
### Context
Since 2008, guests and hosts have used Airbnb to travel in a more unique, personalized way. As part of the Airbnb Inside initiative, this dataset describes the listing activity of homestays in Berlin, Germany.

### Content
The following Airbnb activity is included in this Berlin dataset:

- Listings, including full descriptions and average review score
- Reviews, including unique id for each reviewer and detailed comments
- Calendar, including listing id and the price and availability for that day

The data has been published on 22 June, 2024.

### Acknowledgement
This dataset is part of Airbnb Inside, and the original source can be found [here](http://insideairbnb.com/get-the-data.html]).

## Approach
This analysis is guided by key questions inspired by the Airbnb Inside dashboard, focusing on the following aspects:
- **Distribution of room types:** What is the current distribution of different room types available on Airbnb in Berlin?
- **Booking and income trends:** What are the average number of nights booked and the average income generated by listings over the past 12 months?
- **Short-Term Rentals:** What are the "minimum nights" settings for listings? How is the distribution between short-term and long-term rentals?
- **Host listing distribution:** How are listings distributed across hosts? Specifically, how many listings do the top 10 hosts manage?

The data will be analyzed and cleaned as necessary to address each of these focus areas.

### Load Libraries and Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
df_calendar = pd.read_csv('./data/calendar.csv')
df_listings = pd.read_csv('./data/listings.csv')
df_reviews = pd.read_csv('./data/reviews.csv')

### Distribution of Room Types
Airbnb hosts can list entire homes/apartments, private, shared rooms, and more recently hotel rooms.

Depending on the room type and activity, a residential airbnb listing could be more like a hotel, disruptive for neighbours, taking away housing, and illegal.

In [None]:
room_types = df_listings['room_type'].value_counts()

entire_home_apt = room_types['Entire home/apt']
private_room = room_types['Private room']
shared_room = room_types['Shared room']
hotel_room = room_types['Hotel room']

In [None]:
plt.barh(room_types.index, room_types.values)
plt.ylabel('Room Type')
plt.xlabel('Listings')

In [None]:
print("In Berlin, {}% of Airbnb listings are entire homes or apartments, totaling {} listings. Private rooms make up {}% with {} listings, while shared rooms and hotel rooms account for {}% ({} listings) and {}% ({} listings) respectively.".format(np.round(entire_home_apt / room_types.sum() * 100, 1), entire_home_apt, np.round(private_room / room_types.sum() * 100, 1), private_room, np.round(shared_room / room_types.sum() * 100, 1), shared_room, np.round(hotel_room / room_types.sum() * 100, 1), hotel_room))

### Booking and Income Trends
The minimum stay, price and number of reviews have been used to estimate the the number of nights booked and the income for each listing, for the last 12 months. 

Is the home, apartment or room rented frequently and displacing units of housing and residents? Does the income from Airbnb incentivise short-term rentals vs long-term housing?

In [None]:
df_calendar['date'] = pd.to_datetime(df_calendar['date'])
df_calendar['price'] = df_calendar['price'].replace('[\$,]', '', regex=True).astype(float)

df_calendar_price = df_calendar.groupby('listing_id').first()['price'].reset_index()
df_calendar_price.columns = ['id', 'calendar_price']

In [None]:
df_listings['price'] = df_listings['price'].replace('[\$,]', '', regex=True).astype(float)
df_listings_subset = df_listings[['id', 'price', 'minimum_nights']]

df_listings_price = pd.merge(df_listings_subset, df_calendar_price, on='id', right_index=True)
df_listings_price['price_adj'] = df_listings_price['price'].fillna(df_listings_price['calendar_price'])
df_listings_price = df_listings_price.rename(columns={'id': 'listing_id'})

In [None]:
df_reviews['date'] = pd.to_datetime(df_reviews['date'])
df_reviews_subset = df_reviews[(df_reviews['date'] >= '2023-06-01') & (df_reviews['date'] <= '2024-05-31')]

In [None]:
df_listings_stays = pd.DataFrame(df_reviews_subset['listing_id'].value_counts()).reset_index().rename(columns={'index':'listing_id', 'listing_id': 'stays'})
df_listings_activity = pd.merge(df_listings_price, df_listings_stays, on='listing_id', how='left')
df_listings_activity['stays'].fillna(0, inplace=True)
df_listings_activity['total_nights'] = df_listings_activity['minimum_nights'] * df_listings_activity['stays']
df_listings_activity['total_income'] = df_listings_activity['price_adj'] * df_listings_activity['total_nights']

avg_nights_booked = df_listings_activity['total_nights'].sum() / len(df_listings_activity[df_listings_activity['stays'] > 0])
avg_price_night = df_listings_activity['total_income'].sum() / df_listings_activity['total_nights'].sum()
avg_income = df_listings_activity['total_income'].sum() / len(df_listings_activity[df_listings_activity['stays'] > 0])

In [None]:
bins_occupancy = pd.cut(df_listings_activity.groupby('listing_id')['total_nights'].sum(), bins=[0, 1, 31, 61, 91, 121, 151, 181, 211, 241, float('inf')], right=False, labels=['0', '1-30', '31-60', '61-90', '91-120', '121-150', '151-180', '181-210', '211-240', '241-255+']).sort_index()
occupancy = bins_occupancy.value_counts().sort_index()

In [None]:
plt.figure(figsize=(10, 4.8))
plt.ylabel('Listings')
plt.xlabel('Occupancy (last 12 months)')
plt.xticks(range(len(occupancy)), np.array(occupancy.index.astype(str)))
plt.bar(range(len(occupancy)), occupancy.values)

In [None]:
print("In Berlin, the average number of nights booked per Airbnb listing over the last 12 months is {}. The average price per night is €{}, leading to an average income of €{} per listing.".format(round(avg_nights_booked, 0), round(avg_price_night, 1), round(avg_income, 1)))

### Short-Term Rentals
The housing policies of cities and towns can be restrictive of short-term rentals, to protect housing for residents.

By looking at the "minimum nights" setting for listings, we can see if the market has shifted to longer-term stays. Was it to avoid regulations, or in response to changes in travel demands?

In some cases, Airbnb has moved large numbers of their listings to longer-stays to avoid short-term rental regulations and accountability.

In [None]:
bins_rental_duration = pd.cut(df_listings['minimum_nights'], bins=list(range(1, 36)) + [float('inf')], right=False, labels=list(range(1, 35)) + ['35+'])
rental_duration = bins_rental_duration.value_counts().sort_index()

st_rentals = rental_duration.reset_index(drop=True).loc[:28].sum()
lt_rentals = rental_duration.reset_index(drop=True).loc[29:].sum()

In [None]:
plt.figure(figsize=(10, 4.8))
plt.ylabel('Listings')
plt.xlabel('Minimum Nights')
plt.xticks(range(len(rental_duration)), np.array(rental_duration.index.astype(str)))
vertical_line = plt.axvline(x=28.5, color='black', linestyle='--', label='Short-Term Rentals Threshold')
plt.legend(handles=[vertical_line])
plt.bar(range(len(rental_duration)), rental_duration)

In [None]:
print("In Berlin, {}% of Airbnb listings are categorized as short-term rentals, accounting for {} listings. The remaining {}% are longer-term rentals, which totals {} listings.".format(round(st_rentals / (st_rentals + lt_rentals) * 100, 1), st_rentals, round(lt_rentals / (st_rentals + lt_rentals) * 100, 1), lt_rentals))

### Host Listing Distribution
Some Airbnb hosts have multiple listings.

A host may list separate rooms in the same apartment, or multiple apartments or homes available in their entirity.

Hosts with multiple listings are more likely to be running a business, are unlikely to be living in the property, and in violation of most short term rental laws designed to protect residential housing.

In [None]:
bins_listings_per_host = pd.cut(df_listings.groupby('host_id')['id'].count(), bins=list(range(1, 11)) + [float('inf')], right=False, labels=list(range(1, 10)) + ['10+'])
listings_per_host = pd.concat([bins_listings_per_host, df_listings.groupby('host_id')['id'].count()], axis=1)
listings_per_host.columns = ['bins', 'count']
listings_per_host = listings_per_host.groupby('bins').sum().sort_index()

single_listings = listings_per_host['count'][0]
multi_listings = listings_per_host['count'].sum() - single_listings

In [None]:
plt.ylabel('Listings')
plt.xlabel('Listings per Host')
plt.xticks(range(len(listings_per_host)), np.array(listings_per_host.index.astype(str)))
plt.bar(range(len(listings_per_host)), listings_per_host['count'].values)

In [None]:
print("In Berlin, {}% of Airbnb listings are multi-listings, meaning they are managed by hosts with multiple properties. This accounts for {} listings. The remaining {}% are single listings, totaling {} properties managed by hosts with only one listing.".format(round(multi_listings / listings_per_host['count'].sum() * 100, 1), multi_listings, round(single_listings / listings_per_host['count'].sum() * 100, 1), single_listings))

In [None]:
top_hosts = df_listings.groupby(['host_id', 'room_type']).size().unstack(fill_value=0)
top_hosts['Total listings'] = top_hosts.sum(axis=1)
top_hosts = top_hosts.sort_values(by='Total listings', ascending=False)
top_hosts[:10]

In [None]:
print("The table shows the top {} Airbnb hosts in Berlin, with the highest number of listings managed by a single host being {}, all of which are entire homes/apartments. Other hosts have a mix of entire homes, private rooms, shared rooms, and hotel rooms, with the total number of listings per host ranging from {} to {}.".format(len(top_hosts[:10]), top_hosts[:10]['Entire home/apt'].iloc[0], top_hosts[:10]['Entire home/apt'].iloc[9], top_hosts[:10]['Entire home/apt'].iloc[1]))