# **Introduction:  Exploring Airbnb in New York City**

<img src='https://costar.brightspotcdn.com/dims4/default/9ef74e2/2147483647/strip/true/crop/2048x1366+0+0/resize/2048x1366!/quality/100/?url=http%3A%2F%2Fcostar-brightspot.s3.us-east-1.amazonaws.com%2FGettyImages-1424386174.jpg'>

Airbnb is an online platform founded in 2008, connecting hosts who offer short-term lodging with travelers seeking unique accommodations. Users can book homes, apartments, or rooms, fostering a community-driven travel experience. It has revolutionized the hospitality industry by offering a wide range of lodging options and cultural immersion opportunities.

The data for this project was gathered from [Inside Airbnb](http://insideairbnb.com/get-the-data).

Here's the [link](https://tinyurl.com/nyc-airbnb) to the presentation slides for this project.

# **Import libraries**

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import os
from matplotlib.ticker import FuncFormatter
import warnings

# Function to save figures
def save_figure(figure_name):
    figure_path = os.path.join('/content/drive/MyDrive/Data Science - Final Project/images', figure_name)
    plt.tight_layout()
    plt.savefig(figure_path)
    print(f"Figure saved at: {figure_path}")

# Function to format axis labels
def format_revenue(value, _):
    return f"{value/1000:.0f}K"

warnings.filterwarnings('ignore')

# **Gather data**

In [None]:
# Import datasets
listings = pd.read_csv('/content/drive/MyDrive/Data Science - Final Project/data/listings.csv')

# **Assess data**

In [None]:
# Preview listings dataset
listings.head(3)

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license
0,2595,Rental unit in New York · ★4.68 · Studio · 1 b...,2845,Jennifer,Manhattan,Midtown,40.75356,-73.98559,Entire home/apt,240,30,49,2022-06-21,0.29,3,351,0,
1,5121,Rental unit in Brooklyn · ★4.52 · 1 bedroom · ...,7356,Garon,Brooklyn,Bedford-Stuyvesant,40.68535,-73.95512,Private room,66,30,50,2019-12-02,0.29,2,151,0,
2,6848,Rental unit in Brooklyn · ★4.58 · 2 bedrooms ·...,15991,Allen & Irina,Brooklyn,Williamsburg,40.70935,-73.95342,Entire home/apt,81,30,191,2023-08-14,1.09,1,79,5,


In [None]:
print("This dataset consists of {} observations and {} attributes. \n".format(listings.shape[0], listings.shape[1]))

print("There are {} numerical attributes such as {}... \n".format(len(listings.select_dtypes(['float', 'int']).columns),
                                                                ", ".join(list(listings.select_dtypes(['float', 'int']).columns)[0:5])))

print("There are {} categorical attributes such as {}... \n".format(len(listings.select_dtypes(['object']).columns),
                                                                  ", ".join(list(listings.select_dtypes(['object']).columns)[0:5])))

This dataset consists of 38792 observations and 18 attributes. 

There are 11 numerical attributes such as id, host_id, latitude, longitude, price... 

There are 7 categorical attributes such as name, host_name, neighbourhood_group, neighbourhood, room_type... 



In [None]:
# Select the categorical attributes
cat_attribs = listings.select_dtypes(include=['object'])

# Create a table of missing categorical data with percentage
cat_null_total = cat_attribs.isnull().sum().sort_values(ascending=False)
cat_null_pct = (cat_attribs.isnull().sum() / cat_attribs.isnull().count() * 100).sort_values(ascending=False)
cat_nulls = pd.concat([cat_null_total, cat_null_pct],
                      axis=1,
                      keys=['nulls', '% of nulls']).reset_index().rename(columns={'index':'categorical attributes'})
cat_nulls

Unnamed: 0,categorical attributes,nulls,% of nulls
0,license,35853,92.423696
1,last_review,10352,26.685915
2,host_name,5,0.012889
3,name,0,0.0
4,neighbourhood_group,0,0.0
5,neighbourhood,0,0.0
6,room_type,0,0.0


In [None]:
# Select the numerical attributes
num_attribs = listings.select_dtypes(include=['float64', 'int64'])

# Create a table of missing numerical data with percentage
num_null_total = num_attribs.isnull().sum().sort_values(ascending=False)
num_null_pct = (num_attribs.isnull().sum() / num_attribs.isnull().count() * 100).sort_values(ascending=False)
num_nulls = pd.concat([num_null_total, num_null_pct],
                      axis=1,
                      keys=['nulls', '% of nulls']).reset_index().rename(columns={'index':'numerical attributes'})
num_nulls

Unnamed: 0,numerical attributes,nulls,% of nulls
0,reviews_per_month,10352,26.685915
1,id,0,0.0
2,host_id,0,0.0
3,latitude,0,0.0
4,longitude,0,0.0
5,price,0,0.0
6,minimum_nights,0,0.0
7,number_of_reviews,0,0.0
8,calculated_host_listings_count,0,0.0
9,availability_365,0,0.0


# **Preprocess data**

In [None]:
# Exclude columns that are not needed
cleaned_listings = listings.drop(columns=['license', 'availability_365', 'number_of_reviews_ltm'])

# Convert 'price' to numeric data type
cleaned_listings['price'] = cleaned_listings['price'].replace('\$|,', '', regex=True)
cleaned_listings['price'] = pd.to_numeric(cleaned_listings['price'])

# Rename columns
cleaned_listings.rename(columns={'neighbourhood_group': 'borough',
                                 'calculated_host_listings_count': 'host_listings_count'},
                        inplace=True)

# Calculate 'occupancy', which is capped at 85% of 365 days
cleaned_listings['reviews_per_month'] = cleaned_listings['reviews_per_month'].fillna(0) # Fill null values
cleaned_listings['guests_per_month'] = cleaned_listings['reviews_per_month'] / 0.72 # Calculate guests_per_month
cleaned_listings['occupancy'] = np.minimum(
    np.where(cleaned_listings['minimum_nights'] > 3,
             cleaned_listings['guests_per_month'] * cleaned_listings['minimum_nights'] * 12,
             cleaned_listings['guests_per_month'] * 3 * 12),
    0.85 * 365.25
)

# Calculate 'listing_earns'
cleaned_listings['listing_earns'] = cleaned_listings['price'] * cleaned_listings['occupancy']

# Calculate 'host_earns'
host_earnings = cleaned_listings.groupby('host_id')['listing_earns'].sum().reset_index()
host_earnings.columns = ['host_id', 'host_earns']
cleaned_listings = pd.merge(cleaned_listings,
                            host_earnings,
                            on='host_id',
                            how='left')

# Filter active listings and short-term listings
last_scraped = datetime(2023, 10, 1)
cleaned_listings['last_review'] = pd.to_datetime(cleaned_listings['last_review'])
six_months_ago = last_scraped - timedelta(days=6*30.4375)
active_listings = cleaned_listings[cleaned_listings['last_review'] > six_months_ago]
short_term_listings = active_listings[active_listings['minimum_nights'] < 30]

# **Analyze and visualize data**

### Market overview

**Objective:** Uncover and analyze general trends within the Airbnb rental market.

In [None]:
# Print total number of listings
print(f'Total number of listings: {cleaned_listings.shape[0]}')

Total number of listings: 38792


In [None]:
# Plot distribution of listings based on room type
room_type_count = cleaned_listings['room_type'].value_counts().reset_index()

fig = px.bar(room_type_count,
             x='index',
             y='room_type',
             labels={'index': 'Room Type', 'room_type': 'Listings'},
             title='Distribution of Listings Based on Room Type',
             template='plotly',
             color='index')

fig.update_layout(xaxis_title='Room Type',
                  yaxis_title='Listings',
                  width=600,
                  showlegend=False)

# Show the plot
fig.show()

In [None]:
# Plot distribution of listings across all boroughs
filtered_listings = cleaned_listings[~(cleaned_listings['price'] > 500)] # Filter out a small number of unsually expensive listings for better visibility
fig = px.scatter_mapbox(filtered_listings,
                        lat='latitude',
                        lon='longitude',
                        color='price',
                        mapbox_style='open-street-map',
                        size_max=15,
                        width=900,
                        height=900,
                        zoom=9.75,
                        center={'lat': filtered_listings['latitude'].median(),
                                'lon': filtered_listings['longitude'].median()},
                        opacity=0.5,
                        title='Distribution of Listings Across All Boroughs')

fig.update_traces(marker_size=5)

# Show the plot
fig.show()

In [None]:
# Bin the 'host_listings_count' data into categories
listings_count_categories = pd.cut(cleaned_listings['host_listings_count'],
                                   bins=list(range(1, 11)) + [float('inf')],
                                   labels=[str(i) for i in range(1, 10)] + ['10+'],
                                   right=False)

# Create a DataFrame to store the counts
listings_count_data = listings_count_categories.value_counts().reset_index()
listings_count_data.columns = ['listings_per_host', 'number_of_listings']

# Plot the distribution of listings based on the number of listings per host
fig = px.bar(listings_count_data,
             x='listings_per_host',
             y='number_of_listings',
             labels={'listings_per_host': 'Listings per Host', 'number_of_listings': 'Listings'},
             title='Distribution of Listings Based on Number of Listings per Host',
             category_orders={'listings_per_host': [str(i) for i in range(1, 10)]},
             width=600)

# Show the plot
fig.show()

In [None]:
# Bin the 'minimum_nights' data into categories
minimum_nights_categories = pd.cut(cleaned_listings['minimum_nights'],
                                   bins=list(range(1, 36)) + [float('inf')],
                                   labels=[str(i) for i in range(1, 35)] + ['35+'],
                                   right=False)

# Create a DataFrame to store the counts
minimum_nights_data = minimum_nights_categories.value_counts().sort_index().reset_index()
minimum_nights_data.columns = ['minimum_nights', 'number_of_listings']

# Plot the distribution of listings based on the number of minimum nights using Plotly Express
fig = px.bar(minimum_nights_data,
             x='minimum_nights',
             y='number_of_listings',
             labels={'minimum_nights': 'Minimum Nights', 'number_of_listings': 'Listings'},
             title='Distribution of Listings Based on Minimum Nights',
             category_orders={'minimum_nights': [str(i) for i in range(1, 36)] + ['35+']},
             width=600)

# Show the plot
fig.show()

In [None]:
# Create bins for categorization
occupancy_bins = [0, 1, 30, 60, 90, 120, 150, 180, 210, 240, 255, float('inf')]

# Define corresponding labels for the bins
occupancy_labels = ['0', '1-30', '31-60', '61-90', '91-120', '121-150', '151-180', '181-210', '211-240', '241-255', '255+']

# Categorize listings based on occupancy
occupancy_categories = pd.cut(cleaned_listings['occupancy'],
                              bins=occupancy_bins,
                              labels=occupancy_labels,
                              right=False)

# Create a DataFrame to store the counts
occupancy_data = occupancy_categories.value_counts().sort_index().reset_index()
occupancy_data.columns = ['occupancy', 'number_of_listings']

# Plot the distribution of listings based on estimated bookings using Plotly Express
fig = px.bar(occupancy_data,
             x='occupancy',
             y='number_of_listings',
             labels={'occupancy': 'Booked Nights', 'number_of_listings': 'Listings'},
             title='Distribution of Listings Based on Booked Nights',
             width=600)

fig.update_layout(xaxis=dict(type='category'),
                  xaxis_title='Booked Nights',
                  yaxis_title='Listings',
                  xaxis_categoryorder='array',
                  xaxis_categoryarray=occupancy_labels,
                  xaxis_tickangle=-45,
                  bargap=0.2)

# Show the
fig.show()

### Main analysis

**Objective:** Uncover how much New York City Airbnb hosts earn in a year and their varying strategies for achieving different incomes.

In [None]:
# Filter out a small number of unusually expensive listings
filtered_listings = short_term_listings[~(short_term_listings['price'] > 3000)]

# Group hosts earning six figures
six_figure_hosts = filtered_listings[
    (100000 <= filtered_listings['host_earns']) &
    (filtered_listings['host_earns'] < 1000000)
]

# Group hosts earning seven figures
seven_figure_hosts = filtered_listings[
    (1000000 <= filtered_listings['host_earns']) &
    (filtered_listings['host_earns'] < 10000000)
]

# Pie chart
labels = ['Below $100K', '$100K - $1M', 'Above $1M']
sizes = [
    len(filtered_listings) - len(six_figure_hosts) - len(seven_figure_hosts),
    len(six_figure_hosts),
    len(seven_figure_hosts)
]

fig = px.pie(names=labels,
             values=sizes,
             title='Distribution of Airbnb Host Earnings',
             hole=0.4)

fig.update_traces(textinfo='percent+label',
                  pull=[0.1, 0.1, 0.1],
                  textposition='inside',
                  showlegend=False)

fig.update_layout(title_x=0.5)

# Show the plot
fig.show()

In [None]:
# Scatter plot
fig = px.scatter(filtered_listings,
                 x='price',
                 y='occupancy',
                 size='listing_earns',
                 color='room_type',
                 title='Relationship between Daily Rates, Occupancy, and Revenue per Listing',
                 labels={'price': 'Daily Rates',
                         'occupancy': 'Occupancy',
                         'listing_earns': 'Revenue per Listing ($)',
                         'room_type': 'Room Type'},
                 size_max=30,
                 opacity=0.5,
                 width=1000)

fig.update_layout(legend_title='Room Type')

# Show the plot
fig.show()

In [None]:
# Scatter plot
fig = px.scatter(filtered_listings,
                 x='host_listings_count',
                 y='host_earns',
                 trendline='ols',
                 labels={'host_listings_count': 'Listings per Host',
                         'host_earns': 'Revenue per Host ($)'},
                 title='Relationship between Listings per Host and Revenue per Host',
                 width=600)

fig.update_traces(marker=dict(opacity=0.5))

fig.update_traces(line=dict(color='red', width=0.5))

# Show the plot
fig.show()

In [None]:
# Scatter plot
fig = px.scatter(filtered_listings,
                 x='host_listings_count',
                 y='occupancy',
                 trendline='ols',
                 labels={'host_listings_count': 'Listings per Host',
                         'occupancy': 'Booked Nights'},
                 title='Relationship between Listings per Host and Booked Nights',
                 width=600)

fig.update_traces(marker=dict(opacity=0.5))

fig.update_traces(line=dict(color='red', width=0.5))

# Show the plot
fig.show()

**Conclusions:**

- On average, hosts earned more in the short-term rental market than they would in the long-term rental market.

  ⇒ Hosts may have varying strategies for achieving different income goals.

- While a single listing can be lucrative, hosts aiming for seven figures often adopt a portfolio approach.
  
  ⇒ Specialization or diversification: Hosts should consider whether to specialize in a niche with a single high-performing property or diversify with multiple listings.

- In general, hosts with more listings tend to generate higher revenue. However, there might be a point of diminishing returns or saturation.

  ⇒ More listings could mean spreading guests across a larger inventory, potentially leading to lower occupancy rates and revenue per listing.


**Key takeaways:**
- The maximum potential earnings for hosts listing a 3-bedroom apartment in prime locations in Manhattan such as SoHo, Times Square, or the Upper East Side reach **$350,000.**

- Achieving a lucrative six-figure income is attainable for hosts, provided they secure **a minimum of 10 listings.**

- **After surpassing 83 listings,** there tends to be a decline in occupancy rates.