# **Analysing NYC Airbnb Data**
**Airbnb.com** has changed the game when it comes to short-term rentals, making it easier than ever to find a cozy place to stay. With millions of users worldwide, it offers diverse properties, from entire homes to shared rooms, across different price ranges and locations.

This project dives into Airbnb data for **New York City**.
From historic brownstones to luxurious Manhattan penthouses, the Big Apple's vibrant rental market offers endless possibilities, making it one of the platform's most dynamic markets.

Let’s dive into the data and see what stories it reveals about one of the most exciting places in the world!

# **The New York Airbnb Open Data 2024**

Welcome to the Data Story!

Our journey through NYC's Airbnb landscape is powered by the New York Airbnb Open Data 2024 dataset from Kaggle. This dataset offers an overview of the market using 22 key variables. Let's look at our data.

## **Importing libraries and loading the data**

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import style 
import numpy as np
import folium
from folium.plugins import MarkerCluster
import folium.features

In [None]:
df_nyc_2024 = pd.read_csv('new_york_listings_2024.csv')
df_nyc_2024.head()

In [None]:
df_nyc_2024.info()

In [None]:
df_nyc_2024.duplicated().sum()

In [None]:
df_nyc_2024.isnull().sum()

We can see that this dataset is free of null values and duplicates, allowing for a more straightforward analysis.


To gain deeper insights, let’s explore some key variables more thoroughly by examining their statistical summaries.

In [None]:
df_nyc_2024.describe()

In [None]:
numeric_columns = df_nyc_2024.select_dtypes(include=['number'])

plt.figure(figsize=(10, 6))
plt.style.use('seaborn-v0_8-darkgrid') 

sns.heatmap(numeric_columns.corr(), annot=True, cmap='coolwarm', linewidth=2)
plt.title('Correlation Heatmap', fontsize=16, pad=20)
plt.show()


# **Unmasking the Outliers**
Our initial dive into the data reveals some eyebrow-raising values that require attention.

Take the **minimum_nights** variable - its maximum value of 1250 nights (that's about 3.5 years!) stands out as highly improbable for what's meant to be a short-term rental platform.

Similarly, when looking at **price**, we encounter listings allegedly charging $100.000 per night. Even in a luxury market like NYC, this raises red flags. The min value of $10 is also concerning. These extreme values suggest we'll need some careful data cleaning to ensure our analysis reflects the real market dynamics.

In [None]:
df_nyc_2024['price'].describe()

In [None]:
print(f"The most common price is: {df_nyc_2024.price.mode()[0]} USD")

In [None]:
plt.figure(figsize=(13, 3))
plt.style.use('seaborn-v0_8-darkgrid') 

plt.boxplot(df_nyc_2024['price'],
            vert=False,
            flierprops=dict(marker='o', color='red', alpha=0.3, markersize=10))
plt.title('Boxplot of Airbnb Prices in NYC', fontsize=16)
plt.ylabel('Price (USD)', fontsize=14)
plt.grid(axis='y')

plt.show()

In [None]:
df_nyc_2024[df_nyc_2024['price']==100000]

There are two listings priced at 100,000 USD. Based on their location, they appear to be one-bedroom rental units in Brooklyn, offered by the same host and likely in the same building. Given this, we can reasonably consider these listings as outliers.

What about the other side of the spectrum - could an Airbnb in NYC really cost just $10?

In [None]:
df_nyc_2024[df_nyc_2024['price']==10].shape[0]

In [None]:
df_nyc_2024.sort_values('price').head(10)

There are 9 listings priced at $10 a night, and some are categorized as 'Entire home/apt.' An entire apartment in Manhattan for just $10 a night? That seems implausible. Even if we consider shared rooms, it's still difficult to believe that any accommodation in NYC would be priced so low.

For now, let's just get rid of the obvious outliers of $10 and $100000.

In [None]:
df_nyc_2024 = df_nyc_2024[(df_nyc_2024['price'] < 20000) & (df_nyc_2024['price'] > 10)]

Now, let's take a closer look at the minimum nights criteria.

In [None]:
plt.figure(figsize=(13, 3))
plt.style.use('seaborn-v0_8-darkgrid') 

plt.boxplot(df_nyc_2024['minimum_nights'],
            vert=False,
            flierprops=dict(marker='o', color='red', alpha=0.3, markersize=10))
plt.title('Boxplot of Minimum Nights in Airbnb', fontsize=16)
plt.ylabel('Minimum nights', fontsize=14)
plt.grid(axis='y')

plt.show()

When we think about Airbnbs, we usually picture short stays - just a night or two. However, we noticed that many listings have unusually high minimum night requirements.
This raises an important question: **why are some hosts asking for longer stays?**

In [None]:
df_nyc_2024[df_nyc_2024['minimum_nights'] <= 90].value_counts('minimum_nights').sort_values(ascending=False)

Our data reveals an interesting pattern: 30 days stands out as the most common minimum stay requirement. This isn't just a coincidence - **it's a direct reflection of New York City's Local Law 18**, implemented in 2023. This regulation transformed the city's short-term rental landscape by requiring hosts to:
- Register with the city for stays under 30 days
- Be physically present in the property during guest stays under 30 days
- Provide guests with full access to the property

Rather than navigating these requirements, many hosts have opted to set 30-day minimums, effectively positioning their listings in the medium-term rental market. This shift in hosting strategy demonstrates how local regulations can fundamentally reshape rental patterns across an entire city.

Let's explore what types of properties appear on each side of this 30-day threshold.

In [None]:
properties_over_30_nights = df_nyc_2024[df_nyc_2024['minimum_nights'] >= 30]
print("Number of properties requiring a minimum stay of 30+ nights by room type:")
properties_over_30_nights['room_type'].value_counts()

In [None]:
properties_under_30_nights = df_nyc_2024[df_nyc_2024['minimum_nights'] < 30]
print("Number of properties requiring a minimum stay of under 30 nights by room type:")
properties_under_30_nights['room_type'].value_counts()

It makes sense that longer stays are typically for entire homes, which people might rent for a month or more. These stays usually cater to those seeking temporary housing, such as remote workers, students, or families relocating. Still, Airbnb isn’t usually the first place that comes to mind when looking for long-term accommodation.

Let's see what percentage of the data consists of listings where Airbnb states that the minimum nights required is over 60.

In [None]:
over_60_nights = len(df_nyc_2024[df_nyc_2024['minimum_nights'] > 60])
over_60_nights_percentage = (df_nyc_2024[df_nyc_2024['minimum_nights'] > 60].shape[0] / df_nyc_2024.shape[0]) * 100

print(f"Listings with minimum nights over 60: {over_60_nights}")
print(f"Percentage of listings with minimum nights over 60: {over_60_nights_percentage:.4f}%")

Listings with a minimum night requirement exceeding 60 account for only about 1.5% of the data. In this analysis, we will remove these records to ensure a cleaner dataset, free from obvious outliers.

In [None]:
df_nyc_filtered = df_nyc_2024[df_nyc_2024['minimum_nights'] <= 60]

# Borough Breakdown: Quantifying Airbnb Listings Across NYC
Now let's take a look at the spread of Airbnb listings across New York City's five iconic boroughs!

By plotting the number of listings in each borough, we get a clear picture of where Airbnb activity is most concentrated. This view sheds light on potential hotspots for visitors and hints at where hosts may face the most competition.

In [None]:
neighborhood_counts = df_nyc_filtered['neighbourhood_group'].value_counts()

plt.figure(figsize=(10, 6))
plt.style.use('seaborn-v0_8-darkgrid') 

neighborhood_counts.plot(kind='bar', color='indianred')
plt.title('Boroughs by Number of Listings', fontsize=16, pad=20)
plt.xlabel('Borough', fontsize=14)
plt.ylabel('Number of Listings', fontsize=14, labelpad=20)
plt.xticks(rotation=45)
plt.show()

As we can see (and as we probably expected), Manhattan stands out as the borough with the most listings, boasting nearly 8.000 available properties.

But is the pricing landscape just as predictable? Let’s explore the average prices across the boroughs to find out!

In [None]:
average_price_per_neighborhood = df_nyc_filtered.groupby('neighbourhood_group')['price'].mean().sort_values(ascending=False)

plt.figure(figsize=(10, 6))
plt.style.use('seaborn-v0_8-darkgrid') 

average_price_per_neighborhood.plot(kind='bar', color='darksalmon')
plt.title('Boroughs by Average Price', fontsize=16, pad=20)
plt.xlabel('Borough', fontsize=14)
plt.ylabel('Average Price (USD)', fontsize=14, labelpad=20)
plt.xticks(rotation=45)
plt.ylim((0,300))
plt.show()

In [None]:
df_nyc_filtered[df_nyc_filtered['neighbourhood_group'] == 'Manhattan']['price'].mean()

Of course, Manhattan takes the lead once again, with an average price of ~$228 per night.

We've previously discussed high prices, and it's clear that there are quite a few properties priced in the thousands! Let's take a closer look at what these properties are.

In [None]:
most_expensive_properties = df_nyc_filtered.nlargest(10, 'price')
most_expensive_properties[['name', 'neighbourhood_group', 'price', 'room_type', 'number_of_reviews','availability_365', 'bedrooms', 'host_name', 'rating']]


- Right off the bat, we can see there's a frequent host - The Gregory Hotel, which seems to be a luxury hotel offering expensive stays. It appears to have quite average ratings, 3.33 and 3.40, and one of its listings in the top 10 doesn't even have a rating at all.
- When talking about ratings, we can see that the top 10 most expensive properties are not very popular, as it would seem. The number of reviews they have is quite low, ranging from 1 to 29. The most reviewed one is a place in Brooklyn, which costs $7.500 a night — it has 173 reviews, a significant boost compared to the others.
- There are some places that are mistyped. For example, the loft in Manhattan, which supposedly has 4 bedrooms, is listed as a "shared room." This may suggest that there are many other places in our database that could be mistyped as well.

# Interactive Map
Let’s take a closer look at how Airbnb listings are distributed across New York City.

The interactive map below highlights properties throughout the city, giving you a clear view of where listings are concentrated. This visualization provides valuable insights into popular areas with high listing density as well as neighborhoods with fewer options.

Additionally, I have included a price legend to help you easily identify price ranges across the city.

To maintain optimal performance, I used markers for a faster and smoother map experience.

In [None]:
map_nyc = folium.Map(
    location=[40.730610, -73.935242], 
    zoom_start=11
)

In [None]:
marker_cluster = MarkerCluster().add_to(map_nyc)

In [None]:
def get_color(price):
    if float(price) <= 100:
        return 'green'
    elif float(price) <= 200:
        return 'blue'
    elif float(price) <= 300:
        return 'purple'
    else:
        return 'red'

In [None]:
for _, row in df_nyc_filtered.iterrows():
    try:
        lat = float(row['latitude'])
        lng = float(row['longitude'])
        price = float(row['price'])
        
        price_formatted = f"${price:,.2f}"
        
        name = str(row.get('name', 'Unnamed'))
        minimum_nights = str(row.get('minimum_nights', 'N/A'))
        neighbourhood_group = str(row.get('neighbourhood_group', 'N/A'))
        neighbourhood = str(row.get('neighbourhood', 'N/A'))
        
        color = get_color(price)
        
    except Exception as e:
        continue
        
    popup_content = f"""
    <div style="font-family: 'Helvetica', sans-serif; padding: 10px; min-width: 200px;">
        <h4 style="margin: 0 0 10px 0; color: #333;">{name}</h4>
        <p style="margin: 5px 0; font-size: 14px;"><strong>Price:</strong> {price_formatted} per night</p>
        <p style="margin: 5px 0; font-size: 14px;"><strong>Min Stay:</strong> {minimum_nights} night(s)</p>
        <p style="margin: 5px 0; font-size: 14px;"><strong>Area:</strong> {neighbourhood_group} - {neighbourhood}</p>
    </div>
    """    
    
    folium.Marker(
        location=(lat, lng),
        popup=folium.Popup(popup_content, max_width=300),
        tooltip=f"{name} - {price_formatted}",
        icon=folium.Icon(color=color, prefix='fa', icon='home')
    ).add_to(marker_cluster)

In [None]:
legend_html = '''
<div style="
    position: fixed; 
    bottom: 20px; 
    right: 20px; 
    z-index: 1000; 
    background-color: white; 
    padding: 15px; 
    border-radius: 8px; 
    box-shadow: 0 2px 10px rgba(0,0,0,0.15);
    font-family: 'Helvetica', Arial, sans-serif;
    font-size: 12px;
    line-height: 1.4;
    width: 160px;
">
    <div style="font-weight: bold; margin-bottom: 8px; font-size: 14px; color: #333; border-bottom: 1px solid #eee; padding-bottom: 5px;">
        NYC Airbnb Prices
    </div>
    <div style="display: flex; align-items: center; margin: 6px 0;">
        <span style="background: green; width: 12px; height: 12px; display: inline-block; border-radius: 50%; margin-right: 8px;"></span>
        <span>Under $100</span>
    </div>
    <div style="display: flex; align-items: center; margin: 6px 0;">
        <span style="background: blue; width: 12px; height: 12px; display: inline-block; border-radius: 50%; margin-right: 8px;"></span>
        <span>$100 - $200</span>
    </div>
    <div style="display: flex; align-items: center; margin: 6px 0;">
        <span style="background: purple; width: 12px; height: 12px; display: inline-block; border-radius: 50%; margin-right: 8px;"></span>
        <span>$200 - $300</span>
    </div>
    <div style="display: flex; align-items: center; margin: 6px 0;">
        <span style="background: red; width: 12px; height: 12px; display: inline-block; border-radius: 50%; margin-right: 8px;"></span>
        <span>Over $300</span>
    </div>
</div>
'''

In [None]:
legend = folium.Element(legend_html)
map_nyc.get_root().html.add_child(legend)

map_nyc

# Ratings: Which Boroughs Are Guests Happiest With?
Price and location are important, but what about guest satisfaction? Let’s analyze the rating column to see which boroughs have the highest and lowest average ratings.

In [None]:
df_nyc_filtered['rating'].unique()

We can observe a few things in our data:

- Trailing spaces – All values have an extra space at the end.
- Data type – The column is stored as object, meaning it's not properly formatted as numeric.
- Non-numeric values – There are entries like 'No rating' and 'New ', which may need special handling.

Let's clean up our data to continue with our ratings analysis.

In [None]:
df_nyc_filtered.loc[:,'rating'] = df_nyc_filtered['rating'].str.strip()
df_nyc_filtered_ratings = df_nyc_filtered[df_nyc_filtered['rating'].isin(['No rating', 'New']) == False]
df_nyc_filtered_ratings.loc[:,'rating'] = df_nyc_filtered_ratings['rating'].astype(float)

In [None]:
removed_count = len(df_nyc_filtered) - len(df_nyc_filtered_ratings)
print(f"Removed {removed_count} listings ({(removed_count / len(df_nyc_filtered)) * 100}%) due to missing or new ratings.")

Let's take a closer look at the ratings for each borough. We’ll calculate the mean, median, and count of ratings. This will help us identify not only the boroughs with the highest average ratings, but also the areas with the most feedback.

In [None]:
df_nyc_filtered_ratings.groupby('neighbourhood_group')['rating'].agg(['mean', 'median', 'count']).sort_values(by='mean', ascending=False)

The average ratings across the boroughs show slight differences, with Staten Island having the highest average rating at 4.775, followed by Brooklyn at 4.773.

While Staten Island's higher rating may appear notable, the small number of reviews could mean it doesn't fully represent the overall sentiment of the population. Larger boroughs, like Brooklyn and Manhattan, offer a more comprehensive view due to their higher volume of reviews.
