# Airbnb Data Deep Dive – Python + Pandas Challenge

### Table of Contents
<ul>
<li><a href="#gathering">Data Gathering</a></li>
<li><a href="#assessing">Data Assessment</a></li>
<li><a href="#cleaning">Data Cleaning</a></li>
<li><a href="#enriching">Data Enrichment</a></li>
<li><a href="#analyzing">Data Analysis</a></li>
<li><a href="#insights">Key Insights</a></li>
</ul>

In [1]:
# Importing library to be used

import pandas as pd

<a id='gathering'></a>
## Data Gathering

In [None]:
# Loading the dataset

url = "https://data.insideairbnb.com/united-kingdom/england/london/2025-06-10/visualisations/listings.csv"

df = pd.read_csv(url)
print(df.head())

       id                                             name  host_id host_name  \
0  264776                      Huge Four Bedroom Apartment  1389063       Sue   
1  264777                            One Bedroom Apartment  1389063       Sue   
2  264778          Two Bedroom Newly Refurbished Apartment  1389063       Sue   
3  264779                Refurbished Two Bedroom Apartment  1389063       Sue   
4  264780  Spacious refurbished 2 bedroom apt with balcony  1389063       Sue   

   neighbourhood_group neighbourhood  latitude  longitude        room_type  \
0                  NaN      Lewisham  51.44306   -0.01948  Entire home/apt   
1                  NaN      Lewisham  51.44284   -0.01997  Entire home/apt   
2                  NaN      Lewisham  51.44359   -0.02275  Entire home/apt   
3                  NaN      Lewisham  51.44355   -0.02309  Entire home/apt   
4                  NaN      Lewisham  51.44333   -0.02307  Entire home/apt   

   price  minimum_nights  number_of_reviews 

<a id='assessing'></a>
## Data Assessment

In [115]:
#Concise summary of the dataframe

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96651 entries, 0 to 96650
Data columns (total 18 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              96651 non-null  int64  
 1   name                            96651 non-null  object 
 2   host_id                         96651 non-null  int64  
 3   host_name                       96611 non-null  object 
 4   neighbourhood_group             0 non-null      float64
 5   neighbourhood                   96651 non-null  object 
 6   latitude                        96651 non-null  float64
 7   longitude                       96651 non-null  float64
 8   room_type                       96651 non-null  object 
 9   price                           62684 non-null  float64
 10  minimum_nights                  96651 non-null  int64  
 11  number_of_reviews               96651 non-null  int64  
 12  last_review                     

In [116]:
# Displaying descriptive statistics

df.describe()

Unnamed: 0,id,host_id,neighbourhood_group,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license
count,96651.0,96651.0,0.0,96651.0,96651.0,62684.0,96651.0,96651.0,71487.0,96651.0,96651.0,96651.0,0.0
mean,6.52602e+17,209179000.0,,51.509818,-0.127087,213.366058,5.429504,20.891734,0.958877,16.38937,139.697365,5.634665,
std,5.708808e+17,214126600.0,,0.048945,0.100853,860.901557,23.315086,49.922266,1.282595,53.299577,137.426817,11.951389,
min,13913.0,2594.0,,51.295937,-0.49676,6.0,1.0,0.0,0.01,1.0,0.0,0.0,
25%,29555180.0,26731760.0,,51.48424,-0.18906,75.0,1.0,0.0,0.15,1.0,0.0,0.0,
50%,8.123206e+17,112868400.0,,51.513791,-0.12699,135.0,2.0,4.0,0.5,2.0,93.0,0.0,
75%,1.197378e+18,406376200.0,,51.539099,-0.06788,225.0,4.0,19.0,1.23,8.0,270.0,6.0,
max,1.439673e+18,700129800.0,,51.68263,0.27896,74100.0,1125.0,1855.0,38.41,495.0,365.0,355.0,


In [117]:
# Checking for missing values

df.isnull().sum()

id                                    0
name                                  0
host_id                               0
host_name                            40
neighbourhood_group               96651
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                             33967
minimum_nights                        0
number_of_reviews                     0
last_review                       25164
reviews_per_month                 25164
calculated_host_listings_count        0
availability_365                      0
number_of_reviews_ltm                 0
license                           96651
dtype: int64

In [118]:
# Checking for duplicate values

df.duplicated().value_counts()

False    96651
Name: count, dtype: int64

In [119]:
# Checking for unusual data types

df.infer_objects().dtypes

id                                  int64
name                               object
host_id                             int64
host_name                          object
neighbourhood_group               float64
neighbourhood                      object
latitude                          float64
longitude                         float64
room_type                          object
price                             float64
minimum_nights                      int64
number_of_reviews                   int64
last_review                        object
reviews_per_month                 float64
calculated_host_listings_count      int64
availability_365                    int64
number_of_reviews_ltm               int64
license                           float64
dtype: object

<a id='cleaning'></a>
## Data Cleaning

In [None]:
# Converting price column to float

df['price'] = df['price'].replace('[\$,]', '', regex=True).astype(float)

  df['price'] = df['price'].replace('[\$,]', '', regex=True).astype(float)


In [123]:
# Parsing date columns into datetime objects

df['last_review'] = pd.to_datetime(df['last_review'], errors='coerce')
df['last_review'].dtype

dtype('<M8[ns]')

In [124]:
# Handling missing values
df['reviews_per_month'] = df['reviews_per_month'].fillna(0)
df['host_name'] = df['host_name'].fillna('Unknown')
df['neighbourhood_group'] = df['neighbourhood_group'].fillna('Not Specified')
df['last_review'] = df['last_review'].fillna('N/A')
df['license'] = df['license'].fillna('N/A')

In [125]:
# Removing listings with zero availability and price

df = df[(df['availability_365'] > 0) & (df['price'] > 0)]

df.isnull().sum()

id                                0
name                              0
host_id                           0
host_name                         0
neighbourhood_group               0
neighbourhood                     0
latitude                          0
longitude                         0
room_type                         0
price                             0
minimum_nights                    0
number_of_reviews                 0
last_review                       0
reviews_per_month                 0
calculated_host_listings_count    0
availability_365                  0
number_of_reviews_ltm             0
license                           0
dtype: int64

<a id='enriching'></a>
## Data Enrichment

In [126]:
# Creating a new column (price_per_booking)

df['price_per_booking'] = df['price'] * df['minimum_nights']


In [127]:
# Categorizing the availability column

def availability_category(x):
    if x > 300:
        return 'Full-time'
    elif 100 >= x <= 300:
        return 'Part-time'
    else:
        return 'Rare'

df['availability_category'] = df['availability_365'].apply(availability_category)

In [None]:
#Re-checking for duplicates

df.isnull().sum()

id                                0
name                              0
host_id                           0
host_name                         0
neighbourhood_group               0
neighbourhood                     0
latitude                          0
longitude                         0
room_type                         0
price                             0
minimum_nights                    0
number_of_reviews                 0
last_review                       0
reviews_per_month                 0
calculated_host_listings_count    0
availability_365                  0
number_of_reviews_ltm             0
license                           0
price_per_booking                 0
availability_category             0
dtype: int64

<a id='analyzing'></a>
## Data Analysis

This section will be addressing the below questions:

* What are the top 10 most expensive neighborhoods by average price?
* What’s the average availability and price by room type?
* Which host has the most listings?
* How does average price vary across different boroughs or districts?
* How many listings have never been reviewed?

In [129]:
# What are the top 10 most expensive neighborhoods by average price?

exp_neighborhoods = df.groupby('neighbourhood')['price'].mean().sort_values(ascending=False).head(10)
exp_neighborhoods

neighbourhood
City of London            379.090909
Lambeth                   371.780972
Kensington and Chelsea    362.972805
Westminster               343.499335
Camden                    229.625084
Islington                 218.462021
Hammersmith and Fulham    193.441929
Wandsworth                189.752249
Richmond upon Thames      184.709962
Brent                     169.960661
Name: price, dtype: float64

In [130]:
# What’s the average availability and price by room type?

room_type_stats = df.groupby('room_type')[['availability_365', 'price']].mean().round(2).sort_values('price',ascending=False)
room_type_stats

Unnamed: 0_level_0,availability_365,price
room_type,Unnamed: 1_level_1,Unnamed: 2_level_1
Hotel room,239.22,310.97
Entire home/apt,202.28,256.6
Private room,216.12,120.02
Shared room,261.08,83.64


In [139]:
# Which host has the most listing?

top_host = df[['host_name', 'calculated_host_listings_count']].sort_values('calculated_host_listings_count', ascending=False).head(1)
top_host

Unnamed: 0,host_name,calculated_host_listings_count
87926,LuxurybookingsFZE,495


In [144]:
# How does average price vary across different boroughs or districts?

avg_price_per_district = df.groupby('neighbourhood')[['price']].mean().round(2).sort_values('price', ascending=False)
avg_price_per_district

Unnamed: 0_level_0,price
neighbourhood,Unnamed: 1_level_1
City of London,379.09
Lambeth,371.78
Kensington and Chelsea,362.97
Westminster,343.5
Camden,229.63
Islington,218.46
Hammersmith and Fulham,193.44
Wandsworth,189.75
Richmond upon Thames,184.71
Brent,169.96


In [None]:
# How many listings have never been reviewed?

never_been_reviewed_listings = (df['number_of_reviews'] == 0).sum()
print(f"The number of unreviewed listings is {never_been_reviewed_listings}")


The number of unreviewed listings is 14853


<a id='insights'></a>
## Key Insights

The analysis of London Airbnb listings shows that prices are heavily influenced by **location**, with central areas such as the **City of London** and **Lambeth** recording the **highest** average rates (above £370), while outer boroughs like Sutton and Harrow are far more affordable with average price below £100. 

Among room types, **hotel rooms** have the **highest** average price at about £310.97 per night, followed by entire homes or apartments at roughly £256.60. Private rooms and shared rooms are much more affordable, averaging £120.02 and £83.64 respectively. **Availability also varies**, with shared rooms and hotel rooms tending to be open for more days in a year compared to entire homes/apartments and private rooms.

The most active host based on this analysis is **LuxurybookingsFZE**, managing 495 listings, which indicates a strong professional or commercial presence in the London Airbnb market.

Interestingly, **14,853 listings have never been reviewed**, hinting at low guest engagement or an oversupply of properties in some areas.