Project Title: Airbnb Listings Data Analysis – Insights & Visualization
Project Summary: I performed an in-depth exploratory data analysis (EDA) on an Airbnb listings dataset containing 102,000+ entries. The goal was to extract actionable insights on listings, pricing, host reliability, and potential revenue opportunities.
Key Steps & Techniques Used:
Data Cleaning & Preprocessing:
Identified and handled missing values using df.isnull().sum(), replacing string placeholders like "NA", "null", or empty spaces with np.nan.
Dropped columns with excessive missing values (e.g., house_rules) using df.drop(axis=1, thresh=int(len(df)*0.6)).
Converted columns like price and availability 365 to numeric using pd.to_numeric after cleaning symbols like $ and ,.
Filled missing values in categorical columns with mode (df['neighbourhood group'].fillna(df['neighbourhood group'].mode()[0])) and numeric columns with median (df['price'].fillna(df['price'].median())).
Data Analysis Using Pandas & NumPy:
Aggregated data to find average price by room type (df.groupby('room type')['price'].mean()).
Identified top neighbourhoods and most active hosts using value_counts() and groupby().
Calculated potential revenue per listing: df['potential_revenue'] = df['price'] * df['availability 365'].
Summed potential revenue by host to find top-earning hosts (df.groupby('host name')['potential_revenue'].sum().sort_values(ascending=False).head(10)).
Data Visualization Using Matplotlib & Seaborn:
Bar plots to show listing counts by room type and top neighbourhoods.
Boxplots to visualize price distribution by room type.
Histograms to understand distribution of prices, availability, and review counts.
Scatter plots to explore relationships between price and number of reviews.
Heatmaps to identify missing values and correlations between numeric features.
Horizontal bar charts to highlight top hosts by potential revenue.
Key Insights Derived:
Entire homes tend to have higher average prices than private or shared rooms.
Certain neighbourhoods dominate the listings, indicating supply concentration and potential competition.
Most hosts are verified, which correlates with higher review counts.
The majority of listings have moderate availability, with a few full-time hosts dominating potential revenue.
Potential revenue analysis highlighted top hosts and neighbourhoods that could be targeted for investment or partnership.
Tools & Libraries Used:
Python: Pandas, NumPy
Visualization: Matplotlib, Seaborn
Jupyter Notebook for interactive analysis
Outcome: This project demonstrates practical data cleaning, EDA, aggregation, and visualization skills, providing actionable insights for Airbnb hosts, investors, or analysts to understand pricing trends, host performance, and neighbourhood potential.