<a href="https://colab.research.google.com/github/priyadharsh73/airbnb_eda/blob/main/airbnb_eda.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# In-depth Exploratory Data Analysis of Airbnb Dataset

## Introduction
This notebook provides an in-depth exploratory data analysis (EDA) of the Airbnb dataset. The objective is to uncover insights and patterns in the data to better understand the factors influencing Airbnb listings.

### Airbnb Dataset Column Descriptions

| Column Name                      | Description                                                                 | Data Type   |
|----------------------------------|-----------------------------------------------------------------------------|-------------|
| id                               | Unique identifier for the listing                                           | int64       |
| name                             | Name of the listing                                                         | object      |
| host_id                          | Unique identifier for the host                                              | int64       |
| host_identity_verified           | Whether the host's identity is verified                                     | object      |
| host_name                        | Name of the host                                                            | object      |
| neighbourhood_group              | Grouping of neighbourhoods                                                  | object      |
| neighbourhood                    | Name of the neighbourhood                                                   | object      |
| lat                              | Latitude of the listing                                                     | float64     |
| long                             | Longitude of the listing                                                    | float64     |
| country                          | Country where the listing is located                                        | object      |
| country_code                     | Country code                                                                | object      |
| instant_bookable                 | Whether the listing is instantly bookable                                   | object      |
| cancellation_policy              | Cancellation policy of the listing                                          | object      |
| room_type                        | Type of room offered                                                        | object      |
| construction_year                | Year the property was constructed                                           | float64     |
| price                            | Price per night                                                             | float64     |
| service_fee                      | Service fee charged                                                         | float64     |
| minimum_nights                   | Minimum number of nights required to book                                   | float64     |
| number_of_reviews                | Total number of reviews received                                            | float64     |
| last_review                      | Date of the last review                                                     | datetime64  |
| reviews_per_month                | Average number of reviews per month                                         | float64     |
| review_rate_number               | Average rating of the listing                                               | float64     |
| calculated_host_listings_count   | Total number of listings by the host                                        | float64     |
| availability_365                 | Number of days the listing is available in a year                           | float64     |
| house_rules                      | House rules for the listing                                                 | object      |
| license                          | License number of the listing (if applicable)                               | object      |

## Data Loading and Preparation

### Import Libraries
First, import the necessary libraries for data manipulation, visualization, and mapping.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import folium
from folium.plugins import MarkerCluster
import zipfile
import requests
from io import BytesIO

### Download and Extract ZIP File from GitHub

This step involves downloading the ZIP file containing the Airbnb dataset from GitHub and extracting the CSV file from it. The `requests` library is used to download the file, and the `zipfile` and `io` libraries are used to handle the extraction. The extracted CSV file is then read into a pandas DataFrame for further analysis.


In [34]:
# URL of the ZIP file on GitHub (direct download link)
url = 'https://github.com/priyadharsh73/airbnb_eda/raw/main/airbnb_dataset.zip'
filename = 'Airbnb_Open_Data.csv'

# Download the ZIP file
response = requests.get(url)
zip_file = BytesIO(response.content)

# Check if the response is valid
if response.status_code == 200:
    try:
        # Extract the ZIP file
        with zipfile.ZipFile(zip_file, 'r') as z:
            # List all files in the ZIP
            print(z.namelist())

            # Extract and read the CSV file
            with z.open(filename) as f:
                df = pd.read_csv(f, low_memory=False)

        # Display the first few rows of the DataFrame
    except zipfile.BadZipFile:
        print("Error: The file is not a valid ZIP file.")
else:
    print(f"Error: Failed to download the file. Status code: {response.status_code}")


['Airbnb_Open_Data.csv']


###Clean Column Names
- Use the `str.replace()` method to replace spaces with underscores.
- Use the `str.lower()` method to convert column names to lowercase.

Here’s the code to clean the column names:

In [35]:
# Clean column names by replacing spaces with underscores and converting to lowercase
df.columns = df.columns.str.replace(' ', '_').str.lower()

# Check the cleaned column names
print("Cleaned column names:")
print(df.columns)

Cleaned column names:
Index(['id', 'name', 'host_id', 'host_identity_verified', 'host_name',
       'neighbourhood_group', 'neighbourhood', 'lat', 'long', 'country',
       'country_code', 'instant_bookable', 'cancellation_policy', 'room_type',
       'construction_year', 'price', 'service_fee', 'minimum_nights',
       'number_of_reviews', 'last_review', 'reviews_per_month',
       'review_rate_number', 'calculated_host_listings_count',
       'availability_365', 'house_rules', 'license'],
      dtype='object')


In [36]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102599 entries, 0 to 102598
Data columns (total 26 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   id                              102599 non-null  int64  
 1   name                            102349 non-null  object 
 2   host_id                         102599 non-null  int64  
 3   host_identity_verified          102310 non-null  object 
 4   host_name                       102193 non-null  object 
 5   neighbourhood_group             102570 non-null  object 
 6   neighbourhood                   102583 non-null  object 
 7   lat                             102591 non-null  float64
 8   long                            102591 non-null  float64
 9   country                         102067 non-null  object 
 10  country_code                    102468 non-null  object 
 11  instant_bookable                102494 non-null  object 
 12  cancellation_pol

In [37]:
# Check for missing values
print(df.isnull().sum())


id                                     0
name                                 250
host_id                                0
host_identity_verified               289
host_name                            406
neighbourhood_group                   29
neighbourhood                         16
lat                                    8
long                                   8
country                              532
country_code                         131
instant_bookable                     105
cancellation_policy                   76
room_type                              0
construction_year                    214
price                                247
service_fee                          273
minimum_nights                       409
number_of_reviews                    183
last_review                        15893
reviews_per_month                  15879
review_rate_number                   326
calculated_host_listings_count       319
availability_365                     448
house_rules     

### Observations on Data Type Misinterpretation and Missing Values

#### Data Type Misinterpretation
1. **Price and Service Fee**: These columns are interpreted as `object` types due to the presence of currency symbols. They should be converted to `float` after removing the symbols.
2. **Construction Year**: This column is interpreted as `float64` but should be an `int64` type.
3. **Minimum Nights, Number of Reviews, Review Rate Number, Calculated Host Listings Count, Availability 365**: These columns are interpreted as `float64` but should be `int64` types.
4. **Last Review**: This column is interpreted as `object` but should be converted to `datetime`.
5. **Reviews per Month**: This column is correctly interpreted as `float64`.

#### Missing Values
1. **High Missing Values**:
   - **House Rules**: 52,131 missing values, indicating that many listings do not provide house rules.
   - **License**: 102,597 missing values, suggesting that almost no listings have a license provided.
   - **Last Review**: 15,893 missing values, indicating many listings have not been reviewed recently.
   - **Reviews per Month**: 15,879 missing values, likely correlated with the missing last review dates.

2. **Moderate Missing Values**:
   - **Country**: 532 missing values.
   - **Host Name**: 406 missing values.
   - **Minimum Nights**: 409 missing values.
   - **Availability 365**: 448 missing values.
   - **Review Rate Number**: 326 missing values.
   - **Calculated Host Listings Count**: 319 missing values.
   - **Host Identity Verified**: 289 missing values.
   - **Service Fee**: 273 missing values.
   - **Price**: 247 missing values.
   - **Construction Year**: 214 missing values.
   - **Number of Reviews**: 183 missing values.
   - **Country Code**: 131 missing values.
   - **Instant Bookable**: 105 missing values.
   - **Cancellation Policy**: 76 missing values.
   - **Name**: 250 missing values.

3. **Low Missing Values**:
   - **Neighbourhood Group**: 29 missing values.
   - **Neighbourhood**: 16 missing values.
   - **Latitude and Longitude**: 8 missing values each.

#### Next Steps
1. **Handle Missing Values**: Decide on strategies to fill or drop missing values based on the significance of each column.
2. **Correct Data Types**: Convert columns to appropriate data types for accurate analysis.
3. **Further Cleaning**: Address any other inconsistencies or errors in the data.