<a href="https://colab.research.google.com/github/lmansf/EDA-with-AirBnb/blob/main/AirBnB_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploratory Data Analysis: AirBnB Open Data
## Data Source
https://www.kaggle.com/datasets/arianazmoudeh/airbnbopendata/data
## Goals
- Manipulate and clean up the data to prepare for analysis
- Identify features that correspond to high review rating numbers.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
data = pd.read_csv('/content/Airbnb_Open_Data.csv')

  data = pd.read_csv('/content/Airbnb_Open_Data.csv')


In [3]:
df = pd.DataFrame(data)

In [4]:
df.head(5)

Unnamed: 0,id,NAME,host id,host_identity_verified,host name,neighbourhood group,neighbourhood,lat,long,country,...,service fee,minimum nights,number of reviews,last review,reviews per month,review rate number,calculated host listings count,availability 365,house_rules,license
0,1001254,Clean & quiet apt home by the park,80014485718,unconfirmed,Madaline,Brooklyn,Kensington,40.64749,-73.97237,United States,...,$193,10.0,9.0,10/19/2021,0.21,4.0,6.0,286.0,Clean up and treat the home the way you'd like...,
1,1002102,Skylit Midtown Castle,52335172823,verified,Jenna,Manhattan,Midtown,40.75362,-73.98377,United States,...,$28,30.0,45.0,5/21/2022,0.38,4.0,2.0,228.0,Pet friendly but please confirm with me if the...,
2,1002403,THE VILLAGE OF HARLEM....NEW YORK !,78829239556,,Elise,Manhattan,Harlem,40.80902,-73.9419,United States,...,$124,3.0,0.0,,,5.0,1.0,352.0,"I encourage you to use my kitchen, cooking and...",
3,1002755,,85098326012,unconfirmed,Garry,Brooklyn,Clinton Hill,40.68514,-73.95976,United States,...,$74,30.0,270.0,7/5/2019,4.64,4.0,1.0,322.0,,
4,1003689,Entire Apt: Spacious Studio/Loft by central park,92037596077,verified,Lyndon,Manhattan,East Harlem,40.79851,-73.94399,United States,...,$41,10.0,9.0,11/19/2018,0.1,3.0,1.0,289.0,"Please no smoking in the house, porch or on th...",


In [6]:
!pip install gender-guesser -q
import gender_guesser.detector as gender

# Initialize the gender detector
d = gender.Detector()

def identify_gender(name):
    if pd.isna(name):
        return 'Unknown'

    # Get the first part of the name and ensure it is capitalized correctly
    # This helps with names like "John Doe" -> "John"
    first_name = str(name).split()[0].capitalize()

    gender_pred = d.get_gender(first_name)

    # Map predictions to Male, Female, or Unknown
    if 'female' in gender_pred:
        return '0'
    elif 'male' in gender_pred:
        return '1'
    else:
        return 'Unknown'

# Apply the function to the 'host name' column
df['host_gender'] = df['host name'].apply(identify_gender)

# Display the new column alongside the names to verify
df[['host name', 'host_gender']].head()

Unnamed: 0,host name,host_gender
0,Madaline,0
1,Jenna,0
2,Elise,0
3,Garry,1
4,Lyndon,1


In [9]:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Download the VADER lexicon for sentiment analysis
nltk.download('vader_lexicon')

# Initialize the sentiment analyzer
sid = SentimentIntensityAnalyzer()

def rate_positivity(text):
    if pd.isna(text):
        return None

    # Get the polarity scores
    scores = sid.polarity_scores(str(text))

    # Compound score ranges from -1 (most negative) to 1 (most positive)
    # We treat anything <= 0 as 0 (neutral/negative), and scale positive values to 0-5
    compound = scores['compound']

    if compound <= 0:
        return 0.0
    else:
        return compound * 5.0

# Apply the function to the 'house_rules' column
df['house_rules_positivity'] = df['house_rules'].apply(rate_positivity)

# Display the results to verify
display(df[['house_rules', 'house_rules_positivity']].head())

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


Unnamed: 0,house_rules,house_rules_positivity
0,Clean up and treat the home the way you'd like...,3.454
1,Pet friendly but please confirm with me if the...,4.8305
2,"I encourage you to use my kitchen, cooking and...",0.0
3,,
4,"Please no smoking in the house, porch or on th...",1.7


In [18]:
df.dropna(subset=['NAME','number of reviews','host name'],inplace=True)

In [21]:
df['has_rules'] = df['house_rules'].notna()

In [36]:
df.replace({'has_rules': {True: 1, False: 0}}, inplace=True)
df['house_rules_positivity'].fillna(3, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['house_rules_positivity'].fillna(3, inplace=True)


In [37]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 86127 entries, 0 to 102597
Data columns (total 29 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              86127 non-null  int64  
 1   NAME                            86127 non-null  object 
 2   host id                         86127 non-null  int64  
 3   host_identity_verified          85893 non-null  object 
 4   host name                       86127 non-null  object 
 5   neighbourhood group             86103 non-null  object 
 6   neighbourhood                   86111 non-null  object 
 7   lat                             86120 non-null  float64
 8   long                            86120 non-null  float64
 9   country                         85640 non-null  object 
 10  country code                    86015 non-null  object 
 11  instant_bookable                86038 non-null  object 
 12  cancellation_policy             8606