# Airbnb Data Cleansing

This notebook performs basic data visualization (data viz) on a cleansed open-source Airbnb dataset from Kaggle. The goal is to perform basic EDA aided by data viz and some numerical analysis.

The dataset includes listing-level information such as location, price, availability, and review activity.

🔗 Source of raw, uncleansed data: [Airbnb Open Data on Kaggle](https://www.kaggle.com/datasets/arianazmoudeh/airbnbopendata)

Note: This is a personal data science project for educational purposes. To reproduce the results, please download the dataset directly from Kaggle and run it through my data cleansing notebook which you can find here: [Click here to view GitHub cleansing repo](https://github.com/mg-ds-portfolio/prj_open_airbnb_data_cleanse.git)

In [None]:
# Import packages.
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns

In [None]:
# Configure chart displays.

# Allow in-line plotting.
%matplotlib inline

# Set plot style.
plt.style.use("seaborn-v0_8-whitegrid")

In [None]:
# import raw data and take a copy to work on.
df_raw = pd.read_csv("/home/mark/data_viz_practice/data/airbnb_open_data_cleaned.csv")

df = df_raw.copy()

In [4]:
# Inspect data.
df.head()

Unnamed: 0,id,neighbourhood_group,neighbourhood,lat,long,instant_bookable,cancellation_policy,room_type,construction_year,price,service_fee,minimum_nights,number_of_reviews,last_review,reviews_per_month,review_rate_number,calculated_host_listings_count,availability_365
0,1001254,brooklyn,kensington,40.64749,-73.97237,False,strict,private room,2020.0,966.0,193.0,10.0,9.0,2021-10-19,0.21,4.0,6.0,286.0
1,1002102,manhattan,midtown,40.75362,-73.98377,False,moderate,entire home/apt,2007.0,142.0,28.0,30.0,45.0,2022-05-21,0.38,4.0,2.0,228.0
2,1002403,manhattan,harlem,40.80902,-73.9419,True,flexible,private room,2005.0,620.0,124.0,3.0,0.0,,0.0,5.0,1.0,352.0
3,1002755,brooklyn,clinton hill,40.68514,-73.95976,True,moderate,entire home/apt,2005.0,368.0,74.0,30.0,270.0,2019-07-05,4.64,4.0,1.0,322.0
4,1003689,manhattan,east harlem,40.79851,-73.94399,False,moderate,entire home/apt,2009.0,204.0,41.0,10.0,9.0,2018-11-19,0.1,3.0,1.0,289.0


In [12]:
# Check data types.
data_types = df.dtypes
count_nas = df.isna().sum()
count_unique = df.nunique()

pd.concat([data_types, count_nas, count_unique], axis = 1).rename(columns={0: "data_type", 1: "na_count", 2: "num_unique_values"})

Unnamed: 0,data_type,na_count,num_unique_values
id,int64,0,99812
neighbourhood_group,object,0,5
neighbourhood,object,0,224
lat,float64,0,21846
long,float64,0,17643
instant_bookable,bool,0,2
cancellation_policy,object,0,3
room_type,object,0,4
construction_year,float64,0,20
price,float64,0,1151


Observations:
- The only column with NAs is last_review. All others have been cleansed.
- Data types broadly align with content (int/float for numbers, object for strings)

Actions:
- Convert data types

In [None]:
category_type_cols = ["neighbourhood_group", ]
numeric_type_cols = []
string_type_cols = []

id                                  int64
neighbourhood_group                object
neighbourhood                      object
lat                               float64
long                              float64
instant_bookable                     bool
cancellation_policy                object
room_type                          object
construction_year                 float64
price                             float64
service_fee                       float64
minimum_nights                    float64
number_of_reviews                 float64
last_review                        object
reviews_per_month                 float64
review_rate_number                float64
calculated_host_listings_count    float64
availability_365                  float64
dtype: object