# Airbnb Data Cleansing

This notebook performs initial data cleansing on an open-source Airbnb dataset from Kaggle. The goal is to prepare the data for further analysis by identifying and handling missing values, correcting data types, and filtering out invalid or duplicate entries.

The dataset includes listing-level information such as location, price, availability, and review activity.

🔗 Source: [Airbnb Open Data on Kaggle](https://www.kaggle.com/datasets/arianazmoudeh/airbnbopendata)

*Note: This is a personal data science project for educational purposes.*

In [34]:
# Import packages.
import pandas as pd
import numpy as np

In [35]:
# Change settings to show all columns.
pd.set_option('display.max_columns', None)

In [36]:
# Import raw data and take a copy to work on.
df_raw = pd.read_csv("/home/mark/data_cleansing_practice/prj_open_airbnb_data_cleanse/data/Airbnb_Open_Data.csv")
df = df_raw.copy()

  df_raw = pd.read_csv("/home/mark/data_cleansing_practice/prj_open_airbnb_data_cleanse/data/Airbnb_Open_Data.csv")


In [37]:
# Format column names.
df.columns = df.columns.str.lower().str.replace(" ", "_")
df.columns

Index(['id', 'name', 'host_id', 'host_identity_verified', 'host_name',
       'neighbourhood_group', 'neighbourhood', 'lat', 'long', 'country',
       'country_code', 'instant_bookable', 'cancellation_policy', 'room_type',
       'construction_year', 'price', 'service_fee', 'minimum_nights',
       'number_of_reviews', 'last_review', 'reviews_per_month',
       'review_rate_number', 'calculated_host_listings_count',
       'availability_365', 'house_rules', 'license'],
      dtype='object')

In [38]:
# Inspect the data.
df.head()

Unnamed: 0,id,name,host_id,host_identity_verified,host_name,neighbourhood_group,neighbourhood,lat,long,country,country_code,instant_bookable,cancellation_policy,room_type,construction_year,price,service_fee,minimum_nights,number_of_reviews,last_review,reviews_per_month,review_rate_number,calculated_host_listings_count,availability_365,house_rules,license
0,1001254,Clean & quiet apt home by the park,80014485718,unconfirmed,Madaline,Brooklyn,Kensington,40.64749,-73.97237,United States,US,False,strict,Private room,2020.0,$966,$193,10.0,9.0,10/19/2021,0.21,4.0,6.0,286.0,Clean up and treat the home the way you'd like...,
1,1002102,Skylit Midtown Castle,52335172823,verified,Jenna,Manhattan,Midtown,40.75362,-73.98377,United States,US,False,moderate,Entire home/apt,2007.0,$142,$28,30.0,45.0,5/21/2022,0.38,4.0,2.0,228.0,Pet friendly but please confirm with me if the...,
2,1002403,THE VILLAGE OF HARLEM....NEW YORK !,78829239556,,Elise,Manhattan,Harlem,40.80902,-73.9419,United States,US,True,flexible,Private room,2005.0,$620,$124,3.0,0.0,,,5.0,1.0,352.0,"I encourage you to use my kitchen, cooking and...",
3,1002755,,85098326012,unconfirmed,Garry,Brooklyn,Clinton Hill,40.68514,-73.95976,United States,US,True,moderate,Entire home/apt,2005.0,$368,$74,30.0,270.0,7/5/2019,4.64,4.0,1.0,322.0,,
4,1003689,Entire Apt: Spacious Studio/Loft by central park,92037596077,verified,Lyndon,Manhattan,East Harlem,40.79851,-73.94399,United States,US,False,moderate,Entire home/apt,2009.0,$204,$41,10.0,9.0,11/19/2018,0.1,3.0,1.0,289.0,"Please no smoking in the house, porch or on th...",


Observations:
- We have columns we probably won't need for EDA.
- Some data is NA.
- There are different data types (dates, strings, floats, integers).
- $ sign on price columns probably means it's formatted as a string.
- Different use of cases, e.g. neighbourhood_group = sentence, cancellation_policy = lower.

Actions:
1. Remove unneccessary columns.
2. Remove duplicate rows.
3. Remove $ signs.
4. Change data types.
5. Make string cases consistent.
6. Fill/remove NA data.
7. Check for outliers.

In [39]:
# Remove unnecessary columns. Keep id as a key even though it's not used. Otherwise you get lots of duplicate rows.
# Removing country and country code because I know all the data is from the US.
cols_to_drop = ["name", "host_id", "host_identity_verified", "host_name", "country", "country_code", "house_rules", "license"]

for col in cols_to_drop:
    df = df.drop(col, axis = 1)

In [40]:
# Visual check.
df.head()

Unnamed: 0,id,neighbourhood_group,neighbourhood,lat,long,instant_bookable,cancellation_policy,room_type,construction_year,price,service_fee,minimum_nights,number_of_reviews,last_review,reviews_per_month,review_rate_number,calculated_host_listings_count,availability_365
0,1001254,Brooklyn,Kensington,40.64749,-73.97237,False,strict,Private room,2020.0,$966,$193,10.0,9.0,10/19/2021,0.21,4.0,6.0,286.0
1,1002102,Manhattan,Midtown,40.75362,-73.98377,False,moderate,Entire home/apt,2007.0,$142,$28,30.0,45.0,5/21/2022,0.38,4.0,2.0,228.0
2,1002403,Manhattan,Harlem,40.80902,-73.9419,True,flexible,Private room,2005.0,$620,$124,3.0,0.0,,,5.0,1.0,352.0
3,1002755,Brooklyn,Clinton Hill,40.68514,-73.95976,True,moderate,Entire home/apt,2005.0,$368,$74,30.0,270.0,7/5/2019,4.64,4.0,1.0,322.0
4,1003689,Manhattan,East Harlem,40.79851,-73.94399,False,moderate,Entire home/apt,2009.0,$204,$41,10.0,9.0,11/19/2018,0.1,3.0,1.0,289.0


In [41]:
# Check no rows were removed.
print(f"Original # rows: {len(df_raw)}")
print(f"Trimmed # rows: {len(df)}")

print("----------")

# Check only the selected columns were removed.
print(f"# of columns to drop: {len(cols_to_drop)}")
print(f"# of columns dropped: {len(df_raw.columns) - len(df.columns)}")

Original # rows: 102599
Trimmed # rows: 102599
----------
# of columns to drop: 8
# of columns dropped: 8


In [42]:
# Check how many duplicate rows we have.
print(f"# of duplicated rows: {df.duplicated().sum()}")

# Drop duplicate rows.
df = df.drop_duplicates()

# Check the right number of rows were removed.
print(f"# of rows dropped: {len(df_raw) - len(df)}")

print(f"# of non-duplicated rows: {len(df)}")


# of duplicated rows: 541
# of rows dropped: 541
# of non-duplicated rows: 102058


In [43]:
# Remove the $ signs from price and service_fee.
df["price"] = df["price"].str.replace(r"[^0-9.]", "", regex = True)
df["service_fee"] = df["service_fee"].str.replace(r"[^0-9.]", "", regex = True)

df[["price", "service_fee"]]

Unnamed: 0,price,service_fee
0,966,193
1,142,28
2,620,124
3,368,74
4,204,41
...,...,...
102053,696,
102054,909,
102055,387,
102056,848,


In [44]:
# Check data types of each column, and number of unique values in each column.
data_types = df.dtypes
num_unique_values = df.nunique()
print(pd.concat([data_types, num_unique_values], axis = 1).rename(columns = {0: "data_type", 1: "no_unique_values"}))

                               data_type  no_unique_values
id                                 int64            102058
neighbourhood_group               object                 7
neighbourhood                     object               224
lat                              float64             21991
long                             float64             17774
instant_bookable                  object                 2
cancellation_policy               object                 3
room_type                         object                 4
construction_year                float64                20
price                             object              1151
service_fee                       object               231
minimum_nights                   float64               153
number_of_reviews                float64               476
last_review                       object              2477
reviews_per_month                float64              1016
review_rate_number               float64                

In [45]:
# Let's convert numerical columns to either int or float, and text columns to strings. Then we can format and decide if some should be categories.

num_type_cols = ["id", "lat", "long", "construction_year", "price", "service_fee", "minimum_nights", "number_of_reviews", "reviews_per_month", "review_rate_number", "calculated_host_listings_count", "availability_365"]
date_type_cols = ["last_review"]
text_type_cols = ["neighbourhood_group", "neighbourhood", "instant_bookable", "cancellation_policy", "room_type"]

for val in num_type_cols:
    df[val] = pd.to_numeric(df[val])

for val in date_type_cols:
    df[val] = pd.to_datetime(df[val])

for val in text_type_cols:
    df[val] = df[val].astype("string").str.lower().str.strip()

print(f"# numeric cols: {len(num_type_cols)}")
print(f"# date cols: {len(date_type_cols)}")
print(f"# text cols: {len(text_type_cols)}")

print("-----")

print(f"total # cols: {len(df.columns)}")


# numeric cols: 12
# date cols: 1
# text cols: 5
-----
total # cols: 18


In [46]:
# I want to see the values from columns with a low number of unique values. This will decide whether they are categorical or not.
category_type_cols = ["neighbourhood_group", "instant_bookable", "cancellation_policy", "room_type", "review_rate_number"]

for col in category_type_cols:
    dtype_row_label = col + ": "
    for val in df[col].drop_duplicates():
        dtype_row_label += f"{val}, "
    print(dtype_row_label)


neighbourhood_group: brooklyn, manhattan, brookln, manhatan, queens, <NA>, staten island, bronx, 
instant_bookable: false, true, <NA>, 
cancellation_policy: strict, moderate, flexible, <NA>, 
room_type: private room, entire home/apt, shared room, hotel room, 
review_rate_number: 4.0, 5.0, 3.0, nan, 2.0, 1.0, 


Observation + action:
- Fix spelling mistakes in the neighbourhood_group column

In [47]:
df["neighbourhood_group"] = df["neighbourhood_group"].replace({"brookln": "brooklyn", "manhatan": "manhattan"})

# Check.
df["neighbourhood_group"].drop_duplicates()

0           brooklyn
1          manhattan
47            queens
74              <NA>
170    staten island
172            bronx
Name: neighbourhood_group, dtype: string

In [48]:
# Manually check for and fix spelling mistakes in neighbourhood column.
neighbourhood_list = df["neighbourhood"].drop_duplicates().sort_values()

for n in neighbourhood_list:
    print(n)

allerton
arden heights
arrochar
arverne
astoria
bath beach
battery park city
bay ridge
bay terrace
bay terrace, staten island
baychester
bayside
bayswater
bedford-stuyvesant
belle harbor
bellerose
belmont
bensonhurst
bergen beach
boerum hill
borough park
breezy point
briarwood
brighton beach
bronxdale
brooklyn heights
brownsville
bull's head
bushwick
cambria heights
canarsie
carroll gardens
castle hill
castleton corners
chelsea
chelsea, staten island
chinatown
city island
civic center
claremont village
clason point
clifton
clinton hill
co-op city
cobble hill
college point
columbia st
concord
concourse
concourse village
coney island
corona
crown heights
cypress hills
ditmars steinway
dongan hills
douglaston
downtown brooklyn
dumbo
dyker heights
east elmhurst
east flatbush
east harlem
east morrisania
east new york
east village
eastchester
edenwald
edgemere
elmhurst
eltingville
emerson hill
far rockaway
fieldston
financial district
flatbush
flatiron district
flatlands
flushing
fordham
for

In [49]:
# Check how many cells in each column have missing data. Express as number and %.
count_missing_data = df.isna().sum()

percent_missing_data = round(count_missing_data/len(df)*100, 2)

print(pd.concat([count_missing_data, percent_missing_data, df.dtypes], axis = 1).rename(columns = {0: "missing_count", 1: "missing_%", 2: "data_type"}))

print("----------")

# Check how many rows have at least 1 NA value, excluding columns with lots of missing data.
cols_to_na_check = df.columns.difference(["last_review", "reviews_per_month"])

num_rows_with_na_excl = df[df[cols_to_na_check].isna().any(axis=1)].shape[0]

print(f"# rows with NA (excl last_review, reviews_per_month): {num_rows_with_na_excl}")


                                missing_count  missing_%       data_type
id                                          0       0.00           int64
neighbourhood_group                        29       0.03  string[python]
neighbourhood                              16       0.02  string[python]
lat                                         8       0.01         float64
long                                        8       0.01         float64
instant_bookable                          105       0.10  string[python]
cancellation_policy                        76       0.07  string[python]
room_type                                   0       0.00  string[python]
construction_year                         214       0.21         float64
price                                     247       0.24         float64
service_fee                               273       0.27         float64
minimum_nights                            400       0.39         float64
number_of_reviews                         183      


----------
# rows with NA (excl last_review, reviews_per_month): 2246


Obervation:
- There are relatively few NAs (< 0.5%) in all columns except last_review and reviews_per_month.
- **NB. This table will update and show no NAs if script is run again.**

Action:
- Drop all rows with NA, except those in last_review and reviews_per_month.

Justification for dropping rather than filling:
- We have thousands of rows. Missing rows represent < 0.5% of each column, therefore unlikely to skew results.
- No evidence that missingness is non-random. Even so, data is only going to be used for EDA, not complex modelling.

In [50]:
# Drop NA rows in subset of columns.
df = df.dropna(subset = cols_to_na_check)

In [51]:
# Convert columns that are categories to category data type
for col in category_type_cols:
    df[col] = df[col].astype("category")

print(df.dtypes)

id                                         int64
neighbourhood_group                     category
neighbourhood                     string[python]
lat                                      float64
long                                     float64
instant_bookable                        category
cancellation_policy                     category
room_type                               category
construction_year                        float64
price                                    float64
service_fee                              float64
minimum_nights                           float64
number_of_reviews                        float64
last_review                       datetime64[ns]
reviews_per_month                        float64
review_rate_number                      category
calculated_host_listings_count           float64
availability_365                         float64
dtype: object


In [52]:
# Check for outliers.
df.describe().round(2)


Unnamed: 0,id,lat,long,construction_year,price,service_fee,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
count,99812.0,99812.0,99812.0,99812.0,99812.0,99812.0,99812.0,99812.0,84541,84539.0,99812.0,99812.0
mean,29309473.89,40.73,-73.95,2012.49,625.43,125.09,8.08,27.32,2019-06-10 23:21:42.567984128,1.38,7.96,140.71
min,1001254.0,40.5,-74.25,2003.0,50.0,10.0,-1223.0,0.0,2012-07-11 00:00:00,0.01,1.0,-10.0
25%,15240596.75,40.69,-73.98,2007.0,340.0,68.0,2.0,1.0,2018-10-27 00:00:00,0.22,1.0,3.0
50%,29392041.5,40.72,-73.95,2012.0,625.0,125.0,3.0,7.0,2019-06-13 00:00:00,0.74,1.0,95.0
75%,43376692.0,40.76,-73.93,2017.0,913.0,183.0,5.0,30.0,2019-07-05 00:00:00,2.01,2.0,268.0
max,57360237.0,40.92,-73.71,2022.0,1200.0,240.0,5645.0,1024.0,2058-06-16 00:00:00,90.0,332.0,3677.0
std,16224445.07,0.06,0.05,5.76,331.7,66.34,28.73,49.14,,1.75,32.35,135.37


Observations:
- Columns with negative mins - minimum_nights, availability_365
- Columns with high max - minimum_nights, number_of_reviews, reviews_per_month, calculated_host_listings_count, availability_365

Other logic to check:
- service_fee as a % of price
- If number_of_reviews is 0 there should be no last_review, and reviews_per_month should be 0 (assuming "review" means comments, not just star rating)

In [53]:
# Check description of negative values for minimum_nights and availability_365 columns.
print("Descriptive stats for minimum_nights rows < 0")
print("----------")
print(df.loc[df["minimum_nights"] < 0, "minimum_nights"].describe().round(2))
print("\n")
print("Descriptive stats for availability_365 rows < 0")
print("----------")
print(df.loc[df["availability_365"] < 0, "availability_365"].describe().round(2))

Descriptive stats for minimum_nights rows < 0
----------
count      12.00
mean     -164.25
std       351.80
min     -1223.00
25%      -143.75
50%       -10.00
75%        -8.25
max        -1.00
Name: minimum_nights, dtype: float64


Descriptive stats for availability_365 rows < 0
----------
count    413.00
mean      -5.46
std        2.87
min      -10.00
25%       -8.00
50%       -5.00
75%       -3.00
max       -1.00
Name: availability_365, dtype: float64


Observations:
- minimum_nights has very few negative values (12 rows).
- availability_365 has more but they are all close to 0 (413 rows, min is -10)

Actions:
- Clip each columns lower bound at zero.

Justification:
- See Observations

In [54]:
# Clip minimum_nights and availability_365 lower bound to 0.
df["minimum_nights"] = df["minimum_nights"].clip(lower = 0)
df["availability_365"] = df["availability_365"].clip(lower = 0)

# Check count of negative values.
print(f"# minimum_night rows < 0: {(df["minimum_nights"] < 0).sum()}")
print(f"# availability_365 rows < 0: {(df["availability_365"] < 0).sum()}")

# minimum_night rows < 0: 0
# availability_365 rows < 0: 0


In [55]:
# Check descriptive stats of columns with high max values.
high_max_cols = ["minimum_nights", "number_of_reviews", "reviews_per_month", "calculated_host_listings_count", "availability_365"]
df[high_max_cols].describe().round(2)

Unnamed: 0,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
count,99812.0,99812.0,84539.0,99812.0,99812.0
mean,8.1,27.32,1.38,7.96,140.73
std,28.43,49.14,1.75,32.35,135.34
min,0.0,0.0,0.01,1.0,0.0
25%,2.0,1.0,0.22,1.0,3.0
50%,3.0,7.0,0.74,1.0,95.0
75%,5.0,30.0,2.01,2.0,268.0
max,5645.0,1024.0,90.0,332.0,3677.0


The approach for high max values will vary:

minimum_nights:
- This is probably a user-entered value. Extremely high values are likely meant as indicators rather than literal values.
- Action: Clip at 365. This is high enough to act as categorical indicator if we want to inspect the data later.

number_of_reviews:
- This is likely counted by the AirBnb system. Although high, 1,024 as the max value could be valid.
- Action: None.

reviews_per_month:
- This is likely counted by the AirBnb system. It seems unlikely that a property could receive more than 1 review per day.
- Action: Clip at 30, an assumed max numbre of reviews per month.

calculated_host_listings_count:
- This is likely counted by the AirBnb system. Although high, 332 as the max could be valid.
- Action: None.

availability_365:
- This is probably a user-entered value. Max cannot be higher than 365.
- Action: Clip at 365.

In [56]:
# Clip to upper bounds as per above.
df["minimum_nights"] = df["minimum_nights"].clip(upper = 365)
df["reviews_per_month"] = df["reviews_per_month"].clip(upper = 30)
df["availability_365"] = df["availability_365"].clip(upper = 365)

In [57]:
# Sense check service fee as % of price
service_fee_pct = df["service_fee"]/df["price"]*100
service_fee_pct.describe()

count    99812.000000
mean        19.999347
std          0.114904
min         19.230769
25%         19.960861
50%         20.000000
75%         20.038911
max         20.754717
dtype: float64

Observations:
- All service fees are between 19.2% and 20.7% of the price.

Actions:
- None. This all seems in order.

In [58]:
# Sense check properties with zero reviews
print(f"# properties with zero reviews: {(df["number_of_reviews"] == 0).sum()}")
print(f"# missing rows for reviews_per_month: {df["reviews_per_month"].isna().sum()}")
print(f"# missing rows for last_review: {df["last_review"].isna().sum()}")

# properties with zero reviews: 15259
# missing rows for reviews_per_month: 15273
# missing rows for last_review: 15271


Observations:
- Given the similar numbers, it's possible the zero review rows correspond to the missing reviews_per_month and last_review data.

Actions:
- Check whether zero reviews_per_month rows match missing data from reviews_per_month and last_review columns.
- If so, fill missing reviews_per_month with zero.

NB. last_review was correctly converted to datetime earlier. The only valid missing value here is NaT. I will still check that the majority of these are NA because number_of_reviews is zero, but I am comfortable leaving them as NaT.

In [59]:
# Create a boolean mask for rows with zero reviews.
zero_reviews = df['number_of_reviews'] == 0

# Filter the whole df to only show rows with zero reviews.
zero_review_rows = df[zero_reviews]

# Focus on just the reviews_per_month and last_review column.
zero_review_rows_target_cols = zero_review_rows[["reviews_per_month", "last_review"]]

# Create boolean mask for these columns that checks whether they're NA.
# Check if all entries in each column and row is True.
# NB. When you call .all() on a df it operates down columns and returns a Series. When you call .all() on a Series it returns a single True/False
print(zero_review_rows_target_cols.isna().all().all())

True


Observations:
- This test returned True which means every row that has zero reviews has NA in the reviews_per_month and last_review columns.

Actions:
- Fill this subset of reviews_per_month NAs with 0

NB. There are some NAs in reviews_per_month and last_review which do not correspond to rows with zero reviews so we have to ensure we only fill the subset.

In [60]:
# Set entries in reviews_per_month on rows with zero reviews to 0
df.loc[zero_reviews, "reviews_per_month"] = 0

In [61]:
# Check remaining number of NAs
df.isna().sum()

id                                    0
neighbourhood_group                   0
neighbourhood                         0
lat                                   0
long                                  0
instant_bookable                      0
cancellation_policy                   0
room_type                             0
construction_year                     0
price                                 0
service_fee                           0
minimum_nights                        0
number_of_reviews                     0
last_review                       15271
reviews_per_month                    14
review_rate_number                    0
calculated_host_listings_count        0
availability_365                      0
dtype: int64

In [62]:
# It's a very small number, but let's look at the 14 rows with NA for reviews_per_month
reviews_per_month_mask = df["reviews_per_month"].isna()
df.loc[reviews_per_month_mask]

Unnamed: 0,id,neighbourhood_group,neighbourhood,lat,long,instant_bookable,cancellation_policy,room_type,construction_year,price,service_fee,minimum_nights,number_of_reviews,last_review,reviews_per_month,review_rate_number,calculated_host_listings_count,availability_365
164,1091913,brooklyn,dumbo,40.70207,-73.98571,True,strict,private room,2016.0,385.0,77.0,47.0,14.0,2021-04-25,,2.0,1.0,190.0
165,1092466,manhattan,upper east side,40.76123,-73.9642,False,flexible,entire home/apt,2021.0,950.0,190.0,81.0,4.0,2016-09-23,,1.0,1.0,335.0
166,1093018,brooklyn,gowanus,40.66858,-73.99083,False,strict,entire home/apt,2021.0,374.0,75.0,144.0,80.0,2019-07-06,,2.0,1.0,52.0
167,1093570,manhattan,harlem,40.82704,-73.94907,False,flexible,entire home/apt,2011.0,375.0,75.0,365.0,2.0,2015-11-02,,4.0,1.0,70.0
214,1119528,brooklyn,williamsburg,40.70979,-73.95162,False,strict,private room,2017.0,131.0,26.0,4.0,82.0,2019-06-10,,4.0,2.0,361.0
253,1141068,manhattan,east village,40.72477,-73.98161,True,strict,entire home/apt,2014.0,1067.0,213.0,3.0,185.0,2019-05-24,,4.0,2.0,248.0
254,1141620,manhattan,chelsea,40.74238,-73.99567,False,strict,entire home/apt,2008.0,794.0,159.0,9.0,62.0,2019-06-21,,2.0,2.0,80.0
68560,38867024,manhattan,hell's kitchen,40.76366,-73.99096,False,strict,entire home/apt,2003.0,394.0,79.0,2.0,65.0,NaT,,4.0,1.0,59.0
68561,38867576,manhattan,east village,40.7263,-73.98432,True,moderate,private room,2016.0,471.0,94.0,1.0,2.0,NaT,,3.0,2.0,0.0
91430,51498125,brooklyn,bath beach,40.60382,-74.01083,True,flexible,private room,2010.0,343.0,69.0,4.0,17.0,2019-06-07,,1.0,2.0,17.0


Observations:
- There is no obvious reason why these rows have NaN.
- They have a varied number of rewviews and last_review dates.

Actions:
- Set these NaNs to zero.

Justification:
- Small number of rows, unlikely to impact analysis

In [63]:
# Set remaining reviews_per_month NAs to zero.
df.loc[reviews_per_month_mask, "reviews_per_month"] = 0

We have:
- Imported the data and created a working copy.
- Formatted column names.
- Removed unneccessary columns.
- Dropped duplicate rows.
- Fixed formatting for dollar amounts.
- Converted columns to the correct data type.
- Fixed spelling mistakes.
- Dropped or filled NAs.
- Conducted sense checks, corrected invalid entries and handled outliers.

We are ready to export for further EDA.

NB. Exporting to csv will not preserve data types. 

In [64]:
df_cleaned = df.copy()

output_path = "../data/airbnb_open_data_cleaned.csv"
df_cleaned.to_csv(output_path, index = False)