# Asheville: Data Cleaning 

In [1]:
# ! pip install langdetect

In [2]:
import pandas as pd
import ast
from langdetect import detect

pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 100)

In [3]:
listing = pd.read_csv('./data/Asheville_listings.csv')
review = pd.read_csv('./data/Asheville_reviews.csv')

In [4]:
print(f'Shape for Asheville Listings CSV: {listing.shape}')
print(f'Shape for Asheville Reviews CSV: {review.shape}')

Shape for Asheville Listings CSV: (2246, 74)
Shape for Asheville Reviews CSV: (181684, 6)


## Claning Review csv file

In [5]:
review.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,108061,553741,2011-09-21,822907,Pedro & Katie,"Lisa is superb hostess, she will treat you lik..."
1,108061,683278,2011-11-01,236064,Tim,This was a lovely little place walking distanc...
2,108061,714889,2011-11-13,1382707,Shane,"Lisa was very nice to work with. However, we ..."
3,108061,1766157,2012-07-21,416731,Brenda,I feel very lucky to have found this beautiful...
4,108061,2033065,2012-08-19,1858880,Lindsey,"Great roomy little apartment, beautiful privat..."


Dropping unnecessary columns

In [6]:
review.drop(columns = ['id',
                       'reviewer_id',
                       'reviewer_name'],
            inplace = True)

Removing all reviews that are not from 2018-2019

In [7]:
review = review[(review['date'] >= '2018-01-01') & (review['date'] <= '2019-12-31')]

In [8]:
review.shape

(86722, 3)

Dropping all rows with null values

In [9]:
review.isnull().sum()

listing_id     0
date           0
comments      33
dtype: int64

In [10]:
review = review.dropna()

After doing some EDA these are some of the things I noticed needed to be addressed.

Remove all reviews that are less than 5 words.

In [12]:
review = review[review['comments'].str.count(' ') > 4]

Remove all reviews that are not in English.

In [13]:
review.drop([i for i in review[review['comments'].apply(detect) != 'en'].index], inplace=True)

Remove '\n' since this is just an indicator for a line break.

In [14]:
review['comments'] = review['comments'].str.replace('\n', '')

Removing any numbers from the comments.

In [15]:
review['comments'] = review['comments'].replace('\d+', '', regex=True)

Removing reviews written in Asian languages.

In [16]:
review.drop([i for i in review[review['comments'].str.contains(r'[^\x00-\x7F]+') == True].index], inplace = True)

Removing rows where the comments were generated by AirBnb due to the host cancelling a booked reservation.

In [17]:
review.drop([i for i in review[review['comments'].str.contains('This is an automated posting') == True].index], inplace = True)

## Cleaning Listing csv file

In [18]:
listing.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,description,neighborhood_overview,picture_url,host_id,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bathrooms_text,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,calendar_updated,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,number_of_reviews_ltm,number_of_reviews_l30d,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,108061,https://www.airbnb.com/rooms/108061,20210217171651,2021-02-18,Walk to stores/parks/downtown. Fenced yard/Pet...,Walk to town in ten minutes! Monthly rental in...,"I love my neighborhood! Its friendly, easy-goi...",https://a0.muscache.com/pictures/41011975/0cdf...,320564,https://www.airbnb.com/users/show/320564,Lisa,2010-12-16,"Asheville, North Carolina, United States",I am a long time resident of Asheville and am ...,within a few hours,100%,25%,f,https://a0.muscache.com/im/users/320564/profil...,https://a0.muscache.com/im/users/320564/profil...,,2,2,"['email', 'phone', 'facebook', 'reviews', 'off...",t,t,"Asheville, North Carolina, United States",28801,,35.6067,-82.55563,Entire apartment,Entire home/apt,2,,1 bath,1.0,1.0,"[""Free parking on premises"", ""Essentials"", ""Co...",$82.00,30,365,30,30,1125,1125,30.0,1125.0,,t,0,0,0,242,2021-02-18,89,0,0,2011-09-21,2019-11-30,90.0,9.0,9.0,10.0,10.0,10.0,9.0,,f,2,2,0,0,0.78
1,155305,https://www.airbnb.com/rooms/155305,20210217171651,2021-02-18,Cottage! BonPaul + Sharky's Hostel,<b>The space</b><br />Private cottage located ...,"We are within easy walk of pubs, breweries, mu...",https://a0.muscache.com/pictures/8880711/cf38d...,746673,https://www.airbnb.com/users/show/746673,BonPaul,2011-06-26,"Asheville, North Carolina, United States",We operate two traveler's hostels located in H...,within an hour,91%,100%,t,https://a0.muscache.com/im/pictures/user/4dff7...,https://a0.muscache.com/im/pictures/user/4dff7...,,7,7,"['email', 'phone', 'facebook', 'reviews', 'off...",t,t,"Asheville, North Carolina, United States",28806,,35.57864,-82.59578,Entire guesthouse,Entire home/apt,2,,1 bath,1.0,1.0,"[""Oven"", ""Hair dryer"", ""Free parking on premis...",$90.00,1,365,1,1,7,1125,1.0,62.7,,t,26,55,85,310,2021-02-18,289,53,4,2011-07-31,2021-02-13,91.0,10.0,9.0,10.0,10.0,10.0,9.0,,t,7,1,2,4,2.48
2,156805,https://www.airbnb.com/rooms/156805,20210217171651,2021-02-19,"Private Room ""Ader"" at BPS Hostel",<b>The space</b><br />Private Rooms at Bon Pau...,"Easy walk to pubs, cafes, bakery, breweries, l...",https://a0.muscache.com/pictures/23447d55-fa7e...,746673,https://www.airbnb.com/users/show/746673,BonPaul,2011-06-26,"Asheville, North Carolina, United States",We operate two traveler's hostels located in H...,within an hour,91%,100%,t,https://a0.muscache.com/im/pictures/user/4dff7...,https://a0.muscache.com/im/pictures/user/4dff7...,,7,7,"['email', 'phone', 'facebook', 'reviews', 'off...",t,t,"Asheville, North Carolina, United States",28806,,35.57864,-82.59578,Private room in house,Private room,2,,2.5 shared baths,1.0,1.0,"[""Coffee maker"", ""Fire extinguisher"", ""Lock on...",$66.00,1,365,1,1,365,365,1.0,365.0,,t,0,0,0,16,2021-02-19,67,0,0,2011-09-20,2020-01-01,90.0,10.0,9.0,10.0,9.0,10.0,9.0,,t,7,1,2,4,0.58
3,156926,https://www.airbnb.com/rooms/156926,20210217171651,2021-02-18,"Mixed Dorm ""Top Bunk #1"" at BPS Hostel",This is a top bunk in the mixed dorm room<br /...,,https://a0.muscache.com/pictures/98f4e655-c4d6...,746673,https://www.airbnb.com/users/show/746673,BonPaul,2011-06-26,"Asheville, North Carolina, United States",We operate two traveler's hostels located in H...,within an hour,91%,100%,t,https://a0.muscache.com/im/pictures/user/4dff7...,https://a0.muscache.com/im/pictures/user/4dff7...,,7,7,"['email', 'phone', 'facebook', 'reviews', 'off...",t,t,,28806,,35.57864,-82.59578,Shared room in hostel,Shared room,1,,2.5 shared baths,1.0,6.0,"[""Carbon monoxide alarm"", ""Dishes and silverwa...",$31.00,1,365,1,1,1,365,1.0,49.7,,t,0,19,49,281,2021-02-18,282,6,0,2011-09-01,2020-12-31,94.0,10.0,9.0,10.0,9.0,10.0,10.0,,t,7,1,2,4,2.45
4,160594,https://www.airbnb.com/rooms/160594,20210217171651,2021-02-17,Historic Grove Park,Come enjoy the beautiful Grove Park neighborho...,,https://a0.muscache.com/pictures/92433837/d340...,769252,https://www.airbnb.com/users/show/769252,Elizabeth,2011-07-02,"Asheville, North Carolina, United States",We love having guests and sharing a piece of h...,,,,f,https://a0.muscache.com/im/users/769252/profil...,https://a0.muscache.com/im/users/769252/profil...,,1,1,"['email', 'phone', 'reviews', 'kba', 'work_ema...",t,f,,28801,,35.61442,-82.54127,Private room in house,Private room,2,,1 bath,1.0,1.0,"[""Lock on bedroom door"", ""Free parking on prem...",$125.00,30,1125,30,30,1125,1125,30.0,1125.0,,f,0,0,0,0,2021-02-17,58,0,0,2011-08-07,2015-10-19,99.0,10.0,10.0,10.0,10.0,10.0,10.0,,f,1,0,1,0,0.5


In [19]:
listing.isnull().sum().sum()

15073

In [20]:
listing.isnull().sum()

id                                                 0
listing_url                                        0
scrape_id                                          0
last_scraped                                       0
name                                               0
description                                        2
neighborhood_overview                            452
picture_url                                        0
host_id                                            0
host_url                                           0
host_name                                          0
host_since                                         0
host_location                                      2
host_about                                       732
host_response_time                               216
host_response_rate                               216
host_acceptance_rate                              89
host_is_superhost                                  0
host_thumbnail_url                            

### Dropping Columns

Columns with no relevant information

In [21]:
listing.drop(columns = ['last_scraped',                         
                        'license',
                        'host_id',
                        'scrape_id',                            
                        'listing_url',                          
                        'picture_url',                          
                        'host_url',                             
                        'host_thumbnail_url',                   
                        'host_picture_url',                     
                        'host_name',                            
                        'host_verifications',
                        'calendar_last_scraped',
                        'host_neighbourhood',
                        'host_location',
                        'host_response_rate',
                        'availability_30',
                        'availability_60',
                        'availability_90',
                        'availability_365',
                        'number_of_reviews_ltm',
                        'number_of_reviews_l30d',
                        'calculated_host_listings_count',
                        'calculated_host_listings_count_entire_homes',
                        'calculated_host_listings_count_private_rooms',
                        'calculated_host_listings_count_shared_rooms'],
            inplace = True)

All values are nulls

In [22]:
listing.drop(columns = ['calendar_updated',                     
                        'neighbourhood_group_cleansed',         
                        'bathrooms'],
            inplace = True)

Repeat values from other columns

In [23]:
listing.drop(columns = ['minimum_minimum_nights',               
                        'maximum_minimum_nights',               
                        'minimum_maximum_nights',              
                        'maximum_maximum_nights',               
                        'minimum_nights_avg_ntm',               
                        'maximum_nights_avg_ntm',               
                        'neighbourhood',
                        'host_total_listings_count',
                        'beds',
                        'room_type'],                       
            inplace = True)

Almost all values are the same

In [24]:
listing.drop(columns = ['host_has_profile_pic',
                        'has_availability'], 
            inplace = True)

Dropping column due to multicollinearity

In [25]:
listing.drop(columns = ['host_identity_verified',
                        'host_is_superhost'],
             inplace = True)

### Dropping Nulls

The rows with no description are also missing data for many other rows.

In [26]:
listing.dropna(subset=['description'], inplace = True)

Since these rows have no values for these columns, it is implied that they had no reviews.

In [27]:
listing.dropna(subset=['first_review',
                       'last_review',
                       'review_scores_rating',
                       'review_scores_accuracy',
                       'review_scores_cleanliness',
                       'review_scores_checkin',
                       'review_scores_communication',
                       'review_scores_location',
                       'review_scores_value'],
               inplace = True)

### Imputing Nulls

Filling the nulls with 'No Content' because there are 452 nulls for 'neighborhood_overview' and 732 nulls for 'host_about'.

In [28]:
listing['neighborhood_overview'].fillna('No Content', inplace = True)
listing['host_about'].fillna('No Content', inplace = True)

Filling the null values for 'host_response_time' with 'within a few hours' since that is the most reasonable.

In [29]:
listing['host_response_time'].value_counts()

within an hour        1562
within a few hours     197
within a day            86
a few days or more      11
Name: host_response_time, dtype: int64

In [30]:
listing['host_response_time'].fillna('within a few hours', inplace = True)

Filling the null values for 'host_acceptance_rate' with the mean value. First, converting the values from percentages into floats.

In [31]:
listing['host_acceptance_rate'] = listing['host_acceptance_rate'].str.replace('%', '').astype('float')/100.0

In [32]:
avg_acceptance_rate = listing['host_acceptance_rate'].value_counts().mean()

In [33]:
listing['host_acceptance_rate'].fillna(avg_acceptance_rate, inplace = True)

Filling the null values for 'bedrooms' with '1.0' since that is the most frequent value.

In [34]:
listing['bedrooms'].value_counts()

1.0    1016
2.0     482
3.0     261
4.0     102
5.0      18
6.0      10
7.0       3
8.0       2
Name: bedrooms, dtype: int64

In [35]:
listing['bedrooms'].fillna('1.0', inplace = True)

## Save clean dataframes

Verify all nulls are dealt with.

In [36]:
listing.isnull().sum().sum()

0

In [37]:
review.isnull().sum().sum()

0

Checking the final amount of columns and rows for the final dataframe.

In [38]:
print(f'Shape for Asheville Listings CSV: {listing.shape}')
print(f'Shape for Asheville Reviews CSV: {review.shape}')

Shape for Asheville Listings CSV: (2038, 32)
Shape for Asheville Reviews CSV: (63073, 3)


Saving finalized dataframes as new CSV files.

In [39]:
listing.to_csv('./data/Asheville_Listings_Clean', index = False)
review.to_csv('./data/Asheville_Reviews_Clean', index = False)