<h1><center>System Investigation Part 4 - Airbnb Data Cleaning</center></h1>
<center>Madalyn Li</center>
<center>DATA 514 Spring 2022</center>

The purpose of this notebook is to document the steps taken to produce the cleaned and final versions of the Airbnb datasets used for the analysis and the queries in Part 4 of the System Investigation Project. Cleaning and condensing the Airbnb datasets is necessary as part of the analysis because the queries do not require all the columns on the original dataset and condensing it actually saves a lot of processing time and storage space. 

The Airbnb data is derived from the source [Inside Airbnb](http://insideairbnb.com/get-the-data/) which contains files of data for airbnb listings, calendar bookings, reviews, and neighborhoods of various cities and countries. For the purpose of this project, we were asked to focus on the following cities in the US: Los Angeles, Portland, Salem, and San Diego. 


The three question/queries that I have selected to answer for this project are listed as follows:
1. Display list of stays in Portland, OR with details: name, neighbourhood, room type, how many guests it accommodates, property type and amenities, per night’s cost and is available for the next two days in descending order of rating.
2. Are there any neighbourhoods in any of the cities that don’t have any listings?
3. For each city, how many reviews are received for December of each year?

For the queries listed above, I will only be cleaning two of the data files for each city. They are Listings and Reviews. More details about the process and steps to clean these files are mentioned in more detail below. For the Neighborhoods and Calendar columns, there was not much to update or alter in the original files so for simplicity, we can keep the original files as is that are directly downloaded from the site. 

In [1]:
import pandas as pd

<b>Airbnb Listings Dataset</b>

This section includes the code used to clean and condense the Listings data for all cities. For Los Angeles, Salem, and San Diego, we only need to obtain the columns 'id' and 'neighbourhood_cleansed' for query 2 to determine distinct neighborhoods in each city. However, for Portland, we require more columns to answer query 1. Additionally, some string manipulation is done on the amenities column in the Portland dataset to remove commas and list indicator values because when Spark reads in the data, it confuses the list within amenities to be individual values for the table. 

In [2]:
# Read the original LA listings csv datafile downloaded from the Inside Airbnb site
LA_listings = pd.read_csv('/Users/madalynli/Desktop/DATA_514/AirbnbData/LA_listings.csv')

# Condense rows needed for the LA listings data
LA_listings = LA_listings[['id', 'neighbourhood_cleansed']]

# Export updated LA listings data to a new csv file
LA_listings.to_csv('/Users/madalynli/Desktop/Airbnb_DataFiles/LA_listings.csv')

In [3]:
# Read the original Portland listings csv datafile downloaded from the Inside Airbnb site
Portland_listings = pd.read_csv('/Users/madalynli/Desktop/DATA_514/AirbnbData/Portland_listings.csv')

# Condense rows needed for the Portland listings data
Portland_listings = Portland_listings[['id', 'name', 'neighbourhood_cleansed', 'property_type', 'room_type', 
                                       'accommodates', 'amenities', 'price', 'has_availability', 'availability_30', 
                                       'review_scores_rating']]

# Replace some string characters from Portland listing amenities 
Portland_listings['amenities'] = Portland_listings['amenities'].str.replace('[', '')
Portland_listings['amenities'] = Portland_listings['amenities'].str.replace(']', '')
Portland_listings['amenities'] = Portland_listings['amenities'].str.replace(',', '&')

# Export updated Portland listings data to a new csv file
Portland_listings.to_csv('/Users/madalynli/Desktop/Airbnb_DataFiles/Portland_listings.csv')

  Portland_listings['amenities'] = Portland_listings['amenities'].str.replace('[', '')
  Portland_listings['amenities'] = Portland_listings['amenities'].str.replace(']', '')


In [4]:
# Read the original Salem listings csv datafile downloaded from the Inside Airbnb site
Salem_listings = pd.read_csv('/Users/madalynli/Desktop/DATA_514/AirbnbData/Salem_listings.csv')

# Condense rows needed for the Salem listings data
Salem_listings = Salem_listings[['id', 'neighbourhood_cleansed']]

# Export updated Salem listings data to a new csv file
Salem_listings.to_csv('/Users/madalynli/Desktop/Airbnb_DataFiles/Salem_listings.csv')

In [5]:
# Read the original SD listings csv datafile downloaded from the Inside Airbnb site
SD_listings = pd.read_csv('/Users/madalynli/Desktop/DATA_514/AirbnbData/SD_listings.csv')

# Condense rows needed for the SD listings data
SD_listings = SD_listings[['id', 'neighbourhood_cleansed']]

# Export updated SD listings data to a new csv file
SD_listings.to_csv('/Users/madalynli/Desktop/Airbnb_DataFiles/SD_listings.csv')

<b>Airbnb Reviews Dataset</b>

This section includes the code used to clean and condense the Reviews data for all cities. For all cities, we will condense and limit the columns to only contain listing_id, id, and date. This data will be required to answer query 3. 

In [6]:
# Read the original LA reviews csv datafile downloaded from the Inside Airbnb site
LA_reviews = pd.read_csv('/Users/madalynli/Desktop/DATA_514/AirbnbData/LA_reviews.csv')

# Condense rows needed for the LA reviews data
LA_reviews = LA_reviews[['listing_id', 'id', 'date']]

# Export updated LA reviews data to a new csv file
LA_reviews.to_csv('/Users/madalynli/Desktop/Airbnb_DataFiles/LA_reviews.csv')

In [7]:
# Read the original Portland reviews csv datafile downloaded from the Inside Airbnb site
Portland_reviews = pd.read_csv('/Users/madalynli/Desktop/DATA_514/AirbnbData/Portland_reviews.csv')

# Condense rows needed for the Portland reviews data
Portland_reviews = Portland_reviews[['listing_id', 'id', 'date']]

# Export updated Portland reviews data to a new csv file
Portland_reviews.to_csv('/Users/madalynli/Desktop/Airbnb_DataFiles/Portland_reviews.csv')

In [8]:
# Read the original Salem reviews csv datafile downloaded from the Inside Airbnb site
Salem_reviews = pd.read_csv('/Users/madalynli/Desktop/DATA_514/AirbnbData/Salem_reviews.csv')

# Condense rows needed for the Salem reviews data
Salem_reviews = Salem_reviews[['listing_id', 'id', 'date']]

# Export updated Salem reviews data to a new csv file
Salem_reviews.to_csv('/Users/madalynli/Desktop/Airbnb_DataFiles/Salem_reviews.csv')

In [9]:
# Read the original SD reviews csv datafile downloaded from the Inside Airbnb site
SD_reviews = pd.read_csv('/Users/madalynli/Desktop/DATA_514/AirbnbData/SD_reviews.csv')

# Condense rows needed for the SD reviews data
SD_reviews = SD_reviews[['listing_id', 'id', 'date']]

# Export updated SD reviews data to a new csv file
SD_reviews.to_csv('/Users/madalynli/Desktop/Airbnb_DataFiles/SD_reviews.csv')