# Introduction 

AirBnB is a technology company that is part of the fast growing ‘sharing economy’,
where it acts as an ebroker, allowing anyone to rent out their property or a spare room in their
house to guests, using their web or phone apps. They collect a small commision from each
booking and they operate globally with over $2.6 billion in annual revenue. The goal of this
project is to analyze Airbnb listing data to better understand the trends in pricing. The data is
available from many cities, however we are choosing to focus on the pricing of properties in
Metro Vancouver.

# The Data & Initial Analysis

The raw data that will be used comes from http://insideairbnb.com, where it is available as
several different files. After obtaining the data we will inspect, clean and join it to produce a
datatable with the variables we need. This will be followed by creating visual plots of the data.

## Getting the data 

The data was obtained from the website and it came in the four files below

- listings.csv.gz: _Detailed Listings data for Vancouver_
- calendar.csv.gz: _Detailed Calendar Data for listings in Vancouver_
- reviews.csv.gz: _Detailed Review Data for listings in Vancouver_
- listings.csv: _Summary information and metrics for listings in Vancouver (good for visualisations)_
- reviews.csv: _Summary Review data and Listing ID (to facilitate time based analytics and visualisations linked to a listing)_


In [1]:
#get relevant packages
import csv
import pandas as pd
import urllib.request
import numpy as np
import os

In [None]:
#Import data from website url for a given month to explore

listing_full = 'http://data.insideairbnb.com/canada/bc/vancouver/2018-04-11/data/listings.csv.gz'
calendar_full = 'http://data.insideairbnb.com/canada/bc/vancouver/2018-04-11/data/calendar.csv.gz'
reviews_full = 'http://data.insideairbnb.com/canada/bc/vancouver/2018-04-11/data/reviews.csv.gz'

listing_summary = 'http://data.insideairbnb.com/canada/bc/vancouver/2018-04-11/data/listings.csv'
calendar_summary = 'http://data.insideairbnb.com/canada/bc/vancouver/2018-04-11/data/calendar.csv'
review_summary = 'http://data.insideairbnb.com/canada/bc/vancouver/2018-04-11/data/reviews.csv'

urls = [listing_full,listing_summary,calendar_full,calendar_summary,reviews_full,review_summary]

dataframes = []
for i in range(0,len(urls)):
    if i/2 != 1:
        dataframes.append(pd.read_csv(urls[i],sep=',', header=0))
    else:
        dataframes.append(pd.read_csv(urls[i],compression='gzip', header=0, sep=',', quotechar='"', error_bad_lines=False))

In [None]:
listings = dataframes[0]
calendar = dataframes[2]
reviews = dataframes[4].iloc[:,[0,1,2,3]]

In [None]:
# relevant columns from lisiting
col_idx = np.array([79,1,9,20,26,27,28,29,33,34,35,36,37,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,80,81,82,83,84,85,86,90,91,92,93,94,95,96])
col_idx = col_idx - 1
listings = listings.iloc[:,col_idx]

In [None]:
calendar.iloc[1:5,]

In [None]:
list(dataframes[3].head())

In [None]:
list(dataframes[4].head())

In [None]:
list(dataframes[5].head())

In [None]:
dataframes[5].head()

From the column headers clearly only the listings.csv.gz files are of use for us. The above was for just a given time period. The next section will build a function that downloads all the relevant data to a directory for later use. 

# Functions to get the data 

In [None]:
def download_listing(date):
    
    try:
        url = 'http://data.insideairbnb.com/canada/bc/vancouver/'+ date + '/data/listings.csv.gz'
        file_name = "data/"+date+"listing.gz"
        urllib.request.urlretrieve(url, file_name)
    except urllib.error.HTTPError as err:
        if(err.code == 404):
            pass
        else:
            print(err)

def download_reviews(date):
    
    try:
        url = 'http://data.insideairbnb.com/canada/bc/vancouver/'+ date + '/data/reviews.csv.gz'
        file_name = "data/reviews/"+date+"reviews.gz"
        urllib.request.urlretrieve(url, file_name)
    except urllib.error.HTTPError as err:
        if(err.code == 404):
            pass
        else:
            print(err)

            
def download_calendar(date):
    
    try:
        url = 'http://data.insideairbnb.com/canada/bc/vancouver/'+ date + '/data/calendar.csv.gz'
        file_name = "data/calendar/"+date+"calendar.gz"
        urllib.request.urlretrieve(url, file_name)
    except urllib.error.HTTPError as err:
        if(err.code == 404):
            pass
        else:
            print(err)

    


# Merging the files
Since there are several listing, calendar and review files it would be easier to merge the different types into one. 

ONLY RUN THIS ONCE! NEED TO KEEP WORKING ON THIS! 

In [23]:
#get all files in listings directory
filenames = os.listdir('data/listings')

#remove files that are not wanted
filenames.pop(2)
filenames.pop(len(filenames) - 9)

#turn to string object
str(filenames)

#read all files and write them to a new csv
combined_csv = pd.concat( [ pd.read_csv('data/listings/' + f) for f in filenames ] )
combined_csv.to_csv( "combined_listing.csv", index=True )

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  


# ONLY RUN THE LOOP BELOW ONCE. IT WILL DOWNLOAD ALL THE DATA! IT WILL TAKE SOME TIME

In [9]:
#Call functions to query data
start = '2015-11-06'
end = '2018-11-08'
dates = pd.date_range(start, end)

for date in dates:
    download_reviews(str(date)[0:10])
    download_calendar(str(date)[0:10])


In [None]:
# columns we want 
col_idx = np.array([79,1,9,20,26,27,28,29,33,34,35,36,37,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,80,81,82,83,84,85,86,90,91,92,93,94,95,96])
col_idx = col_idx - 1


In [None]:
listings = dataframes[1]
listings = listings.iloc[:,col_idx]

In [None]:
listings.iloc[1:5:]

# TODO
- Build a function that: Takes a .gz listings file -> gets the relevant columns -> writes them to a new text/csv file
- Build function to get all the review data. Use the number of reviews left for a property and minimum number of nights to estimate how much someone is making. 