# Preliminary model development Airbnb

This notebook presents some initial models for the Airbnb dataset. The main question that is to be answered is wheter the COVID-19 pandemic affected prices in a significant way.

In order to answer this question, different models will be developped using the available datasets. After this has been done, models fitted with one dataset will be used on other dates to see how they fit this new data. Two different comparisons will be conducted:
* Within same year, changes (from June to September) will be compared
* Year to year changes will also be calculated

Finally, an overall model will be developped hoping to produce an accurate prediction tool for future prices.

Initial hypothesis is that June 2020 prices will be heavily affected but we expect a slight rebound in September. This may be an indicator of longer term impact or patterns.

**Next steps**

After this analysis is computed, a more in depth study should be done regarding the use of different regularizations and ways to compute the final model. For example, the effect of cross validation on training and bootstraping could be possible steps to take.

### Reading the dataset

In [149]:
using Random
using CSV
using Plots
using DataFrames
using Statistics
using LinearAlgebra
using Dates
using StatsBase

One of the encoded datasets is selected for initial approach. After a course of action has been determined, the steps will be replicated for all datasets.

In [150]:
df = CSV.read("./data/treated data/2020_october_listings_encoded.csv");

In [151]:
head(df)

Unnamed: 0_level_0,id,name,host_id,host_name
Unnamed: 0_level_1,Int64,String,Int64,String
1,2595,Skylit Midtown Castle,2845,Jennifer
2,3831,"Whole flr w/private bdrm, bath & kitchen(pls read)",4869,LisaRoxanne
3,5121,BlissArtsSpace!,7356,Garon
4,5136,"Spacious Brooklyn Duplex, Patio + Garden",7378,Rebecca
5,5178,Large Furnished Room Near B'way,8967,Shunichi
6,5203,Cozy Clean Guest Room - Family Apt,7490,MaryEllen


### Eliminating neighbourhood one-hot encoding columns

There are 249 columns due to the one hot encoding that was done of the `neighbourhood` feature which had 223 different values. For a first approach, these columns will not be taken into account. A proposed method for further analysis is substituting these columns by information regarding the neighbourhoods such as mean income, pop density and other relevant information.

In [152]:
feature_names = names(df);
print(feature_names)will 

["id", "name", "host_id", "host_name", "neighbourhood_group", "neighbourhood", "latitude", "longitude", "room_type", "price", "minimum_nights", "number_of_reviews", "last_review", "reviews_per_month", "calculated_host_listings_count", "availability_365", "period_code", "Manhattan", "Brooklyn", "Queens", "Staten Island", "Bronx", "Midtown", "Clinton Hill", "Bedford-Stuyvesant", "Sunset Park", "Hell's Kitchen", "Upper West Side", "West Village", "South Slope", "Williamsburg", "East Harlem", "Fort Greene", "Inwood", "East Village", "Harlem", "Flatbush", "Prospect-Lefferts Gardens", "Long Island City", "Jamaica", "Chelsea", "Greenpoint", "Kips Bay", "Lower East Side", "Nolita", "Upper East Side", "Prospect Heights", "Park Slope", "Washington Heights", "Woodside", "Brooklyn Heights", "Bushwick", "Carroll Gardens", "Gowanus", "Flatlands", "SoHo", "Cobble Hill", "Flushing", "Boerum Hill", "Sunnyside", "St. George", "Tribeca", "Highbridge", "Ridgewood", "Port Morris", "Morningside Heights", "M

LoadError: UndefVarError: will not defined

In [153]:
neighbourhood_group_names = names(df)[18:22];
print(neighbourhood_group_names)

["Manhattan", "Brooklyn", "Queens", "Staten Island", "Bronx"]

In [154]:
room_type_names = names(df)[end-4:end];
print(room_type_names)

["Country Club", "Entire home/apt", "Private room", "Shared room", "Hotel room"]

Columns 1-18 are the original variables, 18-22 are the one hot encoding of `neighbourhood_group` and the last 5 columns are the one hot encoding of `room_type`

In [155]:
df_cut = hcat(df[:, 1:22], df[:, end-4:end]);
size(df_cut)

(44634, 27)

In [156]:
cut_feature_names = names(df_cut);
print(cut_feature_names)

["id", "name", "host_id", "host_name", "neighbourhood_group", "neighbourhood", "latitude", "longitude", "room_type", "price", "minimum_nights", "number_of_reviews", "last_review", "reviews_per_month", "calculated_host_listings_count", "availability_365", "period_code", "Manhattan", "Brooklyn", "Queens", "Staten Island", "Bronx", "Country Club", "Entire home/apt", "Private room", "Shared room", "Hotel room"]

### Eliminating unnecessary variables for model

Not all variables will be used for the model. For example `id` and `name` are not needed. `host_id` probably would be very explanatory, however the number by itself does not mean anything as it is truly a categorical variable. This means that it shall be taken out as well. 

Furthermore, there is no need for `neighbourhood_group`, `neighbourhood` or `room_type` as their information is already present in other columns (or previously eliminated as is the case of `neighbourhood`)

A proposed course of action in case the model needs a boos in accuracy is scraping expressions in the listing name. For example, maybe including the word "Private" or "New" increases price.

In [157]:
df_filtered = df_cut[:,setdiff(names(df_cut), ["id", "name", "host_id", "host_name", "neighbourhood_group", "neighbourhood", "room_type"])];
filtered_feature_names = names(df_filtered);
print(filtered_feature_names)

["latitude", "longitude", "price", "minimum_nights", "number_of_reviews", "last_review", "reviews_per_month", "calculated_host_listings_count", "availability_365", "period_code", "Manhattan", "Brooklyn", "Queens", "Staten Island", "Bronx", "Country Club", "Entire home/apt", "Private room", "Shared room", "Hotel room"]

In [158]:
for i in 1:length(filtered_feature_names)
    println(string(i), "\t", string(filtered_feature_names[i]), "\t\t\t", string(eltype(df_filtered[!, i])))
end

1	latitude			Float64
2	longitude			Float64
3	price			Int64
4	minimum_nights			Int64
5	number_of_reviews			Int64
6	last_review			Date
7	reviews_per_month			Float64
8	calculated_host_listings_count			Int64
9	availability_365			Int64
10	period_code			Int64
11	Manhattan			Float64
12	Brooklyn			Float64
13	Queens			Float64
14	Staten Island			Float64
15	Bronx			Float64
16	Country Club			Float64
17	Entire home/apt			Float64
18	Private room			Float64
19	Shared room			Float64
20	Hotel room			Float64


### Encoding Date in a way that is useful for the model

The date will be encoded in an ordinal way. The categories will be:
* 1: last review was done within the last month
* 2: last review was done within in the last three months
* 3: last review was done within the last year
* 4: last review was done earler
* 5: there are no reviews yet (date is 1900-01-01)

The encoded dates will be stored in `last_review_dates_code`

In [160]:
last_review_dates = df_filtered[:, :last_review];
last_review_dates_code = zeros(length(last_review_dates));
size(last_review_dates) == size(last_review_dates_code)

true

Some tests regarding operations with Date objects

In [161]:
print(last_review_dates[16])
print("\n")
print(last_review_dates[16] == Date("1900-01-01"))
print("\n")
print(last_review_dates[15])
print("\n")
print(last_review_dates[15] > Date("1900-01-01"))

1900-01-01
true
2020-09-07
true

In [162]:
print(Dates.month(Date("1900-05-01")))
print("\n")
print(Dates.year(Date("1900-05-01")))
print("\n")
print(Dates.year(last_review_dates[15]))

5
1900
2020

The thresholds are defined. The recent thresholds are computed relative to the date of the last review

In [163]:
last_review_date = maximum(df_filtered[:, :last_review])

2020-10-11

In [164]:
# Within the last month
threshold_1 = Date(Dates.year(last_review_date), Dates.month(last_review_date)-1, Dates.day(last_review_date));

# Three months earlier
threshold_2 = Date(Dates.year(last_review_date), Dates.month(last_review_date)-3, Dates.day(last_review_date));

# One year earlier
threshold_3 = Date(Dates.year(last_review_date) - 1, Dates.month(last_review_date), Dates.day(last_review_date));

# No review has been introduced
threshold_5 = Date("1900-01-01");

# This will be the "else" clause in the if statement
# threshold_4 = otherwise.

`last_review_dates_code` is filled out with a loop by comparing dates to established thresholds

In [165]:

for i in 1:length(last_review_dates)
    
    if last_review_dates[i] >= threshold_1
        last_review_dates_code[i] = 1
    
    elseif last_review_dates[i] >= threshold_2
        last_review_dates_code[i] = 2

    elseif last_review_dates[i] >= threshold_3
        last_review_dates_code[i] = 3
    
    elseif last_review_dates[i] == threshold_5
        last_review_dates_code[i] = 5
        
    else
        last_review_dates_code[i] = 4
    
    end
        
end

In [166]:
unique(last_review_dates_code)

5-element Array{Float64,1}:
 3.0
 1.0
 4.0
 2.0
 5.0

In [167]:
countmap(last_review_dates_code)

Dict{Float64,Int64} with 5 entries:
  4.0 => 14318
  2.0 => 3493
  3.0 => 12513
  5.0 => 10506
  1.0 => 3804

The new date column is substituted for the previous data

In [168]:
df_final = df_filtered[:,setdiff(names(df_filtered), ["last_review"])];

In [170]:
head(df_final)

Unnamed: 0_level_0,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month
Unnamed: 0_level_1,Float64,Float64,Int64,Int64,Int64,Float64
1,40.7536,-73.9838,175,3,48,0.36
2,40.6851,-73.9598,76,1,354,4.82
3,40.6869,-73.956,60,29,50,0.36
4,40.6612,-73.9942,175,7,1,0.01
5,40.7649,-73.9849,73,2,473,3.4
6,40.8018,-73.9672,75,2,118,0.87


In [172]:
df_final[:, :last_review] = last_review_dates_code;

In [173]:
head(df_final)

Unnamed: 0_level_0,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month
Unnamed: 0_level_1,Float64,Float64,Int64,Int64,Int64,Float64
1,40.7536,-73.9838,175,3,48,0.36
2,40.6851,-73.9598,76,1,354,4.82
3,40.6869,-73.956,60,29,50,0.36
4,40.6612,-73.9942,175,7,1,0.01
5,40.7649,-73.9849,73,2,473,3.4
6,40.8018,-73.9672,75,2,118,0.87


### Separation of training and testing set

### First model is built