# Data mining and cleaning Airbnb

In [3]:
using CSV
using Plots
using DataFrames
using Dates

There are 4 starting files:
* 2019_july_listings.csv
* 2019_october_listings.csv
* 2020_july_listings.csv
* 2020_octuber_listings.csv

Which have been altered and tidied in excel from the original files from [here](http://insideairbnb.com/get-the-data.html)

Inidividual analysis will be done to each one in case there are particular problems in any one of them. The result of this analysis is to produce 4 clean datasets that have the necessary information to work on the problem

The files contain listing data from the previous month. The stated month is the month when they were uploaded. For example,  2019_july_listings.csv contains data from June.

## Missing value treatment

### October 2020

In [4]:
df = CSV.read("./data/initial data/2020_october_listings.csv");

Take a look at variables and see if they have missing values

In [5]:
feature_names = names(df);

In [6]:
for i in 1:length(feature_names)
    println(string(i), "\t", string(feature_names[i]), "\t\t\t", string(eltype(df[!, i])))
end

1	id			Int64
2	name			Union{Missing, String}
3	host_id			Int64
4	host_name			Union{Missing, String}
5	neighbourhood_group			String
6	neighbourhood			String
7	latitude			Float64
8	longitude			Float64
9	room_type			String
10	price			Int64
11	minimum_nights			Int64
12	number_of_reviews			Int64
13	last_review			Union{Missing, Date}
14	reviews_per_month			Union{Missing, Float64}
15	calculated_host_listings_count			Int64
16	availability_365			Int64


There are missing values in some categories. We will see how much they affect and how many they are.

Number of rows in full dataset

In [7]:
n_rows_full = size(df,1)

44666

In [8]:
function percentage(fraction, total)
    return fraction/total
end

percentage (generic function with 1 method)

Will see for each of the columns the percentage of missing values that there are in them. The goal of this is to see of it makes sense to substitute missing values for any dummy variable or if we can eliminate the rows.

In [9]:
for feature in feature_names
    index_missing_data = ismissing.(df[:, feature]) # For each column we obtain 1 if missing, 0 if not
    
    number_missing_data = sum(index_missing_data)
    
    # Percentage is calculated
    per = percentage(number_missing_data, n_rows_full)
    
    print("\n\n", feature, " --> ", round(per*100, digits = 3) ,"% missing values")
end



id --> 0.0% missing values

name --> 0.034% missing values

host_id --> 0.0% missing values

host_name --> 0.038% missing values

neighbourhood_group --> 0.0% missing values

neighbourhood --> 0.0% missing values

latitude --> 0.0% missing values

longitude --> 0.0% missing values

room_type --> 0.0% missing values

price --> 0.0% missing values

minimum_nights --> 0.0% missing values

number_of_reviews --> 0.0% missing values

last_review --> 23.546% missing values

reviews_per_month --> 23.546% missing values

calculated_host_listings_count --> 0.0% missing values

availability_365 --> 0.0% missing values

The results show which columns to investigate more:
* last_review
* reviews_per_month

For the rest of the columns, missing values can be eliminated

In [10]:
size(df)

(44666, 16)

`last_review` is a date column, `reviews_per_month` is a float that indicates average number of reviews per month. `Missing`values will be substituted by dummy date "01/01/1900" and value 0 respectively.

In [11]:
df[1:10, :last_review]

10-element Array{Union{Missing, Date},1}:
 2019-11-04
 2020-09-20
 2019-12-02
 2014-01-02
 2020-03-15
 2017-07-21
 2019-08-10
 2020-09-09
 2019-12-09
 2020-03-16

In [12]:
dummy_date = Date(Dates.Month(1),Dates.Year(1900))

1900-01-01

In [13]:
df[:, :last_review] = coalesce.(df[:, :last_review], dummy_date);

In [14]:
df[:, :reviews_per_month] =  coalesce.(df[:, :reviews_per_month], 0);

Because the missing values are `host_name`and `name`they will be elimnated as they representa a very small percentage of the total observations

In [15]:
index_host_name = .!ismissing.(df[:, :host_name]); # 1 to rows with non missing values

In [16]:
size(df)

(44666, 16)

In [17]:
df_clean_host_name = df[index_host_name, :];

In [18]:
size(df_clean_host_name)

(44649, 16)

Now it will be checked if there are still rows with missing `name`

In [19]:
index_name = .!ismissing.(df_clean_host_name[:, :name]); # 1 to rows with non missing values

In [20]:
sum(index_name)

44634

As this number is less than the number of rows of `df_clean_host_name` cleaning is still needed

In [21]:
df_clean = df_clean_host_name[index_name, :];

In [22]:
size(df_clean)

(44634, 16)

Final check to see all variables are ok

In [23]:
 any_missing = colwise(x -> any(ismissing.(x)), df_clean);
sum(any_missing)

0

No values are missing from any of the columns. Clean dataframe is stored in new variable to prevent overwriting

In [24]:
october_2020_dataframe = df_clean;

### July 2020

Same process explained with October 2020 is followed

In [25]:
df = CSV.read("./data/initial data/2020_july_listings.csv");

Take a look at variables and see if they have missing values

In [26]:
feature_names = names(df);

In [27]:
for i in 1:length(feature_names)
    println(string(i), "\t", string(feature_names[i]), "\t\t\t", string(eltype(df[!, i])))
end

1	id			Int64
2	name			Union{Missing, String}
3	host_id			Int64
4	host_name			Union{Missing, String}
5	neighbourhood_group			String
6	neighbourhood			String
7	latitude			Float64
8	longitude			Float64
9	room_type			String
10	price			Int64
11	minimum_nights			Int64
12	number_of_reviews			Int64
13	last_review			Union{Missing, Date}
14	reviews_per_month			Union{Missing, Float64}
15	calculated_host_listings_count			Int64
16	availability_365			Int64


There are missing values in some categories. We will see how much they affect and how many they are.

Number of rows in full dataset

In [28]:
n_rows_full = size(df,1)

48588

In [29]:
function percentage(fraction, total)
    return fraction/total
end

percentage (generic function with 1 method)

Will see for each of the columns the percentage of missing values that there are in them. The goal of this is to see of it makes sense to substitute missing values for any dummy variable or if we can eliminate the rows.

In [30]:
for feature in feature_names
    index_missing_data = ismissing.(df[:, feature]) # For each column we obtain 1 if missing, 0 if not
    
    number_missing_data = sum(index_missing_data)
    
    # Percentage is calculated
    per = percentage(number_missing_data, n_rows_full)
    
    print("\n\n", feature, " --> ", round(per*100, digits = 3) ,"% missing values")
end



id --> 0.0% missing values

name --> 0.033% missing values

host_id --> 0.0% missing values

host_name --> 0.025% missing values

neighbourhood_group --> 0.0% missing values

neighbourhood --> 0.0% missing values

latitude --> 0.0% missing values

longitude --> 0.0% missing values

room_type --> 0.0% missing values

price --> 0.0% missing values

minimum_nights --> 0.0% missing values

number_of_reviews --> 0.0% missing values

last_review --> 23.376% missing values

reviews_per_month --> 23.376% missing values

calculated_host_listings_count --> 0.0% missing values

availability_365 --> 0.0% missing values

Results are similar to before. We need to investigate same columns:
* last_review
* reviews_per_month

For the rest of the columns, missing values can be eliminated

In [31]:
size(df)

(48588, 16)

`last_review` is a date column, `reviews_per_month` is a float that indicates average number of reviews per month. `Missing`values will be substituted by dummy date "01/01/1900" and value 0 respectively.

In [32]:
df[1:10, :last_review]

10-element Array{Union{Missing, Date},1}:
 2008-09-22
 2019-11-04
 2020-06-21
 2016-12-23
 2019-10-13
 2019-12-02
 2014-01-02
 2020-03-15
 2017-07-21
 2019-07-29

In [33]:
dummy_date = Date(Dates.Month(1),Dates.Year(1900))

1900-01-01

In [34]:
df[:, :last_review] = coalesce.(df[:, :last_review], dummy_date);

In [35]:
df[:, :reviews_per_month] =  coalesce.(df[:, :reviews_per_month], 0);

Because the missing values are `host_name`and `name`they will be elimnated as they representa a very small percentage of the total observations

In [36]:
index_host_name = .!ismissing.(df[:, :host_name]); # 1 to rows with non missing values

In [37]:
size(df)

(48588, 16)

In [38]:
df_clean_host_name = df[index_host_name, :];

In [39]:
size(df_clean_host_name)

(48576, 16)

Now it will be checked if there are still rows with missing `name`

In [40]:
index_name = .!ismissing.(df_clean_host_name[:, :name]); # 1 to rows with non missing values

In [41]:
sum(index_name)

48560

As this number is less than the number of rows of `df_clean_host_name` cleaning is still needed

In [42]:
df_clean = df_clean_host_name[index_name, :];

In [43]:
size(df_clean)

(48560, 16)

Final check to see all variables are ok

In [44]:
 any_missing = colwise(x -> any(ismissing.(x)), df_clean);
sum(any_missing)

0

No values are missing from any of the columns. Clean dataframe is stored in new variable to prevent overwriting

In [45]:
july_2020_dataframe = df_clean;

### October 2019

Same process explained with October 2020 is followed

In [46]:
df = CSV.read("./data/initial data/2019_october_listings.csv");

Take a look at variables and see if they have missing values

In [47]:
feature_names = names(df);

In [48]:
for i in 1:length(feature_names)
    println(string(i), "\t", string(feature_names[i]), "\t\t\t", string(eltype(df[!, i])))
end

1	id			Int64
2	name			Union{Missing, String}
3	host_id			Int64
4	host_name			Union{Missing, String}
5	neighbourhood_group			String
6	neighbourhood			String
7	latitude			Float64
8	longitude			Float64
9	room_type			String
10	price			Int64
11	minimum_nights			Int64
12	number_of_reviews			Int64
13	last_review			Union{Missing, Date}
14	reviews_per_month			Union{Missing, Float64}
15	calculated_host_listings_count			Int64
16	availability_365			Int64


There are missing values in some categories. We will see how much they affect and how many they are.

Number of rows in full dataset

In [49]:
n_rows_full = size(df,1)

48602

In [50]:
function percentage(fraction, total)
    return fraction/total
end

percentage (generic function with 1 method)

Will see for each of the columns the percentage of missing values that there are in them. The goal of this is to see of it makes sense to substitute missing values for any dummy variable or if we can eliminate the rows.

In [51]:
for feature in feature_names
    index_missing_data = ismissing.(df[:, feature]) # For each column we obtain 1 if missing, 0 if not
    
    number_missing_data = sum(index_missing_data)
    
    # Percentage is calculated
    per = percentage(number_missing_data, n_rows_full)
    
    print("\n\n", feature, " --> ", round(per*100, digits = 3) ,"% missing values")
end



id --> 0.0% missing values

name --> 0.035% missing values

host_id --> 0.0% missing values

host_name --> 0.066% missing values

neighbourhood_group --> 0.0% missing values

neighbourhood --> 0.0% missing values

latitude --> 0.0% missing values

longitude --> 0.0% missing values

room_type --> 0.0% missing values

price --> 0.0% missing values

minimum_nights --> 0.0% missing values

number_of_reviews --> 0.0% missing values

last_review --> 19.133% missing values

reviews_per_month --> 19.133% missing values

calculated_host_listings_count --> 0.0% missing values

availability_365 --> 0.0% missing values

Results are similar to before. We need to investigate same columns:
* last_review
* reviews_per_month

For the rest of the columns, missing values can be eliminated

In [52]:
size(df)

(48602, 16)

`last_review` is a date column, `reviews_per_month` is a float that indicates average number of reviews per month. `Missing`values will be substituted by dummy date "01/01/1900" and value 0 respectively.

In [53]:
df[1:10, :last_review]

10-element Array{Union{Missing, Date},1}:
 2019-09-24
 2018-11-19
 2019-09-25
 2017-10-05
 2019-09-28
 2017-07-21
 2019-07-29
 2019-08-03
 2019-10-13
 2019-09-16

In [54]:
dummy_date = Date(Dates.Month(1),Dates.Year(1900))

1900-01-01

In [55]:
df[:, :last_review] = coalesce.(df[:, :last_review], dummy_date);

In [56]:
df[:, :reviews_per_month] =  coalesce.(df[:, :reviews_per_month], 0);

Because the missing values are `host_name`and `name`they will be elimnated as they representa a very small percentage of the total observations

In [57]:
index_host_name = .!ismissing.(df[:, :host_name]); # 1 to rows with non missing values

In [58]:
size(df)

(48602, 16)

In [59]:
df_clean_host_name = df[index_host_name, :];

In [60]:
size(df_clean_host_name)

(48570, 16)

Now it will be checked if there are still rows with missing `name`

In [61]:
index_name = .!ismissing.(df_clean_host_name[:, :name]); # 1 to rows with non missing values

In [62]:
sum(index_name)

48553

As this number is less than the number of rows of `df_clean_host_name` cleaning is still needed

In [63]:
df_clean = df_clean_host_name[index_name, :];

In [64]:
size(df_clean)

(48553, 16)

Final check to see all variables are ok

In [65]:
 any_missing = colwise(x -> any(ismissing.(x)), df_clean);
sum(any_missing)

0

No values are missing from any of the columns. Clean dataframe is stored in new variable to prevent overwriting

In [66]:
october_2019_dataframe = df_clean;

### July 2019

Same process explained with October 2020 is followed

In [67]:
df = CSV.read("./data/initial data/2019_july_listings.csv");

Take a look at variables and see if they have missing values

In [68]:
feature_names = names(df);

In [69]:
for i in 1:length(feature_names)
    println(string(i), "\t", string(feature_names[i]), "\t\t\t", string(eltype(df[!, i])))
end

1	id			Int64
2	name			Union{Missing, String}
3	host_id			Int64
4	host_name			Union{Missing, String}
5	neighbourhood_group			String
6	neighbourhood			String
7	latitude			Float64
8	longitude			Float64
9	room_type			String
10	price			Int64
11	minimum_nights			Int64
12	number_of_reviews			Int64
13	last_review			Union{Missing, Date}
14	reviews_per_month			Union{Missing, Float64}
15	calculated_host_listings_count			Int64
16	availability_365			Int64


There are missing values in some categories. We will see how much they affect and how many they are.

Number of rows in full dataset

In [70]:
n_rows_full = size(df,1)

48895

In [71]:
function percentage(fraction, total)
    return fraction/total
end

percentage (generic function with 1 method)

Will see for each of the columns the percentage of missing values that there are in them. The goal of this is to see of it makes sense to substitute missing values for any dummy variable or if we can eliminate the rows.

In [72]:
for feature in feature_names
    index_missing_data = ismissing.(df[:, feature]) # For each column we obtain 1 if missing, 0 if not
    
    number_missing_data = sum(index_missing_data)
    
    # Percentage is calculated
    per = percentage(number_missing_data, n_rows_full)
    
    print("\n\n", feature, " --> ", round(per*100, digits = 3) ,"% missing values")
end



id --> 0.0% missing values

name --> 0.033% missing values

host_id --> 0.0% missing values

host_name --> 0.043% missing values

neighbourhood_group --> 0.0% missing values

neighbourhood --> 0.0% missing values

latitude --> 0.0% missing values

longitude --> 0.0% missing values

room_type --> 0.0% missing values

price --> 0.0% missing values

minimum_nights --> 0.0% missing values

number_of_reviews --> 0.0% missing values

last_review --> 20.558% missing values

reviews_per_month --> 20.558% missing values

calculated_host_listings_count --> 0.0% missing values

availability_365 --> 0.0% missing values

Results are similar to before. We need to investigate same columns:
* last_review
* reviews_per_month

For the rest of the columns, missing values can be eliminated

In [73]:
size(df)

(48895, 16)

`last_review` is a date column, `reviews_per_month` is a float that indicates average number of reviews per month. `Missing`values will be substituted by dummy date "01/01/1900" and value 0 respectively.

In [74]:
df[1:10, :last_review]

10-element Array{Union{Missing, Date},1}:
 2018-10-19
 2019-05-21
 missing
 2019-07-05
 2018-11-19
 2019-06-22
 2017-10-05
 2019-06-24
 2017-07-21
 2019-06-09

In [75]:
dummy_date = Date(Dates.Month(1),Dates.Year(1900))

1900-01-01

In [76]:
df[:, :last_review] = coalesce.(df[:, :last_review], dummy_date);

In [77]:
df[:, :reviews_per_month] =  coalesce.(df[:, :reviews_per_month], 0);

Because the missing values are `host_name`and `name`they will be elimnated as they representa a very small percentage of the total observations

In [78]:
index_host_name = .!ismissing.(df[:, :host_name]); # 1 to rows with non missing values

In [79]:
size(df)

(48895, 16)

In [80]:
df_clean_host_name = df[index_host_name, :];

In [81]:
size(df_clean_host_name)

(48874, 16)

Now it will be checked if there are still rows with missing `name`

In [82]:
index_name = .!ismissing.(df_clean_host_name[:, :name]); # 1 to rows with non missing values

In [83]:
sum(index_name)

48858

As this number is less than the number of rows of `df_clean_host_name` cleaning is still needed

In [84]:
df_clean = df_clean_host_name[index_name, :];

In [85]:
size(df_clean)

(48858, 16)

Final check to see all variables are ok

In [86]:
 any_missing = colwise(x -> any(ismissing.(x)), df_clean);
sum(any_missing)

0

No values are missing from any of the columns. Clean dataframe is stored in new variable to prevent overwriting

In [87]:
july_2019_dataframe = df_clean;

##  Variable encoding

Now that the data is clean, several columns will be encoded in order to do easier analysis:
* `neighbourhood_group`
* `neighbourhood`
* `room_type`

Maybe `last_review` could be considered to be encoded [Recently, last 3 months, never]. This is something to consider

In [88]:
dataframes = [october_2020_dataframe, july_2020_dataframe, october_2019_dataframe, july_2019_dataframe];

In [89]:
dataframe_names = ["october_2020_dataframe", "july_2020_dataframe", "october_2019_dataframe", "july_2019_dataframe"];

In [90]:
variables_to_encode = ["neighbourhood_group", "neighbourhood", "room_type" ];

In [91]:
for i in 1:length(dataframe_names)
    print("\n\n ", dataframe_names[i])
    print("\n =======================")
    
    for variable in variables_to_encode
        print("\n ", variable, "\n")
        unique_values = unique(dataframes[i][:,variable] )
        print(length(unique_values))
    end
end




 october_2020_dataframe
 neighbourhood_group
5
 neighbourhood
221
 room_type
4

 july_2020_dataframe
 neighbourhood_group
5
 neighbourhood
222
 room_type
4

 october_2019_dataframe
 neighbourhood_group
5
 neighbourhood
223
 room_type
4

 july_2019_dataframe
 neighbourhood_group
5
 neighbourhood
221
 room_type
3

* `neighbourhood_group`: Can encode directly in all of them
* `neighbourhood`: Need to see the 2 (at least) different values
* `room_type`: july 2019 only has 3 types


Next steps are to encode the three studied variables in all four datasets using the same coding pattern so the results are better understood in all. This will be accomplished with the following steps:
1. Create new column in all datasets indicating what period of time the row corresponds to (oct 2020, jul 2020...)
2. Join all datasets in the same one
3. Encode the necessary variables globally to ensure encoding is consistent across all datasets
4. Separate the data to remake initial datasets

#### 1. Create new column

`period_code` will be a variable indicating where the data is from. The purpose of this variable is to be able to separate the dataset afterwards.
* Oct 2020: `period_code`= 1
* Jul 2020: `period_code`= 2 
* Oct 2019: `period_code`= 3 
* Jul 2019: `period_code`= 4 

In [92]:
october_2020_dataframe[:, :period_code] = 1;

In [93]:
july_2020_dataframe[:, :period_code] = 2;

In [94]:
october_2019_dataframe[:, :period_code] = 3;

In [95]:
july_2019_dataframe[:, :period_code] = 4;

#### 2. Join all datasets

In [96]:
df_all_time_periods = vcat(october_2020_dataframe, july_2020_dataframe, october_2019_dataframe, july_2019_dataframe);

Check dimensions are ok

In [97]:
size(df_all_time_periods,1) == size(october_2020_dataframe,1) + size(july_2020_dataframe,1) + size(october_2019_dataframe,1) + size(july_2019_dataframe, 1)

true

In [98]:
size(df_all_time_periods)

(190605, 17)

#### 3. Encode all variables

The `onehot` function used in Homework3 will be used to encode the neccessary variables

In [99]:
function onehot(column, cats=unique(column))
    result = zeros(size(column,1) , size(cats,1))
    
    for i in 1:length(column)
        for j in 1:length(cats)
            if column[i] == cats[j]
                result[i,j] = 1
            end
        end
    end
    return result
end

onehot (generic function with 2 methods)

**`neigbourhood_group`**

In [100]:
onehot_neighbourhood_groups = onehot(df_all_time_periods[:, :neighbourhood_group]);
onehot_neighbourhood_groups[1:10,:]

10×5 Array{Float64,2}:
 1.0  0.0  0.0  0.0  0.0
 0.0  1.0  0.0  0.0  0.0
 0.0  1.0  0.0  0.0  0.0
 0.0  1.0  0.0  0.0  0.0
 1.0  0.0  0.0  0.0  0.0
 1.0  0.0  0.0  0.0  0.0
 1.0  0.0  0.0  0.0  0.0
 0.0  1.0  0.0  0.0  0.0
 1.0  0.0  0.0  0.0  0.0
 0.0  1.0  0.0  0.0  0.0

Column headers are created

In [101]:
neighbourhood_group_encoded_names = unique(df_all_time_periods[:, :neighbourhood_group]);

In [102]:
df_onehot_neighbourhood_groups = DataFrame(onehot_neighbourhood_groups, neighbourhood_group_encoded_names);
df_onehot_neighbourhood_groups[1:10, :]

Unnamed: 0_level_0,Manhattan,Brooklyn,Queens,Staten Island,Bronx
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,1.0,0.0,0.0,0.0,0.0
2,0.0,1.0,0.0,0.0,0.0
3,0.0,1.0,0.0,0.0,0.0
4,0.0,1.0,0.0,0.0,0.0
5,1.0,0.0,0.0,0.0,0.0
6,1.0,0.0,0.0,0.0,0.0
7,1.0,0.0,0.0,0.0,0.0
8,0.0,1.0,0.0,0.0,0.0
9,1.0,0.0,0.0,0.0,0.0
10,0.0,1.0,0.0,0.0,0.0


The new dataframe is concatenated with the full one.

In [103]:
df_all_time_periods_one_hot1 = hcat(df_all_time_periods, df_onehot_neighbourhood_groups);

Additionally, a small dataframe will be constructed with the number of instances each of the `neighbourhood_group`occur in the dataset

In [104]:
neighbourhood_group_distribution = transpose(sum(onehot_neighbourhood_groups, dims = 1));

In [105]:
neighbourhood_group_information =  hcat(neighbourhood_group_encoded_names, neighbourhood_group_distribution)

5×2 Array{Any,2}:
 "Manhattan"      84194.0
 "Brooklyn"       77274.0
 "Queens"         23144.0
 "Staten Island"   1421.0
 "Bronx"           4572.0

In [106]:
df_neighbourhood_group_information = DataFrame(neighbourhood_group_information, [:neighbourhood_group, :number_instances])

Unnamed: 0_level_0,neighbourhood_group,number_instances
Unnamed: 0_level_1,Any,Any
1,Manhattan,84194.0
2,Brooklyn,77274.0
3,Queens,23144.0
4,Staten Island,1421.0
5,Bronx,4572.0


**`neigbourhood`**

In [107]:
onehot_neighbourhood = onehot(df_all_time_periods[:, :neighbourhood]);
onehot_neighbourhood[1:10,:]

10×223 Array{Float64,2}:
 1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  1.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0

Column headers are created

In [108]:
neighbourhood_encoded_names = unique(df_all_time_periods[:, :neighbourhood]);

In [109]:
df_onehot_neighbourhood = DataFrame(onehot_neighbourhood, neighbourhood_encoded_names);
df_onehot_neighbourhood[1:10, :]

Unnamed: 0_level_0,Midtown,Clinton Hill,Bedford-Stuyvesant,Sunset Park,Hell's Kitchen,Upper West Side
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64
1,1.0,0.0,0.0,0.0,0.0,0.0
2,0.0,1.0,0.0,0.0,0.0,0.0
3,0.0,0.0,1.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,0.0
5,0.0,0.0,0.0,0.0,1.0,0.0
6,0.0,0.0,0.0,0.0,0.0,1.0
7,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,1.0
10,0.0,0.0,0.0,0.0,0.0,0.0


The new dataframe is concatenated with the full one (that has previous one hot already done)

In [111]:
df_all_time_periods_one_hot2 = hcat(df_all_time_periods_one_hot1, df_onehot_neighbourhood);

Additionally, a small dataframe will be constructed with the number of instances each of the `neighbourhood`occur in the dataset

In [112]:
neighbourhood_distribution = transpose(sum(onehot_neighbourhood, dims = 1));

In [113]:
neighbourhood_information =  hcat(neighbourhood_encoded_names, neighbourhood_distribution);

In [114]:
df_neighbourhood_information = DataFrame(neighbourhood_information, [:neighbourhood, :number_instances]);
head(df_neighbourhood_information)

Unnamed: 0_level_0,neighbourhood,number_instances
Unnamed: 0_level_1,Any,Any
1,Midtown,6261.0
2,Clinton Hill,2211.0
3,Bedford-Stuyvesant,14386.0
4,Sunset Park,1594.0
5,Hell's Kitchen,7627.0
6,Upper West Side,7558.0


**`room_type`**

In [115]:
onehot_room_type = onehot(df_all_time_periods[:, :room_type]);
onehot_room_type[1:10,:]

10×4 Array{Float64,2}:
 1.0  0.0  0.0  0.0
 1.0  0.0  0.0  0.0
 0.0  1.0  0.0  0.0
 1.0  0.0  0.0  0.0
 0.0  1.0  0.0  0.0
 0.0  1.0  0.0  0.0
 1.0  0.0  0.0  0.0
 0.0  1.0  0.0  0.0
 0.0  1.0  0.0  0.0
 1.0  0.0  0.0  0.0

Column headers are created

In [116]:
room_type_encoded_names = unique(df_all_time_periods[:, :room_type]);

In [117]:
df_onehot_room_type = DataFrame(onehot_room_type, room_type_encoded_names);
df_onehot_room_type[1:10, :]

Unnamed: 0_level_0,Entire home/apt,Private room,Shared room,Hotel room
Unnamed: 0_level_1,Float64,Float64,Float64,Float64
1,1.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0
3,0.0,1.0,0.0,0.0
4,1.0,0.0,0.0,0.0
5,0.0,1.0,0.0,0.0
6,0.0,1.0,0.0,0.0
7,1.0,0.0,0.0,0.0
8,0.0,1.0,0.0,0.0
9,0.0,1.0,0.0,0.0
10,1.0,0.0,0.0,0.0


The new dataframe is concatenated with the full one.

In [118]:
df_all_time_periods_one_hot3 = hcat(df_all_time_periods_one_hot2, df_onehot_room_type);

Additionally, a small dataframe will be constructed with the number of instances each of the `neighbourhood_group`occur in the dataset

In [119]:
room_type_distribution = transpose(sum(onehot_room_type, dims = 1));

In [120]:
room_type_information =  hcat(room_type_encoded_names, room_type_distribution)

4×2 Array{Any,2}:
 "Entire home/apt"  98522.0
 "Private room"     86502.0
 "Shared room"       4339.0
 "Hotel room"        1242.0

In [121]:
df_room_type_information = DataFrame(room_type_information, [:neighbourhood_group, :number_instances])

Unnamed: 0_level_0,neighbourhood_group,number_instances
Unnamed: 0_level_1,Any,Any
1,Entire home/apt,98522.0
2,Private room,86502.0
3,Shared room,4339.0
4,Hotel room,1242.0


#### 4. Separate data by initial variable to obtain original datasets

In [122]:
df_final_combined = df_all_time_periods_one_hot3;

* Oct 2020 `:period_code` = 1
* Jul 2020 `:period_code` = 2
* Oct 2019 `:period_code` = 3
* Jul 2019 `:period_code` = 4

**October 2020**

In [123]:
october_2020_dataframe_encoded = df_final_combined[ df_final_combined[:,:period_code] .== 1, : ];

**July 2020**

In [124]:
july_2020_dataframe_encoded = df_final_combined[ df_final_combined[:,:period_code] .== 2, : ];

**October 2019**

In [125]:
october_2019_dataframe_encoded = df_final_combined[ df_final_combined[:,:period_code] .== 3, : ];

**July 2019**

In [126]:
july_2019_dataframe_encoded = df_final_combined[ df_final_combined[:,:period_code] .== 4, : ];

Check to see if conversions were correctly done

In [127]:
size(october_2020_dataframe_encoded, 1) == size(october_2020_dataframe, 1)

true

In [128]:
size(july_2020_dataframe_encoded, 1) == size(july_2020_dataframe, 1)

true

In [129]:
size(october_2019_dataframe_encoded, 1) == size(october_2019_dataframe, 1)

true

In [130]:
size(july_2019_dataframe_encoded, 1) == size(july_2019_dataframe, 1)

true

##  Cleaning '0' price values

There are several unreasonable values for the price column. These will be eliminated from the datasets as they will affect in the process of model building

In [131]:
minimum(october_2020_dataframe_encoded[:price])

0

The number of zeros is shown below

In [132]:
prices = october_2020_dataframe_encoded[:price];

In [133]:
count(i->(i== 0), prices)

25

Rows with zeros will be eliminated from all four datasets

In [134]:
function remove_zeros(df, column)
    
    # Check initial and expected future size
    original_row_number = size(df)[1]
    column_values = df[column]
    zeros_count = count(i->(i== 0), column_values)
    expected_final_row_number = original_row_number - zeros_count
    
    df_no_zeros = df[df[column] .!= 0, :]
    
    print("Size of final df is as expected:")
    print(size(df_no_zeros)[1] == expected_final_row_number)
    
    return df_no_zeros
end 

remove_zeros (generic function with 1 method)

All dataframes get their zeroes in price removed

In [135]:
october_2020_dataframe_encoded = remove_zeros(october_2020_dataframe_encoded, :price);

Size of final df is as expected:true

In [136]:
july_2020_dataframe_encoded = remove_zeros(july_2020_dataframe_encoded, :price);

Size of final df is as expected:true

In [137]:
october_2019_dataframe_encoded = remove_zeros(october_2019_dataframe_encoded, :price);

Size of final df is as expected:true

In [138]:
july_2019_dataframe_encoded = remove_zeros(july_2019_dataframe_encoded, :price);

Size of final df is as expected:true

## Date encoding

The date will be encoded in an ordinal way. The categories will be:
* 5: last review was done within the last month
* 4: last review was done within in the last three months
* 3: last review was done within the last year
* 2: last review was done earler
* 1: there are no reviews yet (date is 1900-01-01)

In [184]:
function encode_date(df)
    
    # Encoded dates will be stored in last-review_dates_code
    last_review_dates = df[:, :last_review];
    last_review_dates_code = zeros(length(last_review_dates));
    size(last_review_dates) == size(last_review_dates_code)
    
    # The thresholds for calculating time periods are defined relative to the date of the last review
    last_review_date = maximum(df[:, :last_review])
    
    # Within the last month
    threshold_5 = Date(Dates.year(last_review_date), Dates.month(last_review_date)-1, Dates.day(last_review_date));

    # Three months earlier
    threshold_4 = Date(Dates.year(last_review_date), Dates.month(last_review_date)-3, Dates.day(last_review_date));

    # One year earlier
    threshold_3 = Date(Dates.year(last_review_date) - 1, Dates.month(last_review_date), Dates.day(last_review_date));

    # No review has been introduced
    threshold_1 = Date("1900-01-01");

    # This will be the "else" clause in the if statement
    # threshold_2 = otherwise.
    
    # `last_review_dates_code` is filled out with a loop by comparing dates to established thresholds
    
    for i in 1:length(last_review_dates)
        if last_review_dates[i] >= threshold_5
            last_review_dates_code[i] = 5
        elseif last_review_dates[i] >= threshold_4
            last_review_dates_code[i] = 4
        elseif last_review_dates[i] >= threshold_3
            last_review_dates_code[i] = 3
        elseif last_review_dates[i] == threshold_1
            last_review_dates_code[i] = 1
        else
            last_review_dates_code[i] = 2
        end
    end
    
    print("\n\nDate distribution is :")
    print(countmap(last_review_dates_code))
    
    # The new encoded date column is included
    df[:, :last_review_code] = last_review_dates_code;
    
    
    return df
end

encode_date (generic function with 1 method)

**The different months are encoded**

In [185]:
october_2020_dataframe_encoded = encode_date(october_2020_dataframe_encoded);
july_2020_dataframe_encoded = encode_date(july_2020_dataframe_encoded);
october_2019_dataframe_encoded = encode_date(october_2019_dataframe_encoded);
july_2019_dataframe_encoded = encode_date(july_2019_dataframe_encoded);



Date distribution is :Dict(4.0 => 3493,2.0 => 14318,3.0 => 12509,5.0 => 3804,1.0 => 10485)

Date distribution is :Dict(4.0 => 2080,2.0 => 12727,3.0 => 19200,5.0 => 3200,1.0 => 11326)

Date distribution is :Dict(4.0 => 5621,2.0 => 10198,3.0 => 5983,5.0 => 17459,1.0 => 9282)

Date distribution is :Dict(4.0 => 6297,2.0 => 9677,3.0 => 6876,5.0 => 15961,1.0 => 10036)

#### Save all encoded dataframes as csv

In [186]:
CSV.write("./data/treated data/2020_october_listings_encoded.csv", october_2020_dataframe_encoded)

"./data/treated data/2020_october_listings_encoded.csv"

In [187]:
CSV.write("./data/treated data/2020_july_listings_encoded.csv", july_2020_dataframe_encoded)

"./data/treated data/2020_july_listings_encoded.csv"

In [188]:
CSV.write("./data/treated data/2019_october_listings_encoded.csv", october_2019_dataframe_encoded)

"./data/treated data/2019_october_listings_encoded.csv"

In [189]:
CSV.write("./data/treated data/2019_july_listings_encoded.csv", july_2019_dataframe_encoded)

"./data/treated data/2019_july_listings_encoded.csv"