### EDA Airbnb

The hope is that this code can be applied to any file that we end up choosing

In [105]:
using CSV
using Plots
using DataFrames


In [106]:
df = CSV.read("listings_treated.csv");

In [107]:
feature_names = names(df);

In [108]:
for i in 1:length(feature_names)
    println(string(i), "\t", string(feature_names[i]), "\t\t\t", string(eltype(df[!, i])))
end

1	id			Union{Missing, String}
2	name			Union{Missing, String}
3	host_id			Union{Missing, String}
4	host_name			Union{Missing, String}
5	neighbourhood_group			Union{Missing, String}
6	neighbourhood			Union{Missing, String}
7	latitude			Union{Missing, String}
8	longitude			Union{Missing, String}
9	room_type			Union{Missing, String}
10	price			Union{Missing, String}
11	minimum_nights			Union{Missing, String}
12	number_of_reviews			Union{Missing, String}
13	last_review			Union{Missing, String}
14	reviews_per_month			Union{Missing, String}
15	calculated_host_listings_count			Union{Missing, Float64}
16	availability_365			Union{Missing, Int64}
17	Column17			Union{Missing, Int64}
18	Column18			Union{Missing, Int64}


There are missing values in almost every category. We will see how much they affect and how many they are.

Number of rows in full dataset

In [109]:
n_rows_full = size(df,1)

44794

In [110]:
function percentage(fraction, total)
    return fraction/total
end

percentage (generic function with 1 method)

In [111]:
for feature in feature_names
    index_missing_data = ismissing.(df[:, feature]) # For each column we obtain 1 if missing, 0 if not
    
    number_missing_data = sum(index_missing_data)
    
    

    # Percentage is calculated
    per = percentage(number_missing_data, n_rows_full)
    
    print("\n\n", feature, " --> ", round(per*100, digits = 2) ,"% missing values")
    
end





id --> 0.01% missing values

name --> 0.07% missing values

host_id --> 0.31% missing values

host_name --> 0.35% missing values

neighbourhood_group --> 0.32% missing values

neighbourhood --> 0.32% missing values

latitude --> 0.32% missing values

longitude --> 0.32% missing values

room_type --> 0.32% missing values

price --> 0.32% missing values

minimum_nights --> 0.32% missing values

number_of_reviews --> 0.36% missing values

last_review --> 23.77% missing values

reviews_per_month --> 23.73% missing values

calculated_host_listings_count --> 0.32% missing values

availability_365 --> 0.56% missing values

Column17 --> 99.99% missing values

Column18 --> 100.0% missing values

The results show which columns to investigate more:
* last_review
* reviews_per_month
* Columns 17 and 18

Columns 17 and 18 are eliminated

In [112]:
size(df)

(44794, 18)

In [113]:
select!(df, Not([:Column17, :Column18]))
size(df)


(44794, 16)

`last_review` is a date column, `reviews_per_month` is a float that indicates avergae number of reviews per month. `Missing`values will be substituted by dummy date "01/01/1900" and value 0 respectively.

In [114]:
df[:, :last_review] =  coalesce.(df[:, :last_review], "01/01/1900");

In [115]:
df[:, :reviews_per_month] =  coalesce.(df[:, :reviews_per_month], "0");

Row  22695 has its columns inverted

In [116]:
df[22694:22696, [:last_review,:reviews_per_month]]

Unnamed: 0_level_0,last_review,reviews_per_month
Unnamed: 0_level_1,String?,String?
1,01/01/1900,0
2,13,11/06/2018
3,19/12/2019,0.21


And a high number of reviews per monnth. It will be deleted as outlier just in case the data is faulty

In [117]:
df = df[setdiff(1:end, 22695), :];

In [118]:
df[22694:22696, [:last_review,:reviews_per_month]]

Unnamed: 0_level_0,last_review,reviews_per_month
Unnamed: 0_level_1,String?,String?
1,01/01/1900,0.0
2,19/12/2019,0.21
3,01/01/1900,0.0


Now the column can be transformed to float

In [120]:
df[:reviews_per_month] = [parse(Float64,x) for x in df[:reviews_per_month]] ;

LoadError: MethodError: no method matching parse(::Type{Float64}, ::Float64)
Closest candidates are:
  parse(::Type{T}, !Matched::AbstractString; kwargs...) where T<:Real at parse.jl:376