# Train-Test split

Data is split into training and testing sets to ensure that all three models have the same input (format, data and columns)

In [1]:
using CSV
using Random
using Statistics
using DataFrames

In [20]:
df_2020_oct = CSV.read("./data/treated data/2020_october_listings_encoded.csv");
df_2019_oct = CSV.read("./data/treated data/2019_october_listings_encoded.csv");


In [21]:
Random.seed!(3);
function test_train_split(df)
    df_shuffle = df[shuffle(1:end), :]

    # Number of train and test observations is determined
    train_proportion = 0.7
    n = size(df_shuffle, 1)
    ntrain = convert(Int, round(train_proportion*n));
    ntest = n - ntrain;
    
    # All columns but price
    features = df_shuffle[filter(x -> x != "price", names(df_shuffle))];
    
    # Only price column
    target = df_shuffle[:, :price];

    train_x = features[1:ntrain,:];
    test_x = features[ntrain+1:end,:];
    train_y = DataFrame(price = target[1:ntrain]);
    test_y = DataFrame(price = target[ntrain+1:end]);
    
    return train_x, test_x, train_y, test_y
    
end

test_train_split (generic function with 1 method)

In [22]:
function save_test_train_split_as_csv(train_x, test_x, train_y, test_y, date)
    CSV.write(string("./data/train test data/",date,"_train_x.csv"), train_x)
    CSV.write(string("./data/train test data/", date ,"_test_x.csv"), test_x)
    CSV.write(string("./data/train test data/", date ,"_train_y.csv"), train_y)
    CSV.write(string("./data/train test data/", date ,"_test_y.csv"), test_y)
    print("All files saved")
end

save_test_train_split_as_csv (generic function with 1 method)

In [23]:
train_x, test_x, train_y, test_y = test_train_split(df_2020_oct);

In [24]:
save_test_train_split_as_csv(train_x, test_x, train_y, test_y, "2020_october")

All files saved

In [25]:
train_x, test_x, train_y, test_y = test_train_split(df_2019_oct);

In [26]:
save_test_train_split_as_csv(train_x, test_x, train_y, test_y, "2019_october")

All files saved

**Important!!** 

When loading the dataset there are several things to do before fitting the model:
1. Take care of non-numeric variables (date, room_type, host_name...)
2. Eliminate columns that you will not use (for example 1 one-hot encoding of neighbourhood)
3. Select time range of interest (from april)
4. Transform the resulting dataframe into an array

See `5.1 Model Development I` for reference