# Test new data
Client has provided new query for obtaining expiry data (`./expiry_prepped_data.sql`)

Verifying that query returns data that's usable by existing scripts


In general: Steps for running queries provided by client

0. ensure w/ client that they have set you up with necessary Permissions to query their Project's tables.
1. make sure query definition points to correct Project name (radixbi-249015) by appending this to every table call (FROM statement) that fails to mention it (i.e. change prediction_vendors.predictions to radixbi-249015.prediction_vendors.predictions) otherwise, BuigQuery will default to whichever project you're working in.
2. in BigQuery, run query 
3. in BigQuery, save results to BigQuery table (this will create a table in your project)
4. in R, use bigrquery package to load data in memory<br>
``sql <- paste("SELECT * FROM `radix2020.expiry.new_test`")
new_test_df <- bq_table_download(bq_project_query("radix2020", sql))``

I've created a new BigQuery table (`radix2020.expiry.new_test`) that has just the first 100 results of this query

## TODO for John:

1. Modify load_prep_data_expiry_2.R such that the date filter is dynamic, 
    - we don't want to analyze data that is 'premature' -- renewal status lags 90 days (so domains that expired today won't have a correct renewal status until 90 days from today).
    - in general, we want to analyze 5 quarters of data 
    - implement the above as a script argument (like min date and max date) to kill two birds with one stone
    - we could also just bake this into the intial data pull (i.e. modify date constraints in 2 above)
2. Create a script that, when given the name of a "local" BigQuery table (such as radix2020.expiry.new_test), first runs load_prep_data_expiry_2.R and then predictions_metalearning.R. This will generate a file of predictions across all models
3. Cleate a cleaner version of training_metalearning.R, 
    - one that uses predictions from (2) (maybe as a script argument?) 
    - may be split into two scripts -- one that trains the metalearning model and one that generates predictions on a new dataset
    - good idea to create a separate script file that contains functions (similar to how predictions_metalearning.R is kept "lean" and easy to read)
4. ensure output is in line with Radix request 
    - expiry data columns + prediction column + model name column + model version/info column
    - write to bigquery table
    - see 16_fallbackmeta_analysis_deliv.ipynb for part of this

In [7]:
library(bigrquery)
library(plotly)
library(data.table)
library(stringr)
library(readr)
library(dplyr)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:data.table’:

    between, first, last


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




In [2]:
objects()

# Pull new data 02/09

In [5]:
# Following work around does not seem to be necessary for this new data set
# https://community.exploratory.io/t/google-bigquery-import-fails-with-invalid-value-at-start-index-type-uint64-1e-05-invalid/1901
# options(scipen = 20)

In [2]:
sql <- paste("SELECT * FROM `radix2020.expiry.new_test`")
new_test_df <- bq_table_download(bq_project_query("radix2020", sql))

In [3]:
dim(new_test_df)

In [4]:
# Following writes data to an RDS on the virtual machine and then copies it to an already created Google Cloud Storage bucket location
# saveRDS(expiry_20200902_20201102_20201127,"../../data/output/datapull_20201127/expiry_20200902_20201102_20201127")
# system("gsutil cp /home/jupyter/local/Domains_202003/data/output/datapull_20201127/* gs://data_outputt/output/")

# Test Scripts

### load_prep_data_expiry_2.R

In [12]:
cat("Loading data...")
expiry_df <- new_test_df#readRDS("/home/jupyter/Domains_202003/data/output/expiry_20190601_20200901_20201116_excl")
cat("Loaded", expiry_df %>% nrow(),"rows\n")

# select most recent 5Q [1 quarter = 90 days, 5 quarters = 450 days ]
# 450 days before 20200901 is 20190609 ... round off to 20190601
cat("Removing", expiry_df %>%filter(expiry_date < as.Date("2019-06-01") | expiry_date > as.Date("2020-09-01")) %>% tally() %>% pull(n) ,"rows due to expiry_date constraints\n")
expiry_df <- expiry_df %>% filter(expiry_date >= as.Date("2019-06-01") & expiry_date <= as.Date("2020-09-01"))

# remove renewed_count>1
cat("Removing", expiry_df %>% filter(renewed_count>1) %>% tally() %>% pull(n) ,"rows due to renewed_count constraints\n")
expiry_df <- expiry_df %>% filter(renewed_count==1)

# remove where gibb_score, etc. are NA
cat("Removing", expiry_df %>% filter(is.na(gibb_score)) %>% tally() %>% pull(n) ,"rows due to missing gibb_score\n")
expiry_df <- expiry_df %>% filter(!is.na(gibb_score))
cat("... now dataset min(creation_date) is ", expiry_df %>% summarise(min(creation_date)) %>% pull(1) %>% as.character(),".\n")

# add necessary columns
expiry_df <- expiry_df %>% mutate (reg_arpt = ifelse(reg_arpt <= 0, 0.0001,reg_arpt),
                                   log_reg_arpt = log(reg_arpt),
                                   tld_registrar_index = tolower(paste(tld, reseller,sep="")))

# test/train split 
set.seed(123) 
smp_siz = floor(0.8*nrow(expiry_df))
train_ind = sample(seq_len(nrow(expiry_df)),size = smp_siz) 
expiry_train_df = expiry_df[train_ind,] 
expiry_test_df = expiry_df[-train_ind,]

# split into lists
expiry_list <- split(expiry_df, expiry_df$tld_registrar_index)
expiry_train_list <- split(expiry_train_df, expiry_train_df$tld_registrar_index)
expiry_test_list <- split(expiry_test_df, expiry_test_df$tld_registrar_index)


Loading data...Loaded 100 rows
Removing 68 rows due to expiry_date constraints
Removing 4 rows due to renewed_count constraints
Removing 0 rows due to missing gibb_score
... now dataset min(creation_date) is  2018-06-08 .


In [13]:
dim(expiry_df)
dim(expiry_train_df)
dim(expiry_test_df)

### predictions_metalearning
Not excluding any low-vol tld-re's using tld_registrar_excl() because dataet is already so small

In [None]:
# load & prep input data
source('/home/jupyter/Domains_202003/scripts/orig/functions_models.R')
source('/home/jupyter/Domains_202003/scripts/phaseII_03_forest/functions_eval.R')
source('/home/jupyter/Domains_202003/scripts/phaseII_06_fallbacksupp/functions_metalearning.R')
# source('/home/jupyter/Domains_202003/scripts/phaseII_06_fallbacksupp/load_prep_data_expiry_2.R')
# defines expiry_df & list of expiry_20180101_20190331
# as well as expiry_train_df, expiry_test_df,  expiry_train_list, expiry_test_list


# define oputput folder
fullDir='/home/jupyter/Domains_202003/data/output/models_20201104'
dir.create(fullDir)
dir.create(file.path(fullDir,'preds'))

# define tld-re's for training
tld_reseller_list = expiry_train_df %>%  distinct(tld_registrar_index) %>% pull(tld_registrar_index)
tld_registrar_excl_list = list() #tld_registrar_excl(train_list = expiry_train_list)

# train & save models
tld_reseller_list = train_all(  tld_reseller_list,
                                tld_registrar_excl_list,
                                train_list = expiry_train_list,
                                test_list = expiry_test_list,
                                model_agg_glm = NULL, 
                                model_agg_rf = NULL,
                                fullDir)   

# define tld-re's for testing
tld_reseller_list = expiry_test_df %>%  distinct(tld_registrar_index) %>% pull(tld_registrar_index)
tld_registrar_excl_list=list() #= tld_registrar_excl(train_list = expiry_train_list)

# predict based on saved models
preds_df <- pred_all(tld_reseller_list, tld_registrar_excl_list,
                     test_list = expiry_test_list,
                     modelDir=fullDir,
                     fullDir=fullDir)

# write.csv(preds_df, file=file.path(fullDir,'preds','preds.csv'),row.names = FALSE)


In [None]:
preds_df