#### Load functions

This line loads several notebooks that are necessary for running the entire pipeline. These notebooks define the `config` variable and functions used for the pipeline. This functions are stored in several different files in the `src` folder, with the following order:
* *`data`* contains functions for loading and processing data:
  * `get_config`: loads the `config` variable that will be used throughout the code. It contains paths to read files, information on how to treat the data sources, and paths for the output. Please note that this is the only file that does not define functions but a variable. Because of how Data Bricks loas files, this variable is available in the rest of the execution.
  * `get_raw_data`: loads the functions that read the raw data files. This data files have been processed previously only to remove sensitive columns. This file also contains auxiliary functions, such as `format_as_character`, that normalizes text strings (makes them lowercase, removes leading and trailing spaces and appropiately replaces non unicode characters). On the `config` file, under the `data_params` section, the user can find the specification for each data set, containing information such as:
    * The columns to drop / keep.
    * The columns to rename.
    * How to format the columns (character, integer, float, id).
  * `get_processed_data`: takes the input from the `get_raw_data` functions and further process the data sets. So far, only a function for the Service Requests table has been necessary.
  * `test_wrappers_for_data_loading`: helper functions that are used for different tests, mainly within the `get_raw_data` functions.
* *`features`* contains functions for creating the master table:
  * `customer_base`: creates a table with the ids (tenant and unit, named `htenant` and `hunit` respectively) and the appropiate date (column named `id_date`). For each date, only the customers that are individual and whose contract expires in three months appear. This function only considers contract start and end dates, but not the check out date.
  * `target`: creates a table with the move out date for all individual customers. 
  * `add_features`: functions that add variables to the master table: 
    * Demographic, such as the coordinates of the country of origin of the tenant.
    * Unit features, such as the square footage or boolean variables based on the description (e.g., `is_economic_unit` or `is_luxury_unit`).
    * Average price features, which contains rations of the rent and the rent per square feet with different averages (global, per community, per unit type).
    * Service requests features.
  * `master_table`: uses the other functions in this section to create the master table, and also contains a function, `get_feats` that returns the features from the master table. The process followed to create the master table is as follows:
    * It creates the customer base (the table with the customers that need to be scored each month).
    * It adds, through a left join, the the check out date.
    * It filters out customers who churned before their scoring date.
    * It creates the column with the target.
    * It adds, through a left join, the demographic features.
    * It adds, through a left join, the unit features.
    * It adds, through a left join, the service request features.
    * Creates additional variables, such as price per square feet.
    * It adds, through a left join, the ratios with the average prices.
* *`models`*:
  * `dataset_split`: defines a function for splitting the master table in train, dev and test, returning the features matrix as well as the target.
  * `model_metrics`: defines functions for evaluating the performance of the model, through the per decile metrics and the most important variables data frame.
  * `model_definition`: which returns a model object from the `sklearn` library with the appropiate parameters, as specified in the `config` dictionary.
* *`visualization`*: 
  * `plots`: defines a function for plotting the ROC curves.

In [0]:
dbutils.fs.mkdirs("FileStore/tables/raw/")

In [0]:
%run /Users/ummeaafia@outlook.com/src/load_functions

In [0]:
config['paths']['raw']['coords']

In [0]:
config['data_params']

#### Reading parameters from `config`

In the following cell, serveral paramters are read from the `config` dictionary, under `model_params`:
* `n_months_target` is read from the `config` dictionary, and it indicates the number of months to consider a positive observation. For example, if scoring in January, assuming `n_months_to_end_for_predicting = 3`, customers whose contracts ends between April 1st and April 30th would be considered for scoring, and a customer would be labeled as churner if he or she churned in the next `n_months_target` since the scoring. If `n_months_target = 5`, positive observations would include churners between January 1st and May 31st.
* `min_train_date` determines what is the earliest possible train date. It is also read from the `config` dict, and is set to June 2018, because earlier data (April and May) seemed to have different distributions.
* `max_months_for_training` is the maximum number of months allowed for the training set.
* `n_months_to_end_for_predicting` determines how many months before the end of the contract the model will be scored. For instance, if `n_months_to_end_for_predicting = 3`, customers whose contract ends between April 1st and April 30th will be scored.
* `target_col` is the name of the column in the master table that contains the target for the model
* `model_algorithm` is the name of the algorithm to use. Please note that for an algorithm to be used, it needs to be loaded from the `sklearn` library. So far, the code only loads `ExtraTreesClassifier` and `RandomForestClassifier`, as they had the best performance in the model development phase.
* `model_params`, which defines the paramters for the model (e.g., number of trees to use).

This cell also includes a test to make sure than `n_months_for_target > n_months_to_end_for_predicting`.

In [0]:
n_months_target = config['model_params']['n_months_target']
min_train_date = pd.to_datetime(config['model_params']['min_train_date'])
max_months_for_training = config['model_params']['max_months_for_training']
n_months_to_end_for_predicting = config['model_params']['n_months_to_end_for_predicting']
target_col = config['model_params']['target_col']
model_algorithm = config['model_params']['model_algorithm']
model_params = config['model_params']['model_params']

#### Determining dates

In the following cell, the date parameters are calculared, using in part the parameters from the previous cell:

* `scoring_date` is the date for the leads, which is computed as the first day of the month of the execution of the code (if the code is executed in June, it will return the leads for customers scored on June, whose contracts end between Agusut 1st and August 31st).
* `test_date` is the latest possible month in which the model performance can be evaluated. For this to be possible, at least `n_months_target` must have passed. For instance, if scoring in June with `n_months_for_target = 5`, this month would be January, as the time window for the target finishes in May 31st.
* `dev_date` is the month prior to the test date, that is used as an additional evaluation of the model performance.
* `train_end` is the last month of the training set, and is the prior month to the dev date.
* `train_start` is the first month of the training set. If possible, it will be a date such that there are `max_months_for_training` between `train_start` and `train_end`, but always making sure `train_start` is not before `min_train_date`.

In [0]:
scoring_date = pd.to_datetime('today').floor('d') - pd.offsets.MonthBegin(1)
test_date = scoring_date - pd.offsets.MonthBegin(n_months_target)
dev_date = test_date - pd.offsets.MonthBegin(1)
train_end = dev_date - pd.offsets.MonthBegin(1)
train_start = max(train_end - pd.offsets.MonthBegin(max_months_for_training - 1), min_train_date)

#### Generating master table

This is done with the previously defined parameters and dates. In the following cells, the features (`feats`) of the master table (`mt`) are also determined, and some tests are performed to make sure the dates are consistent.

In [0]:
mt = get_master_table(config = config, n_months_target = n_months_target, n_months_to_end_for_predicting = n_months_to_end_for_predicting, from_date = train_start, to_date = scoring_date, target_col = target_col)

In [0]:
feats = get_feats(mt, test_for_NAs = True)

#### Modelling phase

In this part of the code, several tasks are performed:

* Loading of the model library
* Definition of train, dev and test sets
* Training of the model.

In [0]:
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier

In [0]:
X_train, X_dev, X_test, X_leads, y_train, y_dev, y_test = get_mt_sets(mt = mt, train_end_date = train_end, dev_end_date = dev_date, test_end_date = test_date, leads_date = scoring_date, feats = feats, target_col = target_col)

In [0]:
model = get_model(model_algorithm, model_params)
model.fit(X_train[feats], y_train)

#### Generating model outputs

In the following lines:
* A scores column, named `preds` is added to a copy of the master table.
* A dictionary of performances in train, dev and test is calculated and saved.
* Three plots with the ROC curves for the train, dev and test sets are generated. This plots can be saved as an image.
* The leads data frame is generated and saved.

In [0]:
mt_with_scores = add_model_scores(mt, model, feats, target_col, X_train, X_dev, X_test, X_leads)

In [0]:
performance_dict = get_performance_dict(mt_with_scores, target_col, feats, model)

In [0]:
display(sqlContext.createDataFrame(performance_dict['dev']))

In [0]:
from sklearn.metrics import roc_curve, auc

In [0]:
plot_roc_curve(mt_with_scores, 'train', target_col)

In [0]:
plot_roc_curve(mt_with_scores, 'dev', target_col)

In [0]:
plot_roc_curve(mt_with_scores, 'test', target_col)

In [0]:
output_folder_name = get_output_file_name(config = config, file_name = 'TBR' + '.xlsx', file_dir = None, is_model = True)

In [0]:
save_dict_to_excel(dict_to_save = performance_dict,  config = config, full_file_name = output_folder_name.replace('TBR', config['model_params']['model_performance_excel_name']), is_model = False)

In [0]:
leads_df = get_leads_df(mt_with_scores)

In [0]:
leads_df.to_excel(output_folder_name.replace('TBR', config['model_params']['leads_df_excel_name']))