# Training Script
In this notebook, we create the training script whose hyperparameters will be tuned. This script is stored alone in a `scripts` directory both for ease of reference and because the Azure ML SDK limits the contents of this directory to at most 300 MB.

The notebook cells are each appended in turn in the training script, so it is essential that you run the notebook's cells _in order_ for the script to run correctly. If you edit this notebook's cells, be sure to preserve the blank lines at the start and end of the cells, as they prevent the contents of consecutive cells from being improperly concatenated.

The script sections are
- [import libraries](#import),
- [define utility functions and classes](#utility),
- [define the script input parameters](#parameters),
- [load and prepare the training and tuning data](#data),
- [define the training pipeline](#pipeline),
- [train the model](#train),
- [score the test data](#score), and
- [compute the test data performance](#performance).

[The final cell](#run) runs the script using the training data created by [the first notebook](00_Data_Prep.ipynb).

Start by creating the `scripts` directory, if it does not already exist.

In [None]:
!mkdir -p scripts

## Load libraries <a id='import'></a>

In [None]:
%%writefile scripts/TrainTestClassifier.py

from __future__ import print_function
import os
import warnings
import argparse
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.feature_extraction import text
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.externals import joblib
from azureml.core import Run
import azureml.core
print('azureml.core.VERSION={}'.format(azureml.core.VERSION))


## Define utility functions and classes <a id='utility'></a>

In [None]:
%%writefile --append scripts/TrainTestClassifier.py

def score_rank(scores):
    """Compute the ranks of the scores."""
    return pd.Series(scores).rank(ascending=False)


def label_index(label, label_order):
    """Compute the index of label in label_order."""
    loc = np.where(label == label_order)[0]
    if loc.shape[0] == 0:
        return None
    return loc[0]


def label_rank(label, scores, label_order):
    """Compute the rank of label using the scores."""
    loc = label_index(label, label_order)
    if loc is None:
        return len(scores) + 1
    return score_rank(scores)[loc]


def log_evaluation(logger, metric_name, period=1):
    """Create a callback that logs the evaluation results.

    Parameters
    ----------
    logger : function
    period : int, optional (default=1)
        The period to print the evaluation results.

    Returns
    -------
    callback : function
        The callback that logs the evaluation results every ``period`` iteration(s).
    """
    def callback(env):
        """internal function"""
        if period > 0 and env.evaluation_result_list and (env.iteration + 1) % period == 0:
            logger(metric_name,
                   1-env.evaluation_result_list[0][2])
    callback.order = 10
    return callback

warnings.filterwarnings(action='ignore', category=UserWarning, module='lightgbm')


## Define the input parameters <a id='parameters'></a>
One of the most important parameters is `estimators`, the number of estimators that allows you to trade-off accuracy, modeling time, and model size. The table below should give you an idea of the relationships between the number of estimators and the metrics. The default value is 100.

| Estimators | Run time (s) | Size (MB) | Accuracy@1 | Accuracy@2 | Accuracy@3 |
|------------|--------------|-----------|------------|------------|------------|
|        100 |           40 |  2 | 25.02% | 38.72% | 47.83% |
|       1000 |          177 |  4 | 46.79% | 60.80% | 69.11% |
|       2000 |          359 |  7 | 51.38% | 65.93% | 73.09% |
|       4000 |          628 | 12 | 53.39% | 67.40% | 74.74% |
|       8000 |          904 | 22 | 54,62% | 67.77% | 75.35% |

Other parameters that may be useful to tune include the following:
* `ngrams`: the maximum n-gram size for features, an integer ranging from 1 (default 1),
* `min_child_samples`: the minimum number of samples in a leaf, an integer ranging from 1 (default 20),
* `match`: the maximum number of training examples per duplicate question, an integer ranging from 2 (default 10), and
* `unweighted`: whether to use sample weights to compensate for unbalanced data, a boolean (default weighted).

The performance of the estimator is estimated on held-aside test data, and the statistic reported is how far down the list of sorted results is the correct result found. The `rank` parameter controls the maximum distance down the list for which the statistic is reported.

In [None]:
%%writefile --append scripts/TrainTestClassifier.py

if __name__ == '__main__':
    
    parser = argparse.ArgumentParser(description='Fit and evaluate a model'
                                     ' based on train-test datasets.')
    parser.add_argument('--data-folder', help='the path to the data',
                        dest='data_folder', default='.')
    parser.add_argument('--inputs', help='the inputs directory',
                        default='data')
    parser.add_argument('--data', help='the training dataset name',
                        default='balanced_pairs_train.tsv')
    parser.add_argument('--tune', help='the tune dataset name',
                        default='balanced_pairs_tune.tsv')
    parser.add_argument('--test', help='the test dataset name',
                        default='balanced_pairs_test.tsv')
    parser.add_argument('--estimators',
                        help='the number of learner estimators',
                        type=int, default=100)
    parser.add_argument('--min_child_samples',
                        help='the minimum number of samples in a child(leaf)',
                        type=int, default=20)
    parser.add_argument('--ngrams',
                        help='the maximum size of word ngrams',
                        type=int, default=1)
    parser.add_argument('--match',
                        help='the maximum number of duplicate matches',
                        type=int, default=20)
    parser.add_argument('--unweighted',
                        help='whether or not to use instance weights',
                        default='No')
    parser.add_argument('--rank',
                        help='the maximum rank of correct answers',
                        type=int, default=3)
    parser.add_argument('--outputs', help='the outputs directory',
                        default='outputs')
    parser.add_argument('--save', help='the model file base name', default='None')
    parser.add_argument('--verbose',
                        help='the verbosity of the estimator',
                        type=int, default=-1)
    args = parser.parse_args()
    

## Load and prepare the training data <a id='data'></a>

In [None]:
%%writefile --append scripts/TrainTestClassifier.py

    # Get a run logger.
    run = Run.get_context()

    # What to name the metric logged
    metric_name = "accuracy"

    print('Prepare the training data.')
    
    # Paths to the input data.
    data_path = args.data_folder
    inputs_path = os.path.join(data_path, args.inputs)
    data_path = os.path.join(inputs_path, args.data)
    tune_path = os.path.join(inputs_path, args.tune)
    test_path = os.path.join(inputs_path, args.test)

    # Paths for the output data.
    outputs_path = args.outputs
    model_path = os.path.join(outputs_path, '{}.pkl'.format(args.save))
    labels_path = os.path.join(outputs_path, '{}.csv'.format(args.save))

    # Create the outputs folder.
    os.makedirs(outputs_path, exist_ok=True)

    # Define the input data columns.
    feature_columns = ['Text_x', 'Text_y']
    label_column = 'Label'
    group_column = 'Id_x'
    answerid_column = 'AnswerId_y'
    name_columns = ['Id_x', 'Id_y']
    weight_column = 'Weight'

    # Load the training data.
    print('Reading {}'.format(data_path))
    train = pd.read_csv(data_path, sep='\t', encoding='latin1')

    # Limit the number of training duplicate matches.
    train = train[train.n < args.match]

    # Report on the dataset.
    print('train: {:,} rows with {:.2%} matches'
          .format(train.shape[0], train[label_column].mean()))
    
    # Load the tunning data.
    print('Reading {}'.format(tune_path))
    tune = pd.read_csv(tune_path, sep='\t', encoding='latin1')

    # Report on the dataset.
    print('tune: {:,} rows with {:.2%} matches'
          .format(tune.shape[0], tune[label_column].mean()))
    
    # Compute instance weights.
    if args.unweighted == 'Yes':
        print('No sample weights.')
        labels = train[label_column].unique()
        weight = pd.Series([1.0] * labels.shape[0], labels)
    else:
        print('Using sample weights.')
        label_counts = train[label_column].value_counts()
        weight = train.shape[0] / (label_counts.shape[0] * label_counts)
        print(weight)
    train[weight_column] = train[label_column].apply(lambda x: weight[x])

    # Select and format the training data.
    train_X = train[feature_columns]
    train_y = train[label_column]
    train_group = train[group_column]
    train_sample_weight = train[weight_column]
    train_names = train[name_columns]
    tune_X = tune[feature_columns]
    tune_y = tune[label_column]
    tune_group = tune[group_column]
    

## Define the featurization and estimator <a id='pipeline'></a>

In [None]:
%%writefile --append scripts/TrainTestClassifier.py

    print('Define the model pipeline.')

    # Select the training hyperparameters.
    n_estimators = args.estimators
    min_child_samples = args.min_child_samples
    if args.ngrams > 0:
        ngram_range = (1, args.ngrams)
    else:
        ngram_range = None
    accuracy_rank = args.rank

    # Verify that the hyperparameter settings are valid.
    if n_estimators <= 0:
        raise Exception('n_estimators must be > 0')
    if min_child_samples <= 0:
        raise Exception('min_child_samples must be > 0')
    if (ngram_range is None
        or type(ngram_range) is not tuple
        or len(ngram_range) != 2
        or ngram_range[0] < 1
        or ngram_range[0] > ngram_range[1]):
        raise Exception('ngram_range must be a tuple with two integers (a, b) where a > 0 and a <= b')
    if accuracy_rank < 1:
        raise Exception("accuracy_rank must be at least 1.")

    # Define the featurization pipeline
    featurization = [
        (column, text.TfidfVectorizer(ngram_range=ngram_range), column)
        for column in feature_columns]
    features = ColumnTransformer(featurization)

    # Define the estimator.
    estimator = lgb.LGBMClassifier(n_estimators=n_estimators,
                                   min_child_samples=min_child_samples,
                                   verbose=args.verbose)

    # Put them together into the model pipeline.
    model = Pipeline([
        ('features', features),
        ('estimator', estimator)
    ])
    
    # Report the featurization.
    print('Estimators={:,}'.format(n_estimators))
    print('Ngram range={}'.format(ngram_range))
    print('Min child samples={}'.format(min_child_samples))
    print('Accuracy rank={}'.format(accuracy_rank))
    

## Train the model <a id='train'></a>

In [None]:
%%writefile --append scripts/TrainTestClassifier.py

    print('Fitting the model.')

    # Collect the ordered AnswerId for computing scores.
    labels = sorted(train[answerid_column].unique())
    label_order = pd.DataFrame({'label': labels})

    # Featurize the train and tune data.  It's important to only fit the
    # featurizer on the training data, so that the tuning data is treated the
    # same way the testing data will be later on.
    train_X_features = model.named_steps["features"].fit_transform(train_X)
    tune_X_features = model.named_steps["features"].transform(tune_X)

    # Fit the model.
    model.named_steps["estimator"].fit(
        train_X_features, train_y, sample_weight=train_sample_weight,
        feature_name=model.named_steps["features"].get_feature_names(),
        eval_set=[(tune_X_features, tune_y)], eval_names=["tune"],
        callbacks=[log_evaluation(run.log, metric_name, period=100)],
        verbose=False
    )

    # Write the model to file.
    if args.save != 'None':
        print('Saving the model to {}'.format(model_path))
        joblib.dump(model, model_path)
        print('{}: {:.2f} MB'
              .format(model_path, os.path.getsize(model_path)/(2**20)))
        print('Saving the labels to {}'.format(labels_path))
        label_order.to_csv(labels_path, sep='\t', index=False)
        

## Score the test data using the model <a id='score'></a>
This produces a dataframe of scores with one row per duplicate question.

In [None]:
%%writefile --append scripts/TrainTestClassifier.py

    print('Scoring the test data.')

    # Read the test data.
    print('Reading {}'.format(test_path))
    test = pd.read_csv(test_path, sep='\t', encoding='latin1')
    print('test: {:,} rows with {:.2%} matches'
          .format(test.shape[0], test[label_column].mean()))

    # Collect the model predictions.
    test_X = test[feature_columns]
    test['probabilities'] = model.predict_proba(test_X)[:, 1]

    # Order the testing data by dupe Id and question AnswerId.
    test.sort_values([group_column, answerid_column], inplace=True)

    # Extract the ordered probabilities.
    probabilities = (
        test.probabilities
        .groupby(test[group_column], sort=False)
        .apply(lambda x: tuple(x.values)))

    # Get the individual records.
    output_columns_x = ['Id_x', 'AnswerId_x', 'Text_x']
    test_score = (test[output_columns_x]
                  .drop_duplicates()
                  .set_index(group_column))
    test_score['probabilities'] = probabilities
    test_score.reset_index(inplace=True)
    test_score.columns = ['Id', 'AnswerId', 'Text', 'probabilities']
    

## Report the model's performance statistics on the test data <a id='performance'></a>

In [None]:
%%writefile --append scripts/TrainTestClassifier.py

    print("Evaluating the model's performance.")
    
    # Compute the ranks of the correct answers.
    test_score['Ranks'] = test_score.apply(lambda x:
                                           label_rank(x.AnswerId,
                                                      x.probabilities,
                                                      label_order.label),
                                           axis=1)

    # Compute the number of correctly ranked answers
    for i in range(1, args.rank+1):
        print('Accuracy @{} = {:.2%}'
              .format(i, (test_score['Ranks'] <= i).mean()))
    mean_rank = test_score['Ranks'].mean()
    print('Mean Rank {:.4f}'.format(mean_rank))

    # Log the metric.
    accuracy_at_rank = (test_score['Ranks'] <= args.rank).mean()
    run.log(metric_name, accuracy_at_rank)
        

## Run the script to see that it works <a id='run'></a>
This should take around five minutes.

In [None]:
%run -t scripts/TrainTestClassifier.py --estimators 1000 --match 5 --ngrams 2 --min_child_samples 10 --save FAQ-ranker

In [the next notebook](02_Run_Locally.ipynb), we set up and use the AML SDK to run the training script.