# Modeling
In this notebook, we create the training script whose hyperparameters will be tuned. The notebook cells are each appended in turn in the training script.

## Load libraries

In [1]:
%%writefile TrainTestClassifier.py
from __future__ import print_function
import os
import warnings
import argparse
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.feature_extraction import text
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
from sklearn.externals import joblib
from sklearn.base import BaseEstimator, TransformerMixin

Overwriting TrainTestClassifier.py


## Define utility functions and classes

In [2]:
%%writefile --append TrainTestClassifier.py

class ItemSelector(BaseEstimator, TransformerMixin):
    """For data grouped by feature, select subset of data at provided
    key(s).

    The data are expected to be stored in a 2D data structure, where
    the first index is over features and the second is over samples,
    i.e.

    >> len(data[keys]) == n_samples

    Please note that this is the opposite convention to scikit-learn
    feature matrixes (where the first index corresponds to sample).

    ItemSelector only requires that the collection implement getitem
    (data[keys]).  Examples include: a dict of lists, 2D numpy array,
    Pandas DataFrame, numpy record array, etc.

    >> data = {'a': [1, 5, 2, 5, 2, 8],
               'b': [9, 4, 1, 4, 1, 3]}
    >> ds = ItemSelector(key='a')
    >> data['a'] == ds.transform(data)

    ItemSelector is not designed to handle data grouped by sample
    (e.g. a list of dicts).  If your data are structured this way,
    consider a transformer along the lines of
    `sklearn.feature_extraction.DictVectorizer`.

    Parameters
    ----------
    keys : hashable or list of hashable, required
        The key(s) corresponding to the desired value(s) in a mappable.

    """

    def __init__(self, keys):
        self.keys = keys

    def fit(self, x, *args, **kwargs):
        if type(self.keys) is list:
            assert all([key in x for key in self.keys]), 'Not all keys in data'
        else:
            assert self.keys in x, 'key not in data'
        return self

    def transform(self, data_dict, *args, **kwargs):
        return data_dict[self.keys]

    
def score_rank(scores):
    return pd.Series(scores).rank(ascending=False)


def label_index(label, label_order):
    loc = np.where(label == label_order)[0]
    if loc.shape[0] == 0:
        return None
    return loc[0]


def label_rank(label, scores, label_order):
    loc = label_index(label, label_order)
    if loc is None:
        return len(scores) + 1
    return score_rank(scores)[loc]


warnings.filterwarnings(action='ignore', category=UserWarning, module='lightgbm')

Appending to TrainTestClassifier.py


## Define the input parameters
One of the most important parameters is `estimators`, the number of estimators that allows you to trade-off accuracy, modeling time, and model size. The table below should give you an idea of the relationships between the number of estimators and the metrics.

| Estimators | Run time (s) | Size (MB) | Accuracy@1 | Accuracy@2 | Accuracy@3 |
|------------|--------------|-----------|------------|------------|------------|
|        100 |           40 |  2 | 25.02% | 38.72% | 47.83% |
|       1000 |          177 |  4 | 46.79% | 60.80% | 69.11% |
|       2000 |          359 |  7 | 51.38% | 65.93% | 73.09% |
|       4000 |          628 | 12 | 53.39% | 67.40% | 74.74% |
|       8000 |          904 | 22 | 54,62% | 67.77% | 75.35% |

Other parameters that may be useful to tune include the following:
* `ngrams`: the maximum n-gram size for features, an integer ranging from 1,
* `min_child_samples`: the minimum number of samples in a leaf, an integer ranging from 1,
* `match`: the maximum number of training examples per duplicate question, an integer ranging from 2, and
* `unweighted`: whether to use sample weights to compensate for unbalanced data, a boolean.

The performance of the estimator is estimated on held-aside test data, and the statistic reported is how far down the list of sorted results is the correct result found. The `rank` parameter controls the maximum distance down the list for which the statistic is reported.

In [3]:
%%writefile --append TrainTestClassifier.py

if __name__ == '__main__':
    
    parser = argparse.ArgumentParser(description='Fit and evaluate a model'
                                     ' based on train-test datasets.')
    parser.add_argument('--data', help='the training dataset name',
                        default='balanced_pairs_train.tsv')
    parser.add_argument('--test', help='the test dataset name',
                        default='balanced_pairs_test.tsv')
    parser.add_argument('--estimators',
                        help='the number of learner estimators',
                        type=int, default=100)
    parser.add_argument('--min_child_samples',
                        help='the minimum number of samples in a child(leaf)',
                        type=int, default=20)
    parser.add_argument('--ngrams',
                        help='the maximum size of word ngrams',
                        type=int, default=1)
    parser.add_argument('--match',
                        help='the maximum number of duplicate matches',
                        type=int, default=20)
    parser.add_argument('--unweighted',
                        help='do not use instance weights',
                        action='store_true')
    parser.add_argument('--rank',
                        help='the maximum rank of correct answers',
                        type=int, default=3)
    parser.add_argument('--inputs', help='the inputs directory',
                        default='.')
    parser.add_argument('--outputs', help='the outputs directory',
                        default='.')
    parser.add_argument('--save', help='save the model',
                        action='store_true')
    parser.add_argument('--model', help='the model file', default='model.pkl')
    parser.add_argument('--instances', help='the instances file',
                        default='inst.txt')
    parser.add_argument('--labels', help='the labels file',
                        default='labels.txt')
    parser.add_argument('--verbose',
                        help='the verbosity of the estimator',
                        type=int, default=-1)
    args = parser.parse_args()
    

Appending to TrainTestClassifier.py


## Load and prepare the training data

In [4]:
%%writefile --append TrainTestClassifier.py

    print('Prepare the training data.')
    
    # Paths to the input data.
    inputs_path = args.inputs
    data_path = os.path.join(inputs_path, args.data)
    test_path = os.path.join(inputs_path, args.test)

    # Paths for the output data.
    outputs_path = args.outputs
    model_path = os.path.join(outputs_path, args.model)
    instances_path = os.path.join(outputs_path, args.instances)
    labels_path = os.path.join(outputs_path, args.labels)

    # Create the outputs folder.
    os.makedirs(outputs_path, exist_ok=True)

    # Load the data.
    print('Reading {}'.format(data_path))
    train = pd.read_csv(data_path, sep='\t', encoding='latin1')

    # Limit the number of training duplicate matches.
    train = train[train.n < args.match]

    # Define the input data columns.
    feature_columns = ['Text_x', 'Text_y']
    label_column = 'Label'
    group_column = 'Id_x'
    answerid_column = 'AnswerId_y'
    name_columns = ['Id_x', 'Id_y']
    weight_column = 'Weight'

    # Report on the dataset.
    print('train: {:,} rows with {:.2%} matches'.format(
        train.shape[0], train[label_column].mean()))
    
    # Compute instance weights.
    if args.unweighted:
        print('No sample weights.')
        weight = pd.Series([1.0], train[label_column].unique())
    else:
        print('Using sample weights.')
        label_counts = train[label_column].value_counts()
        weight = train.shape[0]/(label_counts.shape[0]*label_counts)
        print(weight)
    train[weight_column] = train[label_column].apply(lambda x: weight[x])

    # Select and format the training data.
    train_X = train[feature_columns]
    train_y = train[label_column]
    sample_weight = train[weight_column]
    groups = train[group_column]
    names = train[name_columns]
    

Appending to TrainTestClassifier.py


## Define the featurization and estimator

In [5]:
%%writefile --append TrainTestClassifier.py

    print('Define the model pipeline.')

    # Select the training hyperparameters.
    n_estimators = args.estimators
    min_child_samples = args.min_child_samples
    if args.ngrams > 0:
        ngram_range = (1, args.ngrams)
    else:
        ngram_range = None

    # Verify that the hyperparameter values are valid.
    assert n_estimators > 0
    assert min_child_samples > 1
    assert ngram_range is not None
    assert type(ngram_range) is tuple and len(ngram_range) == 2
    assert ngram_range[0] > 0 and ngram_range[0] <= ngram_range[1]

    # Define the featurization pipeline.
    featurization = [
        (column,
         make_pipeline(ItemSelector(column),
                       text.TfidfVectorizer(ngram_range=ngram_range)))
        for column in feature_columns]
    features = FeatureUnion(featurization)

    # Define the estimator.
    estimator = lgb.LGBMClassifier(n_estimators=n_estimators,
                                   min_child_samples=min_child_samples,
                                   verbose=args.verbose)

    # Put them together into the model pipeline.
    model = Pipeline([
        ('features', features),
        ('model', estimator)
    ])
    
    # Report the featurization.
    print('Estimators={:,}'.format(n_estimators))
    print('Ngram range={}'.format(ngram_range))
    print('Min child samples={}'.format(min_child_samples))
    

Appending to TrainTestClassifier.py


## Train the model

In [6]:
%%writefile --append TrainTestClassifier.py

    print('Fitting the model.')

    # Fit the model.
    model.fit(train_X, train_y, model__sample_weight=sample_weight)

    # Write the model to file.
    if args.save:
        joblib.dump(model, model_path)
        print('{}: {:.2f} MB'.format(
            model_path, os.path.getsize(model_path)/(2**20)))
        

Appending to TrainTestClassifier.py


## Score the test data using the model
This produces a dataframe of scores with one row per duplicate question.

In [7]:
%%writefile --append TrainTestClassifier.py

    print('Scoring the test data.')

    # Read the test data.
    print('Reading {}'.format(test_path))
    test = pd.read_csv(test_path, sep='\t', encoding='latin1')
    print('test {:,} rows with {:.2%} matches'.format(
        test.shape[0], test[label_column].mean()))

    # Collect the model predictions.
    test_X = test[feature_columns]
    test['probabilities'] = model.predict_proba(test_X)[:, 1]

    # Order the testing data by dupe Id and question AnswerId.
    test.sort_values([group_column, answerid_column], inplace=True)

    # Extract the ordered probabilities.
    probabilities = (
        test.probabilities
        .groupby(test[group_column], sort=False)
        .apply(lambda x: tuple(x.values)))

    # Get the individual records.
    output_columns_x = ['Id_x', 'AnswerId_x', 'Text_x']
    test_score = (test[output_columns_x]
                  .drop_duplicates()
                  .set_index(group_column))
    test_score['probabilities'] = probabilities
    test_score.reset_index(inplace=True)
    test_score.columns = ['Id', 'AnswerId', 'Text', 'probabilities']
    

Appending to TrainTestClassifier.py


## Report the model's performance statistics on the test data

In [8]:
%%writefile --append TrainTestClassifier.py

    print("Evaluating the model's performance.")
    
    # Collect the ordered AnswerId for computing the scores.
    labels = sorted(train[answerid_column].unique())
    label_order = pd.DataFrame({'label': labels})

    # Rank the correct answers.
    test_score['Ranks'] = test_score.apply(lambda x:
                                           label_rank(x.AnswerId,
                                                      x.probabilities,
                                                      label_order.label),
                                           axis=1)

    # Compute the number of correctly ranked answers
    for i in range(1, args.rank+1):
        print('Accuracy @{} = {:.2%}'.format(
            i, (test_score['Ranks'] <= i).mean()))
    mean_rank = test_score['Ranks'].mean()
    print('Mean Rank {:.4f}'.format(mean_rank))

    # Write the scored instances.
    if args.save:
        test_score.to_csv(instances_path, sep='\t', index=False,
                          encoding='latin1')
        label_order.to_csv(labels_path, sep='\t', index=False)
        

Appending to TrainTestClassifier.py


## Run the script to see that it works

In [9]:
%run -t TrainTestClassifier.py --match 5 --estimators 1000 --ngrams 2 --min_child_samples 10

Prepare the training data.
Reading ./balanced_pairs_train.tsv
train: 33,415 rows with 20.00% matches
Using sample weights.
0    0.625
1    2.500
Name: Label, dtype: float64
Define the model pipeline.
Estimators=1,000
Ngram range=(1, 2)
Min child samples=10
Fitting the model.
Scoring the test data.
Reading ./balanced_pairs_test.tsv
test 287,014 rows with 0.55% matches
Evaluating the model's performance.
Accuracy @1 = 43.63%
Accuracy @2 = 58.91%
Accuracy @3 = 65.31%
Mean Rank 7.4350

IPython CPU timings (estimated):
  User   :     719.98 s.
  System :       2.81 s.
Wall time:     200.52 s.


In [the next notebook](02_Configure_Batch_AI.ipynb), we create a file to contain the Batch AI configuration we will use.