Skip to content

Latest commit

 

History

History
 
 

k-fold-cross-validation

Performing K-fold Cross-validation with Amazon Machine Learning

This document and the associated sample scripts show how to use Amazon Machine Learning's (Amazon ML’s) DataRearrangement parameter in the CreateDataSource* APIs to perform k-fold cross-validation for a binary classification model, using the AWS SDK for Python (Boto).

This example shows how to perform a 4-fold cross-validation using Amazon ML. Because k=4 in this example, the script generates four disjoint splitting ranges (folds) of equal length. Amazon ML takes one range and uses it to create an evaluation datasource, and then uses the data outside of the range to create a training datasource. This process is repeated three more times, once for each of the folds, resulting in four models and four evaluations. Amazon ML then averages the quality metrics from the four evaluations to produce a single metric that represents the aggregate model quality. In this example, the single metric is the mean of the area under curve (AUC) score for the four evaluations.

The single metric generated by cross-validating the model reflects how well the model succeeded in generalizing the patterns it found in the data (also known as the generalization performance). Cross-validating model evaluations helps you select model settings (also known as hyperparameters), without overfitting to the evaluation data, i.e., failing to generalize patterns. For example, say that we are deciding which value to use for the sgd.l2RegularizationAmount setting of a binary classification model, 1e-4 or 1e-5. To decide which value to use, use the sample scripts twice, once with sgd.l2RegularizationAmount = 1e-4 for all four folds, and once with sgd.l2RegularizationAmount = 1e-5 for all four folds. After you have the single metric from each of the two cross-validations, you can choose which sgd.l2RegularizationAmount value to use by comparing the two metrics. The model with the higher metric did a better job of generalizing the patterns in the data.

For information about cross-validation, see Cross-validation in the Amazon Machine Learning Developer Guide.

Setting Up

Install Python

This sample scripts are written in Python 3. Although we have tested it successfully on Python 2, we recommend using Python 3. To see which version of Python you have, run this command in your CLI:

python --version

Learn more about how to download the latest version of Python at Python.org.

Pull Dependent Libraries

The sample scripts run in an isolated Python environment. You create the environment by using the virtualenv tool, and then installing the dependent package (boto) using pip in the ./local-python directory. The setup.sh script (in machine-learning-samples/k-fold-cross-validation/setup.sh) creates the environment and installs the package.

If you are a Python 3 developer, run:

source setup.sh

Note that Python 3 includes the virtualenv and pip tools.

If you are a Python 2 developer and do not already have virtualenv and pip tools in your local Python environment, you will need to install them before running the sample scripts. For example, if you are using Linux with apt-get, install them with the following command:

sudo apt-get update
sudo apt-get install python-pip python-virtualenv

Users of other operating systems and package managers can learn more about installing pip here, and about installing virtualenv here.

After you’ve installed the virtualenv and pip tools, run:

source setup.sh

Setup is complete. To exit the virtual environment, type deactivate. To clean up the dependent libraries after exiting, remove the ./local-python directory.

Configure AWS Credentials

Your AWS credentials must be stored in a ~/.boto or ~/.aws/credentials file. Your credential file should look like this:

[Credentials]
aws_access_key_id = YOURACCESSKEY
aws_secret_access_key = YOURSECRETKEY

To learn more about configuring your AWS credentials with Boto, go to Getting Started with Boto.

Add Sample Scripts

Get the samples by cloning this repository.

git clone https://github.com/awslabs/machine-learning-samples.git

After you have tried the sample scripts, you can integrate machine-learning-samples/k-fold-cross-validation/build_folds.py into your Python application by calling the build_folds function in the build_folds.py module.

Demo

The sample code includes two scripts, build_folds.py and collect_perf.py under machine-learning-samples/k-fold-cross-validation directory. The first script (build_folds.py) uses the DataRearrangement parameter of the CreateDatasourceFromS3, CreateDatasourceFromRedshift or CreateDatasourceFromRDS APIs to create the training and evaluation datasources for the cross-validation models, and then trains and evaluates ML models using these datasources. All datasources, ML models, and evaluations are based on the sample data used in the Amazon ML tutorial. The second script (collect_perf.py) averages the quality metrics of the resulting evaluations to produce a single metric that you can use to compare models.

build_folds.py takes a resource name prefix and the number of folds as arguments, and generates the DataRearrangement parameters for each fold. It then uses the generated DataRearrangement parameters to create the datasources for training and evaluating the models. After it has created the datasources, it uses the training datasources to train the models, and uses the evaluation datasources to evaluate the models. For each fold, build_folds.py creates two datasources, one model, and one evaluation.

For example, let’s say that we want to perform a 4-fold cross-validation. We would run build_folds.py with the following command:

python build_folds.py --name 4-fold-cv-demo 4

In this example, the --name 4-fold-cv-demo argument (optional) defines the prefix that Amazon ML adds to the names of all of the entities created by build_folds.py (the datasources, models, and evaluations). The 4 argument (required) specifies the number of folds that Amazon ML creates for the cross-validation process. Replace these values with your own values when you execute build_folds.py.

When build_folds.py executes, it displays the IDs of the objects that it creates. For the datasources, it also displays the DataRearrangement parameter that it used to create the datasource. Here is an example of the DataRearrangement parameter from one of the four folds:

{
    "splitting": {
        "complement": true,
        "percentBegin": 25,
        "percentEnd": 50,
        "strategy": "random",
        "strategyParams": {
            "randomSeed": "RANDOMSEED"
        }
    }
}

The DataRearrangement parameter has the following parameters:

  • complement – This parameter is a boolean flag. To use data within the given range to create a datasource, set complement to false. To use data from outside of the given range to create a datasource, set complement to true. For example, suppose that the given range is [25, 50] and complement is false. Amazon ML selects only records within the 25-50 percent range. The selected range has roughly 25% of the input data’s records. In contrast, suppose that the given range is the same ([25, 50]) but that complement is true. In that case, Amazon ML selects the records outside of the range, and the selected range has roughly 75% of the input data’s records.
  • percentBegin, percentEnd – These parameters specify the beginning and end range of the input data that is used to create a datasource . Valid values are [0, 100].
  • strategy – When percentBegin and percentEnd parameters are specified, this parameter determines how the records from your data are selected for inclusion into the specified range. The sequential strategy splits data sequentially, based on the record order, while the random strategy selects records using a pseudo-random order. If your data is already shuffled, choose sequential. If your data has not been shuffled, we recommend that you choose random for your splitting strategy, in order to make the distribution of data consistent between the training and evaluation datasources. For more information about splitting strategies, see Splitting Your Data in the Amazon Machine Learning Developer Guide.
  • randomSeed – (Optional) This parameter specifies the seed value that is used by the random strategy. The default is an empty string. In this sample code, the seed value is defined in config.py module (in machine-learning-samples/k-fold-cross-validation/config.py). You may replace this value with your own string in the config.py file when you execute the script. For more information about random seeds, see Splitting Your Data in the Amazon Machine Learning Developer Guide.

After build_folds.py finishes creating the evaluations, use collect_perf.py to collect the AUC scores from the four evaluations and average them to produce a single AUC metric. Non-binary models use a different type of evaluation score, such as macro-average F1 or RMSE, but this script handles only binary models. For example, let’s say that build_folds.py created the following four evaluations: 4-fold-cv-demo-eval-1, 4-fold-cv-demo-eval-2, 4-fold-cv-demo-eval-3, and 4-fold-cv-demo-eval-4. To run collect_perf.py, we would run the following command:

python collect_perf.py 4-fold-cv-demo-eval-1 4-fold-cv-demo-eval-2 4-fold-cv-demo-eval-3 4-fold-cv-demo-eval-4

The number of arguments that collect_perf.py takes depends on the number of folds that you specified. collect_perf.py takes the same number of arguments as folds, so for a 4-fold cross-validation, you would execute collect_perf.py with four arguments. Replace these values with your own values when you execute collect_perf.py.

In addition to the single metric, collect_perf.py displays the variance between the models and a sorted list of all of the AUC scores that it collected. This allows you to see the distribution of AUC scores. If the variance is high enough that it affects your ability to compare the AUC metric with another AUC metric, there are several things that you can do to try to reduce the variance:

  1. If you’re using a large number of folds, try reducing the number of folds. For example, instead of creating a 10-fold cross-validation, try creating a 5-fold cross-validation. Fewer folds means that models are trained and evaluated with larger datasources, which reduces the variability of the data from datasource to datasource, and therefore reduces the variation between the models that are trained and evaluated with the datasources.
  2. If you’re using a random splitting strategy, try changing the value of the config.RANDOM_STRATEGY_RANDOM_SEED parameter in the config.py file, to change the way data is selected for your datasources.
  3. If you’re using a sequential splitting strategy, try shuffling your data by using the random splitting strategy.

All resources created by these scripts are billed at the regular Amazon ML rates. For information about Amazon ML pricing, see Amazon Machine Learning Pricing. For information on how to delete your resources, see Clean Up in the tutorial in the Amazon Machine Learning Developer Guide.

For more information about how cross-validation works in Amazon ML, see Cross-Validation in the Amazon Machine Learning Developer Guide. For more information about the DataRearrangement parameter, see Data Rearrangement in the Amazon Machine Learning Developer Guide. To learn about performance metrics for Amazon ML, see PerformanceMetrics in the Amazon Machine Learning API Reference.