# Model Training Sweep
In this notebook, we will do the following:
- Run a W&B Sweep using already-existing resources, including:
    - A custom training/inference Docker image
    - Training data loaded into an s3 bucket
- Multiple instances will be spun up/down automatically to run the sweep in
  parallel.

In [2]:
import json
import os

import botocore
import boto3
import sagemaker
from sagemaker.estimator import Estimator

%pip install wandb
import wandb

[33mDEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pyodbc or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [3]:
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
client = boto3.client(service_name="sagemaker")



## Define Data Version, Training Script, Metrics, Etc.
- The data version and prefix point to a directory of the s3 bucket that is used for training text scoring models (`pipplet-data`).

In [4]:
data_s3_uri = "s3://my-data-bucket/v1/"
image_uri = "98717289289.dkr.ecr.us-east-1.amazonaws.com/my-image:1.0.0"
training_host = image_uri.split("/")[-1]
output_bucket = "my-model-artifacts-bucket"
experiment_id = "test"
features = [
  "DIS_COH",
  "VPs_per_doc",
  "VPs_per_sent",
  "VocabularyRichness3",
  "acoustic_model_score",
  "num_silences_per_token",
  "num_silences_per_utt",
  "num_types",
  "num_types_per_utt",
  "percent_stressed_words",
  "percent_vocalic",
  "pitch_range",
  "rPVI_consonantal",
  "silence_abs_mean_deviation",
  "temporalCuesCount",
  "varco_consonantal",
  "varco_vocalic",
]
hyperparameters = {
    "features": ",".join(features),
    "experiment-id": experiment_id,
    "min-score": "0",
    "max-score": "17",
    "id-column": "test_instance_id",
    "train-label-column": "averaged_score",
    "test-label-column": "score1",
    "second-human-score-column": "score2",
    "subgroup-columns": "question_set_id",
    "select-transformations": "true",
    "use-scaled-predictions": "true",
    "oversample-highest": "true",
}

## Create `W&B` Sweep

In [8]:
wandb_required_env_vars = {
    "WANDB_PROJECT": "my-experiments",
    "WANDB_ENTITY": "my-entity",
    "WANDB_API_KEY": function_that_gets_api_key(),
    "WANDB_BASE_URL": "https://api.wandb.ai",
}
for env_var_name, env_var_value in wandb_required_env_vars.items():
    os.environ[env_var_name] = env_var_value
sweep_count = 50
sweep_configuration = {
    "method": "grid",
    "name": "sweep_multi_instance_test",
    "metric": {"goal": "minimize", "name": "rsmtool/eval_short.wtkappa.scale_trim"},
    "parameters": {
        "train_sample_size": {
            "values": [0.01, 0.02, 0.03, 0.04, 0.05],
        },
        "objective_function": {
            "values": ["quadratic_weighted_kappa", "neg_mean_squared_error"],
        },
        "learner": {
            "values": [
                "SVR",
                "RescaledSVR",
                "RandomForestRegressor",
                "RescaledRandomForestRegressor",
            ],
        },
    },
}
sweep_id = wandb.sweep(sweep=sweep_configuration)

Create sweep with ID: f4hu2cox
Sweep URL: https://wandb.ai/etslabs/pipplet-speech-experiments/sweeps/f4hu2cox


## Create an `Estimator`

In [10]:
environment = {
    "WANDB_HOST": training_host,
    "WANDB_JOB_TYPE": "training",
    "WANDB_USERNAME": "mmulholland",
    "WANDB_USER_EMAIL": "mmulholland@ets.org",
    "INPUT_DATA_S3_URI": data_s3_uri,
    "OUTPUT_BUCKET": output_bucket,
    "SWEEP_COUNT": str(sweep_count),
    "SWEEP_ID": sweep_id,
    "SWEEP_CONFIGURATION": json.dumps(sweep_configuration),
    **wandb_required_env_vars,
}
max_run = 60*60*24  # One day
max_wait = 60*60  # One hour
estimator = Estimator(
    image_uri=image_uri,
    base_job_name="speech",
    session=sagemaker_session,
    instance_type="ml.c5.2xlarge",
    instance_count=3,
    use_spot_instances=True,
    max_run=max_run,
    max_wait=max_run + max_wait,
    role=role,
    environment=environment,
    hyperparameters=hyperparameters,
    output_path=f"s3://{output_bucket}",
)

## Run Sweep
- Execute the training procedure defined in the image's `train.py` script.

In [11]:
train_config = sagemaker.inputs.TrainingInput(data_s3_uri, content_type="text/csv")
estimator.fit({"train": train_config}, wait=False)

INFO:sagemaker:Creating training-job with name: speech-2023-10-30-13-37-23-867
