In [1]:
!pip install datasets transformers torch nlpaug

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting datasets
  Downloading datasets-2.4.0-py3-none-any.whl (365 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m365.7/365.7 KB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hCollecting transformers
  Downloading transformers-4.21.3-py3-none-any.whl (4.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.7/4.7 MB[0m [31m30.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting nlpaug
  Downloading nlpaug-1.1.11-py3-none-any.whl (410 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.5/410.5 KB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.9.1-py3-none-any.whl (120 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m120.7/120.7 KB[0m [31m31.3 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19
  Downloading responses-0.18.0-py3-

# Download Data & Prepare

The script `wrangling_segment.py` by default will use files downloaded to prepare the datasets for predicting UNSPSC market Segment. If the --download flag is passed, the program first hits the web sources for this data and downloads them prior to creating prepared data.

In [13]:
import torchtext

In [2]:
!python wrangling_segment.py --download

downloading https://data.ok.gov/dataset/18a622a6-32d1-48f6-842a-8232bc4ca06c/resource/b92ad3ac-b0f5-4c62-9bd0-eac023cfd083/download/data-unspsc-codes.csv to ./data/codes/data-unspsc-codes.csv
    done
downloading https://data.ca.gov/dataset/ae343670-f827-4bc8-9d44-2af937d60190/resource/bb82edc5-9c78-44e2-8947-68ece26197c5/download/purchase-order-data-2012-2015-.csv to ./data/california/purchase-order-data-2012-2015-.csv
    done
downloading https://data.gov.au/data/dataset/5c7fa69b-b0e9-4553-b8df-2a022dd2e982/resource/561a549b-5a65-450e-86cf-81d392d8fef3/download/20142015fy.csv to ./data/australia/20142015fy.csv
    done
downloading https://data.gov.au/data/dataset/5c7fa69b-b0e9-4553-b8df-2a022dd2e982/resource/21212500-169f-4745-86b3-6ac1c1174151/download/2016-2017-australian-government-contract-data.csv to ./data/australia/2016-2017-australian-government-contract-data.csv
    done
downloading https://data.gov.au/data/dataset/5c7fa69b-b0e9-4553-b8df-2a022dd2e982/resource/bc2097b7-8116-

# Conduct Data Augmentation

Class imbalance is a significant problem in this task. The largest segment has several thousand langauge samples, while the smallest has less than 100. As a result, data augmentation using randomised synonym replacement has been used to try to augment the training set.

The script `data_augmentation.py` is able to implement this and includes options for the augmentation routine. In particular, it is possible to perform augmentation to a certain level while undersampling the larger segments to ensure that the classification problem is perfectly balanced. Excess samples are redistributed to the test set for later use in validation. It is also possible to set the number to increase the under-represented classes to. 

The default behaviour lifts the number of samples in each of the under-represented classes to 1000 records, while leaving the over-represented classes unchanged.

Defaults are sufficient for our current purposes.

In [3]:
!python data_augmentation.py

[nltk_data] Downloading package wordnet to /home/ec2-user/nltk_data...
[nltk_data] Downloading package omw-1.4 to /home/ec2-user/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


# Upload Prepared Files to S3

In [5]:
#upload again.
import sagemaker

sagemaker_session = sagemaker.Session()

bucket = 'unspsc-data'
prefix = 'segment_training'

data_input = sagemaker_session.upload_data(path = './prepared_data/rebalanced/', bucket= bucket, key_prefix=prefix)

# RNN Baseline Model

In [8]:
from sagemaker.pytorch import PyTorch

role = sagemaker.get_execution_role()


We need to pass with our entry point script a requirements.txt file for sagemaker to be able to install torchtext.

Information on how to do this obtained from:

https://github.com/awslabs/sagemaker-privacy-for-nlp/blob/master/source/sagemaker/2.Model_Training.ipynb

I install pip-tools and use the pip-compile feature to extract all libraries to support what is in requirements.in which are all of the modules imported by train_rnn.py

In [None]:
!pip install pip-tools

In [None]:
!pip-compile requirements.in > requirements.txt

In [14]:
estimator = pytorch.PyTorch(entry_point='train_rnn.py',
                            source_dir = './rnn_baseline/'
                            instance_count = 1,
                            instance_type = 'ml.g4dn.xlarge',
                            framework_version = '1.10',
                            py_version='py38',
                            role = role)

In [None]:
estimator.fit({'training':data_input}, wait = True)

# Distilbert HyperParameter Tuning

In [16]:
import sagemaker
from sagemaker.tuner import (
    IntegerParameter,
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
)

In [17]:
data_input

's3://unspsc-data/segment_training'

In [18]:
from sagemaker.huggingface import HuggingFace

role = sagemaker.get_execution_role()

In [None]:
huggingface_estimator = HuggingFace(entry_point='unspsc_distilbert_sagemaker_hpo.py',
                                    instance_count=1,
                                    instance_type="ml.g4dn.xlarge",
                                    transformers_version='4.12',
                                    pytorch_version='1.9',
                                    py_version='py38',
                                    role=role)

hyperparameter_ranges = {
    "lr": ContinuousParameter(1e-5, 1e-3),
    "batch-size": CategoricalParameter([32, 64, 128]),
    "epochs": IntegerParameter(1, 2),
    'eps': ContinuousParameter(1e-8, 1e-7)
}

objective_metric_name = "Balanced Accuracy Final:"
objective_type = "Maximize"
metric_definitions = [{"Name": "Balanced Accuracy Final:", "Regex": "Balanced Accuracy Final: ([0-9\\.]+)"}]

tuner = HyperparameterTuner(
    huggingface_estimator,
    objective_metric_name,
    hyperparameter_ranges,
    metric_definitions,
    max_jobs=4,
    max_parallel_jobs=1,
    objective_type=objective_type,
)

tuner.fit({"training": data_input})

No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config
No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config


.....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

# Optimal Parameters

# Profiling

# Profiling Information

# Deploy to Endpoint

# References

[huggingface tutorial notebook](https://github.com/huggingface/notebooks/blob/main/examples/text_classification.ipynb)


[huggingface sagemaker tutorial](https://github.com/huggingface/notebooks/blob/main/sagemaker/01_getting_started_pytorch/sagemaker-notebook.ipynb)

[data augmentation from netune ai](https://neptune.ai/blog/data-augmentation-nlp)

[textual augmentation example code](https://github.com/makcedward/nlpaug/blob/23800cbb9632c7fc8c4a88d46f9c4ecf68a96299/example/textual_augmenter.ipynb)