# Plagiarism Detection Model

Now that you've created training and test data, you are ready to define and train a model. Your goal in this notebook, will be to train a binary classification model that learns to label an answer file as either plagiarized or not, based on the features you provide the model.

This task will be broken down into a few discrete steps:

* Upload your data to S3.
* Define a binary classification model and a training script.
* Train your model and deploy it.
* Evaluate your deployed classifier and answer some questions about your approach.

To complete this notebook, you'll have to complete all given exercises and answer all the questions in this notebook.
> All your tasks will be clearly labeled **EXERCISE** and questions as **QUESTION**.

It will be up to you to explore different classification models and decide on a model that gives you the best performance for this dataset.

---

## Load Data to S3

In the last notebook, you should have created two files: a `training.csv` and `test.csv` file with the features and class labels for the given corpus of plagiarized/non-plagiarized text data. 

>The below cells load in some AWS SageMaker libraries and creates a default bucket. After creating this bucket, you can upload your locally stored data to S3.

Save your train and test `.csv` feature files, locally. To do this you can run the second notebook "2_Plagiarism_Feature_Engineering" in SageMaker or you can manually upload your files to this notebook using the upload icon in Jupyter Lab. Then you can upload local files to S3 by using `sagemaker_session.upload_data` and pointing directly to where the training data is saved.

In [1]:
import pandas as pd
import boto3
import sagemaker
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner

In [2]:
"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
# session and role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# create an S3 bucket
bucket = sagemaker_session.default_bucket()

## EXERCISE: Upload your training data to S3

Specify the `data_dir` where you've saved your `train.csv` file. Decide on a descriptive `prefix` that defines where your data will be uploaded in the default S3 bucket. Finally, create a pointer to your training data by calling `sagemaker_session.upload_data` and passing in the required parameters. It may help to look at the [Session documentation](https://sagemaker.readthedocs.io/en/stable/session.html#sagemaker.session.Session.upload_data) or previous SageMaker code examples.

You are expected to upload your entire directory. Later, the training script will only access the `train.csv` file.

In [3]:
# should be the name of directory you created to save your features data
data_dir = 'plagiarism_data'

# set prefix, a descriptive name for a directory  
prefix = 'plagiarism-classifier'

# upload all data to S3
input_data = sagemaker_session.upload_data(data_dir, bucket=bucket, key_prefix=prefix)

### Test cell

Test that your data has been successfully uploaded. The below cell prints out the items in your S3 bucket and will throw an error if it is empty. You should see the contents of your `data_dir` and perhaps some checkpoints. If you see any other files listed, then you may have some old model files that you can delete via the S3 console (though, additional files shouldn't affect the performance of model developed in this notebook).

In [4]:
"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
# confirm that data is in S3 bucket
empty_check = []
for obj in boto3.resource('s3').Bucket(bucket).objects.all():
    empty_check.append(obj.key)
    print(obj.key)

assert len(empty_check) !=0, 'S3 bucket is empty.'
print('Test passed!')

plagiarism-classifier/test.csv
plagiarism-classifier/train.csv
plagiarism-sklearn/output/sagemaker-scikit-lea-200919-1629-001-de1f9569/output/model.tar.gz
plagiarism-sklearn/output/sagemaker-scikit-lea-200919-1629-002-30e03f88/output/model.tar.gz
plagiarism-sklearn/output/sagemaker-scikit-lea-200919-1629-003-c9b93390/output/model.tar.gz
plagiarism-sklearn/output/sagemaker-scikit-lea-200919-1640-001-eadbc65c/output/model.tar.gz
plagiarism-sklearn/output/sagemaker-scikit-lea-200919-1640-002-931664fa/output/model.tar.gz
plagiarism-sklearn/output/sagemaker-scikit-lea-200919-1640-003-fd7b6bac/output/model.tar.gz
plagiarism-sklearn/output/sagemaker-scikit-lea-200919-1640-004-25d576a6/output/model.tar.gz
plagiarism-sklearn/output/sagemaker-scikit-lea-200919-1640-005-d0a70032/output/model.tar.gz
plagiarism-sklearn/output/sagemaker-scikit-lea-200919-1640-006-adc2a614/output/model.tar.gz
plagiarism-sklearn/output/sagemaker-scikit-lea-200919-1640-007-af4ab029/output/model.tar.gz
plagiarism-sklear

---

# Modeling

Now that you've uploaded your training data, it's time to define and train a model!

The type of model you create is up to you. For a binary classification task, you can choose to go one of three routes:
* Use a built-in classification algorithm, like LinearLearner.
* Define a custom Scikit-learn classifier, a comparison of models can be found [here](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html).
* Define a custom PyTorch neural network classifier. 

It will be up to you to test out a variety of models and choose the best one. Your project will be graded on the accuracy of your final model. 
 
---

## EXERCISE: Complete a training script 

To implement a custom classifier, you'll need to complete a `train.py` script. You've been given the folders `source_sklearn` and `source_pytorch` which hold starting code for a custom Scikit-learn model and a PyTorch model, respectively. Each directory has a `train.py` training script. To complete this project **you only need to complete one of these scripts**; the script that is responsible for training your final model.

A typical training script:
* Loads training data from a specified directory
* Parses any training & model hyperparameters (ex. nodes in a neural network, training epochs, etc.)
* Instantiates a model of your design, with any specified hyperparams
* Trains that model 
* Finally, saves the model so that it can be hosted/deployed, later

### Defining and training a model
Much of the training script code is provided for you. Almost all of your work will be done in the `if __name__ == '__main__':` section. To complete a `train.py` file, you will:
1. Import any extra libraries you need
2. Define any additional model training hyperparameters using `parser.add_argument`
2. Define a model in the `if __name__ == '__main__':` section
3. Train the model in that same section

Below, you can use `!pygmentize` to display an existing `train.py` file. Read through the code; all of your tasks are marked with `TODO` comments. 

**Note: If you choose to create a custom PyTorch model, you will be responsible for defining the model in the `model.py` file,** and a `predict.py` file is provided. If you choose to use Scikit-learn, you only need a `train.py` file; you may import a classifier from the `sklearn` library.

In [5]:
# directory can be changed to: source_sklearn or source_pytorch
# !pygmentize source_sklearn/train.py

### Provided code

If you read the code above, you can see that the starter code includes a few things:
* Model loading (`model_fn`) and saving code
* Getting SageMaker's default hyperparameters
* Loading the training data by name, `train.csv` and extracting the features and labels, `train_x`, and `train_y`

If you'd like to read more about model saving with [joblib for sklearn](https://scikit-learn.org/stable/modules/model_persistence.html) or with [torch.save](https://pytorch.org/tutorials/beginner/saving_loading_models.html), click on the provided links.

---
# Create an Estimator

When a custom model is constructed in SageMaker, an entry point must be specified. This is the Python file which will be executed when the model is trained; the `train.py` function you specified above. To run a custom training script in SageMaker, construct an estimator, and fill in the appropriate constructor arguments:

* **entry_point**: The path to the Python script SageMaker runs for training and prediction.
* **source_dir**: The path to the training script directory `source_sklearn` OR `source_pytorch`.
* **entry_point**: The path to the Python script SageMaker runs for training and prediction.
* **source_dir**: The path to the training script directory `train_sklearn` OR `train_pytorch`.
* **entry_point**: The path to the Python script SageMaker runs for training.
* **source_dir**: The path to the training script directory `train_sklearn` OR `train_pytorch`.
* **role**: Role ARN, which was specified, above.
* **train_instance_count**: The number of training instances (should be left at 1).
* **train_instance_type**: The type of SageMaker instance for training. Note: Because Scikit-learn does not natively support GPU training, Sagemaker Scikit-learn does not currently support training on GPU instance types.
* **sagemaker_session**: The session used to train on Sagemaker.
* **hyperparameters** (optional): A dictionary `{'name':value, ..}` passed to the train function as hyperparameters.

Note: For a PyTorch model, there is another optional argument **framework_version**, which you can set to the latest version of PyTorch, `1.0`.

## EXERCISE: Define a Scikit-learn or PyTorch estimator

To import your desired estimator, use one of the following lines:
```
from sagemaker.sklearn.estimator import SKLearn
```
```
from sagemaker.pytorch import PyTorch
```

### 1. SKLearn Estimator

In [103]:
# your import and estimator code, here
from sagemaker.sklearn.estimator import SKLearn

sklearn_prefix = 'plagiarism-sklearn'
sklearn_output_path = 's3://{}/{}/output'.format(bucket, sklearn_prefix)

sklearn_estimator = SKLearn(entry_point='train.py',
                            source_dir='source_sklearn',
                            role=role,
                            sagemaker_session=sagemaker_session,
                            framework_version='0.20.0',
                            output_path=sklearn_output_path,
                            train_instance_count=1,
                            train_instance_type='ml.m5.large')

# Create a HyperParameter Tuner
sklearn_hyperparameter_ranges = {'learning_rate': ContinuousParameter(0.0001, 1),'n_estimators': IntegerParameter(25, 500)}
sklearn_objective_metric_name = 'accuracy'
sklearn_objective_type = 'Maximize'
sklearn_metric_definitions = [{'Name': 'accuracy',
                               'Regex': 'accuracy = (.*?);'}]

sklearn_tuner = HyperparameterTuner(sklearn_estimator,
                                    sklearn_objective_metric_name,
                                    sklearn_hyperparameter_ranges,
                                    sklearn_metric_definitions,
                                    max_jobs=9,
                                    max_parallel_jobs=3,
                                    objective_type=sklearn_objective_type)


This is not the latest supported version. If you would like to use version 0.23-1, please add framework_version=0.23-1 to your constructor.


**Note:** I am not using latest framework version since I am running into issues with importing joblib. The model trainined and built fine, but is not being deployed.

## EXERCISE: Train the estimator

Train your estimator on the training data stored in S3. This should create a training job that you can monitor in your SageMaker console.

In [104]:
%%time

# Train your estimator on S3 training data
sklearn_tuner.fit({'train':input_data})

's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.


CPU times: user 40.8 ms, sys: 0 ns, total: 40.8 ms
Wall time: 481 ms


In [105]:
sklearn_tuning_job_name = sklearn_tuner.describe()['HyperParameterTuningJobName']
sagemaker_session.wait_for_tuning_job(sklearn_tuning_job_name)

..................................................................................................................................................!


{'HyperParameterTuningJobName': 'sagemaker-scikit-lea-200920-1205',
 'HyperParameterTuningJobArn': 'arn:aws:sagemaker:us-east-1:736760211122:hyper-parameter-tuning-job/sagemaker-scikit-lea-200920-1205',
 'HyperParameterTuningJobConfig': {'Strategy': 'Bayesian',
  'HyperParameterTuningJobObjective': {'Type': 'Maximize',
   'MetricName': 'accuracy'},
  'ResourceLimits': {'MaxNumberOfTrainingJobs': 9,
   'MaxParallelTrainingJobs': 3},
  'ParameterRanges': {'IntegerParameterRanges': [{'Name': 'n_estimators',
     'MinValue': '25',
     'MaxValue': '500',
     'ScalingType': 'Auto'}],
   'ContinuousParameterRanges': [{'Name': 'learning_rate',
     'MinValue': '0.0001',
     'MaxValue': '1',
     'ScalingType': 'Auto'}],
   'CategoricalParameterRanges': []},
  'TrainingJobEarlyStoppingType': 'Off'},
 'TrainingJobDefinition': {'StaticHyperParameters': {'_tuning_objective_metric': 'accuracy',
   'sagemaker_container_log_level': '20',
   'sagemaker_enable_cloudwatch_metrics': 'false',
   'sagem

In [108]:
# output the details of the best training job
sklearn_tuner.describe()['BestTrainingJob']

{'TrainingJobName': 'sagemaker-scikit-lea-200920-1205-009-cd9b5810',
 'TrainingJobArn': 'arn:aws:sagemaker:us-east-1:736760211122:training-job/sagemaker-scikit-lea-200920-1205-009-cd9b5810',
 'CreationTime': datetime.datetime(2020, 9, 20, 12, 13, 58, tzinfo=tzlocal()),
 'TrainingStartTime': datetime.datetime(2020, 9, 20, 12, 16, 19, tzinfo=tzlocal()),
 'TrainingEndTime': datetime.datetime(2020, 9, 20, 12, 17, 41, tzinfo=tzlocal()),
 'TrainingJobStatus': 'Completed',
 'TunedHyperParameters': {'learning_rate': '0.15472718446636952',
  'n_estimators': '38'},
 'FinalHyperParameterTuningJobObjectiveMetric': {'MetricName': 'accuracy',
  'Value': 0.6428571343421936},
 'ObjectiveStatus': 'Succeeded'}

In [109]:
# get the best estimator
sklearn_best_estimator = sklearn_tuner.best_estimator()

This is not the latest supported version. If you would like to use version 0.23-1, please add framework_version=0.23-1 to your constructor.


2020-09-20 12:17:41 Starting - Preparing the instances for training
2020-09-20 12:17:41 Downloading - Downloading input data
2020-09-20 12:17:41 Training - Training image download completed. Training in progress.
2020-09-20 12:17:41 Uploading - Uploading generated training model
2020-09-20 12:17:41 Completed - Training job completed[34m2020-09-20 12:17:29,051 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2020-09-20 12:17:29,051 sagemaker-containers INFO     Failed to parse hyperparameter _tuning_objective_metric value accuracy to Json.[0m
[34mReturning the value itself[0m
[34m2020-09-20 12:17:29,053 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-09-20 12:17:29,063 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2020-09-20 12:17:29,424 sagemaker-containers INFO     Module train does not provide a setup.py. [0m
[34mGenerating setup.py[0m
[34m2020-09-

The training accuracy is quite bad, hence building a new model.

In [117]:
sklearn2_prefix = 'plagiarism-sklearn2'
sklearn2_output_path = 's3://{}/{}/output'.format(bucket, sklearn2_prefix)

sklearn2_estimator = SKLearn(entry_point='train.py',
                            source_dir='source_sklearn',
                            role=role,
                            sagemaker_session=sagemaker_session,
                            framework_version='0.20.0',
                            output_path=sklearn2_output_path,
                            train_instance_count=1,
                            train_instance_type='ml.m5.large',
                            hyperparameters={
                                'learning_rate': 0.0003446307791520923,
                                'n_estimators': 416
                            })

This is not the latest supported version. If you would like to use version 0.23-1, please add framework_version=0.23-1 to your constructor.


In [118]:
%%time

# Train your estimator on S3 training data
sklearn2_estimator.fit({'train':input_data})

's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.


2020-09-20 12:39:27 Starting - Starting the training job...
2020-09-20 12:39:30 Starting - Launching requested ML instances......
2020-09-20 12:40:46 Starting - Preparing the instances for training......
2020-09-20 12:41:33 Downloading - Downloading input data...
2020-09-20 12:42:25 Training - Training image download completed. Training in progress..[34m2020-09-20 12:42:25,428 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2020-09-20 12:42:25,433 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-09-20 12:42:25,451 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2020-09-20 12:42:25,725 sagemaker-containers INFO     Module train does not provide a setup.py. [0m
[34mGenerating setup.py[0m
[34m2020-09-20 12:42:25,725 sagemaker-containers INFO     Generating setup.cfg[0m
[34m2020-09-20 12:42:25,726 sagemaker-containers INFO     Generating MANIFEST.in[0m
[34

### 2. PyTorch Estimator

In [26]:
# your import and estimator code, here
from sagemaker.pytorch import PyTorch

pytorch_prefix = 'plagiarism-pytorch'
pytorch_output_path = 's3://{}/{}/output'.format(bucket, pytorch_prefix)

pytorch_estimator = PyTorch(entry_point='train.py',
                            source_dir='source_pytorch',
                            role=role,
                            sagemaker_session=sagemaker_session,
                            framework_version='1.5.0',
                            output_path=pytorch_output_path,
                            train_instance_count=1,
                            train_instance_type='ml.m5.xlarge',
                            hyperparameters={
                                'epochs': 300,
                                'input_features': 4,
                                'output_dim': 1
                            })

# Create a HyperParameter Tuner
pytorch_hyperparameter_ranges = {'learning_rate': ContinuousParameter(0.0001, .01),'hidden_dim': IntegerParameter(8, 24)}
pytorch_objective_metric_name = 'loss'
pytorch_objective_type = 'Minimize'
pytorch_metric_definitions = [{'Name': 'loss',
                               'Regex': 'Final-Loss = (.*?);'}]

pytorch_tuner = HyperparameterTuner(pytorch_estimator,
                                    pytorch_objective_metric_name,
                                    pytorch_hyperparameter_ranges,
                                    pytorch_metric_definitions,
                                    max_jobs=9,
                                    max_parallel_jobs=3,
                                    objective_type=pytorch_objective_type)


In [27]:
%%time

# Train your estimator on S3 training data
pytorch_tuner.fit({'train':input_data})

'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


CPU times: user 74.7 ms, sys: 0 ns, total: 74.7 ms
Wall time: 2.01 s


In [28]:
pytorch_tuning_job_name = pytorch_tuner.describe()['HyperParameterTuningJobName']
sagemaker_session.wait_for_tuning_job(pytorch_tuning_job_name)

.....................................................................................................................................................................!


{'HyperParameterTuningJobName': 'pytorch-training-200920-0821',
 'HyperParameterTuningJobArn': 'arn:aws:sagemaker:us-east-1:736760211122:hyper-parameter-tuning-job/pytorch-training-200920-0821',
 'HyperParameterTuningJobConfig': {'Strategy': 'Bayesian',
  'HyperParameterTuningJobObjective': {'Type': 'Minimize',
   'MetricName': 'loss'},
  'ResourceLimits': {'MaxNumberOfTrainingJobs': 9,
   'MaxParallelTrainingJobs': 3},
  'ParameterRanges': {'IntegerParameterRanges': [{'Name': 'hidden_dim',
     'MinValue': '8',
     'MaxValue': '24',
     'ScalingType': 'Auto'}],
   'ContinuousParameterRanges': [{'Name': 'learning_rate',
     'MinValue': '0.0001',
     'MaxValue': '0.01',
     'ScalingType': 'Auto'}],
   'CategoricalParameterRanges': []},
  'TrainingJobEarlyStoppingType': 'Off'},
 'TrainingJobDefinition': {'StaticHyperParameters': {'_tuning_objective_metric': 'loss',
   'epochs': '300',
   'input_features': '4',
   'output_dim': '1',
   'sagemaker_container_log_level': '20',
   'sagem

In [29]:
# output the details of the best training job
pytorch_tuner.describe()['BestTrainingJob']

{'TrainingJobName': 'pytorch-training-200920-0821-009-0d8cb082',
 'TrainingJobArn': 'arn:aws:sagemaker:us-east-1:736760211122:training-job/pytorch-training-200920-0821-009-0d8cb082',
 'CreationTime': datetime.datetime(2020, 9, 20, 8, 30, 11, tzinfo=tzlocal()),
 'TrainingStartTime': datetime.datetime(2020, 9, 20, 8, 33, 36, tzinfo=tzlocal()),
 'TrainingEndTime': datetime.datetime(2020, 9, 20, 8, 34, 41, tzinfo=tzlocal()),
 'TrainingJobStatus': 'Completed',
 'TunedHyperParameters': {'hidden_dim': '22',
  'learning_rate': '0.001570313491213259'},
 'FinalHyperParameterTuningJobObjectiveMetric': {'MetricName': 'loss',
  'Value': 0.18457180261611938},
 'ObjectiveStatus': 'Succeeded'}

In [30]:
# get the best estimator
pytorch_best_estimator = pytorch_tuner.best_estimator()

2020-09-20 08:34:41 Starting - Preparing the instances for training
2020-09-20 08:34:41 Downloading - Downloading input data
2020-09-20 08:34:41 Training - Training image download completed. Training in progress.
2020-09-20 08:34:41 Uploading - Uploading generated training model
2020-09-20 08:34:41 Completed - Training job completed[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-09-20 08:34:15,488 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-09-20 08:34:15,489 sagemaker-containers INFO     Failed to parse hyperparameter _tuning_objective_metric value loss to Json.[0m
[34mReturning the value itself[0m
[34m2020-09-20 08:34:15,491 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-09-20 08:34:15,501 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2020-0

### 3. Sagemaker Linear Learner Estimator

In [31]:
# import LinearLearner
from sagemaker import LinearLearner

In [33]:
linear_prefix = 'plagiarism-linear'
linear_output_path = 's3://{}/{}/output'.format(bucket, linear_prefix)

linear_estimator = LinearLearner(role=role,
                                 sagemaker_session=sagemaker_session,
                                 output_path=linear_output_path,
                                 train_instance_count=1,
                                 train_instance_type='ml.m5.large',
                                 predictor_type = 'binary_classifier',
                                 epochs=30)

In [42]:
# read file from storage to convert to recordset format
import os
data_file = os.path.join(data_dir, 'train.csv')
df = pd.read_csv(data_file, header=None, names=None)
df.head()

Unnamed: 0,0,1,2,3,4
0,0,0.398148,0.07907,0.0,0.191781
1,1,0.869369,0.719457,0.319444,0.846491
2,1,0.593583,0.268817,0.044199,0.316062
3,0,0.544503,0.115789,0.0,0.242574
4,0,0.329502,0.053846,0.0,0.161172


In [43]:
# get the train data and labels
train_y = df.loc[:,0].to_numpy()
train_x = df.loc[:,1:].to_numpy()

In [44]:
train_x_np = train_x.astype('float32')
train_y_np = train_y.astype('float32').reshape(-1)
formatted_train_data = linear_estimator.record_set(train_x_np, labels=train_y_np)

In [45]:
%%time 
# train the estimator on formatted training data
linear_estimator.fit(formatted_train_data)

'get_image_uri' method will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
'get_image_uri' method will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


2020-09-20 08:59:22 Starting - Starting the training job...
2020-09-20 08:59:26 Starting - Launching requested ML instances......
2020-09-20 09:00:41 Starting - Preparing the instances for training......
2020-09-20 09:01:31 Downloading - Downloading input data...
2020-09-20 09:02:10 Training - Downloading the training image..[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[09/20/2020 09:02:28 INFO 140462225987392] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-input.json: {u'loss_insensitivity': u'0.01', u'epochs': u'15', u'feature_dim': u'auto', u'init_bias': u'0.0', u'lr_scheduler_factor': u'auto', u'num_calibration_samples': u'10000000', u'accuracy_top_k': u'3', u'_num_kv_servers': u'auto', u'use_bias': u'true', u'num_point_for_scaler': u'10000', u'_log_level': u'info', u'quantile': u'0.5', u'bias_lr_mult': u'auto', u'lr_scheduler_step': u'auto', u'init_me

## EXERCISE: Deploy the trained model

After training, deploy your model to create a `predictor`. If you're using a PyTorch model, you'll need to create a trained `PyTorchModel` that accepts the trained `<model>.model_data` as an input parameter and points to the provided `source_pytorch/predict.py` file as an entry point. 

To deploy a trained model, you'll use `<model>.deploy`, which takes in two arguments:
* **initial_instance_count**: The number of deployed instances (1).
* **instance_type**: The type of SageMaker instance for deployment.

Note: If you run into an instance error, it may be because you chose the wrong training or deployment instance_type. It may help to refer to your previous exercise code to see which types of instances we used.

In [47]:
%%time

# uncomment, if needed
# from sagemaker.pytorch import PyTorchModel

# deploy your model to create a predictor
predictor = linear_estimator.deploy(initial_instance_count=1, instance_type='ml.m5.large')

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.


-----------------!CPU times: user 296 ms, sys: 25.7 ms, total: 321 ms
Wall time: 8min 33s


---
# Evaluating Your Model

Once your model is deployed, you can see how it performs when applied to our test data.

The provided cell below, reads in the test data, assuming it is stored locally in `data_dir` and named `test.csv`. The labels and features are extracted from the `.csv` file.

In [48]:
"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
import os

# read in test data, assuming it is stored locally
test_data = pd.read_csv(os.path.join(data_dir, "test.csv"), header=None, names=None)

# labels are in the first column
test_y = test_data.iloc[:,0]
test_x = test_data.iloc[:,1:]

## EXERCISE: Determine the accuracy of your model

Use your deployed `predictor` to generate predicted, class labels for the test data. Compare those to the *true* labels, `test_y`, and calculate the accuracy as a value between 0 and 1.0 that indicates the fraction of test data that your model classified correctly. You may use [sklearn.metrics](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics) for this calculation.

**To pass this project, your model should get at least 90% test accuracy.**

In [53]:
# First: generate predicted, class labels
test_y_preds = predictor.predict(test_x.to_numpy().astype('float32'))

"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
# test that your model generates the correct number of labels
assert len(test_y_preds)==len(test_y), 'Unexpected number of predictions.'
print('Test passed!')

Test passed!


In [80]:
pred_y = [int(pred.label['predicted_label'].float32_tensor.values[0]) for pred in test_y_preds]
pred_y

[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0]

In [84]:
# Second: calculate the test accuracy
from sklearn.metrics import classification_report
report = classification_report(test_y.values, pred_y)

print(report)

## print out the array of predicted and true labels, if you want
print('\nPredicted class labels: ')
print(pred_y)
print('\nTrue class labels: ')
print(test_y.values)

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00        15

    accuracy                           1.00        25
   macro avg       1.00      1.00      1.00        25
weighted avg       1.00      1.00      1.00        25


Predicted class labels: 
[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0]

True class labels: 
[1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 0]


### Evaluate PyTorch Model

In [86]:
from sagemaker.pytorch import PyTorchModel

pytorch_model = PyTorchModel(model_data=pytorch_best_estimator.model_data,
                             entry_point='predict.py',
                             source_dir='source_pytorch',
                             role=role,
                             sagemaker_session=sagemaker_session,
                             framework_version='1.5.0')

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.


In [87]:
# deploy the model
predictor = pytorch_model.deploy(initial_instance_count=1, instance_type='ml.m5.large')

'create_image_uri' will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


-------------!

In [88]:
# First: generate predicted, class labels
test_y_preds = predictor.predict(test_x.to_numpy())

"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
# test that your model generates the correct number of labels
assert len(test_y_preds)==len(test_y), 'Unexpected number of predictions.'
print('Test passed!')

Test passed!


In [90]:
# Second: calculate the test accuracy
from sklearn.metrics import classification_report
report = classification_report(test_y.values, test_y_preds)

print(report)

## print out the array of predicted and true labels, if you want
print('\nPredicted class labels: ')
print(test_y_preds.reshape(1,-1))
print('\nTrue class labels: ')
print(test_y.values)

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00        15

    accuracy                           1.00        25
   macro avg       1.00      1.00      1.00        25
weighted avg       1.00      1.00      1.00        25


Predicted class labels: 
[[1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 0. 1. 0. 1. 1. 0.
  0.]]

True class labels: 
[1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 0]


### Evaluate SKLearn Model


In [119]:
# deploy the model
predictor = sklearn2_estimator.deploy(initial_instance_count=1, instance_type='ml.m5.large')

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.


---------------!

In [120]:
# First: generate predicted, class labels
test_y_preds = predictor.predict(test_x.to_numpy())

"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
# test that your model generates the correct number of labels
assert len(test_y_preds)==len(test_y), 'Unexpected number of predictions.'
print('Test passed!')

Test passed!


In [121]:
# Second: calculate the test accuracy
from sklearn.metrics import classification_report
report = classification_report(test_y.values, test_y_preds)

print(report)

## print out the array of predicted and true labels, if you want
print('\nPredicted class labels: ')
print(test_y_preds)
print('\nTrue class labels: ')
print(test_y.values)

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        10
           1       0.60      1.00      0.75        15

    accuracy                           0.60        25
   macro avg       0.30      0.50      0.37        25
weighted avg       0.36      0.60      0.45        25


Predicted class labels: 
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

True class labels: 
[1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 0]


  _warn_prf(average, modifier, msg_start, len(result))


### Question 1: How many false positives and false negatives did your model produce, if any? And why do you think this is?

**Answer**: 
I will discount the SKLearn model as the accuracy is very poor.

With the LinearLearner and the PyTorch model, the accuracy was 100%. The number of false positives and false negatives were **zero** each.
The reason this could be is because the features that were engineered to support the data - such as *containment score* and *least common subsequence* is extremely helpful to identify the answers that have commonality to the source text. Hence the high accuracy.

On the other hand, the complexity could be higher if we were to predict the category of plagiarism rather than just the class. This brings in more nuances into the problem statement and this might cause the model to not perform as well.


### Question 2: How did you decide on the type of model to use? 

**Answer**:

I decided to try all the 3 different model types and compare them in order to make a choice. 
Based on performance, I will not go ahead with the SKLearn model since the results are very fluctuating and the accuracy is also quite poor in general.

I also decided to try HyperParameter Tuning in order to find the best performing hyper parameters for the SKLearn and PyTorch models.
For Linear Learner, I decided not to go ahead with Tuning and instead just choose a value for the HyperParameter based on intuition.

Based on the results, I would prefer to go with the SageMaker built-in algorithm - **Linear Learner model**. The performance for this problem statement is very good as well as the complexity is very low.
This makes it the ideal model to use for this problem.

----
## EXERCISE: Clean up Resources

After you're done evaluating your model, **delete your model endpoint**. You can do this with a call to `.delete_endpoint()`. You need to show, in this notebook, that the endpoint was deleted. Any other resources, you may delete from the AWS console, and you will find more instructions on cleaning up all your resources, below.

In [122]:
# uncomment and fill in the line below!
predictor.delete_endpoint()

### Deleting S3 bucket

When you are *completely* done with training and testing models, you can also delete your entire S3 bucket. If you do this before you are done training your model, you'll have to recreate your S3 bucket and upload your training data again.

In [123]:
# deleting bucket, uncomment lines below

bucket_to_delete = boto3.resource('s3').Bucket(bucket)
bucket_to_delete.objects.all().delete()

[{'ResponseMetadata': {'RequestId': 'F618C87CD13EBAA1',
   'HostId': 'nuEEgx93UuBNP9M1ERhogGOvj+0xgCCDaCXnmc/P6D+T5KOoKpv5Mg4UX6oYjO//HMCvLsAn+78=',
   'HTTPStatusCode': 200,
   'HTTPHeaders': {'x-amz-id-2': 'nuEEgx93UuBNP9M1ERhogGOvj+0xgCCDaCXnmc/P6D+T5KOoKpv5Mg4UX6oYjO//HMCvLsAn+78=',
    'x-amz-request-id': 'F618C87CD13EBAA1',
    'date': 'Sun, 20 Sep 2020 12:55:00 GMT',
    'connection': 'close',
    'content-type': 'application/xml',
    'transfer-encoding': 'chunked',
    'server': 'AmazonS3'},
   'RetryAttempts': 0},
  'Deleted': [{'Key': 'plagiarism-sklearn/output/sagemaker-scikit-lea-200919-1640-004-25d576a6/output/model.tar.gz'},
   {'Key': 'plagiarism-classifier/test.csv'},
   {'Key': 'plagiarism-sklearn/output/sagemaker-scikit-lea-200919-1737-001-f73e3754/output/model.tar.gz'},
   {'Key': 'plagiarism-pytorch/output/pytorch-training-200920-0821-005-d70b78f7/output/model.tar.gz'},
   {'Key': 'plagiarism-sklearn/output/sagemaker-scikit-lea-200919-1640-009-413040bf/output/model

### Deleting all your models and instances

When you are _completely_ done with this project and do **not** ever want to revisit this notebook, you can choose to delete all of your SageMaker notebook instances and models by following [these instructions](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-cleanup.html). Before you delete this notebook instance, I recommend at least downloading a copy and saving it, locally.

---
## Further Directions

There are many ways to improve or add on to this project to expand your learning or make this more of a unique project for you. A few ideas are listed below:
* Train a classifier to predict the *category* (1-3) of plagiarism and not just plagiarized (1) or not (0).
* Utilize a different and larger dataset to see if this model can be extended to other types of plagiarism.
* Use language or character-level analysis to find different (and more) similarity features.
* Write a complete pipeline function that accepts a source text and submitted text file, and classifies the submitted text as plagiarized or not.
* Use API Gateway and a lambda function to deploy your model to a web application.

These are all just options for extending your work. If you've completed all the exercises in this notebook, you've completed a real-world application, and can proceed to submit your project. Great job!