# Plagiarism Detection Model

The goal in this notebook, will be to train a binary classification model that learns to label an answer file as either plagiarized or not, based on the features provided to the model.

This task includes:

* Upload your data to S3.
* Define a binary classification model and a training script.
* Train your model and deploy it.
* Evaluate your deployed classifier and answer some questions about your approach.

---

## Load Data to S3

In the Feature Engineering Notebook, we created two files: a `training.csv` and `test.csv` file with the features and class labels for the given corpus of plagiarized/non-plagiarized text data. 

>The below cells load in some AWS SageMaker libraries and creates a default bucket. After creating this bucket, you can upload your locally stored data to S3.

Save your train and test `.csv` feature files, locally. To do this you can run the second notebook "2_Plagiarism_Feature_Engineering" in SageMaker or you can manually upload your files to this notebook using the upload icon in Jupyter Lab. Then you can upload local files to S3 by using `sagemaker_session.upload_data` and pointing directly to where the training data is saved.

In [1]:
import pandas as pd
import boto3
import sagemaker
import os
import numpy as np

In [2]:
# session and role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# create an S3 bucket
bucket = sagemaker_session.default_bucket()

## Upload training data to S3

In [3]:
# should be the name of directory you created to save your features data
data_dir = 'plagiarism_data'

# set prefix, a descriptive name for a directory  
prefix = 'plagiarism-detection'

# upload all data to S3
input_data = sagemaker_session.upload_data(path=data_dir, bucket=bucket, key_prefix=prefix)
input_data

's3://sagemaker-eu-west-1-392009495238/plagiarism-detection'

### Test cell

Test that your data has been successfully uploaded. The below cell prints out the items in your S3 bucket and will throw an error if it is empty. You should see the contents of your `data_dir` and perhaps some checkpoints. If you see any other files listed, then you may have some old model files that you can delete via the S3 console (though, additional files shouldn't affect the performance of model developed in this notebook).

In [4]:
# confirm that data is in S3 bucket
empty_check = []
for obj in boto3.resource('s3').Bucket(bucket).objects.all():
    empty_check.append(obj.key)
    print(obj.key)

assert len(empty_check) !=0, 'S3 bucket is empty.'
print('Test passed!')

/creditcard/linear-learner-2020-02-11-15-32-53-367/output/model.tar.gz
/creditcard/linear-learner-2020-02-11-16-38-57-506/output/model.tar.gz
/creditcard/linear-learner-2020-02-11-16-54-49-054/output/model.tar.gz
/creditcard/linear-learner-2020-02-11-17-13-16-085/output/model.tar.gz
/creditcard/linear-learner-2020-02-11-17-19-33-442/output/model.tar.gz
/creditcard/linear-learner-2020-02-11-17-21-49-694/output/model.tar.gz
/plagiarism-detection/linear-learner-2020-02-19-15-57-14-013/output/model.tar.gz
/plagiarism-detection/linear-learner-2020-02-19-16-49-32-356/output/model.tar.gz
deepar-energy-consumption/output/forecasting-deepar-2020-02-12-15-43-57-156/output/model.tar.gz
deepar-energy-consumption/output/forecasting-deepar-2020-02-14-18-42-42-889/output/model.tar.gz
deepar-energy-consumption/test/test.json
deepar-energy-consumption/train/train.json
gasoline-barrrels/output/linear-learner-2020-02-14-17-01-03-073/output/model.tar.gz
gasoline-barrrels/train/linear_train.data
gasoline-b

---

# Modeling


In [5]:
!pygmentize source_sklearn/train.py

[34mfrom[39;49;00m [04m[36m__future__[39;49;00m [34mimport[39;49;00m print_function

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m

[34mfrom[39;49;00m [04m[36msklearn.externals[39;49;00m [34mimport[39;49;00m joblib

[37m## TODO: Import any additional libraries you need to define a model[39;49;00m
[34mfrom[39;49;00m [04m[36msklearn.svm[39;49;00m [34mimport[39;49;00m LinearSVC

[37m# Provided model load function[39;49;00m
[34mdef[39;49;00m [32mmodel_fn[39;49;00m(model_dir):
    [33m"""Load model from the model_dir. This is the same model that is saved[39;49;00m
[33m    in the main if statement.[39;49;00m
[33m    """[39;49;00m
    [34mprint[39;49;00m([33m"[39;49;00m[33mLoading model.[39;49;00m[33m"[39;49;00m)
    
    [37m# load using joblib[39;49;00m
    model = joblib.load(os.path.join(model_dir, 

---
# Create an Estimator

When a custom model is constructed in SageMaker, an entry point must be specified. This is the Python file which will be executed when the model is trained; the `train.py` function you specified above. To run a custom training script in SageMaker, construct an estimator, and fill in the appropriate constructor arguments:

* **entry_point**: The path to the Python script SageMaker runs for training and prediction.
* **source_dir**: The path to the training script directory `source_sklearn` OR `source_pytorch`.
* **entry_point**: The path to the Python script SageMaker runs for training and prediction.
* **source_dir**: The path to the training script directory `train_sklearn` OR `train_pytorch`.
* **entry_point**: The path to the Python script SageMaker runs for training.
* **source_dir**: The path to the training script directory `train_sklearn` OR `train_pytorch`.
* **role**: Role ARN, which was specified, above.
* **train_instance_count**: The number of training instances (should be left at 1).
* **train_instance_type**: The type of SageMaker instance for training. Note: Because Scikit-learn does not natively support GPU training, Sagemaker Scikit-learn does not currently support training on GPU instance types.
* **sagemaker_session**: The session used to train on Sagemaker.
* **hyperparameters** (optional): A dictionary `{'name':value, ..}` passed to the train function as hyperparameters.

Note: For a PyTorch model, there is another optional argument **framework_version**, which you can set to the latest version of PyTorch, `1.0`.

### Define a Scikit-learn model


In [6]:
from sagemaker.sklearn.estimator import SKLearn

estimator = SKLearn(entry_point="train.py",
                    source_dir="source_sklearn",
                    role=role,
                    train_instance_count=1,
                    train_instance_type='ml.c4.xlarge')

## EXERCISE: Train the estimator

Train your estimator on the training data stored in S3. This should create a training job that you can monitor in your SageMaker console.

In [7]:
%%time

# Train your estimator on S3 training data
estimator.fit({'train': input_data})

2020-02-20 08:17:01 Starting - Starting the training job...
2020-02-20 08:17:03 Starting - Launching requested ML instances...
2020-02-20 08:17:59 Starting - Preparing the instances for training......
2020-02-20 08:18:51 Downloading - Downloading input data...
2020-02-20 08:19:20 Training - Downloading the training image.[34m2020-02-20 08:19:34,120 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2020-02-20 08:19:34,122 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-02-20 08:19:34,134 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2020-02-20 08:19:34,418 sagemaker-containers INFO     Module train does not provide a setup.py. [0m
[34mGenerating setup.py[0m
[34m2020-02-20 08:19:34,418 sagemaker-containers INFO     Generating setup.cfg[0m
[34m2020-02-20 08:19:34,419 sagemaker-containers INFO     Generating MANIFEST.in[0m
[34m2020-02-20 08:19:34,419 sage

## Deploy the trained model

In [8]:
%%time

# deploy the model to create a predictor
predictor = estimator.deploy(initial_instance_count=1,
                     instance_type='ml.m4.xlarge')


-----------!CPU times: user 191 ms, sys: 7.15 ms, total: 199 ms
Wall time: 5min 31s


---
# Evaluate The Model

Once your model is deployed, you can see how it performs when applied to the test data.

The cell below, reads in the test data, assuming it is stored locally in `data_dir` and named `test.csv`. The labels and features are extracted from the `.csv` file.

In [9]:
import os

# read in test data, assuming it is stored locally
test_data = pd.read_csv(os.path.join(data_dir, "test.csv"), header=None, names=None)

# labels are in the first column
test_y = test_data.iloc[:,0]
test_x = test_data.iloc[:,1:]

## Determine the accuracy of the model


In [10]:
# First: generate predicted, class labels
test_y_preds = predictor.predict(test_x)

# test that your model generates the correct number of labels
assert len(test_y_preds)==len(test_y), 'Unexpected number of predictions.'
print('Test passed!')

Test passed!


In [11]:
# Second: calculate the test accuracy
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(test_y, test_y_preds)

print(accuracy)


## print out the array of predicted and true labels
print('\nPredicted class labels: ')
print(test_y_preds)
print('\nTrue class labels: ')
print(test_y.values)

1.0

Predicted class labels: 
[1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 0]

True class labels: 
[1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 0]


In [12]:
pd.crosstab(test_y.values, test_y_preds, rownames=['actuals'], colnames=['predictions'])

predictions,0,1
actuals,Unnamed: 1_level_1,Unnamed: 2_level_1
0,10,0
1,0,15


----
## Clean up Resources


In [13]:
predictor.delete_endpoint()

## Train and Deploy a LinearLearner Model

In [14]:
#Create the estimator
from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(boto3.Session().region_name, 'linear-learner')

output_path = 's3://{}//{}'.format(bucket, prefix)

LL = sagemaker.estimator.Estimator(container,
                                   role=role,
                                   train_instance_count=1,
                                   train_instance_type='ml.c4.xlarge',
                                   output_path=output_path,
                                   sagemaker_session = sagemaker_session)

LL.set_hyperparameters(predictor_type='binary_classifier',
                       mini_batch_size=20)

In [22]:
%%time

#upload dats to s3 (if bucket has been deleted)
train_dir = sagemaker_session.upload_data(os.path.join(data_dir, 'train.csv'), bucket=bucket, key_prefix=prefix)

# load training data from the s3 bucket
s3_train = sagemaker.s3_input(s3_data=train_dir, content_type='text/csv')

LL.fit({'train' : s3_train})

2020-02-20 08:36:03 Starting - Starting the training job...
2020-02-20 08:36:04 Starting - Launching requested ML instances......
2020-02-20 08:37:05 Starting - Preparing the instances for training......
2020-02-20 08:38:12 Downloading - Downloading input data...
2020-02-20 08:39:01 Training - Training image download completed. Training in progress.
2020-02-20 08:39:01 Uploading - Uploading generated training model
2020-02-20 08:39:01 Completed - Training job completed
[34mDocker entrypoint called with argument(s): train[0m
[34m[02/20/2020 08:38:51 INFO 139971438237504] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-input.json: {u'loss_insensitivity': u'0.01', u'epochs': u'15', u'feature_dim': u'auto', u'init_bias': u'0.0', u'lr_scheduler_factor': u'auto', u'num_calibration_samples': u'10000000', u'accuracy_top_k': u'3', u'_num_kv_servers': u'auto', u'use_bias': u'true', u'num_point_for_scaler': u'10000', u'_log_level': u'info'

In [23]:
%%time

# deploy the model to create a predictor
predictor = LL.deploy(initial_instance_count=1,
                     instance_type='ml.m4.xlarge')

-------------!CPU times: user 216 ms, sys: 10.8 ms, total: 227 ms
Wall time: 6min 31s


## Evaluate the Model

In [25]:
# read in test data, assuming it is stored locally
test_data = pd.read_csv(os.path.join(data_dir, "test.csv"), header=None, names=None)

# labels are in the first column
test_y = test_data.iloc[:,0]
test_x = test_data.iloc[:,1:]

In [26]:
# serialize requests and deserialize responses that are specific to the algorithm
from sagemaker.predictor import csv_serializer, json_deserializer

predictor.content_type = 'text/csv'
predictor.serializer = csv_serializer
predictor.deserializer = json_deserializer

In [27]:
# manual check predictions vs scores
test_x_np = test_x.values.astype('float32')
predictor.predict(test_x_np[0])

{'predictions': [{'score': 0.9999998807907104, 'predicted_label': 1.0}]}

In [28]:
# Generate predicted class labels
test_y_preds = np.array([predictor.predict(test_x_np[i])['predictions'][0]['predicted_label'] for i in range(len(test_x_np))])

# test that your model generates the correct number of labels
assert len(test_y_preds)==len(test_y), 'Unexpected number of predictions.'
print('Test passed!')

Test passed!


In [29]:
# Calculate the test accuracy
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(test_y, test_y_preds)

print(accuracy)


## print out the array of predicted and true labels
print('\nPredicted class labels: ')
print(test_y_preds)
print('\nTrue class labels: ')
print(test_y.values)

1.0

Predicted class labels: 
[1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 0. 1. 0. 1. 1. 0.
 0.]

True class labels: 
[1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 0]


In [30]:
pd.crosstab(test_y.values, test_y_preds, rownames=['actuals'], colnames=['predictions'])

predictions,0.0,1.0
actuals,Unnamed: 1_level_1,Unnamed: 2_level_1
0,10,0
1,0,15


In [31]:
# delete endpoint
predictor.delete_endpoint()

### Deleting S3 bucket

When you are *completely* done with training and testing models, you can also delete your entire S3 bucket. If you do this before you are done training your model, you'll have to recreate your S3 bucket and upload your training data again.

In [None]:
# deleting bucket, uncomment lines below

# bucket_to_delete = boto3.resource('s3').Bucket(bucket)
# bucket_to_delete.objects.all().delete()