# Plagiarism Detection Model

Now that I've created training and test data, I'm ready to define and train a model. The goal is to train a binary classification model that learns to label an answer file as either plagiarized or not, based on the features.

This task will be broken down into a few discrete steps:

* Upload the data to S3.
* Define a binary classification model and a training script.
* Train the model and deploy it.
* Evaluate the deployed classifier.

## Load Data to S3

In [1]:
import pandas as pd
import boto3
import sagemaker
import os

In [2]:
# session and role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# create an S3 bucket
bucket = sagemaker_session.default_bucket()

## Upload the training data to S3

In [3]:
data_dir = 'plagiarism_data'

# set prefix, a descriptive name for a directory  
prefix = 'plagiarism_detector'

# upload all data to S3
input_data = sagemaker_session.upload_data(path = data_dir, bucket=bucket, key_prefix=prefix)

In [4]:
train = pd.read_csv(os.path.join(data_dir, 'train.csv'), header=None)
train.head()

Unnamed: 0,0,1,2
0,0,0.398148,0.0
1,1,0.869369,0.382488
2,1,0.593583,0.06044
3,0,0.544503,0.0
4,0,0.329502,0.0


### Test cell

In [5]:
# confirm that data is in S3 bucket
empty_check = []
for obj in boto3.resource('s3').Bucket(bucket).objects.all():
    empty_check.append(obj.key)
    print(obj.key)

assert len(empty_check) !=0, 'S3 bucket is empty.'
print('Test passed!')

plagiarism_detector/test.csv
plagiarism_detector/train.csv
sagemaker-pytorch-2019-11-28-22-38-47-258/source/sourcedir.tar.gz
sagemaker-pytorch-2019-11-28-22-45-55-479/output/model.tar.gz
sagemaker-pytorch-2019-11-28-22-45-55-479/source/sourcedir.tar.gz
sagemaker-pytorch-2019-11-29-02-06-02-331/sourcedir.tar.gz
sagemaker-pytorch-2019-11-29-23-49-06-188/output/model.tar.gz
sagemaker-pytorch-2019-11-29-23-49-06-188/source/sourcedir.tar.gz
sagemaker-pytorch-2019-11-29-23-49-37-252/output/model.tar.gz
sagemaker-pytorch-2019-11-29-23-49-37-252/source/sourcedir.tar.gz
sagemaker-pytorch-2019-11-29-23-52-49-313/sourcedir.tar.gz
sagemaker-pytorch-2019-11-30-00-04-24-863/output/model.tar.gz
sagemaker-pytorch-2019-11-30-00-04-24-863/source/sourcedir.tar.gz
sagemaker-pytorch-2019-11-30-00-07-36-634/sourcedir.tar.gz
Test passed!


---

# Modeling

Now that I've uploaded the training data, it's time to define and train a model!

The type of model can be:
* Use a built-in classification algorithm, like LinearLearner.
* Define a custom Scikit-learn classifier, a comparison of models can be found [here](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html).
* Define a custom PyTorch neural network classifier. 
 
---

## Write a training script 

A typical training script:
* Loads training data from a specified directory
* Parses any training & model hyperparameters (ex. nodes in a neural network, training epochs, etc.)
* Instantiates a model, with any specified hyperparams
* Trains that model 
* Finally, saves the model so that it can be hosted/deployed, later

In [6]:
# directory can be changed to: source_sklearn or source_pytorch
!pygmentize source_pytorch/train.py

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m
[34mimport[39;49;00m [04m[36mtorch.optim[39;49;00m [34mas[39;49;00m [04m[36moptim[39;49;00m
[34mimport[39;49;00m [04m[36mtorch.utils.data[39;49;00m

[37m# imports the model in model.py by name[39;49;00m
[34mfrom[39;49;00m [04m[36mmodel[39;49;00m [34mimport[39;49;00m BinaryClassifier

[34mdef[39;49;00m [32mmodel_fn[39;49;00m(model_dir):
    [33m"""Load the PyTorch model from the `model_dir` directory."""[39;49;00m
    [34mprint[39;49;00m([33m"[39;49;00m[33mLoading model.[39;49;00m[33m"[39;49;00m)

    [37m# First, load the parameters used to create the model.[39;49;00m
    model_info = {}
    model_info_path = os.path.join(model_dir, [33m'[3

---
# Create an Estimator

When a custom model is constructed in SageMaker, an entry point must be specified. This is the Python file which will be executed when the model is trained; the `train.py` function specified above. A custom training script in SageMaker requires constructor arguments:

* **entry_point**: The path to the Python script SageMaker runs for training.
* **source_dir**: The path to the training script directory `train_sklearn` OR `train_pytorch`.
* **role**: Role ARN, which was specified, above.
* **train_instance_count**: The number of training instances (should be left at 1).
* **train_instance_type**: The type of SageMaker instance for training. Note: Because Scikit-learn does not natively support GPU training, Sagemaker Scikit-learn does not currently support training on GPU instance types.
* **sagemaker_session**: The session used to train on Sagemaker.
* **hyperparameters** (optional): A dictionary `{'name':value, ..}` passed to the train function as hyperparameters.

Note: For a PyTorch model, there is another optional argument **framework_version**, which can be set to the latest version of PyTorch.

## Define a Scikit-learn or PyTorch estimator

To import the desired estimator, use one of the following lines:
```
from sagemaker.sklearn.estimator import SKLearn
```
```
from sagemaker.pytorch import PyTorch
```

In [7]:
from sagemaker.pytorch import PyTorch
estimator = PyTorch(entry_point="train.py",
                    source_dir="source_pytorch",
                    role=role,
                    train_instance_count=1,
                    train_instance_type='ml.c4.xlarge',
                    sagemaker_session = sagemaker_session,
                    hyperparameters={
                        'input_features': 2,  # num of features
                        'hidden_dim': 20,
                        'output_dim': 1,
                        'epochs': 50 
                    })

No framework_version specified, defaulting to version 0.4.


## Train the estimator

In [8]:
estimator.fit({'training': input_data})

2019-11-30 00:23:53 Starting - Starting the training job...
2019-11-30 00:23:54 Starting - Launching requested ML instances......
2019-11-30 00:24:57 Starting - Preparing the instances for training...
2019-11-30 00:25:51 Downloading - Downloading input data...
2019-11-30 00:26:12 Training - Downloading the training image..[31mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[31mbash: no job control in this shell[0m
[31m2019-11-30 00:26:32,899 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[31m2019-11-30 00:26:32,902 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[31m2019-11-30 00:26:32,913 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[31m2019-11-30 00:26:34,365 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[31m2019-11-30 00:26:34,591 sagemaker-containers INFO     Module train does not provi

## Deploy the trained model

In [9]:
from sagemaker.pytorch import PyTorchModel

# Create a model from the trained estimator data
# And point to the prediction script
model = PyTorchModel(model_data=estimator.model_data,
                     role = role,
                     framework_version='1.0',
                     entry_point='predict.py',
                     source_dir='source_pytorch')

# deploy the model to create a predictor
predictor = model.deploy(initial_instance_count=1, instance_type='ml.c4.xlarge')

--------------------------------------------------------------------------------------!

---
# Evaluating The Model

In [10]:
import os

# read in test data, assuming it is stored locally
test_data = pd.read_csv(os.path.join(data_dir, "test.csv"), header=None, names=None)

# labels are in the first column
test_y = test_data.iloc[:,0]
test_x = test_data.iloc[:,1:]

## Determine the accuracy of the model

In [11]:
import numpy as np

test_y_preds =  np.squeeze(np.round(predictor.predict(test_x)))

# test that your model generates the correct number of labels
assert len(test_y_preds)==len(test_y), 'Unexpected number of predictions.'
print('Test passed!')

Test passed!


In [12]:
# Second: calculate the test accuracy
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(test_y, test_y_preds)

print(accuracy)

## print out the array of predicted and true labels, if you want
print('\nPredicted class labels: ')
print(test_y_preds)
print('\nTrue class labels: ')
print(test_y.values)

0.6

Predicted class labels: 
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1.]

True class labels: 
[1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 0]


----
## Clean up Resources

In [15]:
# Accepts a predictor endpoint as input
# And deletes the endpoint by name

def delete_endpoint(predictor):
        try:
            boto3.client('sagemaker').delete_endpoint(EndpointName=predictor.endpoint)
            print('Deleted {}'.format(predictor.endpoint))
        except:
            print('Already deleted: {}'.format(predictor.endpoint))
# delete the predictor endpoint 
delete_endpoint(predictor)

Deleted sagemaker-pytorch-2019-11-30-00-27-05-073


### Deleting S3 bucket

In [16]:
# deleting bucket

bucket_to_delete = boto3.resource('s3').Bucket(bucket)
bucket_to_delete.objects.all().delete()

[{'ResponseMetadata': {'RequestId': '9BCAD1C04C81500B',
   'HostId': 'npKH+jNcZekrzn4oLEJlUFMuTL+lU1+0A3Z37WpzPqGl8ciExVtH49tPHek2UFwIWXKme3rS1ug=',
   'HTTPStatusCode': 200,
   'HTTPHeaders': {'x-amz-id-2': 'npKH+jNcZekrzn4oLEJlUFMuTL+lU1+0A3Z37WpzPqGl8ciExVtH49tPHek2UFwIWXKme3rS1ug=',
    'x-amz-request-id': '9BCAD1C04C81500B',
    'date': 'Sat, 30 Nov 2019 00:36:15 GMT',
    'connection': 'close',
    'content-type': 'application/xml',
    'transfer-encoding': 'chunked',
    'server': 'AmazonS3'},
   'RetryAttempts': 0},
  'Deleted': [{'Key': 'sagemaker-pytorch-2019-11-29-23-49-37-252/source/sourcedir.tar.gz'},
   {'Key': 'sagemaker-pytorch-2019-11-30-00-27-04-586/sourcedir.tar.gz'},
   {'Key': 'sagemaker-pytorch-2019-11-30-00-04-24-863/source/sourcedir.tar.gz'},
   {'Key': 'sagemaker-pytorch-2019-11-30-00-07-36-634/sourcedir.tar.gz'},
   {'Key': 'sagemaker-pytorch-2019-11-29-02-06-02-331/sourcedir.tar.gz'},
   {'Key': 'sagemaker-pytorch-2019-11-29-23-49-06-188/output/model.tar.gz'}

---
## Further Directions

* Train a classifier to predict the *category* (1-3) of plagiarism and not just plagiarized (1) or not (0).
* Utilize a different and larger dataset to see if this model can be extended to other types of plagiarism.
* Use language or character-level analysis to find different (and more) similarity features.
* Write a complete pipeline function that accepts a source text and submitted text file, and classifies the submitted text as plagiarized or not.
* Use API Gateway and a lambda function to deploy your model to a web application.