# Plagiarism Detection: PyTorch Model

Now that I've created training and test data, I'm ready to define and train a model. The goal is to train a binary classification model that learns to label an answer file as either plagiarized or not, based on the features.

This task will be broken down into a few discrete steps:

* Upload the data to S3.
* Define a binary classification model and a training script.
* Train the model and deploy it.
* Evaluate the deployed classifier.

## Load Data to S3

In [1]:
import pandas as pd
import numpy as np
import boto3
import sagemaker
import os
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [2]:
# session and role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# create an S3 bucket
bucket = sagemaker_session.default_bucket()

In [3]:
data_dir = 'plagiarism_data'

# set prefix, a descriptive name for a directory  
prefix = 'plagiarism_detector'

# upload all data to S3
input_data = sagemaker_session.upload_data(path = data_dir, bucket=bucket, key_prefix=prefix)

In [4]:
train = pd.read_csv(os.path.join(data_dir, 'train.csv'), header=None)
train.head()

Unnamed: 0,0,1,2
0,0,0.398148,0.191781
1,1,0.869369,0.846491
2,1,0.593583,0.316062
3,0,0.544503,0.242574
4,0,0.329502,0.161172


### Test cell

In [5]:
# confirm that data is in S3 bucket
empty_check = []
for obj in boto3.resource('s3').Bucket(bucket).objects.all():
    empty_check.append(obj.key)
    print(obj.key)

assert len(empty_check) !=0, 'S3 bucket is empty.'
print('Test passed!')

plagiarism_detector/.ipynb_checkpoints/test-checkpoint.csv
plagiarism_detector/.ipynb_checkpoints/train-checkpoint.csv
plagiarism_detector/test.csv
plagiarism_detector/train.csv
sagemaker-pytorch-2020-06-29-16-18-42-537/debug-output/training_job_end.ts
sagemaker-pytorch-2020-06-29-16-18-42-537/output/model.tar.gz
sagemaker-pytorch-2020-06-29-16-18-42-537/source/sourcedir.tar.gz
sagemaker-pytorch-2020-06-29-16-23-28-350/debug-output/training_job_end.ts
sagemaker-pytorch-2020-06-29-16-23-28-350/output/model.tar.gz
sagemaker-pytorch-2020-06-29-16-23-28-350/source/sourcedir.tar.gz
sagemaker-pytorch-2020-06-29-16-27-12-642/sourcedir.tar.gz
sagemaker-pytorch-2020-06-29-17-10-48-590/debug-output/training_job_end.ts
sagemaker-pytorch-2020-06-29-17-10-48-590/output/model.tar.gz
sagemaker-pytorch-2020-06-29-17-10-48-590/source/sourcedir.tar.gz
sagemaker-pytorch-2020-06-29-17-15-01-127/sourcedir.tar.gz
sagemaker-scikit-learn-2020-06-29-16-38-54-431/debug-output/training_job_end.ts
sagemaker-sciki

## Modeling

In [6]:
# directory can be changed to: source_sklearn or source_pytorch
!pygmentize source_pytorch/train.py

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m
[34mimport[39;49;00m [04m[36mtorch.optim[39;49;00m [34mas[39;49;00m [04m[36moptim[39;49;00m
[34mimport[39;49;00m [04m[36mtorch.utils.data[39;49;00m

[37m# imports the model in model.py by name[39;49;00m
[34mfrom[39;49;00m [04m[36mmodel[39;49;00m [34mimport[39;49;00m BinaryClassifier

[34mdef[39;49;00m [32mmodel_fn[39;49;00m(model_dir):
    [33m"""Load the PyTorch model from the `model_dir` directory."""[39;49;00m
    [34mprint[39;49;00m([33m"[39;49;00m[33mLoading model.[39;49;00m[33m"[39;49;00m)

    [37m# First, load the parameters used to create the model.[39;49;00m
    model_info = {}
    model_info_path = os.path.join(model_dir, [33m'[39;49;00m[33mmode

### Define a PyTorch estimator

In [7]:
from sagemaker.pytorch import PyTorch
estimator = PyTorch(entry_point="train.py",
                    source_dir="source_pytorch",
                    role=role,
                    train_instance_count=1,
                    train_instance_type='ml.c4.xlarge',
                    sagemaker_session = sagemaker_session,
                    framework_version='1.0',
                    hyperparameters={
                        'input_features': 2,  
                        'hidden_dim': 20,
                        'output_dim': 1,
                        'epochs': 50 
                    })

### Train the estimator

In [8]:
estimator.fit({'training': input_data})

2020-06-29 17:31:38 Starting - Starting the training job...
2020-06-29 17:31:40 Starting - Launching requested ML instances.........
2020-06-29 17:33:21 Starting - Preparing the instances for training...
2020-06-29 17:34:05 Downloading - Downloading input data...
2020-06-29 17:34:20 Training - Downloading the training image.[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-06-29 17:34:42,349 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-06-29 17:34:42,352 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-06-29 17:34:42,366 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2020-06-29 17:34:42,367 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2020-06-29 17:34:42,668 sagemaker-containers INFO     Module train does not pro

## Deploy the trained model

In [None]:
from sagemaker.pytorch import PyTorchModel

# Create a model from the trained estimator data
# And point to the prediction script
model = PyTorchModel(model_data=estimator.model_data,
                     role = role,
                     framework_version='1.0',
                     entry_point='predict.py',
                     source_dir='source_pytorch')

# deploy the model to create a predictor
predictor = model.deploy(initial_instance_count=1, instance_type='ml.c4.xlarge')

---------

## Evaluating The Model

In [None]:
# read in test data, assuming it is stored locally
test_data = pd.read_csv(os.path.join(data_dir, "test.csv"), header=None, names=None)

# labels are in the first column
test_y = test_data.iloc[:,0]
test_x = test_data.iloc[:,1:]

### Determine the accuracy of the model

In [None]:
test_y_preds =  np.squeeze(np.round(predictor.predict(test_x)))

# test that your model generates the correct number of labels
assert len(test_y_preds)==len(test_y), 'Unexpected number of predictions.'
print('Test passed!')

In [None]:
# Second: calculate the test accuracy
accuracy = accuracy_score(test_y, test_y_preds)

print(accuracy)

## print out the array of predicted and true labels, if you want
print('\nPredicted class labels: ')
print(test_y_preds)
print('\nTrue class labels: ')
print(test_y.values)

In [None]:
# Third: classification report
print(classification_report(test_y, test_y_preds))

----
## Clean up Resources

In [None]:
# Accepts a predictor endpoint as input
# And deletes the endpoint by name

def delete_endpoint(predictor):
        try:
            boto3.client('sagemaker').delete_endpoint(EndpointName=predictor.endpoint)
            print('Deleted {}'.format(predictor.endpoint))
        except:
            print('Already deleted: {}'.format(predictor.endpoint))
# delete the predictor endpoint 
delete_endpoint(predictor)

### Deleting S3 bucket

In [None]:
# deleting bucket

bucket_to_delete = boto3.resource('s3').Bucket(bucket)
bucket_to_delete.objects.all().delete()

---
## Further Directions

* Train a classifier to predict the *category* (1-3) of plagiarism and not just plagiarized (1) or not (0).
* Utilize a different and larger dataset to see if this model can be extended to other types of plagiarism.
* Use language or character-level analysis to find different (and more) similarity features.
* Write a complete pipeline function that accepts a source text and submitted text file, and classifies the submitted text as plagiarized or not.
* Use API Gateway and a lambda function to deploy your model to a web application.