# Plagiarism Detection Model

Now that I've created training and test data, I'm ready to define and train a model. The goal is to train a binary classification model that learns to label an answer file as either plagiarized or not, based on the features.

This task will be broken down into a few discrete steps:

* Upload the data to S3.
* Define a binary classification model and a training script.
* Train the model and deploy it.
* Evaluate the deployed classifier.

## Load Data to S3

In [1]:
import pandas as pd
import boto3
import sagemaker

In [2]:
# session and role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# create an S3 bucket
bucket = sagemaker_session.default_bucket()

## Upload the training data to S3

In [3]:
data_dir = 'plagiarism_data'

# set prefix, a descriptive name for a directory  
prefix = 'plagiarism_detector'

# upload all data to S3
input_data = sagemaker_session.upload_data(path = data_dir, bucket=bucket, key_prefix=prefix)
print(input_data)

s3://sagemaker-us-west-2-376940003530/plagiarism_detector


### Test cell

In [4]:
# confirm that data is in S3 bucket
empty_check = []
for obj in boto3.resource('s3').Bucket(bucket).objects.all():
    empty_check.append(obj.key)
    print(obj.key)

assert len(empty_check) !=0, 'S3 bucket is empty.'
print('Test passed!')

plagiarism_detector/test.csv
plagiarism_detector/train.csv
sagemaker-pytorch-2019-11-28-21-49-52-781/source/sourcedir.tar.gz
sagemaker-scikit-learn-2019-11-28-21-48-17-059/source/sourcedir.tar.gz
sagemaker-scikit-learn-2019-11-28-21-48-27-275/source/sourcedir.tar.gz
sagemaker-scikit-learn-2019-11-28-22-01-53-614/source/sourcedir.tar.gz
sagemaker-scikit-learn-2019-11-28-22-11-32-352/source/sourcedir.tar.gz
Test passed!


---

# Modeling

Now that I've uploaded the training data, it's time to define and train a model!

The type of model can be:
* Use a built-in classification algorithm, like LinearLearner.
* Define a custom Scikit-learn classifier, a comparison of models can be found [here](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html).
* Define a custom PyTorch neural network classifier. 
 
---

## Write a training script 

A typical training script:
* Loads training data from a specified directory
* Parses any training & model hyperparameters (ex. nodes in a neural network, training epochs, etc.)
* Instantiates a model, with any specified hyperparams
* Trains that model 
* Finally, saves the model so that it can be hosted/deployed, later

In [5]:
# directory can be changed to: source_sklearn or source_pytorch
!pygmentize source_sklearn/train.py

[34mfrom[39;49;00m [04m[36m__future__[39;49;00m [34mimport[39;49;00m print_function

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m

[34mfrom[39;49;00m [04m[36msklearn.externals[39;49;00m [34mimport[39;49;00m joblib

[34mfrom[39;49;00m [04m[36msklearn.svm[39;49;00m [34mimport[39;49;00m LinearSVC


[37m# Provided model load function[39;49;00m
[34mdef[39;49;00m [32mmodel_fn[39;49;00m(model_dir):
    [33m"""Load model from the model_dir. This is the same model that is saved[39;49;00m
[33m    in the main if statement.[39;49;00m
[33m    """[39;49;00m
    [34mprint[39;49;00m([33m"[39;49;00m[33mLoading model.[39;49;00m[33m"[39;49;00m)

    [37m# load using joblib[39;49;00m
    model = joblib.load(os.path.join(model_dir, [33m"[39;49;00m[33mmodel.joblib[39;49;00m[33m"[39;49;00m))
  

---
# Create an Estimator

When a custom model is constructed in SageMaker, an entry point must be specified. This is the Python file which will be executed when the model is trained; the `train.py` function specified above. A custom training script in SageMaker requires constructor arguments:

* **entry_point**: The path to the Python script SageMaker runs for training.
* **source_dir**: The path to the training script directory `train_sklearn` OR `train_pytorch`.
* **role**: Role ARN, which was specified, above.
* **train_instance_count**: The number of training instances (should be left at 1).
* **train_instance_type**: The type of SageMaker instance for training. Note: Because Scikit-learn does not natively support GPU training, Sagemaker Scikit-learn does not currently support training on GPU instance types.
* **sagemaker_session**: The session used to train on Sagemaker.
* **hyperparameters** (optional): A dictionary `{'name':value, ..}` passed to the train function as hyperparameters.

Note: For a PyTorch model, there is another optional argument **framework_version**, which can be set to the latest version of PyTorch.

## Define a Scikit-learn or PyTorch estimator

To import the desired estimator, use one of the following lines:
```
from sagemaker.sklearn.estimator import SKLearn
```
```
from sagemaker.pytorch import PyTorch
```

In [6]:
from sagemaker.sklearn.estimator import SKLearn
estimator = SKLearn(entry_point="train.py",
                    source_dir="source_sklearn",
                    role=role,
                    train_instance_count=1,
                    train_instance_type='ml.c4.xlarge'
                    )

## Train the estimator

In [7]:
estimator.fit({'training': input_data})

2019-11-28 22:17:32 Starting - Starting the training job...
2019-11-28 22:17:34 Starting - Launching requested ML instances......
2019-11-28 22:18:35 Starting - Preparing the instances for training...
2019-11-28 22:19:30 Downloading - Downloading input data...
2019-11-28 22:20:01 Training - Training image download completed. Training in progress..[31m2019-11-28 22:20:01,857 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[31m2019-11-28 22:20:01,860 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[31m2019-11-28 22:20:01,871 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[31m2019-11-28 22:20:02,164 sagemaker-containers INFO     Module train does not provide a setup.py. [0m
[31mGenerating setup.py[0m
[31m2019-11-28 22:20:02,165 sagemaker-containers INFO     Generating setup.cfg[0m
[31m2019-11-28 22:20:02,165 sagemaker-containers INFO     Generating MANIFEST.in[0m
[31m20

## Deploy the trained model

In [8]:
# from sagemaker.sklearn.model import SKLearnModel

# deploy the model to create a predictor
predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.c4.xlarge')

--------------------------------------------------------------------------!

---
# Evaluating The Model

In [9]:
import os

# read in test data, assuming it is stored locally
test_data = pd.read_csv(os.path.join(data_dir, "test.csv"), header=None, names=None)

# labels are in the first column
test_y = test_data.iloc[:,0]
test_x = test_data.iloc[:,1:]

## Determine the accuracy of the model

In [10]:
# First: generate predicted, class labels
test_y_preds = predictor.predict(test_x)

# test that your model generates the correct number of labels
assert len(test_y_preds)==len(test_y), 'Unexpected number of predictions.'
print('Test passed!')

Test passed!


In [12]:
# Second: calculate the test accuracy
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(test_y, test_y_preds)

print(accuracy)

## print out the array of predicted and true labels, if you want
print('\nPredicted class labels: ')
print(test_y_preds)
print('\nTrue class labels: ')
print(test_y.values)

1.0

Predicted class labels: 
[1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 0]

True class labels: 
[1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 0]


----
## Clean up Resources

In [13]:
estimator.delete_endpoint()

### Deleting S3 bucket

In [14]:
# deleting bucket

bucket_to_delete = boto3.resource('s3').Bucket(bucket)
bucket_to_delete.objects.all().delete()

[{'ResponseMetadata': {'RequestId': 'D63164D09D595398',
   'HostId': 'N+/6y+6DMiK4fIPw+c1lY3p8bfmJc/M6t3oCp8rxJtLibjtZBxx8PG0a3KmFSP/UCIbkXx/J0Qo=',
   'HTTPStatusCode': 200,
   'HTTPHeaders': {'x-amz-id-2': 'N+/6y+6DMiK4fIPw+c1lY3p8bfmJc/M6t3oCp8rxJtLibjtZBxx8PG0a3KmFSP/UCIbkXx/J0Qo=',
    'x-amz-request-id': 'D63164D09D595398',
    'date': 'Thu, 28 Nov 2019 22:35:31 GMT',
    'connection': 'close',
    'content-type': 'application/xml',
    'transfer-encoding': 'chunked',
    'server': 'AmazonS3'},
   'RetryAttempts': 0},
  'Deleted': [{'Key': 'sagemaker-pytorch-2019-11-28-21-49-52-781/source/sourcedir.tar.gz'},
   {'Key': 'sagemaker-scikit-learn-2019-11-28-22-17-32-062/output/model.tar.gz'},
   {'Key': 'sagemaker-pytorch-2019-11-28-22-18-51-969/source/sourcedir.tar.gz'},
   {'Key': 'sagemaker-scikit-learn-2019-11-28-21-48-27-275/source/sourcedir.tar.gz'},
   {'Key': 'sagemaker-scikit-learn-2019-11-28-22-01-53-614/source/sourcedir.tar.gz'},
   {'Key': 'sagemaker-scikit-learn-2019-11-

---
## Further Directions

* Train a classifier to predict the *category* (1-3) of plagiarism and not just plagiarized (1) or not (0).
* Utilize a different and larger dataset to see if this model can be extended to other types of plagiarism.
* Use language or character-level analysis to find different (and more) similarity features.
* Write a complete pipeline function that accepts a source text and submitted text file, and classifies the submitted text as plagiarized or not.
* Use API Gateway and a lambda function to deploy your model to a web application.