# Plagiarism Detection Model

Now that we've created training and test data, we are ready to define and train a model.

---

In [1]:
import os
import boto3
import pandas as pd
from sklearn.metrics import accuracy_score


import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker.pytorch import PyTorchModel

In [2]:
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = sagemaker_session.default_bucket()

In [3]:
data_dir = 'plagiarism_data'
prefix = 'plagiarism_data'

data_path = sagemaker_session.upload_data(data_dir, bucket=bucket, key_prefix=prefix)

---
#### Below is the model that has been designed to identify plagiarism on the dataset created from the previous notebook.

---

In [32]:
!pygmentize source/train.py

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m

[34mimport[39;49;00m [04m[36mtorch[39;49;00m
[34mimport[39;49;00m [04m[36mtorch.optim[39;49;00m [34mas[39;49;00m [04m[36moptim[39;49;00m
[34mimport[39;49;00m [04m[36mtorch.nn[39;49;00m [34mas[39;49;00m [04m[36mnn[39;49;00m
[34mimport[39;49;00m [04m[36mtorch.utils.data[39;49;00m

[34mfrom[39;49;00m [04m[36mmodel[39;49;00m [34mimport[39;49;00m BinaryClassifier


[34mdef[39;49;00m [32mmodel_fn[39;49;00m(model_dir):
    
    [33m"""Load the PyTorch model from the `model_dir` directory."""[39;49;00m
    
    [34mprint[39;49;00m([33m"[39;49;00m[33mLoading model.[39;49;00m[33m"[39;49;00m)

    model_info = {}
    model_info_path = os.path.join(model_dir, [33m'[39;49;00m[33mmodel_info.pth[39;49;00m[

In [42]:
estimator = PyTorch(
    entry_point='train.py',
    source_dir='source',
    role=role,
    framework_version='1.0',
    sagemaker_session=sagemaker_session,
    train_instance_count=1,
    train_instance_type='ml.c4.xlarge',
    hyperparameters={
        'input_features': 3,
        'hidden_dim': 32,
        'output_dim': 1,
        'epochs': 50
    }
)

## EXERCISE: Train the estimator

Train your estimator on the training data stored in S3. This should create a training job that you can monitor in your SageMaker console.

In [43]:
%%time
estimator.fit({'train': os.path.join(data_path, 'train.csv')})

2020-02-20 17:59:48 Starting - Starting the training job...
2020-02-20 17:59:50 Starting - Launching requested ML instances......
2020-02-20 18:00:58 Starting - Preparing the instances for training......
2020-02-20 18:02:16 Downloading - Downloading input data
2020-02-20 18:02:16 Training - Downloading the training image...
2020-02-20 18:02:46 Uploading - Uploading generated training model[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-02-20 18:02:37,260 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-02-20 18:02:37,262 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-02-20 18:02:37,274 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2020-02-20 18:02:40,290 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2020-02-20 18

In [48]:
%%time

model = PyTorchModel(
    model_data=estimator.model_data,
    role=role,
    framework_version='1.0',
    entry_point='predict.py',
    source_dir='source'
)

predictor = model.deploy(initial_instance_count=1, instance_type='ml.t2.medium')

---------------------!CPU times: user 467 ms, sys: 14.2 ms, total: 481 ms
Wall time: 10min 32s


In [56]:


accuracy = accuracy_score(test_y, test_y_preds)
print(accuracy)

## print out the array of predicted and true labels, if you want
print('\nPredicted class labels: ')
print(test_y_preds)
print('\nTrue class labels: ')
print(test_y.values)

1.0

Predicted class labels: 
[[1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [0.]
 [1.]
 [0.]
 [1.]
 [1.]
 [0.]
 [0.]]

True class labels: 
[1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 0]


---
### Observations


#### There are no false positives or negatives currently. This could be due to the fact that the data sample is small; it could be, because there is a clear enough line between the answer and source texts that the model was able to correctly label across the board, or there could be some overfitting that has taken place. In reality it is likely a blend of these reasons. 

#### I chose to create a neural network utilizing PyTorch for a few reasons. I wanted to choose a good solution for the problem, not just a simple solution; and while linear regression models are good for binary classification problems and simpler to create and train, I've found that neural networks that are trained and developed properly can produce more robust results.

---

In [58]:
predictor.delete_endpoint()
bucket_to_delete = boto3.resource('s3').Bucket(bucket)
bucket_to_delete.objects.all().delete()