# Plagiarism Detection: Sklearn Model

Now that I've created training and test data, I'm ready to define and train a model. The goal is to train a binary classification model that learns to label an answer file as either plagiarized or not, based on the features.

This task will be broken down into a few discrete steps:

* Upload the data to S3.
* Define a binary classification model and a training script.
* Train the model and deploy it.
* Evaluate the deployed classifier.

## Load Data to S3

In [1]:
import pandas as pd
import numpy as np
import boto3
import os
import sagemaker
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [2]:
# session and role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# create an S3 bucket
bucket = sagemaker_session.default_bucket()

In [3]:
data_dir = 'plagiarism_data'

# set prefix, a descriptive name for a directory  
prefix = 'plagiarism_detector'

# upload all data to S3
input_data = sagemaker_session.upload_data(path = data_dir, bucket=bucket, key_prefix=prefix)
print(input_data)

s3://sagemaker-us-west-2-376940003530/plagiarism_detector


### Test cell

In [4]:
# confirm that data is in S3 bucket
empty_check = []
for obj in boto3.resource('s3').Bucket(bucket).objects.all():
    empty_check.append(obj.key)
    print(obj.key)

assert len(empty_check) !=0, 'S3 bucket is empty.'
print('Test passed!')

plagiarism_detector/.ipynb_checkpoints/test-checkpoint.csv
plagiarism_detector/.ipynb_checkpoints/train-checkpoint.csv
plagiarism_detector/test.csv
plagiarism_detector/train.csv
sagemaker-pytorch-2020-06-29-16-18-42-537/debug-output/training_job_end.ts
sagemaker-pytorch-2020-06-29-16-18-42-537/output/model.tar.gz
sagemaker-pytorch-2020-06-29-16-18-42-537/source/sourcedir.tar.gz
sagemaker-pytorch-2020-06-29-16-23-28-350/debug-output/training_job_end.ts
sagemaker-pytorch-2020-06-29-16-23-28-350/output/model.tar.gz
sagemaker-pytorch-2020-06-29-16-23-28-350/source/sourcedir.tar.gz
sagemaker-pytorch-2020-06-29-16-27-12-642/sourcedir.tar.gz
sagemaker-pytorch-2020-06-29-17-10-48-590/debug-output/training_job_end.ts
sagemaker-pytorch-2020-06-29-17-10-48-590/output/model.tar.gz
sagemaker-pytorch-2020-06-29-17-10-48-590/source/sourcedir.tar.gz
sagemaker-pytorch-2020-06-29-17-15-01-127/sourcedir.tar.gz
sagemaker-scikit-learn-2020-06-29-16-38-54-431/debug-output/training_job_end.ts
sagemaker-sciki

## Modeling

In [5]:
# directory can be changed to: source_sklearn or source_pytorch
!pygmentize source_sklearn/train.py

[34mfrom[39;49;00m [04m[36m__future__[39;49;00m [34mimport[39;49;00m print_function

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m

[34mfrom[39;49;00m [04m[36msklearn.externals[39;49;00m [34mimport[39;49;00m joblib

[34mfrom[39;49;00m [04m[36msklearn.svm[39;49;00m [34mimport[39;49;00m LinearSVC


[37m# Provided model load function[39;49;00m
[34mdef[39;49;00m [32mmodel_fn[39;49;00m(model_dir):
    [33m"""Load model from the model_dir. This is the same model that is saved[39;49;00m
[33m    in the main if statement.[39;49;00m
[33m    """[39;49;00m
    [34mprint[39;49;00m([33m"[39;49;00m[33mLoading model.[39;49;00m[33m"[39;49;00m)

    [37m# load using joblib[39;49;00m
    model = joblib.load(os.path.join(model_dir, [33m"[39;49;00m[33mmodel.joblib[39;49;00m[33m"[39;49;00m))
    [34mprint[39;49;

## Create an Estimator

### Define a Scikit-learn estimator

In [6]:
from sagemaker.sklearn.estimator import SKLearn
estimator = SKLearn(entry_point="train.py",
                    source_dir="source_sklearn",
                    role=role,
                    train_instance_count=1,
                    train_instance_type='ml.c4.xlarge'
                    )

### Train the estimator

In [7]:
estimator.fit({'training': input_data})

2020-06-29 17:23:58 Starting - Starting the training job...
2020-06-29 17:24:01 Starting - Launching requested ML instances.........
2020-06-29 17:25:35 Starting - Preparing the instances for training...
2020-06-29 17:26:26 Downloading - Downloading input data...
2020-06-29 17:26:51 Training - Downloading the training image...
2020-06-29 17:27:22 Uploading - Uploading generated training model
2020-06-29 17:27:22 Completed - Training job completed
[34m2020-06-29 17:27:10,789 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2020-06-29 17:27:10,792 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-06-29 17:27:10,804 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2020-06-29 17:27:11,083 sagemaker-containers INFO     Module train does not provide a setup.py. [0m
[34mGenerating setup.py[0m
[34m2020-06-29 17:27:11,083 sagemaker-containers INFO     Generating setup

## Deploy the trained model

In [8]:
# from sagemaker.sklearn.model import SKLearnModel

# deploy the model to create a predictor
predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.c4.xlarge')

-------------!

## Evaluating The Model

In [9]:
# read in test data, assuming it is stored locally
test_data = pd.read_csv(os.path.join(data_dir, "test.csv"), header=None, names=None)

# labels are in the first column
test_y = test_data.iloc[:,0]
test_x = test_data.iloc[:,1:]

### Determine the accuracy of the model

In [10]:
# First: generate predicted, class labels
test_y_preds = predictor.predict(test_x)

# test that your model generates the correct number of labels
assert len(test_y_preds)==len(test_y), 'Unexpected number of predictions.'
print('Test passed!')

Test passed!


In [11]:
# Second: calculate the test accuracy
accuracy = accuracy_score(test_y, test_y_preds)

print(accuracy)

## print out the array of predicted and true labels, if you want
print('\nPredicted class labels: ')
print(test_y_preds)
print('\nTrue class labels: ')
print(test_y.values)

1.0

Predicted class labels: 
[1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 0]

True class labels: 
[1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 0]


In [12]:
# Third: classification report
print(classification_report(test_y, test_y_preds))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00        15

   micro avg       1.00      1.00      1.00        25
   macro avg       1.00      1.00      1.00        25
weighted avg       1.00      1.00      1.00        25



Question 1: How many false positives and false negatives did your model produce, if any? And why do you think this is?

Answer: The model produced 0 false positive and 0 false negative. SVM works relatively well when there is clear margin of separation between classes.

Question 2: How did you decide on the type of model to use?

Answer: I tried the sagemaker built-in LinearModel, a custom PyTorch model and the SVC model from Sklearn, and compared the performances. It turned out that the SVC model had the best performance. 

----
## Clean up Resources

In [13]:
estimator.delete_endpoint()

### Deleting S3 bucket

In [14]:
# deleting bucket

bucket_to_delete = boto3.resource('s3').Bucket(bucket)
bucket_to_delete.objects.all().delete()

[{'ResponseMetadata': {'RequestId': 'D63164D09D595398',
   'HostId': 'N+/6y+6DMiK4fIPw+c1lY3p8bfmJc/M6t3oCp8rxJtLibjtZBxx8PG0a3KmFSP/UCIbkXx/J0Qo=',
   'HTTPStatusCode': 200,
   'HTTPHeaders': {'x-amz-id-2': 'N+/6y+6DMiK4fIPw+c1lY3p8bfmJc/M6t3oCp8rxJtLibjtZBxx8PG0a3KmFSP/UCIbkXx/J0Qo=',
    'x-amz-request-id': 'D63164D09D595398',
    'date': 'Thu, 28 Nov 2019 22:35:31 GMT',
    'connection': 'close',
    'content-type': 'application/xml',
    'transfer-encoding': 'chunked',
    'server': 'AmazonS3'},
   'RetryAttempts': 0},
  'Deleted': [{'Key': 'sagemaker-pytorch-2019-11-28-21-49-52-781/source/sourcedir.tar.gz'},
   {'Key': 'sagemaker-scikit-learn-2019-11-28-22-17-32-062/output/model.tar.gz'},
   {'Key': 'sagemaker-pytorch-2019-11-28-22-18-51-969/source/sourcedir.tar.gz'},
   {'Key': 'sagemaker-scikit-learn-2019-11-28-21-48-27-275/source/sourcedir.tar.gz'},
   {'Key': 'sagemaker-scikit-learn-2019-11-28-22-01-53-614/source/sourcedir.tar.gz'},
   {'Key': 'sagemaker-scikit-learn-2019-11-

---
## Further Directions

* Train a classifier to predict the *category* (1-3) of plagiarism and not just plagiarized (1) or not (0).
* Utilize a different and larger dataset to see if this model can be extended to other types of plagiarism.
* Use language or character-level analysis to find different (and more) similarity features.
* Write a complete pipeline function that accepts a source text and submitted text file, and classifies the submitted text as plagiarized or not.
* Use API Gateway and a lambda function to deploy your model to a web application.