# Plagiarism Detection: Linear Model

Now that I've created training and test data, I'm ready to define and train a model. The goal is to train a binary classification model that learns to label an answer file as either plagiarized or not, based on the features.

This task will be broken down into a few discrete steps:

* Upload the data to S3.
* Define a binary classification model and a training script.
* Train the model and deploy it.
* Evaluate the deployed classifier.

## Load Data to S3

In [1]:
import pandas as pd
import numpy as np
import boto3
import sagemaker
import os
from sagemaker import LinearLearner
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [2]:
# session and role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# create an S3 bucket
bucket = sagemaker_session.default_bucket()

In [3]:
data_dir = 'plagiarism_data'

# set prefix, a descriptive name for a directory  
prefix = 'plagiarism_detector'

# upload all data to S3
input_data = sagemaker_session.upload_data(path = data_dir, bucket=bucket, key_prefix=prefix)

### Test cell

In [4]:
# confirm that data is in S3 bucket
empty_check = []
for obj in boto3.resource('s3').Bucket(bucket).objects.all():
    empty_check.append(obj.key)
    print(obj.key)

assert len(empty_check) !=0, 'S3 bucket is empty.'
print('Test passed!')

plagiarism_detector/.ipynb_checkpoints/test-checkpoint.csv
plagiarism_detector/.ipynb_checkpoints/train-checkpoint.csv
plagiarism_detector/linear-learner-2020-06-29-15-44-28-573/output/model.tar.gz
plagiarism_detector/test.csv
plagiarism_detector/train.csv
sagemaker-record-sets/LinearLearner-2020-06-29-15-44-26-538/.amazon.manifest
sagemaker-record-sets/LinearLearner-2020-06-29-15-44-26-538/matrix_0.pbr
Test passed!


## Modeling


### Create an Estimator

In [5]:
# specify an output path
prefix = 'plagiarism_detector'
output_path = 's3://{}/{}'.format(bucket, prefix)

# instantiate LinearLearner with the selection criteria of recall_at_target_precision
# because false positives is worse than false negatives
linear = LinearLearner(role=role,
                       train_instance_count=1, 
                       train_instance_type='ml.c4.xlarge',
                       predictor_type='binary_classifier',
                       output_path=output_path,
                       sagemaker_session=sagemaker_session,
                       epochs=15,
                       binary_classifier_model_selection_criteria='recall_at_target_precision',
                       target_precision=0.9,
                       positive_example_weight_mult='balanced')

### Train the estimator

In [6]:
train = pd.read_csv(os.path.join(data_dir, 'train.csv'), header=None)

train_y = train.iloc[:,0]
train_x= train.iloc[:, 1:]

In [7]:
# convert features/labels to numpy
train_x_np = np.array(train_x).astype('float32')
train_y_np = np.array(train_y).astype('float32')

# create RecordSet
formatted_train_data = linear.record_set(train_x_np, labels=train_y_np)

In [8]:
linear.fit(formatted_train_data)

2020-06-29 15:50:16 Starting - Starting the training job...
2020-06-29 15:50:18 Starting - Launching requested ML instances......
2020-06-29 15:51:28 Starting - Preparing the instances for training......
2020-06-29 15:52:41 Downloading - Downloading input data...
2020-06-29 15:52:58 Training - Downloading the training image.[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[06/29/2020 15:53:20 INFO 139985385948992] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-input.json: {u'loss_insensitivity': u'0.01', u'epochs': u'15', u'feature_dim': u'auto', u'init_bias': u'0.0', u'lr_scheduler_factor': u'auto', u'num_calibration_samples': u'10000000', u'accuracy_top_k': u'3', u'_num_kv_servers': u'auto', u'use_bias': u'true', u'num_point_for_scaler': u'10000', u'_log_level': u'info', u'quantile': u'0.5', u'bias_lr_mult': u'auto', u'lr_scheduler_step': u'auto', u'init_met

## Deploy the trained model

In [9]:
# deploy the model to create a predictor
linear_predictor = linear.deploy(initial_instance_count=1, instance_type='ml.c4.xlarge')

-------------!

## Evaluating The Model

In [10]:
test_data = pd.read_csv(os.path.join(data_dir, "test.csv"), header=None, names=None)
test_y = test_data.iloc[:,0]
test_x = test_data.iloc[:,1:]

### Determine the accuracy of the model

In [11]:
# test one prediction
test_x_np = np.array(test_x).astype('float32')
test_y = np.array(test_y).astype('float32')

result = linear_predictor.predict(test_x_np[0])
print(result)

[label {
  key: "predicted_label"
  value {
    float32_tensor {
      values: 1.0
    }
  }
}
label {
  key: "score"
  value {
    float32_tensor {
      values: 0.9930276274681091
    }
  }
}
]


In [12]:
# First: generate predicted, class labels

predictions = linear_predictor.predict(test_x_np)
test_y_preds = np.array([prediction.label['predicted_label'].float32_tensor.values[0] for prediction in predictions])

# test that your model generates the correct number of labels
assert len(test_y_preds)==len(test_y), 'Unexpected number of predictions.'
print('Test passed!')

Test passed!


In [13]:
# Second: calculate the test accuracy
accuracy = accuracy_score(test_y, test_y_preds)

print(accuracy)

## print out the array of predicted and true labels, if you want
print('\nPredicted class labels: ')
print(test_y_preds)
print('\nTrue class labels: ')
print(test_y)

0.88

Predicted class labels: 
[1. 1. 1. 1. 1. 1. 0. 0. 1. 1. 0. 0. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 0.
 0.]

True class labels: 
[1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 0. 1. 0. 1. 1. 0.
 0.]


In [15]:
# Third: classification report
print(classification_report(test_y, test_y_preds))

              precision    recall  f1-score   support

         0.0       1.00      0.70      0.82        10
         1.0       0.83      1.00      0.91        15

   micro avg       0.88      0.88      0.88        25
   macro avg       0.92      0.85      0.87        25
weighted avg       0.90      0.88      0.87        25



## Clean up Resources

In [16]:
linear.delete_endpoint()

### Deleting S3 bucket

In [17]:
# deleting bucket

bucket_to_delete = boto3.resource('s3').Bucket(bucket)
bucket_to_delete.objects.all().delete()

[{'ResponseMetadata': {'RequestId': 'AT7P4Q6W4G8S0M9T',
   'HostId': '1ne4pWe7dx+g8aZTlB92Y09Y1fSOVm2xigp/TKjLNlh/JXSKaKDgiKXMvLT9Ud+dzUDiFZ2YCd4=',
   'HTTPStatusCode': 200,
   'HTTPHeaders': {'x-amz-id-2': '1ne4pWe7dx+g8aZTlB92Y09Y1fSOVm2xigp/TKjLNlh/JXSKaKDgiKXMvLT9Ud+dzUDiFZ2YCd4=',
    'x-amz-request-id': 'AT7P4Q6W4G8S0M9T',
    'date': 'Mon, 29 Jun 2020 16:08:54 GMT',
    'connection': 'close',
    'content-type': 'application/xml',
    'transfer-encoding': 'chunked',
    'server': 'AmazonS3'},
   'RetryAttempts': 0},
  'Deleted': [{'Key': 'plagiarism_detector/.ipynb_checkpoints/train-checkpoint.csv'},
   {'Key': 'plagiarism_detector/train.csv'},
   {'Key': 'sagemaker-record-sets/LinearLearner-2020-06-29-15-50-11-686/.amazon.manifest'},
   {'Key': 'sagemaker-record-sets/LinearLearner-2020-06-29-15-44-26-538/matrix_0.pbr'},
   {'Key': 'sagemaker-record-sets/LinearLearner-2020-06-29-15-50-11-686/matrix_0.pbr'},
   {'Key': 'plagiarism_detector/linear-learner-2020-06-29-15-50-16-158/

## Further Directions

* Train a classifier to predict the *category* (1-3) of plagiarism and not just plagiarized (1) or not (0).
* Utilize a different and larger dataset to see if this model can be extended to other types of plagiarism.
* Use language or character-level analysis to find different (and more) similarity features.
* Write a complete pipeline function that accepts a source text and submitted text file, and classifies the submitted text as plagiarized or not.
* Use API Gateway and a lambda function to deploy your model to a web application.