# Plagiarism Detection Model

Now that I've created training and test data, I'm ready to define and train a model. The goal is to train a binary classification model that learns to label an answer file as either plagiarized or not, based on the features.

This task will be broken down into a few discrete steps:

* Upload the data to S3.
* Define a binary classification model and a training script.
* Train the model and deploy it.
* Evaluate the deployed classifier.

## Load Data to S3

In [None]:
import pandas as pd
import boto3
import sagemaker

In [None]:
# session and role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# create an S3 bucket
bucket = sagemaker_session.default_bucket()

## Upload the training data to S3

In [None]:
data_dir = 'plagiarism_data'

# set prefix, a descriptive name for a directory  
prefix = 'plagiarism_detector'

# upload all data to S3
input_data = sagemaker_session.upload_data(path = data_dir, bucket=bucket, key_prefix=prefix)

### Test cell

In [None]:
# confirm that data is in S3 bucket
empty_check = []
for obj in boto3.resource('s3').Bucket(bucket).objects.all():
    empty_check.append(obj.key)
    print(obj.key)

assert len(empty_check) !=0, 'S3 bucket is empty.'
print('Test passed!')

---

# Modeling

Now that I've uploaded the training data, it's time to define and train a model!

The type of model can be:
* Use a built-in classification algorithm, like LinearLearner.
* Define a custom Scikit-learn classifier, a comparison of models can be found [here](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html).
* Define a custom PyTorch neural network classifier. 

---
# Create an Estimator

In [None]:
# import LinearLearner
from sagemaker import LinearLearner

# specify an output path
prefix = 'plagiarism_detector'
output_path = 's3://{}/{}'.format(bucket, prefix)

# instantiate LinearLearner
linear = LinearLearner(role=role,
                       train_instance_count=1, 
                       train_instance_type='ml.c4.xlarge',
                       predictor_type='binary_classifier',
                       output_path=output_path,
                       sagemaker_session=sagemaker_session,
                       epochs=15)

## Train the estimator

In [None]:
train = pd.read_csv(os.path.join(data_dir, 'train.csv'), header=None)

train_x = train.loc[:,0]
train_y= train.loc[:, 1:]

In [None]:
# convert features/labels to numpy
train_x_np = train_x.astype('float32')
train_y_np = train_y.astype('float32')

# create RecordSet
formatted_train_data = linear.record_set(train_x_np, labels=train_y_np)

In [None]:
linear.fit(formatted_train_data)

## Deploy the trained model

In [None]:
# deploy the model to create a predictor
linear_predictor = linear.deploy(initial_instance_count=1, instance_type='ml.c4.xlarge')

---
# Evaluating The Model

In [None]:
import os

# read in test data, assuming it is stored locally
test_data = pd.read_csv(os.path.join(data_dir, "test.csv"), header=None, names=None)

# labels are in the first column
test_y = test_data.iloc[:,0]
test_x = test_data.iloc[:,1:]

## Determine the accuracy of the model

In [None]:
# test one prediction
test_x_np = np.array(test_x).astype('float32')
test_y = np.array(test_y)

result = linear_predictor.predict(test_x_np[0])
print(result)

In [None]:
# First: generate predicted, class labels

predictions = linear_predictor.predict(test_x_np)
test_preds = np.array([prediction.label['predicted_label'].float32_tensor.values[0] for prediction in predictions])

# test that your model generates the correct number of labels
assert len(test_y_preds)==len(test_y), 'Unexpected number of predictions.'
print('Test passed!')

In [None]:
# Second: calculate the test accuracy
from sklearn.metrics import accuracy_score
accuracy_score(test_y, test_y_preds)

print(accuracy)

## print out the array of predicted and true labels, if you want
print('\nPredicted class labels: ')
print(test_y_preds)
print('\nTrue class labels: ')
print(test_y.values)

----
## Clean up Resources

In [None]:
# estimator.delete_endpoint()

### Deleting S3 bucket

In [None]:
# deleting bucket

# bucket_to_delete = boto3.resource('s3').Bucket(bucket)
# bucket_to_delete.objects.all().delete()

---
## Further Directions

* Train a classifier to predict the *category* (1-3) of plagiarism and not just plagiarized (1) or not (0).
* Utilize a different and larger dataset to see if this model can be extended to other types of plagiarism.
* Use language or character-level analysis to find different (and more) similarity features.
* Write a complete pipeline function that accepts a source text and submitted text file, and classifies the submitted text as plagiarized or not.
* Use API Gateway and a lambda function to deploy your model to a web application.