# Plagiarism Detection Model


## To-do:

* Upload the data to S3.
* Define a binary classification model and a training script.
* Train the model and deploy it.
* Evaluate the deployed classifier.

---

## Load Data to S3

In [1]:
import pandas as pd
import boto3
import sagemaker

In [2]:
"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
# session and role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# create an S3 bucket
bucket = sagemaker_session.default_bucket()

## Upload your training data to S3

In [3]:
# should be the name of directory you created to save your features data
data_dir = 'plagiarism_data'
sagemaker_session = sagemaker.Session()

# set prefix, a descriptive name for a directory  
prefix = 'plagiarism'

# upload all data to S3
input_data = sagemaker_session.upload_data(data_dir, bucket=bucket, key_prefix=prefix)

### Test cell

Test that your data has been successfully uploaded. The below cell prints out the items in your S3 bucket and will throw an error if it is empty. You should see the contents of your `data_dir` and perhaps some checkpoints. If you see any other files listed, then you may have some old model files that you can delete via the S3 console (though, additional files shouldn't affect the performance of model developed in this notebook).

In [4]:
"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
# confirm that data is in S3 bucket
empty_check = []
for obj in boto3.resource('s3').Bucket(bucket).objects.all():
    empty_check.append(obj.key)
    print(obj.key)

assert len(empty_check) !=0, 'S3 bucket is empty.'
print('Test passed!')

plagiarism-detection/test.csv
plagiarism-detection/train.csv
plagiarism/test.csv
plagiarism/train.csv
Test passed!


---

# Modeling
 
---

## Complete a training script 

**A typical training script:**
* Loads training data from a specified directory
* Parses any training & model hyperparameters (ex. nodes in a neural network, training epochs, etc.)
* Instantiates a model of your design, with any specified hyperparams
* Trains that model 
* Finally, saves the model so that it can be hosted/deployed, later

### Defining and training a model


In [6]:
# directory can be changed to: source_sklearn or source_pytorch
!pygmentize source_sklearn/train.py
#!pygmentize source_pytorch/train.py

[34mfrom[39;49;00m [04m[36m__future__[39;49;00m [34mimport[39;49;00m print_function

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m

[34mfrom[39;49;00m [04m[36msklearn.externals[39;49;00m [34mimport[39;49;00m joblib
[34mfrom[39;49;00m [04m[36msklearn.ensemble[39;49;00m [34mimport[39;49;00m RandomForestClassifier
[37m## TODO: Import any additional libraries you need to define a model[39;49;00m


[37m# Provided model load function[39;49;00m
[34mdef[39;49;00m [32mmodel_fn[39;49;00m(model_dir):
    [33m"""Load model from the model_dir. This is the same model that is saved[39;49;00m
[33m    in the main if statement.[39;49;00m
[33m    """[39;49;00m
    [34mprint[39;49;00m([33m"[39;49;00m[33mLoading model.[39;49;00m[33m"[39;49;00m)
    
    [37m# load using joblib[39;49;00m
    model =

---
# Create an Estimator

When a custom model is constructed in SageMaker, an entry point must be specified. This is the Python file which will be executed when the model is trained; the `train.py` function you specified above. To run a custom training script in SageMaker, construct an estimator, and fill in the appropriate constructor arguments:

* **entry_point**: The path to the Python script SageMaker runs for training and prediction.
* **source_dir**: The path to the training script directory `source_sklearn` OR `source_pytorch`.
* **entry_point**: The path to the Python script SageMaker runs for training and prediction.
* **source_dir**: The path to the training script directory `train_sklearn` OR `train_pytorch`.
* **entry_point**: The path to the Python script SageMaker runs for training.
* **source_dir**: The path to the training script directory `train_sklearn` OR `train_pytorch`.
* **role**: Role ARN, which was specified, above.
* **train_instance_count**: The number of training instances (should be left at 1).
* **train_instance_type**: The type of SageMaker instance for training. Note: Because Scikit-learn does not natively support GPU training, Sagemaker Scikit-learn does not currently support training on GPU instance types.
* **sagemaker_session**: The session used to train on Sagemaker.
* **hyperparameters** (optional): A dictionary `{'name':value, ..}` passed to the train function as hyperparameters.

Note: For a PyTorch model, there is another optional argument **framework_version**, which you can set to the latest version of PyTorch, `1.0`.

## Define a Scikit-learn or PyTorch estimator

In [7]:
from sagemaker.sklearn.estimator import SKLearn

estimator = SKLearn(entry_point="train.py",
                    source_dir="source_sklearn",
                    role=role,
                    train_instance_count=1,
                    train_instance_type='ml.m4.xlarge',
                    sagemaker_session=sagemaker_session,
                    #output_path=output_path,
                    hyperparameters={'n_estimators': 50 , 'max_depth':5})

## Train the estimator

Train the estimator on the training data stored in S3. This should create a training job that we can monitor in our SageMaker console.

In [8]:
%%time

# Train your estimator on S3 training data

estimator.fit({'train': input_data})


2020-07-03 12:41:02 Starting - Starting the training job...
2020-07-03 12:41:04 Starting - Launching requested ML instances......
2020-07-03 12:42:07 Starting - Preparing the instances for training...
2020-07-03 12:42:50 Downloading - Downloading input data...
2020-07-03 12:43:29 Training - Training image download completed. Training in progress.
2020-07-03 12:43:29 Uploading - Uploading generated training model.[34m2020-07-03 12:43:24,768 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2020-07-03 12:43:24,770 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-07-03 12:43:24,782 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2020-07-03 12:43:25,068 sagemaker-containers INFO     Module train does not provide a setup.py. [0m
[34mGenerating setup.py[0m
[34m2020-07-03 12:43:25,068 sagemaker-containers INFO     Generating setup.cfg[0m
[34m2020-07-03 12:43:25,0

## Deploy the trained model

In [9]:
%%time

# uncomment, if needed
# from sagemaker.pytorch import PyTorchModel


# deploy your model to create a predictor
predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge') #m4, p2


---------------!CPU times: user 255 ms, sys: 9.74 ms, total: 265 ms
Wall time: 7min 31s


---
# Evaluating the Model

Once the model is deployed, we can see how it performs when applied to our test data.

In [10]:
"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
import os

# read in test data, assuming it is stored locally
test_data = pd.read_csv(os.path.join(data_dir, "test.csv"), header=None, names=None)

# labels are in the first column
test_y = test_data.iloc[:,0]
test_x = test_data.iloc[:,1:]

## Determine the accuracy of your model

In [11]:
# First: generate predicted, class labels
test_y_preds = predictor.predict(test_x)


"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
# test that your model generates the correct number of labels
assert len(test_y_preds)==len(test_y), 'Unexpected number of predictions.'
print('Test passed!')

Test passed!


In [12]:
# Second: calculate the test accuracy

from sklearn.metrics import classification_report, accuracy_score

accuracy = accuracy_score(test_y, test_y_preds)

print(accuracy)


## print out the array of predicted and true labels, if you want
print('\nPredicted class labels: ')
print(test_y_preds)
print('\nTrue class labels: ')
print(test_y.values)

1.0

Predicted class labels: 
[1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 0]

True class labels: 
[1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 0]


### Result:

- No false positives
- No false negatives
- 100% accuracy

This could be because we have a huge dataset and pretty decent features for predicitions, but 100% is almost suspiciously good.


**Note: *Why did we choose this model?***: 
- Binary classification problem --> a LSVC is suitable for these kinds of problems, also it's good with larger datasets.

----
## Clean up Resources

In [13]:
# uncomment and fill in the line below!

predictor.delete_endpoint()

### Deleting S3 bucket

In [14]:
# deleting bucket, uncomment lines below

bucket_to_delete = boto3.resource('s3').Bucket(bucket)
bucket_to_delete.objects.all().delete()

[{'ResponseMetadata': {'RequestId': '4E4BF58B146F7464',
   'HostId': 'mdhVj3BxnZZl0DlXI5OHeUEUgevvk2w8/Yfl1ZNf0OGn/Vrd20TEuD9yNRBpgP7u+JEeX3ptJVM=',
   'HTTPStatusCode': 200,
   'HTTPHeaders': {'x-amz-id-2': 'mdhVj3BxnZZl0DlXI5OHeUEUgevvk2w8/Yfl1ZNf0OGn/Vrd20TEuD9yNRBpgP7u+JEeX3ptJVM=',
    'x-amz-request-id': '4E4BF58B146F7464',
    'date': 'Fri, 03 Jul 2020 12:53:31 GMT',
    'connection': 'close',
    'content-type': 'application/xml',
    'transfer-encoding': 'chunked',
    'server': 'AmazonS3'},
   'RetryAttempts': 0},
  'Deleted': [{'Key': 'plagiarism/train.csv'},
   {'Key': 'sagemaker-scikit-learn-2020-07-03-12-41-01-754/debug-output/training_job_end.ts'},
   {'Key': 'sagemaker-scikit-learn-2020-07-03-12-41-01-754/source/sourcedir.tar.gz'},
   {'Key': 'plagiarism-detection/test.csv'},
   {'Key': 'sagemaker-scikit-learn-2020-07-03-12-41-01-754/output/model.tar.gz'},
   {'Key': 'plagiarism-detection/train.csv'},
   {'Key': 'plagiarism/test.csv'}]}]