# Plagiarism Detection Model

Now that we've created training and test data, we are ready to define and train a model. our goal in this notebook, will be to train a binary classification model that learns to label an answer file as either plagiarized or not, based on the features you provide the model.

This task will be broken down into a few discrete steps:

* Upload your data to S3.
* Define a binary classification model and a training script.
* Train your model and deploy it.
* Evaluate your deployed classifier and answer some questions about your approach.

---

## Load Data to S3

In the last notebook, you should have created two files: a `training.csv` and `test.csv` file with the features and class labels for the given corpus of plagiarized/non-plagiarized text data. 

>The below cells load in some AWS SageMaker libraries and creates a default bucket. After creating this bucket, we can upload your locally stored data to S3.

Save train and test `.csv` feature files, locally. To do this you can run the second notebook "2_Plagiarism_Feature_Engineering" in SageMaker or we can manually upload our files to this notebook using the upload icon in Jupyter Lab. Then we can upload local files to S3 by using `sagemaker_session.upload_data` and pointing directly to where the training data is saved.

In [11]:
import pandas as pd
import boto3
import sagemaker
import numpy as np

In [2]:
# session and role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# create an S3 bucket
bucket = sagemaker_session.default_bucket()

## Upload your training data to S3

Specify the `data_dir` where you've saved your `train.csv` file. Decide on a descriptive `prefix` that defines where your data will be uploaded in the default S3 bucket. Finally, create a pointer to your training data by calling `sagemaker_session.upload_data` and passing in the required parameters. It may help to look at the [Session documentation](https://sagemaker.readthedocs.io/en/stable/session.html#sagemaker.session.Session.upload_data) or previous SageMaker code examples.

In [3]:
# should be the name of directory you created to save your features data
data_dir = 'plagiarism_data'

# set prefix, a descriptive name for a directory  
prefix = 'aws-ml-sagemaker-plagiarism'

# upload all data to S3
input_data = sagemaker_session.upload_data(path=data_dir, bucket=bucket, key_prefix=prefix)

In [4]:
print (input_data)

s3://sagemaker-us-east-2-635229580099/aws-ml-sagemaker-plagiarism


### Test cell

Test that your data has been successfully uploaded. The below cell prints out the items in your S3 bucket and will throw an error if it is empty. We should see the contents of your `data_dir` and perhaps some checkpoints. If you see any other files listed, then you may have some old model files that you can delete via the S3 console (though, additional files shouldn't affect the performance of model developed in this notebook).

In [5]:
# confirm that data is in S3 bucket
empty_check = []
for obj in boto3.resource('s3').Bucket(bucket).objects.all():
    empty_check.append(obj.key)
    print(obj.key)

assert len(empty_check) !=0, 'S3 bucket is empty.'
print('Test passed!')

aws-ml-sagemaker-plagiarism/test.csv
aws-ml-sagemaker-plagiarism/train.csv
Test passed!


---

# Modeling

Now that you've uploaded your training data, it's time to define and train a model!

The type of model you create is up to you. For a binary classification task, we can choose to go one of three routes:
* Use a built-in classification algorithm, like LinearLearner.
* Define a custom Scikit-learn classifier, a comparison of models can be found [here](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html).
* Define a custom PyTorch neural network classifier. 

In this project we choosed LinearLearner and PyTorch to create our model

In [6]:
# read in the csv file
data_dir = 'plagiarism_data'
train = 'train.csv'
test = 'test.csv'

# print out some data
train = pd.read_csv('{}/{}'.format(data_dir, train),header=None)
test = pd.read_csv('{}/{}'.format(data_dir, test),header=None)
print('Data shape (rows, cols): ', train.shape)
print('Data shape (rows, cols): ', test.shape)

train.columns = ['Class', 'c_2' ,'c_1', 'lcs']
test.columns = ['Class', 'c_2' ,'c_1', 'lcs']

print()
train.head()

Data shape (rows, cols):  (70, 4)
Data shape (rows, cols):  (25, 4)



Unnamed: 0,Class,c_2,c_1,lcs
0,0,0.07907,0.398148,0.191781
1,1,0.719457,0.869369,0.846491
2,1,0.268817,0.593583,0.316062
3,0,0.115789,0.544503,0.242574
4,0,0.053846,0.329502,0.161172


### Slicing train data

In [22]:
train_y,train_x = train[['Class']],train.drop('Class', axis=1)

In [78]:
train_x.head()

Unnamed: 0,c_2,c_1,lcs
0,0.07907,0.398148,0.191781
1,0.719457,0.869369,0.846491
2,0.268817,0.593583,0.316062
3,0.115789,0.544503,0.242574
4,0.053846,0.329502,0.161172


# 1 - LinearLearner Model

### Create a LinearLearner Estimator

In [24]:
# import LinearLearner
from sagemaker import LinearLearner

# specify an output path
prefix = 'plagiarism_out'
output_path = 's3://{}/{}'.format(bucket, prefix)

# instantiate LinearLearner
linear = LinearLearner(role=role,
                       train_instance_count=1, 
                       train_instance_type='ml.c4.xlarge',
                       predictor_type='binary_classifier',
                       output_path=output_path,
                       sagemaker_session=sagemaker_session,
                       epochs=15)

In [32]:
train_y.values.squeeze().astype('float32')

array([0., 1., 1., 0., 0., 0., 1., 0., 0., 1., 1., 1., 1., 0., 0., 1., 0.,
       1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 0., 0.,
       0., 1., 0., 0., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 1., 1.,
       1., 1., 0., 0., 1., 1., 0., 1., 1., 1., 0., 1., 1., 0., 1., 1., 1.,
       1., 0.], dtype=float32)

### Convert data into a RecordSet format

In [33]:
# convert features/labels to numpy
train_x_np = train_x.values.astype('float32')
train_y_np = train_y.values.squeeze().astype('float32')

# create RecordSet
formatted_train_data = linear.record_set(train_x_np, labels=train_y_np)

In [34]:
formatted_train_data

(<class 'sagemaker.amazon.amazon_estimator.RecordSet'>, {'s3_data': 's3://sagemaker-us-east-2-635229580099/sagemaker-record-sets/LinearLearner-2019-06-29-07-17-58-199/.amazon.manifest', 'feature_dim': 3, 'num_records': 70, 's3_data_type': 'ManifestFile', 'channel': 'train'})

In [35]:
train_x_np[10],train_y_np[10]

(array([0.7972028 , 0.9513889 , 0.89403975], dtype=float32), 1.0)

### Train the Estimator

In [36]:
linear.fit(formatted_train_data)

2019-06-29 07:18:09 Starting - Starting the training job...
2019-06-29 07:18:13 Starting - Launching requested ML instances......
2019-06-29 07:19:16 Starting - Preparing the instances for training......
2019-06-29 07:20:14 Downloading - Downloading input data...
2019-06-29 07:21:09 Training - Training image download completed. Training in progress..
[31mDocker entrypoint called with argument(s): train[0m
[31m[06/29/2019 07:21:12 INFO 140111249409856] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-input.json: {u'loss_insensitivity': u'0.01', u'epochs': u'15', u'init_bias': u'0.0', u'lr_scheduler_factor': u'auto', u'num_calibration_samples': u'10000000', u'accuracy_top_k': u'3', u'_num_kv_servers': u'auto', u'use_bias': u'true', u'num_point_for_scaler': u'10000', u'_log_level': u'info', u'quantile': u'0.5', u'bias_lr_mult': u'auto', u'lr_scheduler_step': u'auto', u'init_method': u'uniform', u'init_sigma': u'0.01', u'lr_scheduler

### Deploy the trained model

In [37]:
# deploy and create a predictor
linear_predictor = linear.deploy(initial_instance_count=1, instance_type='ml.t2.medium')

--------------------------------------------------------------------------------------------------------------------!

# Evaluating Model

### Slicing test data

In [38]:
test.tail()

Unnamed: 0,Class,c_2,c_1,lcs
20,0,0.133758,0.613924,0.298343
21,1,0.9375,0.972763,0.927083
22,1,0.86722,0.96281,0.909804
23,0,0.102128,0.415254,0.177419
24,0,0.163793,0.532189,0.245833


In [39]:
test_y,test_x = test[['Class']],test.drop('Class', axis=1)

### Converting test data to numpy array

In [48]:
test_x_np = test_x.values.astype('float32')
test_y_np = test_y.values.squeeze().astype('float32')
test_x_np[0]

array([0.9846939, 1.       , 0.8207547], dtype=float32)

### Comparing labeled value with predicted value

In [50]:
result = linear_predictor.predict(test_x_np[12])

print(result)

[label {
  key: "predicted_label"
  value {
    float32_tensor {
      values: 1.0
    }
  }
}
label {
  key: "score"
  value {
    float32_tensor {
      values: 0.5811259746551514
    }
  }
}
]


### Function to evaluate our model with test data

In [51]:
import numpy as np
# code to evaluate the endpoint on test data
# returns a variety of model metrics
def evaluate(predictor, test_features, test_labels, verbose=True):
    """
    Evaluate a model on a test set given the prediction endpoint.  
    Return binary classification metrics.
    :param predictor: A prediction endpoint
    :param test_features: Test features
    :param test_labels: Class labels for test data
    :param verbose: If True, prints a table of all performance metrics
    :return: A dictionary of performance metrics.
    """
    
    # We have a lot of test data, so we'll split it into batches of 100
    # split the test data set into batches and evaluate using prediction endpoint    
    prediction_batches = [predictor.predict(batch) for batch in test_features]
    
    # LinearLearner produces a `predicted_label` for each data point in a batch
    # get the 'predicted_label' for every point in a batch
    test_preds = np.concatenate([np.array([x.label['predicted_label'].float32_tensor.values[0] for x in batch]) 
                                 for batch in prediction_batches])
    
    # calculate true positives, false positives, true negatives, false negatives
    tp = np.logical_and(test_labels, test_preds).sum()
    fp = np.logical_and(1-test_labels, test_preds).sum()
    tn = np.logical_and(1-test_labels, 1-test_preds).sum()
    fn = np.logical_and(test_labels, 1-test_preds).sum()
    
    # calculate binary classification metrics
    recall = tp / (tp + fn)
    precision = tp / (tp + fp)
    accuracy = (tp + tn) / (tp + fp + tn + fn)
    
    # printing a table of metrics
    if verbose:
        print(pd.crosstab(test_labels, test_preds, rownames=['actual (row)'], colnames=['prediction (col)']))
        print("\n{:<11} {:.3f}".format('Recall:', recall))
        print("{:<11} {:.3f}".format('Precision:', precision))
        print("{:<11} {:.3f}".format('Accuracy:', accuracy))
        print()
        
    return {'TP': tp, 'FP': fp, 'FN': fn, 'TN': tn, 
            'Precision': precision, 'Recall': recall, 'Accuracy': accuracy}

In [52]:
metrics = evaluate(linear_predictor, test_x_np, test_y_np, True)

prediction (col)  0.0  1.0
actual (row)              
0.0                10    0
1.0                 0   15

Recall:     1.000
Precision:  1.000
Accuracy:   1.000



In [55]:
# Deletes a precictor.endpoint
def delete_endpoint(predictor):
        try:
            boto3.client('sagemaker').delete_endpoint(EndpointName=predictor.endpoint)
            print('Deleted {}'.format(predictor.endpoint))
        except:
            print('Already deleted: {}'.format(predictor.endpoint))

In [56]:
## IMPORTANT
# delete the predictor endpoint 
delete_endpoint(linear_predictor)

Already deleted: linear-learner-2019-06-29-07-18-09-093


# 2 - PyTorch Model

### Provided code

We can see that the starter code for PyTorch model implementation includes a few things:

* Model loading (`model_fn`) and saving code
* Getting SageMaker's default hyperparameters
* Loading the training data by name, `train.csv` and extracting the features and labels, `train_x`, and `train_y`

If you'd like to read more about model saving with [joblib for sklearn](https://scikit-learn.org/stable/modules/model_persistence.html) or with [torch.save](https://pytorch.org/tutorials/beginner/saving_loading_models.html), click on the provided links.

---
# Create an Estimator

When a custom model is constructed in SageMaker, an entry point must be specified. This is the Python file which will be executed when the model is trained; the `train.py` function you specified above. To run a custom training script in SageMaker, construct an estimator, and fill in the appropriate constructor arguments:

* **entry_point**: The path to the Python script SageMaker runs for training and prediction.
* **source_dir**: The path to the training script directory `source_sklearn` OR `source_pytorch`.
* **entry_point**: The path to the Python script SageMaker runs for training and prediction.
* **source_dir**: The path to the training script directory `train_sklearn` OR `train_pytorch`.
* **entry_point**: The path to the Python script SageMaker runs for training.
* **source_dir**: The path to the training script directory `train_sklearn` OR `train_pytorch`.
* **role**: Role ARN, which was specified, above.
* **train_instance_count**: The number of training instances (should be left at 1).
* **train_instance_type**: The type of SageMaker instance for training. Note: Because Scikit-learn does not natively support GPU training, Sagemaker Scikit-learn does not currently support training on GPU instance types.
* **sagemaker_session**: The session used to train on Sagemaker.
* **hyperparameters** (optional): A dictionary `{'name':value, ..}` passed to the train function as hyperparameters.

Note: For a PyTorch model, there is another optional argument **framework_version**, which we can set to the latest version of PyTorch, `1.0`.

In [58]:
from sagemaker.pytorch import PyTorch

# specify an output path
# prefix is specified above
output_path = 's3://{}/{}'.format(bucket, prefix)

# instantiate a pytorch estimator
estimator = PyTorch(entry_point='train.py',
                    source_dir='source_pytorch', # this should be just "source" for your code
                    role=role,
                    framework_version='1.0',
                    train_instance_count=1,
                    train_instance_type='ml.c4.xlarge',
                    output_path=output_path,
                    sagemaker_session=sagemaker_session,
                    hyperparameters={
                        'input_features': 3,  # num of features
                        'hidden_dim': 30,
                        'output_dim': 1,
                        'epochs': 150 # could change to higher
                    })


### Create train csv file

In [59]:
import os

def make_csv(x, y, filename, data_dir):
    '''Merges features and labels and converts them into one csv file with labels in the first column.
       :param x: Data features
       :param y: Data labels
       :param file_name: Name of csv file, ex. 'train.csv'
       :param data_dir: The directory where files will be saved
       '''
    # make data dir, if it does not exist
    if not os.path.exists(data_dir):
        os.makedirs(data_dir)
    
    # first column is the labels and rest is features 
    pd.concat([pd.DataFrame(y), pd.DataFrame(x)], axis=1)\
             .to_csv(os.path.join(data_dir, filename), header=False, index=False)
    
    # nothing is returned, but a print statement indicates that the function has run
    print('Path created: '+str(data_dir)+'/'+str(filename))

In [60]:
data_dir = 'plagiarism_data_pytorch' # the folder we will use for storing data
name = 'train.csv'

# create 'train.csv'
make_csv(train_x_np, train_y_np, name, data_dir)

Path created: plagiarism_data_pytorch/train.csv


In [61]:
# specify where to upload in S3
prefix = 'aws-ml-sagemaker-plagiarism_pytorch'

# upload to S3
input_data = sagemaker_session.upload_data(path=data_dir, bucket=bucket, key_prefix=prefix)
print(input_data)

s3://sagemaker-us-east-2-635229580099/aws-ml-sagemaker-plagiarism_pytorch


## Train the estimator

Train estimator on the training data stored in S3. This should create a training job that we can monitor in our SageMaker console.

In [62]:
%%time

# Train estimator on S3 training data

estimator.fit({'train': input_data})


2019-06-29 07:42:21 Starting - Starting the training job...
2019-06-29 07:42:22 Starting - Launching requested ML instances......
2019-06-29 07:43:25 Starting - Preparing the instances for training......
2019-06-29 07:44:40 Downloading - Downloading input data
2019-06-29 07:44:40 Training - Downloading the training image..
[31mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[31mbash: no job control in this shell[0m
[31m2019-06-29 07:44:55,451 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[31m2019-06-29 07:44:55,453 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[31m2019-06-29 07:44:55,466 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[31m2019-06-29 07:44:56,080 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[31m2019-06-29 07:44:56,306 sagemaker-containers INFO     Module train does not prov

## Deploy the trained model

After training, deploy our model to create a `predictor`. If you're using a PyTorch model, we'll need to create a trained `PyTorchModel` that accepts the trained `<model>.model_data` as an input parameter and points to the provided `source_pytorch/predict.py` file as an entry point. 

To deploy a trained model, we'll use `<model>.deploy`, which takes in two arguments:
* **initial_instance_count**: The number of deployed instances (1).
* **instance_type**: The type of SageMaker instance for deployment.

Note: If we run into an instance error, it may be because we chose the wrong training or deployment instance_type. It may help to refer to your previous exercise code to see which types of instances we used.

In [63]:
# importing PyTorchModel
from sagemaker.pytorch import PyTorchModel

# Create a model from the trained estimator data
# And point to the prediction script
predictor = PyTorchModel(model_data=estimator.model_data,
                     role = role,
                     framework_version='1.0',
                     entry_point='predict.py',
                     source_dir='source_pytorch')

CPU times: user 9.5 ms, sys: 3.95 ms, total: 13.4 ms
Wall time: 69.9 ms


In [64]:
%%time
# deploy and create a predictor
predictor = predictor.deploy(initial_instance_count=1, instance_type='ml.t2.medium')

---------------------------------------------------------------------------------------!CPU times: user 612 ms, sys: 14.5 ms, total: 627 ms
Wall time: 7min 20s


---
# Evaluating The Model

Once our model is deployed, we can see how it performs when applied to our test data.

The provided cell below, reads in the test data, assuming it is stored locally in `data_dir` and named `test.csv`. The labels and features are extracted from the `.csv` file.

In [65]:
name_test = 'test.csv'
make_csv(test_x_np, test_y_np, name_test, data_dir)

Path created: plagiarism_data_pytorch/test.csv


In [66]:
import os

# read in test data, assuming it is stored locally
test_data = pd.read_csv(os.path.join(data_dir, "test.csv"), header=None, names=None)

# labels are in the first column
test_y = test_data.iloc[:,0]
test_x = test_data.iloc[:,1:]

## Determine the accuracy of your model

In [67]:
# First: generate predicted, class labels
import numpy as np
test_y_preds = np.squeeze(np.round(predictor.predict(test_x)))

# test that your model generates the correct number of labels
assert len(test_y_preds)==len(test_y), 'Unexpected number of predictions.'
print('Test passed!')

Test passed!


In [69]:
# Second: calculate the test accuracy
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(test_y.values, test_y_preds)

print(accuracy)

## print out the array of predicted and true labels, if you want
print('\nPredicted class labels: ')
print(test_y_preds)
print('\nTrue class labels: ')
print(test_y.values)

1.0

Predicted class labels: 
[1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 0. 1. 0. 1. 1. 0.
 0.]

True class labels: 
[1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 0. 1. 0. 1. 1. 0.
 0.]


----
## Clean up Resources

In [76]:
# Deletes a precictor.endpoint
def delete_endpoint(predictor):
        try:
            boto3.client('sagemaker').delete_endpoint(EndpointName=predictor.endpoint)
            print('Deleted {}'.format(predictor.endpoint))
        except:
            print('Already deleted: {}'.format(predictor.endpoint))
            
delete_endpoint(predictor)

Already deleted: sagemaker-pytorch-2019-06-29-07-51-36-278


### Deleting S3 bucket

When we are *completely* done with training and testing models, we can also delete your entire S3 bucket. If we do this before we are done training your model, we'll have to recreate your S3 bucket and upload your training data again.

In [77]:
# deleting bucket, uncomment lines below

bucket_to_delete = boto3.resource('s3').Bucket(bucket)
bucket_to_delete.objects.all().delete()

[{'ResponseMetadata': {'RequestId': 'C12A98E1254A6F89',
   'HostId': '2XFhPZgmJJG73uWDU6vjEJJZ8e6/nuc9rU5crUYY8dMaQTQmKR4dpcbLxs6okKwkE7Y1VUZvgmw=',
   'HTTPStatusCode': 200,
   'HTTPHeaders': {'x-amz-id-2': '2XFhPZgmJJG73uWDU6vjEJJZ8e6/nuc9rU5crUYY8dMaQTQmKR4dpcbLxs6okKwkE7Y1VUZvgmw=',
    'x-amz-request-id': 'C12A98E1254A6F89',
    'date': 'Sat, 29 Jun 2019 08:23:19 GMT',
    'connection': 'close',
    'content-type': 'application/xml',
    'transfer-encoding': 'chunked',
    'server': 'AmazonS3'},
   'RetryAttempts': 0},
  'Deleted': [{'Key': 'sagemaker-record-sets/LinearLearner-2019-06-29-07-17-58-199/matrix_0.pbr'},
   {'Key': 'plagiarism_out/linear-learner-2019-06-29-07-18-09-093/output/model.tar.gz'},
   {'Key': 'sagemaker-pytorch-2019-06-29-07-42-21-012/source/sourcedir.tar.gz'},
   {'Key': 'aws-ml-sagemaker-plagiarism/test.csv'},
   {'Key': 'aws-ml-sagemaker-plagiarism/train.csv'},
   {'Key': 'sagemaker-pytorch-2019-06-29-07-51-35-970/sourcedir.tar.gz'},
   {'Key': 'plagiarism