# Notebook to Test External Tool
The following code is used to test the External Tool "Origin" (formally called Thor).  Origin consists of four classes:
* Hash:  contains the hash member and functions to create and view the hash
* Originality: takes in unprocessed text and returns a Hash containing neural and statistics dictionaries
* Database:  opens, finds, and stores Hashes in the database
* Comparison:  takes in a Hash and compare's it against Hashes in the database for the same student id

In [9]:
%load_ext autoreload
%autoreload 2

import origin
import os
import warnings

warnings.filterwarnings('ignore')

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Load dataset into dictionary
Here are the authors excluded from the network's training/validation sets.

In [23]:
new_authors = [
    'EdnaFernandes',
    'FumikoFujisaki',
    'JanLopatka',
    'KevinDrawbaugh',
    'MureDickie',
    'PierreTran',
    'SamuelPerry',
    'SarahDavison',
    'SimonCowell',
    'ToddNissen'
]

In [24]:
DATASET_PATH = 'data/C50/C50all/'

corpus = {}

for author in new_authors:
    texts = os.listdir(DATASET_PATH + author + '/')
    corpus[author] = []
    
    for text in texts:
        with open(DATASET_PATH + author + '/' + text, 'r') as f:
            corpus[author].append(f.read())

In [28]:
print("Here's the unprocessed, first text for author ToddNissen:")
corpus["ToddNissen"][0]

Here's the unprocessed, first text for author ToddNissen:


'Workers striking two Johnson Control Inc. seat assembly plants Tuesday scored a major victory when Ford Motor Co. said it would refuse to accept non-union made seats until the dispute is resolved.\n"We had a huge, huge victory here today," Bob King, director of United Auto Worker Region 1A, told a boisterous crowd of strikers and supporters outside of the plant in Plymouth, Mich., a suburb west of Detroit.\nMore than 300 workers at the plant walked off the job at 6 a.m. Tuesday after talks on a new contract broke down late Monday. The strike drew hundreds of other chanting supporters.\nAbout 200 employees of a Johnson Controls seat plant in Oberlin, Ohio, also struck the company when their talks collapsed. Workers at both plants agreed last fall to be represented by the UAW in a new labour contract.\nThe Plymouth plant provides final assembly for seats that go into Ford\'s hot-selling Expedition full-size sport utility vehicle, which is made at the nearby Michigan Truck plant in Wayne

## Clean database

In [48]:
db = origin.Database()
for author in corpus:
    print("...deleted", author)
    db.delete(author)

Connecting to database
...deleted EdnaFernandes
...deleted FumikoFujisaki
...deleted JanLopatka
...deleted KevinDrawbaugh
...deleted MureDickie
...deleted PierreTran
...deleted SamuelPerry
...deleted SarahDavison
...deleted SimonCowell
...deleted ToddNissen


## Entering first submission

In [49]:
interface = origin.Interface()

author = "ToddNissen"
article_num = 0

interface.score(author, corpus[author][article_num])

Started interface
Executing originality algorithm
Creating hash
Started comparison
Connecting to database

Author ID: ToddNissen
Submitted text: 'Workers striking two Johnson Control Inc. seat asse...'
Score:  -1 (new author; nothing to compare)


-1

## Entering second submission

In [52]:
author = "ToddNissen"
article_num = 37

interface.score(author, corpus[author][article_num])

Executing originality algorithm
Creating hash
Started comparison
Connecting to database
Creating hash

Author ID: ToddNissen
Submitted text: 'Calling it a crucial change for how it develops new...'
Average score: 0.99977

Here's a comparision against prior submissions:
1 (Mon 30 Jul 2018 05:16:53): Workers striking two Johnson Control Inc. seat asse: 0.99977


0.9997734213559645

## Entering new author

In [53]:
author = "KevinDrawbaugh"
article_num = 22

interface.score(author, corpus[author][article_num])

Executing originality algorithm
Creating hash
Started comparison
Connecting to database

Author ID: KevinDrawbaugh
Submitted text: 'The Food and Drug Administration is approving new d...'
Score:  -1 (new author; nothing to compare)


-1

In [55]:
author = "KevinDrawbaugh"
article_num = 33

interface.score(author, corpus[author][article_num])

Executing originality algorithm
Creating hash
Started comparison
Connecting to database
Creating hash

Author ID: KevinDrawbaugh
Submitted text: 'Amoco Corp is basing its 1997 business plans on ass...'
Average score: 0.19042
Submission likely from different author

Here's a comparision against prior submissions:
1 (Mon 30 Jul 2018 05:21:37): The Food and Drug Administration is approving new d: 0.19042


0.19042466174058711

## Entering in text from wrong author

In [54]:
author = "ToddNissen"
article_num = 39

interface.score(author, corpus["SarahDavison"][article_num])

Executing originality algorithm
Creating hash
Started comparison
Connecting to database
Creating hash
Creating hash

Author ID: ToddNissen
Submitted text: 'Hong Kong's Financial Secretary Donald Tsang issued...'
Average score: 0.00231

Here's a comparision against prior submissions:
1 (Mon 30 Jul 2018 05:16:53): Workers striking two Johnson Control Inc. seat asse: 0.00209
2 (Mon 30 Jul 2018 05:20:09): Calling it a crucial change for how it develops new: 0.00253


0.0023127032563767616