**Insights**
* There's huge value in having a single file where you can see everything, whether it's a diagram, makefile or Jupyter notebook
* Need to identify what needs to be flexible. Passing a trivial amount of data e2e should be very doable
* Unit tests waste too much time on the edge cases. At the very least, the top test should always be 'happy path' so you can see what it _should_ look like. Then there needs to be a connection between files
    * This is much more possible for data pipelines than application code, so it's under-developed

In [1]:
DATADIR = '/home/mritter/code/twitter_nlp/sandbox_data/'

In [None]:
# download_data.py

# loading_dock: [local] sandbox_data.txt, [internet] http server
# processing: download, assign IDs, split out test
# dial: progress bar
# shipping_dock: [local] .h5
manifest_filename = 'manifest.txt'
server_url = 'https://files.pushshift.io/hackernews/
output_file_base = 'downloaded'


In [None]:
# preprocess_data.py

# loading_dock: [local] train data
# processing: filter, split, tag, format labels
# lever: training, inference, evaluation
# dial: Dask status
# shipping_dock: [local] text-only and label-only files with IDs
status = 'training'
train_filename = 'downloaded_train'
filter_bools = {}  # Lines are filtered out if true
split_chars = (' ', '.')
tag_patterns = {'http.*\w':' <LINK> '}
output_file = 'preprocessed'


In [2]:
# train_w2v.py

# loading_dock: [local] train text
# processing: gensim w2v
# dial: estimate based on train size
# shipping_dock: [local] gensim model
train_filename = 'preprocessed'
output_file = 'w2v'


In [None]:
# index_text.py

# loading_dock: [local] train text, [local] gensim
# shipping_dock: [local] text as indexes, [local] 100d w2v array sorted with that index
w2v_file = 'w2v'
text_file = 'preprocessed'

In [None]:
# model.py

# loading_dock: 100d w2v array
# lever: params
# dial: model.summary()
# shipping_dock: [local] compiled model

w2v_file = 'w2v'
compiled_model = 'compiled_model'

In [None]:
# train.py

# loading_dock: [local] text as indexes, [local] compiled model
# lever: epocs
# dial: tensorboard
# shipping_dock: [local] saved model

text_file = 'preprocessed'
compiled_model = 'compiled_model'
trained_model = 'trained_model'

In [None]:
# inference.py

# loading_dock: candidate comment
# processing: call preprocess and index, then apply model
# shipping_dock: best comment

comment_text = """
Reminder, if you're in the US, the FTC says your eye doctor must give you your prescription after your exam. If a doctor refuses to do so, they can face legal action and penalties.

https://www.consumer.ftc.gov/blog/2016/05/buying-prescriptio...

That said, I don't think the FTC stipulates what information must appear on the prescription. Many docs leave off your PD (pupillary distance), which is a necessary measurement if you're buying online. Fortunately, there are a variety of easy ways to take this measurement yourself after the exam, although if you're really concerned about precision, you'll want the doctor's measurement.

And by the way, it should go without saying, but I'll say it anyway. Although the quality of eyewear available online can be comparable to what you'd get in store ... please don't think an online eye exam is an acceptable substitute for visiting an ophthalmologist in person and getting a comprehensive eye exam! 
"""

trained_model = 'trained_model'

In [None]:
# test.py

# loading_dock: [local] test file
# processing: call inference, then compare to labels 
# shipping_dock: accuracy printout

test_filename = 'downloaded_test'
trained_model = 'trained_model'