**Insights**
* There's huge value in having a single file where you can see everything, whether it's a diagram, makefile or Jupyter notebook
* Need to identify what needs to be flexible. Passing a trivial amount of data e2e should be very doable
* Unit tests waste too much time on the edge cases. At the very least, the top test should always be 'happy path' so you can see what it _should_ look like. Then there needs to be a connection between files
    * This is much more possible for data pipelines than application code, so it's under-developed

In [111]:
DATADIR = '/home/mritter/code/twitter_nlp/sandbox_data/'
NUM_SAMPLES = 10
SKIP_FIRST = 1000
test_pct = .5

In [112]:
# download_data.py

# loading_dock: [local] sandbox_data.txt, [internet] http server
# processing: download, assign IDs, split out test
# dial: progress bar
# shipping_dock: [local] .h5
manifest_filename = 'manifest.txt'
server_url = 'https://files.pushshift.io/hackernews/'
output_file_base = 'downloaded'


from requests import get
import bz2, json, tqdm
import numpy as np

stop = False
sample_l = []
with open(DATADIR+manifest_filename) as infile:
    for line in tqdm.tqdm(infile):
        remote_filename = line.split()[1]
        print(remote_filename)
        as_bytes = get(server_url+remote_filename+'.bz2').content
        as_text = bz2.decompress(as_bytes)
        for sample in as_text.split(b'\n'):
            if not len(sample): continue
            sample_l.append(json.loads(sample))
            if len(sample_l) >= (SKIP_FIRST + NUM_SAMPLES):
                stop = True
            if stop: break
        print(len(sample_l))
        if stop: break
            
sample_l = sample_l[SKIP_FIRST:]

1it [00:00,  7.29it/s]

HNI_2006-10
61
HNI_2006-12


2it [00:00,  7.49it/s]

62
HNI_2007-02
1010





In [113]:
np.random.seed = 42
np.random.shuffle(sample_l)

In [114]:
test_ix = int(len(sample_l)*test_pct)

with open(DATADIR+output_file_base+'_train.jsonl', 'w') as outfile:
    for entry in sample_l[test_ix:]:
        json.dump(entry, outfile)
        outfile.write('\n')
        
with open(DATADIR+output_file_base+'_test.jsonl', 'w') as outfile:
    for entry in sample_l[:test_ix]:
        json.dump(entry, outfile)
        outfile.write('\n')

In [119]:
! head sandbox_data/*jsonl

==> sandbox_data/downloaded_test.jsonl <==
{"by": "volida", "id": 1009, "parent": 856, "retrieved_on": 1525542115, "text": "Ebay bought its Chinese clone for hundreds of millions of dollars, which afterwards collapsed because after moving the servers outside China the service's data were going through word filtering (e.g. during login) and there were failures...\n", "time": 1172412920, "type": "comment"}
{"by": "dngrmouse", "id": 1004, "parent": 363, "retrieved_on": 1525542114, "text": "1. Have it so you can be automatically logged in. I have to manually log in every time I visit the site (using Safari here).<p>2. Just like Reddit does, show the domain each link belongs to. Reddit has this in brackets after the headline, which works fine. Since I don't have much free time, there are some sites that have sub-par content which I avoid reading, and it helps to know where I would end up without having to hover over the link.", "time": 1172400507, "type": "comment"}
{"by": "msgbeepa", "d

In [138]:
re.split(r'a|f', 'asdfgfdadssf')

['', 'sd', 'g', 'd', 'dss', '']

In [164]:
# preprocess_data.py

# loading_dock: [local] train data
# processing: filter, split, tag, format labels
# lever: training, inference, evaluation
# dial: Dask status
# shipping_dock: [local] text-only and label-only files with IDs
status = 'training'
downloaded_filename = 'downloaded'
filter_bools = {'type':'story'}  # Lines are filtered out if true
split_regex = r' |\.'
remove_regex = r"\'|\""
tag_patterns = {'http.*\w':' <LINK> '}
positive_labels = ('pg', 'patio11', 'volida')
output_file = 'preprocessed'

import re

data = []
with open(DATADIR+downloaded_filename+'_train.jsonl', 'r') as infile:
    for line in tqdm.tqdm(infile):
        data.append(json.loads(line))

for key, value in filter_bools.items():
    data = [x for x in data if x[key] != value]

texts = {}
labels = {}
temp_labels = []
for row in tqdm.tqdm(data):
    temp_text = row['text']
    for key, value in tag_patterns.items():
        temp_text = re.sub(key, value, temp_text)
    texts[row['id']] = re.split(split_regex, re.sub(remove_regex, '', temp_text))
    labels[row['id']] = (1, 0) if row['by'] in positive_labels else (0, 1)  # Not generalizable

for value in texts.values():
    value += ['']*(300-len(value))
print(data)
print(texts)
print(labels)

5it [00:00, 16409.64it/s]
100%|██████████| 2/2 [00:00<00:00, 9228.39it/s]

[{'by': 'volida', 'id': 1010, 'parent': 856, 'retrieved_on': 1525542115, 'text': 'Ebay bought its Chinese clone for hundreds of millions of dollars', 'time': 1172413027, 'type': 'comment'}, {'by': 'rms', 'id': 1006, 'parent': 928, 'retrieved_on': 1525542114, 'text': "It's bad if you come out of the Techstars program without any funding and a non-sustainable company, but then you're probably screwed anyways. VCs are infamously inscrutable; we hear that they are always out to take advantage of naive or underfunded companies.<p>If you're good enough to get further investment after Techstars, you get it from a VC that you already know instead of having to deal with the typical painful negotiations. And if Brad Feld's Foundry Group will give you money, maybe you could get Bay Area VC money. Even better, the best companies will get to reinvest their own profits.", 'time': 1172403958, 'type': 'comment'}]
{1010: ['Ebay', 'bought', 'its', 'Chinese', 'clone', 'for', 'hundreds', 'of', 'millions',




In [169]:
len([str(n).encode("ascii", "ignore") for n in texts[ix]])

300

In [172]:
import h5py
with h5py.File(DATADIR+"preprocessed_text.h5", "w") as f:
    for ix in texts.keys():
        dset = f.create_dataset(str(ix), (300,), dtype='S100')
        dset[:] = [str(n).encode("ascii", "ignore") for n in texts[ix]]

with h5py.File(DATADIR+"preprocessed_label.h5", "w") as f:
    for ix in labels.keys():
        dset = f.create_dataset(str(ix), (2,), dtype='i')
        dset[:] = labels[ix]

with h5py.File(DATADIR+"preprocessed_label.h5", "r") as f:
    print(f['1010'][()])

[1 0]


In [2]:
# train_w2v.py

# loading_dock: [local] train text
# processing: gensim w2v
# dial: estimate based on train size
# shipping_dock: [local] gensim model
train_filename = 'preprocessed'
output_file = 'w2v'


In [None]:
# index_text.py

# loading_dock: [local] train text, [local] gensim
# shipping_dock: [local] text as indexes, [local] 100d w2v array sorted with that index
w2v_file = 'w2v'
text_file = 'preprocessed'

In [None]:
# model.py

# loading_dock: 100d w2v array
# lever: params
# dial: model.summary()
# shipping_dock: [local] compiled model

w2v_file = 'w2v'
compiled_model = 'compiled_model'

In [None]:
# train.py

# loading_dock: [local] text as indexes, [local] compiled model
# lever: epocs
# dial: tensorboard
# shipping_dock: [local] saved model

text_file = 'preprocessed'
compiled_model = 'compiled_model'
trained_model = 'trained_model'

In [None]:
# inference.py

# loading_dock: candidate comment
# processing: call preprocess and index, then apply model
# shipping_dock: best comment

comment_text = """
Reminder, if you're in the US, the FTC says your eye doctor must give you your prescription after your exam. If a doctor refuses to do so, they can face legal action and penalties.

https://www.consumer.ftc.gov/blog/2016/05/buying-prescriptio...

That said, I don't think the FTC stipulates what information must appear on the prescription. Many docs leave off your PD (pupillary distance), which is a necessary measurement if you're buying online. Fortunately, there are a variety of easy ways to take this measurement yourself after the exam, although if you're really concerned about precision, you'll want the doctor's measurement.

And by the way, it should go without saying, but I'll say it anyway. Although the quality of eyewear available online can be comparable to what you'd get in store ... please don't think an online eye exam is an acceptable substitute for visiting an ophthalmologist in person and getting a comprehensive eye exam! 
"""

trained_model = 'trained_model'

In [None]:
# test.py

# loading_dock: [local] test file
# processing: call inference, then compare to labels 
# shipping_dock: accuracy printout

test_filename = 'downloaded_test'
trained_model = 'trained_model'