**Insights**
* The big problem I'm trying to solve here is to make the relatively complex system of NNs for NLP something I cant 'wrap my head around'
    * It's about learning, and one of the critical concepts in human learning is 'chunking'. So I've got three layers here: 
        * Top: "string of text" ==(NN system)==> username of interest [Y/N]
        * Second: Collection of scripts with input/knobs/dials/outputs
        * Third: Code itself, presented as close to linearly as possible
        * Important to not try going too deep (I made this mistake with bash tools). Once you've got some solid abstractions that give you flexibility and don't seem to leak, extend rather than deepening knowledge
    * Connections to concrete concepts that I already know are also key, so making the inputs/outputs clear in examples using Python datatypes that I'm comfortable with will help grasp the abstract transforms 
    * Reenforcement is key, so I'll try to copy this and apply it to a different problem almost as soon as I'm done
    * The problem with many other systems is that, in an admirable effort to be modular, they force you to jump 3-7 levels deep within functions to get to the actual Python code (the concrete concept). From the main script, it can take several minutes to trace how an input and parameter actually get combined into an output.
        * Even if effort was made to document, there's a "curse of knowledge" issue that makes it difficult to grasp the particulars without the mental model that the original writer had
    * Goal is to find the mental models where I can look at other people's output (code), break it down quickly and accurately, identify what's new about it, and how i could update my design process to take advantage of anything that I don't yet have
        * [This twitter thread](https://twitter.com/michael_nielsen/status/1074150124169773056) talks about meeting 'magicians' who are better in ways that you can't comprehend, then working to understand the implicit models that allow them to be 10x better. You probably can't do it just from their work, you need to communicate on a more abstract level. How do I do that here?
    * Get solid on one simple approach, and then identify ways to extend it that I don't fully understand yet. Nailing down specific questions, "I would have expected X, but I'm seeing Y - what's going on?" is critical
* There's huge value in having a single file where you can see everything, whether it's a diagram, makefile or Jupyter notebook
* Need to identify what needs to be flexible. Passing a trivial amount of data e2e should be very doable. Then building on that, so that the data is essentially flowing (not moving in massive blocks)
* From a documentation standpoint, unit tests waste too much time on the edge cases. At the very least, the top test should always be 'happy path' so you can see what it _should_ look like. Then there needs to be a connection between files
    * This is much more possible for data pipelines than application code, so it's under-developed
* One big thing that I think I lack is an intuitive sense of what's "expensive", in terms of Disk, IO, RAM and Compute (are there other limited resources?)
    * Is streaming data from a server going to be a bottleneck?
    * Is it computationally expensive to open a bunch of little data files, instead of one big one?

In [111]:
DATADIR = '/home/mritter/code/twitter_nlp/sandbox_data/'
NUM_SAMPLES = 10
SKIP_FIRST = 1000
test_pct = .5

In [112]:
# download_data.py

# loading_dock: [local] sandbox_data.txt, [internet] http server
# processing: download, assign IDs, split out test
# dial: progress bar
# shipping_dock: [local] .h5
manifest_filename = 'manifest.txt'
server_url = 'https://files.pushshift.io/hackernews/'
output_file_base = 'downloaded'


from requests import get
import bz2, json, tqdm
import numpy as np

stop = False
sample_l = []
with open(DATADIR+manifest_filename) as infile:
    for line in tqdm.tqdm(infile):
        remote_filename = line.split()[1]
        print(remote_filename)
        as_bytes = get(server_url+remote_filename+'.bz2').content
        as_text = bz2.decompress(as_bytes)
        for sample in as_text.split(b'\n'):
            if not len(sample): continue
            sample_l.append(json.loads(sample))
            if len(sample_l) >= (SKIP_FIRST + NUM_SAMPLES):
                stop = True
            if stop: break
        print(len(sample_l))
        if stop: break
            
sample_l = sample_l[SKIP_FIRST:]

1it [00:00,  7.29it/s]

HNI_2006-10
61
HNI_2006-12


2it [00:00,  7.49it/s]

62
HNI_2007-02
1010





In [113]:
np.random.seed = 42
np.random.shuffle(sample_l)

In [114]:
test_ix = int(len(sample_l)*test_pct)

with open(DATADIR+output_file_base+'_train.jsonl', 'w') as outfile:
    for entry in sample_l[test_ix:]:
        json.dump(entry, outfile)
        outfile.write('\n')
        
with open(DATADIR+output_file_base+'_test.jsonl', 'w') as outfile:
    for entry in sample_l[:test_ix]:
        json.dump(entry, outfile)
        outfile.write('\n')

In [119]:
! head sandbox_data/*jsonl

==> sandbox_data/downloaded_test.jsonl <==
{"by": "volida", "id": 1009, "parent": 856, "retrieved_on": 1525542115, "text": "Ebay bought its Chinese clone for hundreds of millions of dollars, which afterwards collapsed because after moving the servers outside China the service's data were going through word filtering (e.g. during login) and there were failures...\n", "time": 1172412920, "type": "comment"}
{"by": "dngrmouse", "id": 1004, "parent": 363, "retrieved_on": 1525542114, "text": "1. Have it so you can be automatically logged in. I have to manually log in every time I visit the site (using Safari here).<p>2. Just like Reddit does, show the domain each link belongs to. Reddit has this in brackets after the headline, which works fine. Since I don't have much free time, there are some sites that have sub-par content which I avoid reading, and it helps to know where I would end up without having to hover over the link.", "time": 1172400507, "type": "comment"}
{"by": "msgbeepa", "d

In [195]:
# preprocess_data.py

# loading_dock: [local] train data
# processing: filter, split, tag, format labels
# lever: training, inference, evaluation
# dial: Dask status
# shipping_dock: [local] text-only and label-only files with IDs
status = 'training'
downloaded_filename = 'downloaded'
filter_bools = {'type':'story'}  # Lines are filtered out if true
split_regex = r' |\.'
remove_regex = r"\'|\""
tag_patterns = {'http.*\w':' <LINK> '}
positive_labels = ('pg', 'patio11', 'volida')
output_file = 'preprocessed'

import re

data = []
with open(DATADIR+downloaded_filename+'_train.jsonl', 'r') as infile:
    for line in tqdm.tqdm(infile):
        data.append(json.loads(line))

for key, value in filter_bools.items():
    data = [x for x in data if x[key] != value]

texts = {}
labels = {}
temp_labels = []
for row in tqdm.tqdm(data):
    temp_text = row['text']
    temp_text = temp_text.lower()
    for key, value in tag_patterns.items():
        temp_text = re.sub(key, value, temp_text)
    texts[row['id']] = re.split(split_regex, re.sub(remove_regex, '', temp_text))
    labels[row['id']] = (1, 0) if row['by'] in positive_labels else (0, 1)  # Not generalizable

for value in texts.values():
    value += ['']*(300-len(value))
print(data)
print(texts)
print(labels)

5it [00:00, 14463.12it/s]
100%|██████████| 2/2 [00:00<00:00, 4422.04it/s]

[{'by': 'volida', 'id': 1010, 'parent': 856, 'retrieved_on': 1525542115, 'text': 'Ebay bought its Chinese clone for hundreds of millions of dollars', 'time': 1172413027, 'type': 'comment'}, {'by': 'rms', 'id': 1006, 'parent': 928, 'retrieved_on': 1525542114, 'text': "It's bad if you come out of the Techstars program without any funding and a non-sustainable company, but then you're probably screwed anyways. VCs are infamously inscrutable; we hear that they are always out to take advantage of naive or underfunded companies.<p>If you're good enough to get further investment after Techstars, you get it from a VC that you already know instead of having to deal with the typical painful negotiations. And if Brad Feld's Foundry Group will give you money, maybe you could get Bay Area VC money. Even better, the best companies will get to reinvest their own profits.", 'time': 1172403958, 'type': 'comment'}]
{1010: ['ebay', 'bought', 'its', 'chinese', 'clone', 'for', 'hundreds', 'of', 'millions',




In [196]:
import h5py
with h5py.File(DATADIR+"preprocessed_text.h5", "w") as f:
    for ix in texts.keys():
        dset = f.create_dataset(str(ix), (300,), dtype='S100')
        dset[:] = [str(n).encode("ascii", "ignore") for n in texts[ix]]

with h5py.File(DATADIR+"preprocessed_label.h5", "w") as f:
    for ix in labels.keys():
        dset = f.create_dataset(str(ix), (2,), dtype='i')
        dset[:] = labels[ix]

with h5py.File(DATADIR+"preprocessed_label.h5", "r") as f:
    print(f['1010'][()])

[1 0]


_insight_: w2v is generated with a simple NN autoencoder

In [197]:
# train_w2v.py

# loading_dock: [local] train text
# processing: gensim w2v
# dial: estimate based on train size
# shipping_dock: [local] gensim model
train_filename = 'preprocessed'
output_file = 'w2v'

class W2VIter:
    def __init__(self, texts):
        self.texts = texts.values()
    def __iter__(self):
        for text in self.texts:
            yield [token for token in text if token != '']
            
w2viter = W2VIter(texts)

import logging
from gensim.models import Word2Vec

logger= logging.getLogger()
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# TODO This doesn't work yet
class EpochFilter(logging.Filter):
    def filter(self, record):
        return record.getMessage().contains('EPOCH')

logger.addFilter(EpochFilter())

w2v = Word2Vec(w2viter, iter=2, min_count=1, size=100, workers=2)

2019-01-24 06:47:16,018 : INFO : collecting all words and their counts
2019-01-24 06:47:16,019 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-01-24 06:47:16,020 : INFO : collected 86 word types from a corpus of 113 raw words and 2 sentences
2019-01-24 06:47:16,020 : INFO : Loading a fresh vocabulary
2019-01-24 06:47:16,021 : INFO : min_count=1 retains 86 unique words (100% of original 86, drops 0)
2019-01-24 06:47:16,021 : INFO : min_count=1 leaves 113 word corpus (100% of original 113, drops 0)
2019-01-24 06:47:16,022 : INFO : deleting the raw counts dictionary of 86 items
2019-01-24 06:47:16,023 : INFO : sample=0.001 downsamples 86 most-common words
2019-01-24 06:47:16,023 : INFO : downsampling leaves estimated 41 word corpus (37.1% of prior 113)
2019-01-24 06:47:16,023 : INFO : estimated required memory for 86 words and 100 dimensions: 111800 bytes
2019-01-24 06:47:16,024 : INFO : resetting layer weights
2019-01-24 06:47:16,028 : INFO : training mode

In [198]:
print(w2v.wv['the'][:5])
print(w2v.wv.index2word[0], w2v.wv.index2word[1], w2v.wv.index2word[2])
print('Index of "the" is: {}'.format(w2v.wv.vocab['the'].index))
w2v.save(DATADIR+"myw2v")
w2v_loaded = Word2Vec.load(DATADIR+"myw2v")
print('Index of "the" is: {}'.format(w2v_loaded.wv.vocab['the'].index))

2019-01-24 06:47:20,973 : INFO : saving Word2Vec object under /home/mritter/code/twitter_nlp/sandbox_data/myw2v, separately None
2019-01-24 06:47:20,974 : INFO : not storing attribute vectors_norm
2019-01-24 06:47:20,974 : INFO : not storing attribute cum_table
2019-01-24 06:47:20,976 : INFO : saved /home/mritter/code/twitter_nlp/sandbox_data/myw2v
2019-01-24 06:47:20,977 : INFO : loading Word2Vec object from /home/mritter/code/twitter_nlp/sandbox_data/myw2v
2019-01-24 06:47:20,979 : INFO : loading wv recursively from /home/mritter/code/twitter_nlp/sandbox_data/myw2v.wv.* with mmap=None
2019-01-24 06:47:20,979 : INFO : setting ignored attribute vectors_norm to None
2019-01-24 06:47:20,980 : INFO : loading vocabulary recursively from /home/mritter/code/twitter_nlp/sandbox_data/myw2v.vocabulary.* with mmap=None
2019-01-24 06:47:20,980 : INFO : loading trainables recursively from /home/mritter/code/twitter_nlp/sandbox_data/myw2v.trainables.* with mmap=None
2019-01-24 06:47:20,981 : INFO :

[ 0.00214999  0.00413239  0.00452458  0.00431084 -0.0031492 ]
of you to
Index of "the" is: 4
Index of "the" is: 4


In [229]:
l = []
for i, token in enumerate(w2v.wv.index2word): l.append(w2v.wv[token])
weights = np.array(l)
weights[:5, :5]

array([[-0.00060824,  0.0011704 , -0.00019772, -0.0045374 , -0.00032814],
       [-0.00120165, -0.00185802,  0.00117371,  0.00338027, -0.0031534 ],
       [ 0.00209898, -0.0028773 , -0.00276368, -0.00330904, -0.00310643],
       [-0.00444596, -0.00370529, -0.00366363,  0.00253816,  0.00022051],
       [ 0.00214999,  0.00413239,  0.00452458,  0.00431084, -0.0031492 ]],
      dtype=float32)

In [230]:
with h5py.File(DATADIR+"w2v.h5", "w") as f:
    dset = f.create_dataset('data', weights.shape, dtype='f')
    dset[:] = weights


In [240]:
# index_text.py

# loading_dock: [local] train text, [local] gensim
# shipping_dock: [local] text as indexes, [local] 100d w2v array sorted with that index
# TODO should I bring the w2v sorting into here?
# TODO I need to think about how IDs get passed around here
w2v_file = 'w2v'
text_file = 'preprocessed'

w2v_loaded = Word2Vec.load(DATADIR+"myw2v")
indexed_texts = {}
for key, wordlist in texts.items():
    indexed_texts[key] = []
    for word in wordlist:
        if word in w2v_loaded.wv.vocab:
            a = w2v_loaded.wv.vocab[word].index
        else:
            a = 0
        indexed_texts[key].append(a)
indexed_texts[1010][:20]

2019-01-24 07:19:15,687 : INFO : loading Word2Vec object from /home/mritter/code/twitter_nlp/sandbox_data/myw2v
2019-01-24 07:19:15,689 : INFO : loading wv recursively from /home/mritter/code/twitter_nlp/sandbox_data/myw2v.wv.* with mmap=None
2019-01-24 07:19:15,690 : INFO : setting ignored attribute vectors_norm to None
2019-01-24 07:19:15,691 : INFO : loading vocabulary recursively from /home/mritter/code/twitter_nlp/sandbox_data/myw2v.vocabulary.* with mmap=None
2019-01-24 07:19:15,691 : INFO : loading trainables recursively from /home/mritter/code/twitter_nlp/sandbox_data/myw2v.trainables.* with mmap=None
2019-01-24 07:19:15,692 : INFO : setting ignored attribute cum_table to None
2019-01-24 07:19:15,692 : INFO : loaded /home/mritter/code/twitter_nlp/sandbox_data/myw2v


[16, 17, 5, 18, 19, 20, 21, 0, 22, 0, 23, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [None]:
# model.py

# loading_dock: 100d w2v array
# lever: params
# dial: model.summary()
# shipping_dock: [local] compiled model

w2v_file = 'w2v'
compiled_model = 'compiled_model'



In [None]:
# train.py

# loading_dock: [local] text as indexes, [local] compiled model
# lever: epocs
# dial: tensorboard
# shipping_dock: [local] saved model

text_file = 'preprocessed'
compiled_model = 'compiled_model'
trained_model = 'trained_model'

In [None]:
# inference.py

# loading_dock: candidate comment
# processing: call preprocess and index, then apply model
# shipping_dock: best comment

comment_text = """
Reminder, if you're in the US, the FTC says your eye doctor must give you your prescription after your exam. If a doctor refuses to do so, they can face legal action and penalties.

https://www.consumer.ftc.gov/blog/2016/05/buying-prescriptio...

That said, I don't think the FTC stipulates what information must appear on the prescription. Many docs leave off your PD (pupillary distance), which is a necessary measurement if you're buying online. Fortunately, there are a variety of easy ways to take this measurement yourself after the exam, although if you're really concerned about precision, you'll want the doctor's measurement.

And by the way, it should go without saying, but I'll say it anyway. Although the quality of eyewear available online can be comparable to what you'd get in store ... please don't think an online eye exam is an acceptable substitute for visiting an ophthalmologist in person and getting a comprehensive eye exam! 
"""

trained_model = 'trained_model'

In [None]:
# test.py

# loading_dock: [local] test file
# processing: call inference, then compare to labels 
# shipping_dock: accuracy printout

test_filename = 'downloaded_test'
trained_model = 'trained_model'