**Insights**
* The big problem I'm trying to solve here is to make the relatively complex system of NNs for NLP something I cant 'wrap my head around'
    * It's about learning, and one of the critical concepts in human learning is 'chunking'. So I've got three layers here: 
        * Top: "string of text" ==(NN system)==> username of interest [Y/N]
        * Second: Collection of scripts with input/knobs/dials/outputs
        * Third: Code itself, presented as close to linearly as possible
        * Important to not try going too deep (I made this mistake with bash tools). Once you've got some solid abstractions that give you flexibility and don't seem to leak, extend rather than deepening knowledge
    * Connections to concrete concepts that I already know are also key, so making the inputs/outputs clear in examples using Python datatypes that I'm comfortable with will help grasp the abstract transforms 
    * Reenforcement is key, so I'll try to copy this and apply it to a different problem almost as soon as I'm done
    * The problem with many other systems is that, in an admirable effort to be modular, they force you to jump 3-7 levels deep within functions to get to the actual Python code (the concrete concept). From the main script, it can take several minutes to trace how an input and parameter actually get combined into an output.
        * Even if effort was made to document, there's a "curse of knowledge" issue that makes it difficult to grasp the particulars without the mental model that the original writer had
    * Goal is to find the mental models where I can look at other people's output (code), break it down quickly and accurately, identify what's new about it, and how i could update my design process to take advantage of anything that I don't yet have
        * [This twitter thread](https://twitter.com/michael_nielsen/status/1074150124169773056) talks about meeting 'magicians' who are better in ways that you can't comprehend, then working to understand the implicit models that allow them to be 10x better. You probably can't do it just from their work, you need to communicate on a more abstract level. How do I do that here?
    * Get solid on one simple approach, and then identify ways to extend it that I don't fully understand yet. Nailing down specific questions, "I would have expected X, but I'm seeing Y - what's going on?" is critical
* There's huge value in having a single file where you can see everything, whether it's a diagram, makefile or Jupyter notebook
* Need to identify what needs to be flexible. Passing a trivial amount of data e2e should be very doable. Then building on that, so that the data is essentially flowing (not moving in massive blocks)
* From a documentation standpoint, unit tests waste too much time on the edge cases. At the very least, the top test should always be 'happy path' so you can see what it _should_ look like. Then there needs to be a connection between files
    * This is much more possible for data pipelines than application code, so it's under-developed
* One big thing that I think I lack is an intuitive sense of what's "expensive", in terms of Disk, IO, RAM and Compute (are there other limited resources?)
    * Is streaming data from a server going to be a bottleneck?
    * Is it computationally expensive to open a bunch of little data files, instead of one big one?

In [1]:
DATADIR = '/home/mritter/code/twitter_nlp/sandbox_data/'
NUM_SAMPLES = 10
SKIP_FIRST = 1000
test_pct = .5

In [2]:
# download_data.py

# loading_dock: [local] sandbox_data.txt, [internet] http server
# processing: download, assign IDs, split out test
# dial: progress bar
# shipping_dock: [local] .h5
manifest_filename = 'manifest.txt'
server_url = 'https://files.pushshift.io/hackernews/'
output_file_base = 'downloaded'


from requests import get
import bz2, json, tqdm
import numpy as np

stop = False
sample_l = []
with open(DATADIR+manifest_filename) as infile:
    for line in tqdm.tqdm(infile):
        remote_filename = line.split()[1]
        print(remote_filename)
        as_bytes = get(server_url+remote_filename+'.bz2').content
        as_text = bz2.decompress(as_bytes)
        for sample in as_text.split(b'\n'):
            if not len(sample): continue
            sample_l.append(sample.decode("ascii", "ignore"))
            if len(sample_l) >= (SKIP_FIRST + NUM_SAMPLES):
                stop = True
            if stop: break
        print(len(sample_l))
        if stop: break
            
sample_l = sample_l[SKIP_FIRST:]

0it [00:00, ?it/s]

HNI_2006-10


2it [00:00,  4.58it/s]

61
HNI_2006-12
62
HNI_2007-02
1010





In [3]:
np.random.seed = 42
np.random.shuffle(sample_l)

In [4]:
test_ix = int(len(sample_l)*test_pct)

with open(DATADIR+output_file_base+'_train.jsonl', 'w') as outfile:
    outfile.write('\n'.join(sample_l[test_ix:]))
        
with open(DATADIR+output_file_base+'_train.jsonl', 'w') as outfile:
    outfile.write('\n'.join(sample_l[:test_ix]))

In [5]:
! ls -lah sandbox_data/*jsonl

-rw-rw-r-- 1 mritter mritter 1.3K Jan 26 13:32 sandbox_data/downloaded_test.jsonl
-rw-rw-r-- 1 mritter mritter  978 Jan 26 16:46 sandbox_data/downloaded_train.jsonl


In [6]:
! wc -l sandbox_data/*jsonl

   5 sandbox_data/downloaded_test.jsonl
   4 sandbox_data/downloaded_train.jsonl
   9 total


In [7]:
! head sandbox_data/*jsonl

==> sandbox_data/downloaded_test.jsonl <==
{"by": "volida", "id": 1009, "parent": 856, "retrieved_on": 1525542115, "text": "Ebay bought its Chinese clone for hundreds of millions of dollars, which afterwards collapsed because after moving the servers outside China the service's data were going through word filtering (e.g. during login) and there were failures...\n", "time": 1172412920, "type": "comment"}
{"by": "phil", "id": 1003, "parent": 955, "retrieved_on": 1525542114, "text": "8.3% of what they did in 2005: wow.", "time": 1172399687, "type": "comment"}
{"by": "python_kiss", "descendants": 0, "id": 1002, "retrieved_on": 1525542114, "score": 2, "time": 1172397259, "title": "The Battle for Mobile Search", "type": "story", "url": "http://www.businessweek.com/technology/content/feb2007/tc20070220_828216.htm?campaign_id=rss_daily"}
{"by": "rms", "descendants": 3, "id": 1005, "kids": [1023, 1067], "retrieved_on": 1525542114, "score": 6, "time": 1172400839, "title": "CRV Quickstart:  

In [65]:
# preprocess_w2v_index.py

###
# PREPROCESSING_STEP
###

# loading_dock: [JSONL] raw data
# processing: filter, split, tag, format labels
# lever: training, inference, evaluation, sequence_length
# dial: Dask status
# shipping_dock: [local] text-only and label-only files with IDs
status = 'training'
downloaded_filename = 'downloaded'
filter_bools = {'type':'story'}  # Lines are filtered out if true
split_regex = r' |\.'
remove_regex = r"\'|\""
tag_patterns = {'http.*\w':' <LINK> '}
positive_labels = ('pg', 'patio11', 'volida')
sequence_length = 300
output_file = 'preprocessed'

import re

data = []
with open(DATADIR+downloaded_filename+'_train.jsonl', 'r') as infile:
    for line in tqdm.tqdm(infile):
        data.append(json.loads(line))

for key, value in filter_bools.items():
    data = [x for x in data if x[key] != value]

texts = {}
labels = {}
temp_labels = []
for row in tqdm.tqdm(data):
    temp_text = row['text']
    temp_text = temp_text.lower()
    for key, value in tag_patterns.items():
        temp_text = re.sub(key, value, temp_text)
    texts[row['id']] = re.split(split_regex, re.sub(remove_regex, '', temp_text))
    labels[row['id']] = (1, 0) if row['by'] in positive_labels else (0, 1)  # Not generalizable

for value in texts.values():
    value += ['']*(sequence_length-len(value))
print(data)
print(texts)
print(labels)

5it [00:00, 19490.26it/s]
100%|██████████| 2/2 [00:00<00:00, 9177.91it/s]

[{'by': 'phil', 'id': 1003, 'parent': 955, 'retrieved_on': 1525542114, 'text': '8.3% of what they did in 2005: wow.', 'time': 1172399687, 'type': 'comment'}, {'by': 'volida', 'id': 1010, 'parent': 856, 'retrieved_on': 1525542115, 'text': 'Ebay bought its Chinese clone for hundreds of millions of dollars', 'time': 1172413027, 'type': 'comment'}]
{1003: ['8', '3%', 'of', 'what', 'they', 'did', 'in', '2005:', 'wow', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',




_insight_: w2v is generated with a simple NN autoencoder

In [66]:
###
# W2V_STEP
###

# loading_dock: [local] train text
# processing: gensim w2v
# dial: estimate based on train size
# shipping_dock: [local] gensim model
output_file = 'w2v'

class W2VIter:
    def __init__(self, texts):
        self.texts = texts.values()
    def __iter__(self):
        for text in self.texts:
            yield [token for token in text if token != '']
            
w2viter = W2VIter(texts)

import logging
from gensim.models import Word2Vec

logger= logging.getLogger()
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# TODO This doesn't work yet
class EpochFilter(logging.Filter):
    def filter(self, record):
        return record.getMessage().contains('EPOCH')

logger.addFilter(EpochFilter())

w2v = Word2Vec(w2viter, iter=2, min_count=1, size=100, workers=2)

2019-01-26 17:13:06,875 : INFO : collecting all words and their counts
2019-01-26 17:13:06,876 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-01-26 17:13:06,877 : INFO : collected 18 word types from a corpus of 20 raw words and 2 sentences
2019-01-26 17:13:06,877 : INFO : Loading a fresh vocabulary
2019-01-26 17:13:06,877 : INFO : min_count=1 retains 18 unique words (100% of original 18, drops 0)
2019-01-26 17:13:06,878 : INFO : min_count=1 leaves 20 word corpus (100% of original 20, drops 0)
2019-01-26 17:13:06,878 : INFO : deleting the raw counts dictionary of 18 items
2019-01-26 17:13:06,879 : INFO : sample=0.001 downsamples 18 most-common words
2019-01-26 17:13:06,879 : INFO : downsampling leaves estimated 3 word corpus (15.0% of prior 20)
2019-01-26 17:13:06,880 : INFO : estimated required memory for 18 words and 100 dimensions: 23400 bytes
2019-01-26 17:13:06,880 : INFO : resetting layer weights
2019-01-26 17:13:06,881 : INFO : training model with

In [67]:
w2v.wv.vocab

{'8': <gensim.models.keyedvectors.Vocab at 0x7f6fe62c7828>,
 '3%': <gensim.models.keyedvectors.Vocab at 0x7f6fe62c73c8>,
 'of': <gensim.models.keyedvectors.Vocab at 0x7f6fe6044cc0>,
 'what': <gensim.models.keyedvectors.Vocab at 0x7f6fe6044c18>,
 'they': <gensim.models.keyedvectors.Vocab at 0x7f6fe6044f28>,
 'did': <gensim.models.keyedvectors.Vocab at 0x7f6fe72aa0f0>,
 'in': <gensim.models.keyedvectors.Vocab at 0x7f6fe6046ef0>,
 '2005:': <gensim.models.keyedvectors.Vocab at 0x7f6fe6046f28>,
 'wow': <gensim.models.keyedvectors.Vocab at 0x7f6fe601f908>,
 'ebay': <gensim.models.keyedvectors.Vocab at 0x7f6fe601f9b0>,
 'bought': <gensim.models.keyedvectors.Vocab at 0x7f6fe601f978>,
 'its': <gensim.models.keyedvectors.Vocab at 0x7f6fe601f710>,
 'chinese': <gensim.models.keyedvectors.Vocab at 0x7f6fe601f550>,
 'clone': <gensim.models.keyedvectors.Vocab at 0x7f6fe601f438>,
 'for': <gensim.models.keyedvectors.Vocab at 0x7f6fe601f470>,
 'hundreds': <gensim.models.keyedvectors.Vocab at 0x7f6fe601f

In [68]:
print(w2v.wv['what'][:5])
print(w2v.wv.index2word[0], w2v.wv.index2word[1], w2v.wv.index2word[2])
print('Index of "what" is: {}'.format(w2v.wv.vocab['what'].index))
w2v.save(DATADIR+"myw2v.w2v")
w2v_loaded = Word2Vec.load(DATADIR+"myw2v.w2v")
print('Index of "what" is: {}'.format(w2v_loaded.wv.vocab['what'].index))

2019-01-26 17:13:09,203 : INFO : saving Word2Vec object under /home/mritter/code/twitter_nlp/sandbox_data/myw2v.w2v, separately None
2019-01-26 17:13:09,205 : INFO : not storing attribute vectors_norm
2019-01-26 17:13:09,208 : INFO : not storing attribute cum_table
2019-01-26 17:13:09,210 : INFO : saved /home/mritter/code/twitter_nlp/sandbox_data/myw2v.w2v
2019-01-26 17:13:09,211 : INFO : loading Word2Vec object from /home/mritter/code/twitter_nlp/sandbox_data/myw2v.w2v
2019-01-26 17:13:09,213 : INFO : loading wv recursively from /home/mritter/code/twitter_nlp/sandbox_data/myw2v.w2v.wv.* with mmap=None
2019-01-26 17:13:09,214 : INFO : setting ignored attribute vectors_norm to None
2019-01-26 17:13:09,215 : INFO : loading vocabulary recursively from /home/mritter/code/twitter_nlp/sandbox_data/myw2v.w2v.vocabulary.* with mmap=None
2019-01-26 17:13:09,216 : INFO : loading trainables recursively from /home/mritter/code/twitter_nlp/sandbox_data/myw2v.w2v.trainables.* with mmap=None
2019-01-

[ 0.00054838 -0.00357793  0.00177963 -0.00220928  0.00305515]
of 8 3%
Index of "what" is: 3
Index of "what" is: 3


In [69]:
l = []
for i, token in enumerate(w2v.wv.index2word): l.append(w2v.wv[token])
weights = np.array(l)
print(weights.shape)
weights[:5, :5]

(18, 100)


array([[-0.00253542, -0.00189092, -0.00271564,  0.00188061,  0.00152182],
       [-0.00137067,  0.00165398,  0.00346244, -0.00419546,  0.0022843 ],
       [ 0.00352789, -0.002979  , -0.00073002,  0.00038799,  0.00132004],
       [ 0.00054838, -0.00357793,  0.00177963, -0.00220928,  0.00305515],
       [ 0.00084377, -0.00152044,  0.00032784,  0.00402925,  0.00302467]],
      dtype=float32)

In [70]:
# index_text.py

# loading_dock: [local] train text, [local] gensim
# shipping_dock: [local] text as indexes, [local] 100d w2v array sorted with that index
# TODO should I bring the w2v sorting into here?
# TODO I need to think about how IDs get passed around here

indexed_texts = {}
for key, wordlist in texts.items():
    indexed_texts[key] = []
    for word in wordlist:
        if word in w2v.wv.vocab:
            a = w2v.wv.vocab[word].index
        else:
            a = 0
        indexed_texts[key].append(a)
indexed_texts[1003][:20]

[1, 2, 0, 3, 4, 5, 6, 7, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [71]:
import h5py

with h5py.File(DATADIR+"w2v.h5", "w") as f:
    dset = f.create_dataset('weights', weights.shape, dtype='f')
    dset[:] = weights

In [72]:
# Write outputs
with h5py.File(DATADIR+"indexed_text.h5", "w") as f:
#     for ix in texts.keys():
#         dset = f.create_dataset(str(ix), (300,), dtype='int')
#         dset[:] = indexed_texts[key]
    f.create_dataset('training_data', (len(indexed_texts), 300), dtype='int', data=list(indexed_texts.values()))
    f.create_dataset('ordered_keys', (len(indexed_texts), 1), dtype='int', data=list(indexed_texts.keys()))
    f.create_dataset('max_token', (1,), dtype='int', data=weights.shape[0]+1)
    f.create_dataset('sequence_length', (1,), dtype='int', data=sequence_length)
    

with h5py.File(DATADIR+"indexed_label.h5", "w") as f:
#     for ix in labels.keys():
#         dset = f.create_dataset(str(ix), (2,), dtype='i')
#         dset[:] = labels[ix]
    f.create_dataset('training_labels', (len(labels), 2), dtype='int', data=list(labels.values()))
    f.create_dataset('ordered_keys', (len(labels), 1), dtype='int', data=list(labels.keys()))
    

In [73]:
# Try reading
with h5py.File(DATADIR+"indexed_text.h5", "r") as f:
    print(f['training_data'][:2])
    print(f['ordered_keys'][:2])
    print(f['sequence_length'][0])

with h5py.File(DATADIR+"indexed_label.h5", "r") as f:
    print(f['training_labels'][:2])
    print(f['ordered_keys'][:2])


[[ 1  2  0  3  4  5  6  7  8  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0]
 [ 9 10 11 12 13 14 15  0 16  0 17  0  0  0  0  0  0  0  0  0  0  0  0  0

In [74]:
ls -lh sandbox_data/

total 212K
-rw-rw-r-- 1 mritter mritter 138K Jan 26 17:10 compiled_model.keras
-rw-rw-r-- 1 mritter mritter 1.3K Jan 26 13:32 downloaded_test.jsonl
-rw-rw-r-- 1 mritter mritter  978 Jan 26 16:46 downloaded_train.jsonl
-rw-rw-r-- 1 mritter mritter 2.1K Jan 26 17:13 indexed_label.h5
-rw-rw-r-- 1 mritter mritter 9.0K Jan 26 17:13 indexed_text.h5
-rw-rw-r-- 1 mritter mritter 3.2K Jan 23 19:39 manifest.txt
-rw-rw-r-- 1 mritter mritter  27K Jan 26 17:13 myw2v.w2v
drwxrwxr-x 2 mritter mritter 4.0K Jan 26 14:24 [0m[01;34mtfrecords[0m/
-rw-rw-r-- 1 mritter mritter 9.1K Jan 26 17:13 w2v.h5


In [76]:
# model.py

# loading_dock: 100d w2v array, datafile
# lever: params
# dial: model.summary()
# shipping_dock: [local] compiled model

w2v_file = 'w2v.h5'
training_file = 'indexed_text.h5'
compiled_model = 'compiled_model.keras'

with h5py.File(DATADIR+training_file, "r") as f:
    max_token = f['max_token'][0]
    sequence_length = f['sequence_length'][0]

with h5py.File(DATADIR+w2v_file, "r") as f:
    embedding_matrix = f['weights'][()]
    embedding_dim = embedding_matrix.shape[1]

from keras.models import Sequential
from keras.layers import Dense, Input, Embedding, Flatten
from keras.initializers import Constant
from keras.models import Model

# model = Sequential()
sequence_input = Input(shape=(sequence_length,), dtype='int32')

embedded_sequences = Embedding(num_distinct_words,
                            embedding_dim,
                            embeddings_initializer=Constant(embedding_matrix),
                            input_length=sequence_length,
                            trainable=False)(sequence_input)

x = Dense(units=64, activation='relu')(embedded_sequences)
x = Dense(units=32, activation='relu')(embedded_sequences)
x = Flatten()(x)
preds = Dense(units=2, activation='softmax')(x)

model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',
              optimizer='sgd',
              metrics=['accuracy'])

print(model.summary())
model.save(DATADIR+compiled_model)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_9 (InputLayer)         (None, 300)               0         
_________________________________________________________________
embedding_9 (Embedding)      (None, 300, 100)          10000     
_________________________________________________________________
dense_16 (Dense)             (None, 300, 32)           3232      
_________________________________________________________________
flatten_3 (Flatten)          (None, 9600)              0         
_________________________________________________________________
dense_17 (Dense)             (None, 2)                 19202     
Total params: 32,434
Trainable params: 22,434
Non-trainable params: 10,000
_________________________________________________________________
None


In [81]:
# train.py

# loading_dock: [local] text as indexes, [local] compiled model
# lever: epocs
# dial: tensorboard
# shipping_dock: [local] saved model

training_file = 'indexed_text.h5'
compiled_model = 'compiled_model.keras'
trained_model = 'trained_model.yaml'
trained_weights = 'trained_weights.h5'

from keras.callbacks import TensorBoard as tb
from datetime import datetime
t = datetime.now()
tensorboard = tb(log_dir='tensorboard_logs/{:%Y-%m-%d-%H-%M}'.format(t))

with h5py.File(DATADIR+training_file, "r") as f1:
    with h5py.File(DATADIR+"indexed_label.h5", "r") as f2:
        x_train = f1['training_data']  # Note that Keras is special in being able to read the HDF5 _object_
        y_train = f2['training_labels']
        
        model.fit(x_train, y_train,
                  batch_size=64, #128,
                  epochs=2,
                  shuffle='batch',  # Required for using HDF5
#                   validation_data=(x_val, y_val),
                  callbacks=[tensorboard])
model.save(DATADIR+trained_model)

# serialize model to YAML
# From https://machinelearningmastery.com/save-load-keras-deep-learning-models/
model_yaml = model.to_yaml()
with open(DATADIR+trained_model, "w") as yaml_file:
    yaml_file.write(model_yaml)
# serialize weights to HDF5
model.save_weights(DATADIR+trained_weights)
print("Saved model to disk")


Epoch 1/2
Epoch 2/2
Saved model to disk


In [None]:

# load YAML and create model
yaml_file = open('model.yaml', 'r')
loaded_model_yaml = yaml_file.read()
yaml_file.close()
loaded_model = model_from_yaml(loaded_model_yaml)
# load weights into new model
loaded_model.load_weights("model.h5")
print("Loaded model from disk")


In [99]:
# inference.py

# loading_dock: candidate comment
# processing: call preprocess and index, then apply model
# shipping_dock: best comment

comment_text = """
Reminder, if you're in the US, the FTC says your eye doctor must give you your prescription after your exam. If a doctor refuses to do so, they can face legal action and penalties.

https://www.consumer.ftc.gov/blog/2016/05/buying-prescriptio...

That said, I don't think the FTC stipulates what information must appear on the prescription. Many docs leave off your PD (pupillary distance), which is a necessary measurement if you're buying online. Fortunately, there are a variety of easy ways to take this measurement yourself after the exam, although if you're really concerned about precision, you'll want the doctor's measurement.

And by the way, it should go without saying, but I'll say it anyway. Although the quality of eyewear available online can be comparable to what you'd get in store ... please don't think an online eye exam is an acceptable substitute for visiting an ophthalmologist in person and getting a comprehensive eye exam! 
"""

trained_model = 'trained_model.yaml'
trained_weights = 'trained_weights.h5'


from keras.models import model_from_yaml

# load YAML and create model
with open(DATADIR+trained_model, 'r') as yaml_file:
    loaded_model_yaml = yaml_file.read()
    loaded_model = model_from_yaml(loaded_model_yaml)

# load weights into new model
loaded_model.load_weights(DATADIR+trained_weights)
print("Loaded model from disk")


indexed_texts = []
for word in comment_text.lower().split():
    if word in w2v.wv.vocab:
        a = w2v.wv.vocab[word].index
    else:
        a = 0
    indexed_texts.append(a)
    
indexed_texts += [0]*(sequence_length-len(indexed_texts))
asarray = np.array([indexed_texts])
loaded_model.predict(asarray)

Loaded model from disk


array([[0.4995369, 0.5004631]], dtype=float32)

In [None]:
# test.py

# loading_dock: [local] test file
# processing: call inference, then compare to labels 
# shipping_dock: accuracy printout

test_filename = 'downloaded_test'
trained_model = 'trained_model'