### tldr prototyping

short-term goal:

* focus on tools to generate input functions for tensorflow Estimator API for different categories of problem

seems like there are 3 generic steps before you get to the tensorflow part:

  1. **loading** the data. depends on how data is saved (e.g. a CSV, folder of text files, etc) but not language or model directly
  2. **preprocessing** the text. tokenizing, stemming, stopwords, etc. depends on language and application, and (to a lesser extent) the model. but not on how the files were saved.
  3. **encoding** the data to send to tensorflow. whatever process maps tokens to a bag of words, sequence of one-hot encoded vectors, etc. depends on the model structure more than anything

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import re
import nltk
import string
import pandas as pd

from collections import defaultdict

%matplotlib inline
from IPython.core.pylabtools import figsize

In [2]:
# load our sample data
textfile = "winereviews.txt"
rawtext = [x.decode("ascii", errors="ignore") for x in open(textfile, "rb").readlines() if len(x) > 5]
len(rawtext)

1230

In [3]:
import sys
sys.path.append("/home/joe/projects/tldr/")

In [4]:
import tldr

In [5]:
print(rawtext[0])

Lovely delicate, fragrant Rhone wine. Polished leather and strawberries. Perhaps a bit dilute, but good for drinking now. ***



In [6]:
def extract_rating(x):
    x = x.lower().strip()
    if "no stars" in x.lower():
        return 0, x.replace("no stars", "")
    r = re.findall("\*+", x)
    if len(r) > 0:
        return len(r[0]), x.replace("*", "")
    else:
        return np.nan, x

In [7]:
extracted = [extract_rating(r) for r in rawtext]

In [8]:
text = [x[1] for x in extracted if not np.isnan(x[0])]
ratings = [x[0] for x in extracted if not np.isnan(x[0])]

In [9]:
#df = pd.DataFrame({"text":text, "rating":ratings})
#df.head()

In [10]:
#df.to_csv("wineratings.tsv", sep="\t", index=False)

In [11]:
#import tldr.prepare

In [12]:
bagger = tldr.prepare.Bagginator(text)
len(bagger)

2579

In [13]:
text[0]

'lovely delicate, fragrant rhone wine. polished leather and strawberries. perhaps a bit dilute, but good for drinking now. '

In [15]:
ind = bagger(text[0])
ind

[462,
 1272,
 2277,
 199,
 1788,
 759,
 378,
 2006,
 1671,
 703,
 767,
 1553,
 2241,
 381,
 50,
 1761,
 208]

In [16]:
bagger[ind]

['lovely',
 'delicate',
 'fragrant',
 'rhone',
 'wine',
 'polished',
 'leather',
 'and',
 'strawberries',
 'perhaps',
 'bit',
 'dilute',
 'but',
 'good',
 'for',
 'drinking',
 'now']

In [18]:
bagger.token_list[:10]

['rave',
 'vegetal',
 'firmly',
 'fine',
 'engaging',
 'mouvedre',
 'exhhuasted',
 'finer',
 'struck',
 'others']

In [19]:
split = int(0.7*len(text))
split

806

In [23]:
corpus = [bagger(t) for t in text]

In [24]:
input_train, features = tldr.encode.encode_dense_bag_of_words(
    corpus[:split], ratings[:split],
    len(bagger), shuffle=True, num_epochs=100
    )

input_test, _ = tldr.encode.encode_dense_bag_of_words(
    corpus[split:], ratings[split:],
    len(bagger), shuffle=False, num_epochs=1
    )

In [25]:
logit_model = tf.estimator.LinearClassifier(features, n_classes=7)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_save_checkpoints_steps': None, '_save_summary_steps': 100, '_model_dir': '/tmp/tmp9oev5zig', '_log_step_count_steps': 100, '_keep_checkpoint_max': 5, '_session_config': None, '_keep_checkpoint_every_n_hours': 10000, '_tf_random_seed': 1, '_save_checkpoints_secs': 600}


In [26]:
logit_model.train(input_train)

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into /tmp/tmp9oev5zig/model.ckpt.
INFO:tensorflow:step = 1, loss = 249.077
INFO:tensorflow:global_step/sec: 134.718
INFO:tensorflow:step = 101, loss = 12.5402 (0.744 sec)
INFO:tensorflow:global_step/sec: 128.7
INFO:tensorflow:step = 201, loss = 7.48252 (0.777 sec)
INFO:tensorflow:global_step/sec: 145.385
INFO:tensorflow:step = 301, loss = 4.28763 (0.687 sec)
INFO:tensorflow:global_step/sec: 100.614
INFO:tensorflow:step = 401, loss = 3.02067 (0.994 sec)
INFO:tensorflow:global_step/sec: 118.636
INFO:tensorflow:step = 501, loss = 3.1227 (0.843 sec)
INFO:tensorflow:global_step/sec: 130.79
INFO:tensorflow:step = 601, loss = 2.13723 (0.765 sec)
INFO:tensorflow:Saving checkpoints for 630 into /tmp/tmp9oev5zig/model.ckpt.
INFO:tensorflow:Loss for final step: 1.19451.


<tensorflow.python.estimator.canned.linear.LinearClassifier at 0x7f6ed3d6ccc0>

In [27]:
logit_model.evaluate(input_test)

INFO:tensorflow:Starting evaluation at 2018-01-04-04:49:06
INFO:tensorflow:Restoring parameters from /tmp/tmp9oev5zig/model.ckpt-630
INFO:tensorflow:Finished evaluation at 2018-01-04-04:49:07
INFO:tensorflow:Saving dict for global step 630: accuracy = 0.471098, average_loss = 1.85989, global_step = 630, loss = 214.507


{'accuracy': 0.47109827,
 'average_loss': 1.8598853,
 'global_step': 630,
 'loss': 214.50677}

In [29]:
dnn_model = tf.estimator.DNNClassifier([100, 100], features, n_classes=7)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_save_checkpoints_steps': None, '_save_summary_steps': 100, '_model_dir': '/tmp/tmp9fqle53k', '_log_step_count_steps': 100, '_keep_checkpoint_max': 5, '_session_config': None, '_keep_checkpoint_every_n_hours': 10000, '_tf_random_seed': 1, '_save_checkpoints_secs': 600}


In [30]:
dnn_model.train(input_train)

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into /tmp/tmp9fqle53k/model.ckpt.
INFO:tensorflow:step = 1, loss = 250.867
INFO:tensorflow:global_step/sec: 54.069
INFO:tensorflow:step = 101, loss = 0.177203 (1.849 sec)
INFO:tensorflow:global_step/sec: 68.0672
INFO:tensorflow:step = 201, loss = 0.124044 (1.470 sec)
INFO:tensorflow:global_step/sec: 75.864
INFO:tensorflow:step = 301, loss = 0.0413845 (1.317 sec)
INFO:tensorflow:global_step/sec: 76.3127
INFO:tensorflow:step = 401, loss = 0.0268679 (1.311 sec)
INFO:tensorflow:global_step/sec: 84.2824
INFO:tensorflow:step = 501, loss = 0.0171006 (1.187 sec)
INFO:tensorflow:global_step/sec: 80.7968
INFO:tensorflow:step = 601, loss = 0.0173371 (1.240 sec)
INFO:tensorflow:Saving checkpoints for 630 into /tmp/tmp9fqle53k/model.ckpt.
INFO:tensorflow:Loss for final step: 0.00931517.


<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x7f6ed3d10748>

In [31]:
dnn_model.evaluate(input_test)

INFO:tensorflow:Starting evaluation at 2018-01-04-04:53:26
INFO:tensorflow:Restoring parameters from /tmp/tmp9fqle53k/model.ckpt-630
INFO:tensorflow:Finished evaluation at 2018-01-04-04:53:26
INFO:tensorflow:Saving dict for global step 630: accuracy = 0.485549, average_loss = 2.66293, global_step = 630, loss = 307.124


{'accuracy': 0.48554912,
 'average_loss': 2.662926,
 'global_step': 630,
 'loss': 307.12411}