### tldr prototyping

short-term goal:

* focus on tools to generate input functions for tensorflow Estimator API for different categories of problem

seems like there are 3 generic steps before you get to the tensorflow part:

  1. **loading** the data. depends on how data is saved (e.g. a CSV, folder of text files, etc) but not language or model directly
  2. **preprocessing** the text. tokenizing, stemming, stopwords, etc. depends on language and application, and (to a lesser extent) the model. but not on how the files were saved.
  3. **encoding** the data to send to tensorflow. whatever process maps tokens to a bag of words, sequence of one-hot encoded vectors, etc. depends on the model structure more than anything

In [19]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import re
import nltk

%matplotlib inline
from IPython.core.pylabtools import figsize

In [2]:
# load our sample data
textfile = "winereviews.txt"
rawtext = [x.decode("ascii", errors="ignore") for x in open(textfile, "rb").readlines() if len(x) > 5]
len(rawtext)

1230

In [3]:
print(rawtext[0])

Lovely delicate, fragrant Rhone wine. Polished leather and strawberries. Perhaps a bit dilute, but good for drinking now. ***



In [22]:
def extract_rating(x):
    x = x.lower().strip()
    if "no stars" in x.lower():
        return 0, x.replace("no stars", "")
    r = re.findall("\*+", x)
    if len(r) > 0:
        return len(r[0]), x.replace("*", "")
    else:
        return np.nan, x

In [23]:
extracted = [extract_rating(r) for r in rawtext]

In [27]:
text = [x[1] for x in extracted if not np.isnan(x[0])]
ratings = [x[0] for x in extracted if not np.isnan(x[0])]

In [33]:
tokens = [nltk.word_tokenize(x) for x in text]

In [34]:
text[0]

'lovely delicate, fragrant rhone wine. polished leather and strawberries. perhaps a bit dilute, but good for drinking now. '

In [35]:
tokens[0]

['lovely',
 'delicate',
 ',',
 'fragrant',
 'rhone',
 'wine',
 '.',
 'polished',
 'leather',
 'and',
 'strawberries',
 '.',
 'perhaps',
 'a',
 'bit',
 'dilute',
 ',',
 'but',
 'good',
 'for',
 'drinking',
 'now',
 '.']

In [39]:
from collections import defaultdict