# TensorFow based Text Classifier
This model takes data from a kaggle dataset that provides news short descriptions, and classifications and then tries to guess the correct news classification based on a short description.
This is mostly based on the datalab sample for TF text classifier. It can be found here: https://github.com/googledatalab/notebooks/blob/master/samples/TensorFlow/Text%20Classification%20with%20TensorFlow.ipynb

Import libraries, and download nltk components

In [6]:
from io import BytesIO
import googleapiclient.discovery
import base64
import json
import os
import zipfile
import pandas as pd
import numpy as np
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
import nltk
import re
ps = PorterStemmer()
nltk.download('all')
from sklearn.model_selection import train_test_split

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /content/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to /content/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /content/nltk_data...
[nltk_data]    |   Package biocreative_ppi is already up-to-date!
[nltk_data]    | Downloading package brown to /content/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     /content/nltk_data...
[nltk_data]    |   Package brown_tei is already up-to-date!
[nltk_data]    | Downloading package cess_cat to /content/nltk_data...
[nltk_data]    |   Package cess_cat is already up-to-date!
[nltk_data]    | Downloading package cess_esp to /content/nltk_data...
[nltk_data]    |   Package cess_esp is already up-to

[nltk_data]    |   Package timit is already up-to-date!
[nltk_data]    | Downloading package toolbox to /content/nltk_data...
[nltk_data]    |   Package toolbox is already up-to-date!
[nltk_data]    | Downloading package treebank to /content/nltk_data...
[nltk_data]    |   Package treebank is already up-to-date!
[nltk_data]    | Downloading package twitter_samples to
[nltk_data]    |     /content/nltk_data...
[nltk_data]    |   Package twitter_samples is already up-to-date!
[nltk_data]    | Downloading package udhr to /content/nltk_data...
[nltk_data]    |   Package udhr is already up-to-date!
[nltk_data]    | Downloading package udhr2 to /content/nltk_data...
[nltk_data]    |   Package udhr2 is already up-to-date!
[nltk_data]    | Downloading package unicode_samples to
[nltk_data]    |     /content/nltk_data...
[nltk_data]    |   Package unicode_samples is already up-to-date!
[nltk_data]    | Downloading package universal_treebanks_v20 to
[nltk_data]    |     /content/nltk_data...
[nl

Kaggle requires an id file to be able download files from it's API. I have the file stored encrypted within a storage bucket, and I use kms to decrypt it.
This is based on the method documented here: https://cloud.google.com/kms/docs/store-secrets

In [7]:
%%storage read --object gs://mdh-secrets/kaggle.json.encrypted --variable kaggle_id

In [8]:
kms_client = kms_client = googleapiclient.discovery.build('cloudkms', 'v1')
project="mdh-test-restricted-datalab"
location="global"
keyring="storage"
cryptokey="mykey"
key_name = 'projects/{}/locations/{}/keyRings/{}/cryptoKeys/{}'.format(project,location,keyring,cryptokey)
crypto_keys = kms_client.projects().locations().keyRings().cryptoKeys()
request = crypto_keys.decrypt(
  name=key_name,
  body={'ciphertext':base64.b64encode(kaggle_id).decode('ascii')}
)
plaintext = base64.b64decode(request.execute()['plaintext'].encode('ascii'))
os.makedirs(os.path.dirname('/content/.kaggle/kaggle.json'),exist_ok=True)
f = open('/content/.kaggle/kaggle.json','w+b')
f.write(plaintext)
f.close()

Use the kaggle api to download the news dataset

In [9]:
!kaggle datasets download -p /content/kaggle rmisra/news-category-dataset

news-category-dataset.zip: Skipping, found more recently modified local copy (use --force to force download)


Extract the json file from the zip file

In [10]:
with zipfile.ZipFile("/content/kaggle/news-category-dataset.zip") as z:
  with z.open('News_Category_Dataset.json') as f:
    data = pd.read_json(f,lines=True)

In [11]:
data.head()

Unnamed: 0,authors,category,date,headline,link,short_description
0,Melissa Jeltsen,CRIME,2018-05-26,There Were 2 Mass Shootings In Texas Last Week...,https://www.huffingtonpost.com/entry/texas-ama...,She left her husband. He killed their children...
1,Andy McDonald,ENTERTAINMENT,2018-05-26,Will Smith Joins Diplo And Nicky Jam For The 2...,https://www.huffingtonpost.com/entry/will-smit...,Of course it has a song.
2,Ron Dicker,ENTERTAINMENT,2018-05-26,Hugh Grant Marries For The First Time At Age 57,https://www.huffingtonpost.com/entry/hugh-gran...,The actor and his longtime girlfriend Anna Ebe...
3,Ron Dicker,ENTERTAINMENT,2018-05-26,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,https://www.huffingtonpost.com/entry/jim-carre...,The actor gives Dems an ass-kicking for not fi...
4,Ron Dicker,ENTERTAINMENT,2018-05-26,Julianna Margulies Uses Donald Trump Poop Bags...,https://www.huffingtonpost.com/entry/julianna-...,"The ""Dietland"" actress said using the bags is ..."


Reduce to just a category and a description

In [12]:
cat_desc = data[['category','short_description']]

Remove all records with null values

In [13]:
cat_desc.replace('',np.nan,inplace=True) # replace a blank string with a null value
cat_desc.dropna(inplace=True)
cat_desc.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


Unnamed: 0,category,short_description
0,CRIME,She left her husband. He killed their children...
1,ENTERTAINMENT,Of course it has a song.
2,ENTERTAINMENT,The actor and his longtime girlfriend Anna Ebe...
3,ENTERTAINMENT,The actor gives Dems an ass-kicking for not fi...
4,ENTERTAINMENT,"The ""Dietland"" actress said using the bags is ..."


Obtain a list of unique categories, we'll use the length of this list to define the width of our classification layer

In [14]:
unique_categories = list(set(cat_desc['category'].values))

category_id = {v: idx + 1 for idx, v in enumerate(unique_categories)}

cat_desc['category'] = cat_desc['category'].apply(lambda x: category_id[x])
len(unique_categories)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


31

Split datasets into train and test

In [15]:
train_data,test_data = train_test_split(cat_desc,test_size=0.2)

Remove punctuation, and non ascii characters from the set, and split it into a list of words, then stem those words

In [16]:
def clean_and_tokenize(desc):

  x = re.sub('[^[^\w]|_]', ' ', desc) # only keep numbers and letters and spaces
  x = x.lower() # convert to lowercase
  x = re.sub(r'[^\x00-\x7f]',r'', x) # remove non ascii texts
  tokens = word_tokenize(x)

  stemmed_tokens = [ps.stem(i) for i in tokens]
  return stemmed_tokens
    
clean_train_tokens = train_data['short_description'].apply(clean_and_tokenize)
clean_test_tokens = test_data['short_description'].apply(clean_and_tokenize)

Get all the unique tokens in the set

In [18]:
def get_unique_tokens_per_row(token_list):
  words = []
  for row in token_list:
    words.extend(list(set(row)))
  return words

In [19]:
words = pd.DataFrame(get_unique_tokens_per_row(clean_train_tokens),columns=['words'])

The vocab will serve as the set of known words within the model to associate a value to. We drop out any words that appear less than 10 times in the total set, as they would create outliers for the model.

In [21]:
vocab = token_frequency[token_frequency > 10]

Tensorflow works on numbers not strings, so we convert the strings to integer IDs.

In [23]:
import six 

CONTROL_WORDS = ['<s>', '</s>', '<unk>']

vocab_id = {v[0]: idx + 1 for idx, v in enumerate(sorted(six.iteritems(vocab), key=lambda x: x[1], reverse=True))}
for c in CONTROL_WORDS:
  vocab_id[c] = len(vocab_id)

Clean up the test and training data into lists of vocab ids, pad or truncate to 128 so that we have identical lists

In [24]:
def filter_text_by_vocab(news_data, vocab_id):
    """Removes tokens if not in vocab.
    Args:
        news_data: list, where each element is a token list
        vocab: set containing the tokens to keep.
    Returns:
        List of strings containing the final cleaned text data
    """
    wids_all = []
    for row in news_data:
        wids = [vocab_id[token] if (token in vocab_id) else vocab_id['<unk>'] for token in row]
        wids = [vocab_id['<s>']] + wids + [vocab_id['</s>']]
        wids = wids[:128]
        wids_all.append(wids)
    return wids_all
  
clean_train_data = filter_text_by_vocab(clean_train_tokens,vocab_id)
clean_test_data = filter_text_by_vocab(clean_test_tokens,vocab_id)

In [25]:
def pad_wids(wids, length):
    """Pad each instance to """
    padded = []
    for r in wids:
        if len(r) >= length:
            padded.append(r[0:length])
        else:
            padded.append(r + [0] * (length - len(r)))
    return padded
  
padded_train_data = pad_wids(clean_train_data, 128)
padded_test_data = pad_wids(clean_test_data, 128)  

Build a DNN Model

In [27]:
import tensorflow as tf
import shutil
import tensorflow.contrib.learn as tflearn
import tensorflow.contrib.layers as tflayers
from tensorflow.contrib.learn.python.learn import learn_runner
from tensorflow.contrib.learn.python.learn.utils import saved_model_export_utils
from google.datalab.ml import Summary

TRAIN_BATCH_SIZE = 31
EVAL_BATCH_SIZE = 527
EMBEDDING_SIZE = 527

In [28]:
def dnn_model(batch_size, train_data, targets, mode):
    """Build an DNN Model. """
    
    with tf.name_scope(mode):
        raw_data = tf.convert_to_tensor(train_data, dtype=tf.int64) # convert our data into tensors
        targets = tf.convert_to_tensor(targets, dtype=tf.int64) # convert our classes into tensors
        batch_num = len(train_data) # batch_size - 1
        i = tf.train.range_input_producer(batch_num, shuffle=True).dequeue() # get the batch of lists
        input_seqs = raw_data[i * batch_size: (i + 1) * batch_size] 
        targets = targets[i * batch_size: (i + 1) * batch_size] # get the batch of targets
        length = tf.count_nonzero(input_seqs, axis=1, dtype=tf.int32)

    embedding_map = tf.get_variable(
        name="embeddings_map",
        shape=[len(vocab_id), EMBEDDING_SIZE])
    seq_embeddings = tf.nn.embedding_lookup(embedding_map, input_seqs) # develop the embeddings for each of the words in the description
    
    # Simply combine embeddings.
    combined = tf.sqrt(tf.reduce_sum(tf.square(seq_embeddings), 1)) # combine the embeddings into a sible value

    logits = tf.contrib.layers.fully_connected(
        inputs=combined,
        num_outputs=32,
        activation_fn=None) # A logit for each classication

    cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=targets, logits=logits) # use softmax to identify the best category match
    losses= tf.reduce_mean(cross_entropy, name='xentropy_mean')
    predictions = tf.argmax(logits, 1)
    _, accuracy = tf.contrib.metrics.streaming_accuracy(targets, predictions)
    correct_predictions = tf.count_nonzero(tf.equal(predictions, targets))
    return losses, accuracy, correct_predictions

The trainer for the model

In [29]:
def train(model_fn, train_steps, model_dir):
    """Model trainer."""

    g = tf.Graph()
    with g.as_default():
        uniform_initializer = tf.random_uniform_initializer(minval=-0.08, maxval=0.08)
        with tf.variable_scope("Model", reuse=None, initializer=uniform_initializer):
            losses_train, _, _ = model_fn(TRAIN_BATCH_SIZE, padded_train_data, train_data['category'].values, 'train')
        with tf.variable_scope("Model", reuse=True):
            _, accuracy, correct_predictions = model_fn(EVAL_BATCH_SIZE, padded_test_data, test_data['category'].values, 'eval')

        tf.summary.scalar('accuracy', accuracy)        
        tf.summary.scalar('losses', losses_train)  
        merged = tf.summary.merge_all()        
        
        global_step = tf.Variable(
            initial_value=0,
            name="global_step",
            trainable=False,
            collections=[tf.GraphKeys.GLOBAL_STEP, tf.GraphKeys.GLOBAL_VARIABLES])
    
        train_op = tf.contrib.layers.optimize_loss(
            loss=losses_train,
            global_step=global_step,
            learning_rate=0.001,
            optimizer='Adam')

    def train_step_fn(sess, *args, **kwargs):
        total_loss, should_stop = tf.contrib.slim.python.slim.learning.train_step(sess, *args, **kwargs)

        if train_step_fn.train_steps % 50 == 0:
            summary = sess.run(merged)
            train_step_fn.eval_writer.add_summary(summary, train_step_fn.train_steps)
            total_correct_predictions = 0
            num_eval_batches = int(len(padded_test_data) / EVAL_BATCH_SIZE)
            for i in range(int(len(padded_test_data) / EVAL_BATCH_SIZE)):
                total_correct_predictions += sess.run(correct_predictions)
            print('accuracy: %.4f' % (float(total_correct_predictions)/(num_eval_batches*EVAL_BATCH_SIZE)))

        train_step_fn.train_steps += 1
        return [total_loss, should_stop] 

    train_step_fn.train_steps = 0
    train_step_fn.eval_writer = tf.summary.FileWriter(os.path.join(model_dir, 'eval'))

    tf.contrib.slim.learning.train(
        train_op,
        model_dir,
        graph=g,
        global_step=global_step,
        number_of_steps=train_steps,
        log_every_n_steps=50,  
        train_step_fn=train_step_fn)
    train_step_fn.eval_writer.close()

In [30]:
# Start from fresh. Note that you can skip this step to continue training from previous checkpoint.
!rm -rf dnn

Train the model

In [31]:
train(dnn_model, train_steps=1500, model_dir='dnn')

Instructions for updating:
Please switch to tf.metrics.accuracy. Note that the order of the labels and predictions arguments has been switched.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path dnn/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 1.
accuracy: 0.0204
INFO:tensorflow:global step 50: loss = 2.6683 (0.175 sec/step)
accuracy: 0.2802
INFO:tensorflow:global step 100: loss = 2.6063 (0.235 sec/step)
accuracy: 0.2860
INFO:tensorflow:global step 150: loss = 2.7712 (0.236 sec/step)
accuracy: 0.2892
INFO:tensorflow:global step 200: loss = 2.5948 (0.130 sec/step)
accuracy: 0.3146
INFO:tensorflow:global step 250: loss = 2.7525 (0.239 sec/step)
accuracy: 0.3212
INFO:tensorflow:global step 300: loss = 2.5932 (0.214 sec/step)
accura

The results are ok, but not great. Perhaps more data, or categories that didn't overlap would be more helpful.