# The 20 newsgroups topic analysis

Instead of repeating the IMDB sentiment analysis from the lesson (because frankly, I'm a little bored with sentiment analysis), I will attempt to apply a similar approach to deep-learning NLP classification to a dataset a coworker has recently been messing around with in `scikit-learn`: `sklearn.datasets.fetch_20newsgroups`.

http://people.csail.mit.edu/jrennie/20Newsgroups/

## Setup data

In [1]:
import os
current_dir = os.getcwd()

LESSON_HOME_DIR = current_dir + '/'
DATA_HOME_DIR = LESSON_HOME_DIR + 'data/'

DATASET_DIR = DATA_HOME_DIR + '20_newsgroup/'
MODEL_DIR = DATASET_DIR + 'models/'

In [2]:
if not os.path.exists(MODEL_DIR):
    os.mkdir(DATASET_DIR)
    os.mkdir(MODEL_DIR)

In [3]:
from sklearn.datasets import fetch_20newsgroups

category_subset = [
    'alt.atheism',
    'comp.graphics',
    'comp.os.ms-windows.misc',
    'soc.religion.christian',
]

x_train = fetch_20newsgroups(
    subset='train',
    categories = category_subset,
    shuffle = True,
    remove = ('headers', 'footers', 'quotes'))

x_test = fetch_20newsgroups(
    subset='test',
    categories = category_subset,
    shuffle = True,
    remove = ('headers', 'footers', 'quotes'))

In [4]:
x_train.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'soc.religion.christian']

`target_names` are as requested

In [5]:
x_train.filenames.shape, x_train.target.shape, len(x_train.data)

((2254,), (2254,), 2254)

In [6]:
x_test.filenames.shape, x_test.target.shape, len(x_test.data)

((1500,), (1500,), 1500)

In [7]:
x_train.filenames[:10], x_train.target[:10]

(array([ '/home/ubuntu/scikit_learn_data/20news_home/20news-bydate-train/comp.os.ms-windows.misc/9785',
        '/home/ubuntu/scikit_learn_data/20news_home/20news-bydate-train/soc.religion.christian/20672',
        '/home/ubuntu/scikit_learn_data/20news_home/20news-bydate-train/soc.religion.christian/20528',
        '/home/ubuntu/scikit_learn_data/20news_home/20news-bydate-train/comp.os.ms-windows.misc/9983',
        '/home/ubuntu/scikit_learn_data/20news_home/20news-bydate-train/alt.atheism/53756',
        '/home/ubuntu/scikit_learn_data/20news_home/20news-bydate-train/comp.graphics/38394',
        '/home/ubuntu/scikit_learn_data/20news_home/20news-bydate-train/comp.graphics/38595',
        '/home/ubuntu/scikit_learn_data/20news_home/20news-bydate-train/comp.graphics/38353',
        '/home/ubuntu/scikit_learn_data/20news_home/20news-bydate-train/alt.atheism/51229',
        '/home/ubuntu/scikit_learn_data/20news_home/20news-bydate-train/alt.atheism/53289'], 
       dtype='|S93'), array

In [8]:
x_test.filenames[:10], x_test.target[:10]

(array([ '/home/ubuntu/scikit_learn_data/20news_home/20news-bydate-test/comp.graphics/38963',
        '/home/ubuntu/scikit_learn_data/20news_home/20news-bydate-test/soc.religion.christian/21442',
        '/home/ubuntu/scikit_learn_data/20news_home/20news-bydate-test/comp.graphics/39021',
        '/home/ubuntu/scikit_learn_data/20news_home/20news-bydate-test/comp.os.ms-windows.misc/10835',
        '/home/ubuntu/scikit_learn_data/20news_home/20news-bydate-test/soc.religion.christian/21412',
        '/home/ubuntu/scikit_learn_data/20news_home/20news-bydate-test/comp.graphics/38846',
        '/home/ubuntu/scikit_learn_data/20news_home/20news-bydate-test/comp.os.ms-windows.misc/10633',
        '/home/ubuntu/scikit_learn_data/20news_home/20news-bydate-test/alt.atheism/53640',
        '/home/ubuntu/scikit_learn_data/20news_home/20news-bydate-test/comp.graphics/39077',
        '/home/ubuntu/scikit_learn_data/20news_home/20news-bydate-test/soc.religion.christian/21735'], 
       dtype='|S92'), 

Keras implements `get_word_index()` for the IMDB dataset, which returns an dictionary of word->index derived from a json file hosted on Amazon S3.

This seems bizarre to me? Anyway, sklearn doesn't do this. So let's create our own index with `keras.preprocessing.text.Tokenizer` (https://keras.io/preprocessing/text/).

In [9]:
from keras.preprocessing.text import Tokenizer
from unidecode import unidecode

train_tokenizer = Tokenizer()
unidecoded_x_train = [unidecode(text) for text in x_train.data]
train_tokenizer.fit_on_texts(unidecoded_x_train) # builds the word index
train_sequences = train_tokenizer.texts_to_sequences(unidecoded_x_train)

Using Theano backend.
Using gpu device 0: Tesla K80 (CNMeM is disabled, cuDNN 5103)
  _warn_if_not_unicode(string)


In [10]:
train_word_index = train_tokenizer.word_index

Reverse the `word_index` with `idx2word`.

In [11]:
train_idx2word = {v: k for k, v in train_word_index.iteritems()}

Let's take a look at the first review, both as a list of indices and reconstructed from the indices.

In [12]:
', '.join(map(str, train_sequences[0]))

'6, 47, 1529, 37, 84, 69, 963, 110, 2, 676, 445, 832, 1268, 1135, 198, 72, 445, 832, 8, 736, 450, 7, 6, 95, 189, 3, 28, 3, 1203, 5, 171, 69, 62, 133, 50862, 12, 8, 970, 7537, 4, 117, 1270, 4, 1268, 7, 84, 94, 3755, 18, 109, 236, 26, 542, 29, 206, 244, 117, 69, 4, 134, 176, 213, 199, 16359, 18501, 15497, 14450, 10736, 2404, 144, 35, 15644, 11379, 9545, 2404, 144'

In [13]:
train_idx2word[6]

'i'

In [14]:
' '.join([train_idx2word[o] for o in train_sequences[0]])

"i was wondering if any one knew how the various hard drive compression utilities work my hard drive is getting full and i don't want to have to buy a new one what i'm intrested in is speed ease of use amount of compression and any other aspect you think might be important as i've never use one of these things before thanks morgan bullard mb4008 coewl cen uiuc edu or mjbb uxa cso uiuc edu"

I'm going to skip reducing the vocab size for now...

Distribution of the lengths of sentences:

In [15]:
import numpy as np

lens = np.array(map(len, train_sequences))
(lens.max(), lens.min(), lens.mean())

(16306, 0, 289.51863354037266)

Weird that there are sentences with 0 sequences (words) in them...

In [16]:
test_tokenizer = Tokenizer()
unidecoded_x_test = [unidecode(text) for text in x_test.data]
test_tokenizer.fit_on_texts(unidecoded_x_test) # builds the word index
test_sequences = test_tokenizer.texts_to_sequences(unidecoded_x_test)

Pad (with zero) or truncate each sentence to make consistent length.

In [17]:
from keras.preprocessing import sequence

seq_len = 500

trn = sequence.pad_sequences(train_sequences, maxlen=seq_len, value=0)
test = sequence.pad_sequences(test_sequences, maxlen=seq_len, value=0)

In [18]:
trn[:10]

array([[    0,     0,     0, ...,  9545,  2404,   144],
       [    0,     0,     0, ...,  9485,    16,   546],
       [   26,   104,  8052, ...,   163,   490,   380],
       ..., 
       [    0,     0,     0, ...,   104, 47366,   103],
       [    0,     0,     0, ...,    26,  1263, 12586],
       [    0,     0,     0, ...,  5867,  5785,  2465]], dtype=int32)

In [19]:
trn.shape

(2254, 500)

In [20]:
test.shape

(1500, 500)

## Create simple models