SUMMARY
---

In this mini-project we have a problem of single label, multiclass classification. The dataset we are using is 'Reuters dataset', which is already included in the Keras package. Reuters dataset is a set of short newswires and their topics, published by Reuters in 1986. Each newswire belongs to one of the 46 different topic. 
It's a multiclass problem, because there are 46 different topics, yet single label since each data point has one-to-one corespodence with one of the 46 topics. 

This problem is 'almost the same' with the binary classification one, and the differences are as follows:

* What sigmoid is for binary classification, soft-max is for multiclass classification
* Since the dimension of output space is 46 we should 'not' include information bottleneck. That is, previous layers have less dimensions (aka units) than the (final) output layer/s, and hence compressing too much information which could help to make separation hyperplanes of 46 classes is a bad juju!

In [18]:
#Importing the dataset from Keras dataset library
from keras.datasets import reuters

#Importing libraries for data processing
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#Importing libraries for everything that deals with modeling
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import RMSprop
from keras.losses import categorical_crossentropy
from keras.metrics import accuracy

#For making ticks integer values
from matplotlib.ticker import MaxNLocator

### Importing training and test part of the set

Again, we are gonna stick to first 10000 most frequently used words to make learning feasible. 

In [19]:
(train_data, train_labels), (test_data, test_labels) = reuters.load_data(num_words = 10000)

In [21]:
#Checks to see if we imported data properly

train_data.shape, test_data.shape, train_labels.shape, test_labels.shape

((8982,), (2246,), (8982,), (2246,))

There are 8982 newswires in training, and 2246 in the test set. Close to 80%/20% split. Now we will see one of the decoded newswires and explore all the unique topics that those newswires are associated.

## Decoding first newswire

In [31]:
word_index = reuters.get_word_index() # {'word':'frequency_of_word'}
reverse_word_index = dict([(value, key)
                             for (key, value) in word_index.items()]) # {'frequency_of_word':'word'}

#Let's decode the first newswire in our training dataset
decoded_review_1 = ' '.join([reverse_word_index.get(i-3, '?')
                              for i in train_data[0]])
decoded_review_1

'? ? ? said as a result of its december acquisition of space co it expects earnings per share in 1987 of 1 15 to 1 30 dlrs per share up from 70 cts in 1986 the company said pretax net should rise to nine to 10 mln dlrs from six mln dlrs in 1986 and rental operation revenues to 19 to 22 mln dlrs from 12 5 mln dlrs it said cash flow per share this year should be 2 50 to three dlrs reuter 3'

## Decoding topics

In [38]:
type(train_labels), np.unique(train_labels)

(numpy.ndarray,
 array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
        34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45]))

A naive approach forgeting that most frequent words are for the newswires, not for unique labels.

In [42]:
list_of_topics = []
for idx in np.unique(train_labels):
    list_of_topics.append(reverse_word_index.get(idx))
    
#list_of_topics

Here are the orginal ones scraped from the web: 
  

In [43]:
reuters_topics = ['cocoa','grain','veg-oil','earn','acq','wheat','copper','housing','money-supply',
                  'coffee','sugar','trade','reserves','ship','cotton','carcass','crude','nat-gas',
                  'cpi','money-fx','interest','gnp','meal-feed','alum','oilseed','gold','tin',
                  'strategic-metal','livestock','retail','ipi','iron-steel','rubber','heat','jobs',
                  'lei','bop','zinc','orange','pet-chem','dlr','gas','silver','wpi','hog','lead']

Reference: https://github.com/SteffenBauer/KerasTools/blob/master/KerasTools/datasets/decode.py