## Reuters Newswire Dataset 
<br>
A collection of newswire data is assembled for text classification purposes, and full description of the dataset can be found at [UCI machine learning repositoty](https://archive.ics.uci.edu/ml/datasets/reuters-21578+text+categorization+collection). We will load data to jupyter notebook with Keras. 

In [14]:
import numpy as np
import pandas as pd

In [42]:
import numpy as np
import pandas as pd
from keras.datasets import reuters 

max_words = 10000 #top 10000 most common words

(X_train, y_train), (X_test, y_test) = reuters.load_data(num_words=max_words)
print('Number of training examples: ', X_train.shape[0])
print('Number of test examples: ', X_test.shape[0])

print('Example of training data: ', X_train[1])
print('Example of training label: ', y_train[1])

Number of training examples:  8982
Number of test examples:  2246
Example of training data:  [1, 3267, 699, 3434, 2295, 56, 2, 7511, 9, 56, 3906, 1073, 81, 5, 1198, 57, 366, 737, 132, 20, 4093, 7, 2, 49, 2295, 2, 1037, 3267, 699, 3434, 8, 7, 10, 241, 16, 855, 129, 231, 783, 5, 4, 587, 2295, 2, 2, 775, 7, 48, 34, 191, 44, 35, 1795, 505, 17, 12]
Example of training label:  4


In [43]:
# peek at dataset 
from collections import Counter

print('Number of exmples for each topic label: ', Counter(y_train))

Number of exmples for each topic label:  Counter({3: 3159, 4: 1949, 19: 549, 16: 444, 1: 432, 11: 390, 20: 269, 13: 172, 8: 139, 10: 124, 9: 101, 21: 100, 25: 92, 2: 74, 18: 66, 24: 62, 0: 55, 34: 50, 12: 49, 36: 49, 6: 48, 28: 48, 30: 45, 23: 41, 17: 39, 31: 39, 40: 36, 32: 32, 41: 30, 14: 26, 26: 24, 39: 24, 43: 21, 15: 20, 29: 19, 37: 19, 38: 19, 45: 18, 5: 17, 7: 16, 22: 15, 27: 15, 42: 13, 44: 12, 33: 11, 35: 10})


### 1. Data Preprocessing 

All observations in traning dataset are lists of word indices. 
<br>
#### Decode Training Data 

In [44]:

def decode_newswire(example):
    """
        Args:
            List of word indices 
        Returns:
            List of words matched to given indices
    """
    word_to_index = reuters.get_word_index()
    index_to_word = dict([(key, value) for (value, key) in word_to_index.items()]) 
    words = [index_to_word.get(i-3, 'UNK') for i in example] #indices offset by 3
    return ' '.join(words)


# print one example newswire
decode_newswire(X_train[4])

'UNK seton co said its board has received a proposal from chairman and chief executive officer philip d UNK to acquire seton for 15 75 dlrs per share in cash seton said the acquisition bid is subject to UNK arranging the necessary financing it said he intends to ask other members of senior management to participate the company said UNK owns 30 pct of seton stock and other management members another 7 5 pct seton said it has formed an independent board committee to consider the offer and has deferred the annual meeting it had scheduled for march 31 reuter 3'

#### Construct Binary Input Data with N = max_words

In [45]:
def construct_binary_input(X):
    """construct binary input"""
    input = np.zeros((X.shape[0], max_words))
    for i in range(X.shape[0]):
        for j in range(len(X[i])):
            input[i][X[i][j]] = 1
    return input

X_train = construct_binary_input(X_train)

#### One-Hot Encode Labels

In [46]:
y_train = pd.get_dummies(y_train).values
#y_train[:5]

### 2. Construct Neural Network 



In [47]:

def softmax(A):
    exps = np.exp(A - np.max(A)) # prevent overflow
    return exps / np.sum(exps)

def cross_entropy_loss(model_output, target):
    ce = -np.sum(target * np.log(model_output) + (1 - target) * np.log(1 - model_output))
    return ce

In [48]:

def forward_propagation(X, W):
    Z = np.maximum(np.dot(X, W), 0) 
    return Z

def back_propagation(delta, X):  
    gradient = np.dot(X.T, delta)
    gradient[X < 0] = 0
    return gradient



In [49]:
X_train.shape

(8982, 10000)