## Tutorial on Softmax Regression using TensorFlow

In this section, we will impement Softmax Regression using TensorFlow. 
Softmax Regression is a multinomial algorithm which classifies multi-class labels. 
Given a data point $\mathbf{x} \in \mathbb{R}^{d}$, the hypothesis of Softmax Regression is defined as follows:

$$ H(\mathbf{x}) = \mathbf{Wx + b}$$

where $\mathbf{W} \in \mathbb{R}^{d \times d}$ is an weight matrix to be learned, and $d$ is the number of features.
In this case, the returned values are in a $d$-dimensional vector, and the vector is called a signal of the data point $\mathbf{x}$. 
To make the signal probabilistic, we need to pass it to softmax function which is defined as follows:

$$ S(y_{i}) = \frac{e^{y_{i}}}{\sum_{j}e^{y_j}}$$

The loss function of Softmax Regression for a signal is represented as follows:

$$ D(S, L) = -\sum_{i}L_{i}log(S_i)$$

where $S$ is a softmax vector, and $L$ is the label vector encoded by one-hot vector. 
The loss function is called cross-entory function (generalized version of logistic function).
Hence, the final loss function for a dataset is the average of the loss for each data point as:

$$\mathcal{L}(X) = \frac{1}{N}\sum_{i} D(S(H(X_{i}), L_{i})$$

The following code implements Softmax Regression using TensorFlow.

In [28]:
import tensorflow as tf
import numpy as np
tf.set_random_seed(777)  # for reproducibility

x_data = [[1, 2, 1, 1],
          [2, 1, 3, 2],
          [3, 1, 3, 4],
          [4, 1, 5, 5],
          [1, 7, 5, 5],
          [1, 2, 5, 6],
          [1, 6, 6, 6],
          [1, 7, 7, 7]]
y_data = [[0, 0, 1],
          [0, 0, 1],
          [0, 0, 1],
          [0, 1, 0],
          [0, 1, 0],
          [0, 1, 0],
          [1, 0, 0],
          [1, 0, 0]]


# set values
N = len(x_data) # number of instances
F = 4 # number of features
C = 3 # number of classes

X = tf.placeholder(tf.float32, shape=[None, F])
Y = tf.placeholder(tf.int32, shape=[None, C])
W = tf.Variable(tf.random_normal([F, C]), name='weight')
b = tf.Variable(tf.random_normal([C]), name='bias')

# set hypothesis
H = tf.matmul(X, W) + b # hypothesis
S = tf.nn.softmax(H) # signal

# set loss function
loss_i = tf.nn.softmax_cross_entropy_with_logits(logits=H, labels=Y)
loss = tf.reduce_mean(loss_i) 
#loss = tf.reduce_mean(-tf.reduce_sum(Y * tf.log(S), axis=1)) / N # the above is the same as this

# set train
train = tf.train.GradientDescentOptimizer(learning_rate=0.1).minimize(loss)

# training
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for i in range(2001):
        loss_val, _ = sess.run([loss, train], feed_dict={X:x_data, Y:y_data})
        
        if i % 100 == 0:
            print(i, loss_val)

    # testing
    s_val = sess.run(S, feed_dict={X: [[1, 11, 7, 9],
                                        [1, 3, 4, 3]]})
    c = sess.run(tf.argmax(s_val, 1))
    print(c)

0 4.63772
100 0.64858
200 0.568961
300 0.511829
400 0.461085
500 0.412936
600 0.36592
700 0.319551
800 0.27495
900 0.242613
1000 0.229299
1100 0.218094
1200 0.207919
1300 0.198633
1400 0.190123
1500 0.182293
1600 0.175065
1700 0.168373
1800 0.16216
1900 0.156376
2000 0.150978
[1 0]


## Perform Softmax Regression with Real-world dataset

In [34]:
import numpy as np

xy = np.loadtxt('datasets/data-04-zoo.csv', delimiter=',', dtype=np.float32)
x_data = xy[:, 0:-1]
y_data = xy[:, [-1]] # in this case, y_data is not one-hot encoded

nb_classes = 7

# set values
N = len(x_data) # number of instances
F = 16 # number of features
C = 7 # number of classes

X = tf.placeholder(tf.float32, shape=[None, F])
Y = tf.placeholder(tf.int32, [None, 1])
Y_one_hot = tf.one_hot(Y, C) #shape=(?, 1, 7) see the refernece of one_hot
Y_one_hot = tf.reshape(Y_one_hot, [-1, C]) #shape=(?, 7)

W = tf.Variable(tf.random_normal([F, C]), name='weight')
b = tf.Variable(tf.random_normal([C]), name='bias')

logits = tf.matmul(X, W) + b
H = tf.nn.softmax(logits)

loss_i = tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=Y_one_hot)
loss = tf.reduce_mean(loss_i)

# set train
train = tf.train.GradientDescentOptimizer(learning_rate=0.1).minimize(loss)

prediction = tf.argmax(H, 1)
correct_prediction = tf.equal(prediction, tf.argmax(Y_one_hot, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for i in range(2001):
        loss_val, acc, _ = sess.run([loss, accuracy, train], feed_dict={X:x_data, Y:y_data})
        if i % 100 == 0:
            print(i, loss_val, acc)
            
    pred = sess.run(prediction, feed_dict={X: x_data})
    for p, y in zip(pred, y_data.flatten()):
        print("[{}] Prediction: {} True Y: {}".format(p==int(y), p, int(y)))

0 11.4009 0.19802
100 0.896816 0.712871
200 0.595756 0.831683
300 0.442951 0.861386
400 0.345247 0.871287
500 0.277158 0.910891
600 0.228423 0.930693
700 0.192996 0.950495
800 0.166649 0.960396
900 0.146478 0.960396
1000 0.130581 0.990099
1100 0.117726 0.990099
1200 0.10711 0.990099
1300 0.0981921 0.990099
1400 0.0905944 0.990099
1500 0.0840464 1.0
1600 0.0783481 1.0
1700 0.0733475 1.0
1800 0.0689274 1.0
1900 0.064995 1.0
2000 0.0614763 1.0
[True] Prediction: 0 True Y: 0
[True] Prediction: 0 True Y: 0
[True] Prediction: 3 True Y: 3
[True] Prediction: 0 True Y: 0
[True] Prediction: 0 True Y: 0
[True] Prediction: 0 True Y: 0
[True] Prediction: 0 True Y: 0
[True] Prediction: 3 True Y: 3
[True] Prediction: 3 True Y: 3
[True] Prediction: 0 True Y: 0
[True] Prediction: 0 True Y: 0
[True] Prediction: 1 True Y: 1
[True] Prediction: 3 True Y: 3
[True] Prediction: 6 True Y: 6
[True] Prediction: 6 True Y: 6
[True] Prediction: 6 True Y: 6
[True] Prediction: 1 True Y: 1
[True] Prediction: 0 True Y: