# Feature Distillation in Neural Networks
## AKA Learning Using Provileged Information in Neural Networks

This came about because I very much misunderstood model distillation. Joe clarified where I was going wrong and helped solidify some of the ideas. I was then going to throw the idea away, assuming it had been done before, but then Chris asked if anyone had any ideas for using privileged information in Neural Networks, which this approach gives you. In speaking to Chris, with Joe we came to think this might be useful. However, I'm pretty sure it's similar to Transfer Learning. Maybe I need to do more reading.

## What this would look like
There are 2 steps to this. Firstly, train a network on all $m$ training data with $n$ standard features and $k$ privileged features, $X_{1 \dots m}\{x_0, x_1, \dots, x_n\}, x \in \mathcal{X}$ and $X^*_{1 \dots m} \{x^*_0, x^*_1, \dots, x^*_k \}, x^* \in \mathcal{X^*}$.

This should produce a network like that in the image below.
![Prior to distillation](Images/Model Prior To Feature Distillation.png "Model Prior To Feature Distillation")

To 'distil' the above, we're going to essentially going to copy the inputs to the output layer. But we do this in a kind of backwards way. So once we've learned the big model, we save the weights to the output layer and the output generated by the model, $\langle h , w \rangle $. By fixing the weights and setting the output, the only way that we can achieve this output is to learn the inputs to the penultimate layer. See below.

![Post distillation](Images/Model During Feature Distillation Alternative.png "Model During Feature Distillation Alternative")

Post distillation we now have a model we can use with the reduced number of inputs. See below.

![Post distillation](Images/Model Post Feature Distillation.png "Model Post Feature Distillation")

## Implementation
We're going to use Keras, as it makes this kind of prototyping easy.

In [1]:
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras import optimizers
from keras.callbacks import EarlyStopping
from sklearn.preprocessing import normalize
import numpy as np

Using TensorFlow backend.
  return f(*args, **kwds)


Next we're loading the data. It's TechTC data from Joe's LUPI paper where the privileged information is unselected features. The only features that were selected ar those where the values are different in something like 95% of all rows. So there's a lot of unselected features.

In [2]:
EPOCHS = 100_000

In [3]:
train_x = np.loadtxt("data/fold_0/train_sel_inputs.txt")
train_s = np.loadtxt("data/fold_0/train_unsel_inputs.txt")
train_xs = np.hstack((train_x, train_s))
train_y = np.loadtxt("data/fold_0/train_labels.txt")

valid_x = np.loadtxt("data/fold_0/test_sel_inputs.txt")
valid_s = np.loadtxt("data/fold_0/test_unsel_inputs.txt")
valid_xs = np.hstack((valid_x, valid_s))
valid_y = np.loadtxt("data/fold_0/test_labels.txt")

test_x = np.loadtxt("data/test_sel_inputs.txt")
test_s = np.loadtxt("data/test_unsel_inputs.txt")
test_xs = np.hstack((test_x, test_s))
test_y = np.loadtxt("data/test_labels.txt")

### The BIG Model
This is the model that gets trained on all the data and will eventually be distiled.

In [4]:
all_data_model = Sequential()
all_data_model.add(Dense(200, input_dim=train_xs.shape[1]))#, activation='sigmoid'))
all_data_model.add(Dropout(0.25))
all_data_model.add(Dense(100))#, activation='sigmoid'))
all_data_model.add(Dropout(0.25))
all_data_model.add(Dense(10))#, activation='sigmoid'))
all_data_model.add(Dense(1, activation='tanh'))

In [5]:
all_data_model.compile(optimizer='sgd', loss='mean_squared_error', metrics=['binary_accuracy'])

In [6]:
all_data_model.fit(train_xs, train_y, batch_size=32, epochs=EPOCHS, verbose=0, validation_data=(valid_xs, valid_y))

<keras.callbacks.History at 0x1a1d000198>

In [7]:
score = all_data_model.evaluate(test_xs, test_y, verbose=1)
print("Loss:", score[0], "Acc:", score[1])

Loss: 1.19442424774 Acc: 0.675


Saving the output as this is needed for distilling the model.

In [8]:
pred_y = all_data_model.predict(train_xs)

valid_pred_y = all_data_model.predict(valid_xs)
#valid_pred_y[valid_pred_y < 1] = -1
[(valid_pred_y[i], valid_y[i]) for i in range(len(valid_y)) if not(np.round(valid_pred_y[i]) == valid_y[i])]

[(array([ 1.], dtype=float32), -1.0),
 (array([-0.74228102], dtype=float32), 1.0),
 (array([ 0.99999911], dtype=float32), -1.0),
 (array([-0.99987549], dtype=float32), 1.0),
 (array([-1.], dtype=float32), 1.0),
 (array([ 1.], dtype=float32), -1.0),
 (array([ 1.], dtype=float32), -1.0),
 (array([-0.99986571], dtype=float32), 1.0)]

### The Distilled Model

In [9]:
distilled_model = Sequential()
distilled_model.add(Dense(200, input_dim=train_x.shape[1]))
distilled_model.add(Dropout(0.25))
distilled_model.add(Dense(100))
distilled_model.add(Dropout(0.25))
distilled_model.add(Dense(10))
distilled_model.add(all_data_model.get_layer(index=6))
distilled_model.get_layer(index=6).trainable=False

In [10]:
distilled_model.compile(optimizer='sgd', loss='mean_squared_error', metrics=['binary_accuracy'])

In [11]:
distilled_model.fit(train_x, pred_y, batch_size=32, epochs=EPOCHS, verbose=0, validation_data=(valid_x, valid_pred_y))

<keras.callbacks.History at 0x1a1d76b240>

In [12]:
score = distilled_model.evaluate(test_x, test_y)
print("Loss:", score[0], "Acc:", score[1])

Loss: 1.27045834064 Acc: 0.675


Let's just demonstrate that the last layer of both models is the same.

In [13]:
all_data_model.get_weights()[6]

array([[-0.10814332],
       [-0.73108137],
       [-0.24628025],
       [-0.30259526],
       [-0.20805749],
       [ 0.05454459],
       [-0.6163854 ],
       [ 0.53691632],
       [ 0.46061108],
       [-0.52381009]], dtype=float32)

In [14]:
distilled_model.get_weights()[6]

array([[-0.10814332],
       [-0.73108137],
       [-0.24628025],
       [-0.30259526],
       [-0.20805749],
       [ 0.05454459],
       [-0.6163854 ],
       [ 0.53691632],
       [ 0.46061108],
       [-0.52381009]], dtype=float32)

### Regular Model
A normal setup

In [15]:
regular_model = Sequential()
regular_model.add(Dense(200, input_dim=train_x.shape[1]))
regular_model.add(Dropout(0.25))
regular_model.add(Dense(100))
regular_model.add(Dropout(0.25))
regular_model.add(Dense(10))
regular_model.add(Dense(1, activation='tanh'))

In [16]:
regular_model.compile(optimizer='sgd', loss='hinge', metrics=['binary_accuracy'])

In [17]:
regular_model.fit(train_x, train_y, batch_size=32, epochs=EPOCHS, verbose=0, validation_data=(valid_x, valid_y))

<keras.callbacks.History at 0x1a2075ac18>

In [18]:
score = regular_model.evaluate(test_x, test_y)
print("Loss:", score[0], "Acc:", score[1])

Loss: 0.720061767101 Acc: 0.625


Again, just to demonstrate that the last layer's weights are different in this model

In [19]:
regular_model.get_weights()[6]

array([[-0.4607977 ],
       [ 0.59714705],
       [ 0.21520783],
       [ 0.48031554],
       [ 0.73372406],
       [ 0.60548145],
       [-0.47019443],
       [-0.75899076],
       [ 0.11630869],
       [ 0.72340775]], dtype=float32)

# The Interesting Bit
The thing that I find really interesting is if we do it the other way around. So we only have useless information at deployment time, not useful data. And this method looks lke it might work. Maybe...

### The BIG Model
This is the model that gets trained on all the data and will eventually be distiled.

In [20]:
all_data_model_inverse = Sequential()
all_data_model_inverse.add(Dense(200, input_dim=train_xs.shape[1]))
all_data_model_inverse.add(Dropout(0.25))
all_data_model_inverse.add(Dense(100))
all_data_model_inverse.add(Dropout(0.25))
all_data_model_inverse.add(Dense(10))
all_data_model_inverse.add(Dense(1, activation='tanh'))

In [21]:
all_data_model_inverse.compile(optimizer='sgd', loss='mean_squared_error', metrics=['binary_accuracy'])

In [22]:
all_data_model_inverse.fit(train_xs, train_y, batch_size=32, epochs=EPOCHS, verbose=0, validation_data=(valid_xs, valid_y))

<keras.callbacks.History at 0x1a23af18d0>

In [23]:
score = all_data_model_inverse.evaluate(test_xs, test_y)
print("Loss:", score[0], "Acc:", score[1])

Loss: 1.09190224409 Acc: 0.65


In [24]:
pred_y_inverse = all_data_model_inverse.predict(train_xs)

valid_pred_y_inverse = all_data_model_inverse.predict(valid_xs)
#valid_pred_y_inverse[valid_pred_y_inverse < 1] = -1

### The Distilled Model

In [25]:
distilled_model_inverse = Sequential()
distilled_model_inverse.add(Dense(200, input_dim=train_s.shape[1]))
distilled_model_inverse.add(Dropout(0.25))
distilled_model_inverse.add(Dense(100))
distilled_model_inverse.add(Dropout(0.25))
distilled_model_inverse.add(Dense(10))
distilled_model_inverse.add(all_data_model_inverse.get_layer(index=6))
distilled_model_inverse.get_layer(index=6).trainable=False

In [26]:
distilled_model_inverse.compile(optimizer='sgd', loss='mean_squared_error', metrics=['binary_accuracy'])

In [27]:
distilled_model_inverse.fit(train_s, pred_y_inverse, batch_size=32, epochs=EPOCHS, verbose=0, validation_data=(valid_s, valid_pred_y_inverse))

<keras.callbacks.History at 0x1a2deaa940>

In [28]:
score = distilled_model_inverse.evaluate(test_s, test_y)
print("Loss:", score[0], "Acc:", score[1])

Loss: 0.899837994576 Acc: 0.6


Let's just demonstrate that the last layer of both models is the same.

In [29]:
all_data_model_inverse.get_weights()[6]

array([[-0.34592503],
       [-0.43121678],
       [ 0.01841524],
       [-0.03538501],
       [ 0.23351964],
       [ 0.31625515],
       [ 0.08129516],
       [ 0.24874951],
       [-0.59812289],
       [ 0.25800267]], dtype=float32)

In [30]:
distilled_model_inverse.get_weights()[6]

array([[-0.34592503],
       [-0.43121678],
       [ 0.01841524],
       [-0.03538501],
       [ 0.23351964],
       [ 0.31625515],
       [ 0.08129516],
       [ 0.24874951],
       [-0.59812289],
       [ 0.25800267]], dtype=float32)

### Regular Model
A normal setup

In [31]:
regular_model_inverse = Sequential()
regular_model_inverse.add(Dense(200, input_dim=train_s.shape[1]))
regular_model_inverse.add(Dropout(0.25))
regular_model_inverse.add(Dense(100))
regular_model_inverse.add(Dropout(0.25))
regular_model_inverse.add(Dense(10))
regular_model_inverse.add(Dense(1, activation='tanh'))

In [32]:
regular_model_inverse.compile(optimizer='sgd', loss='mean_squared_error', metrics=['binary_accuracy'])

In [33]:
regular_model_inverse.fit(train_s, train_y, batch_size=32, epochs=EPOCHS, verbose=0, validation_data=(valid_s, valid_y))

<keras.callbacks.History at 0x1a28878ef0>

In [34]:
score = regular_model_inverse.evaluate(test_s, test_y)
print("Loss:", score[0], "Acc:", score[1])

Loss: 0.969038486481 Acc: 0.5


Again, just to demonstrate that the last layer's weights are different in this model

In [35]:
regular_model_inverse.get_weights()[6]

array([[-0.42132613],
       [-0.64012021],
       [ 0.52992225],
       [-0.41942331],
       [-0.07013734],
       [ 0.798958  ],
       [ 0.3937121 ],
       [ 0.1461388 ],
       [ 0.43921226],
       [-0.04010696]], dtype=float32)