## Boltzmann Machine
Boltzmann machine's learns the internal concepts (not defined by the user) that help to explain the observed data. This is achieved by the hidden units that are connected to the input. Commonly used variant of Boltzmann Machine is the Restricted Boltzmann Machine's where the hidden units are not connected amongst themselves (this gives a strong performance boost) 

Boltzmann Machines try to model the distribution of the data using binary hidden neurons. Furthermore, Boltzmann Machines are Unsupervised Stochastic Neural Networks (stochastic means they have an element of probability in them). 

The basic idea of Boltzmann Machine is drawn from concept from Physics which is Energy. Every state is given an energy, and the most likely state will have the least energy. Boltzmann Machine uses Energy function to calculate the probability of each state. The one that has highest probability is the state chosen by the system. The learning is done by the weights that are given to the connections. 

Architecture of Boltzmann Machine is as follows:
1. One visible layer: Where the input goes (like the users' movie ratings) 
2. One hidden layer: Which calculates these latent features not defined by user
3. Bias: A way to adjust the ratings of movie 

Visible layers are connected to hidden layers, but the they are not connected amongst themselves. That is, no visible unit is connect to another visible unit, and no hidden unit is connected to another hidden unit. The bias, on the other hand is connected to both, hidden and visible unit. 


### Learning Process
The learning process of RBMs is using a process called **Contrastive Divergence** which is approximately same as the process of gradient descent. 

Intuitive Process:
1. The input at visible node is multiplied by the weights at visible node and passed through activation function which will give the probability of activation of a hidden node. 
2. Now based on the activation of hidden node, the probability of the activation of visible node is calculated which tries to model the same input that it received first. 
3. Perform random walk for k times. This is the Contrastive Divergence process. 
3. At the end of the k steps, if there's an error, for instance, the input[0] was 1 and the output is 0, the error is backpropagated and the weights, and biases are adjusted accordingly


Mathematically:
1. Activation of a hidden unit is given by: 
$$a = \sum_i{w_{ij}x_i + b}$$
and the probability is $$p(x) = \sigma(a)$$
where, $\sigma(x)$ is the sigmoid activation function

2. TODO: The derivation of energy function and Contrastive Divergence

Variation of RBN is the Deep Belief Network. Deep Belief Network is just a stacked RBN. Another variation is Deep Boltzmann Machine which is nearly the same as DBN but the difference is, in DBN all the layers except the last 2 are directed downwards, in DBM there are no such limitations
<img src="https://qph.ec.quoracdn.net/main-qimg-0f880856d4d1e886bda98ba2b292e6aa">

Useful links:
1. [Detailed algorithm paper by Geoff Hinton](https://www.cs.toronto.edu/~hinton/absps/guideTR.pdf)
2. [Intuitive understanding of the algorithm](http://blog.echen.me/2011/07/18/introduction-to-restricted-boltzmann-machines/)
2. [Derivation of Contrastive Divergence](http://image.diku.dk/igel/paper/AItRBM-proof.pdf)

In [38]:
import numpy as np
import pandas as pd
import torch

In [39]:
movies = pd.read_csv('ml-1m/movies.dat', sep='::', 
                      header=None, engine='python',
                      encoding='latin-1')
print movies.head()

users = pd.read_csv('ml-1m/users.dat', sep='::', 
                      header=None, engine='python',
                      encoding='latin-1')
print users.head()
ratings = pd.read_csv('ml-1m/ratings.dat', sep='::', 
                      header=None, engine='python',
                      encoding='latin-1')
ratings.head()

   0                                   1                             2
0  1                    Toy Story (1995)   Animation|Children's|Comedy
1  2                      Jumanji (1995)  Adventure|Children's|Fantasy
2  3             Grumpier Old Men (1995)                Comedy|Romance
3  4            Waiting to Exhale (1995)                  Comedy|Drama
4  5  Father of the Bride Part II (1995)                        Comedy
   0  1   2   3      4
0  1  F   1  10  48067
1  2  M  56  16  70072
2  3  M  25  15  55117
3  4  M  45   7  02460
4  5  M  25  20  55455


Unnamed: 0,0,1,2,3
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [40]:
training_set = pd.read_csv('ml-100k/u1.base', 
                           delimiter='\t').values 
test_set = pd.read_csv('ml-100k/u1.test', 
                       delimiter='\t').values

In [41]:
# Getting the number of users and movies 
nb_users = int(max(max(training_set[:, 0]),
                   max(test_set[:, 0])))
nb_movies = int(max(max(training_set[:, 1]),
                   max(test_set[:, 1])))

In [42]:
# convert the data into an an array with users 
# in lines and movies in columns.
# This is because torch expects the data to be
# in this fashion
def convert(data):
    """Create a list of lists. One list for each user. 
    The rating are from 1 to 5. User that didn't see 
    the movie will have a rating of 0"""
    new_data = []
    for id_users in range(1, nb_users + 1):
        # take the column with movie ids for this
        # specific user
        id_movies = data[:, 1][data[:, 0] == id_users] 
        # same for the ratings
        id_ratings = data[:, 2][data[:, 0] == id_users]
        # we also have to take care of the case where
        # this user didn't watch a specific movie
        # So, create a list of 1682 (total movies) elements
        # initialized with 0 and set the rating where this
        # person has watched the movies. 
        ratings = np.zeros(nb_movies)
        ratings[id_movies - 1] = id_ratings
        new_data.append(list(ratings))
    return new_data
training_set = convert(training_set)
test_set = convert(test_set)

In [43]:
# Converting the data into torch tensors
training_set = torch.FloatTensor(training_set)
test_set = torch.FloatTensor(test_set)

In [44]:
# Convert the ratings into binary 1 (liked) or 0 (not liked)
# we have to do this because the output of RBM will be 
# 0 if it predicts user will like the movie and 1 if the 
# prediction is yes. 

# ratings will be modified to:
# 1. Not watched                   -1
# 2. Ratings 3 and above (liked)    1
# 3. Ratings 1 or 2 (disliked)      0

# convert 0 ratings to -1 
training_set[training_set == 0] = -1

# convert ratings for 1 and 2 to 0
# using training_set[training_set <= 2] = 0
# has inadvertent effect on rows with rating 0

training_set[training_set == 1] = 0
training_set[training_set == 2] = 0

# convert ratings 3 and above to 1
training_set[training_set >= 3] = 1

# same for test set
test_set[test_set == 0] = -1

test_set[test_set == 1] = 0
test_set[test_set == 2] = 0

# convert ratings 3 and above to 1
test_set[test_set >= 3] = 1


In [45]:
# creating the RBM architecture

class RBM(object):
    """ Creates a Restricted Boltzmann Machine

        Parameters
        ----------
        1. hidden :  number of hidden nodes in RBM
        2. visible : number of visible nodes in RBM
        """
    def __init__(self, visible, hidden):
        # The weights matrix is the weight of the path connected
        # from visible node i to hidden node j. Every path
        # will have a certain weight. 
        
        # initialise the weights according to normal distribution 
        # which has mean of 0 and variance of 1 dimensions will 
        # be hidden x weights. 
        self.weights = torch.randn(hidden, visible)
        # bias_h is the bias of hidden nodes which is required 
        # to calculate the probability of hidden node given 
        # visible node
        self.bias_h = torch.randn(1, hidden)
        # bias_v is the bias of visible nodes which is required
        # to calculate the probability of visible node given 
        # visible node
        self.bias_v = torch.randn(1, visible)
    
    def activate_hidden(self, visible):
        """This will activate some hidden nodes using the probability 
        of hidden given visible or p(h=1 | v) which is nothing but
        sigmoid applied to the sum(weights*input + bias). 
        The question then we'll ask is which hidden units will 
        activate given these active visible nodes. 
        
        Parameters
        ----------
        
        1. visible :    These are visible units in p(h=1 | v) or
                        the current visible nodes that are active.
                   
        Returns
        -------
        
        1. p_h_given_v : Returns the vector (one value for every 
                         hidden node) of float values.
        
        2. activated_values : The nodes will be activated according to
                         bernoulli's sampling. Which basically works
                         as follows: If you have a probability of 0.7
                         Then choose a random number between 0 and 1
                         if the number is less than 0.7, activate the
                         node, else leave it off. 
           
        """
        weight_x = torch.mm(visible, self.weights.t())
        # expand as will add a dimension so that bias is applied
        # to each line of the weight_x
        activation = weight_x + self.bias_h.expand_as(weight_x)
        # calculate p(h | v)
        p_h_given_v = torch.sigmoid(activation)
        return p_h_given_v, torch.bernoulli(p_h_given_v)
    
    def activate_visible(self, hidden):
        """This will activate visible nodes using the probability 
        of visible given hidden or p(v=1 | h) which is nothing but
        sigmoid applied to the sum(weights*input + bias). 
        The question then we'll ask is which visible units will 
        activate given these hidden nodes. 
        
        Parameters
        ----------
        
        1. hidden :      These are hidden units in p(v=1 | h) or
                         the current hidden nodes that are active.
                   
        Returns
        -------
        
        1. p_v_given_h : Returns the vector (one value for every 
                         hidden node) of float values.
        
        2. activated_values : The nodes will be activated according to
                         bernoulli's sampling. Which basically works
                         as follows: If you have a probability of 0.7
                         Then choose a random number between 0 and 1
                         if the number is less than 0.7, activate the
                         node, else leave it off. 
           
        """
        # This time we don't need to compute the transpose 
        # since we're going the other way
        weight_x = torch.mm(hidden, self.weights)
        # expand as will add a dimension so that bias is applied
        # to each line of the weight_x
        activation = weight_x + self.bias_v.expand_as(weight_x)
        # calculate p(v | h)
        p_v_given_h = torch.sigmoid(activation)
        return p_v_given_h, torch.bernoulli(p_v_given_h)
    
    def train(self, input_vector, predicted_visible_k, 
              p_h_given_v_0, p_h_given_v_k):
        """
        Performs Contrastive Divergence
        
        Parameters
        ----------
        
        1. input_vector :  The original input vector given by the user.
        
        2. predicted_visible_k : The visible vector obtained after k
                           iterations in Contrastive Divergence. 
        
        3. p_h_given_v_0 : The probabilities of hidden nodes given the
                           input vector. 
        
        4. p_h_given_v_k : The probabilities of hidden nodes at the kth
                           iteration in Contrastive Divergence. 
        
        """
        # update weights
        self.weights += torch.mm(input_vector.t(), p_h_given_v_0) \
                            - torch.mm(predicted_visible_k.t(), 
                                       p_h_given_v_k)
        # the zero in the end of sum() is to preserve the shape
        # as matrix. 
        self.bias_v += torch.sum((input_vector - predicted_visible_k), 0)
        self.bias_h += torch.sum((p_h_given_v_0 - p_h_given_v_k), 0)


In [46]:
visible_nodes_size = len(training_set[0])
hidden_nodes_size = 100
batch_size = 100
number_of_epochs = 15
rbm = RBM(visible_nodes_size, hidden_nodes_size)

for epoch in range(number_of_epochs):
    train_loss = float(0)
    counter = float(0)
    for id_user in range(0, nb_users - batch_size, batch_size):
        # visible_k is the configuration of visible nodes after
        # k iterations.
        # initially, it is the same as the input vector. 
        visible_k = training_set[id_user: id_user + batch_size]
        visible_0 = training_set[id_user: id_user + batch_size]
        p_h_given_v0, _ = rbm.activate_hidden(visible_0)
        # perform k steps of Contrastive Divergence
        for k in range(10):
            # perform random walk
            
            # hidden_k are the probabilities of hideen nodes
            # after k passes
            
            # initially visible_k = visible_0
            _, hidden_k = rbm.activate_hidden(visible_k)
            _, visible_k = rbm.activate_visible(hidden_k)
            # reset the nodes where the rating is -1
            # i.e. where the users didn't watch the movie. 
            visible_k[visible_0 < 0] = visible_0[visible_0 < 0]
        p_h_given_vk, _ = rbm.activate_hidden(visible_k)
        rbm.train(visible_0, visible_k, p_h_given_v0, p_h_given_vk)
        # don't measure the error where the ratings are -1
        train_loss += torch.mean(torch.abs(visible_0[visible_0 >= 0]
                                           - visible_k[visible_0 >= 0]))
        counter += 1
    print 'Epoch: {0}\tTraining Loss: {1}'.format(epoch + 1, 
                                                 train_loss/counter)

Epoch: 1	Training Loss: 0.292761849616
Epoch: 2	Training Loss: 0.251646004149
Epoch: 3	Training Loss: 0.252478372312
Epoch: 4	Training Loss: 0.250084695766
Epoch: 5	Training Loss: 0.250953539255
Epoch: 6	Training Loss: 0.249900776658
Epoch: 7	Training Loss: 0.251415914688
Epoch: 8	Training Loss: 0.250097338458
Epoch: 9	Training Loss: 0.249537683653
Epoch: 10	Training Loss: 0.248545808809
Epoch: 11	Training Loss: 0.249935396027
Epoch: 12	Training Loss: 0.250694418432
Epoch: 13	Training Loss: 0.253130697998
Epoch: 14	Training Loss: 0.249886217562
Epoch: 15	Training Loss: 0.250107781537


In [49]:
# Testing the predictions

test_loss = float(0)
counter = float(0)
for id_user in range(nb_users):
    # IMPORTANT: While testing, we start with the
    # trainin_set data for this user, which will
    # be used to predict the recommendations for
    # this user
    visible_predict = training_set[id_user: id_user + 1]
    # the test set contains actual movies watched
    # by the same person, if our predictions match
    # the actual results, then we're doing good. 
    visible_target = test_set[id_user: id_user + 1]
    # We have already performed k steps of Contrastive
    # Divergence, so while testing we need to do it
    # only once. 
    # Also, we cannot predict when the user hasn't
    # watched any movies, so keep that case in mind. 
    if len(visible_target[visible_target >= 0]) > 0:
        _, hidden_predict = rbm.activate_hidden(visible)
        _, visible_predict = rbm.activate_visible(hidden_predict)
        # don't measure the error where the ratings are -1
        test_loss += torch.mean(torch.abs(
                                visible_target[visible_target >= 0]
                                - visible_predict[visible_target >= 0]))
        counter += 1
print 'Test Loss: {1}'.format(epoch + 1, test_loss/counter)

Test Loss: 0.229938048396
