# Flappy Bird Value Approximator: Deep Reinforcement Learning

## Value Function Approximation
- Function Approximators: Linear Combinations and Neural Network
- Increment Methods: Stochastic Gradient Descent Prediction and Control 
- Batch Methods: Least Squares Prediction and Control, Experience Replay

## Types of Function Approximators
- There are many function Approximators – supervised ML algorithms: Linear combinations of features, Neural Network, Decision Tree, etc.
- RL can get better benefit from differential function Approximators like Linear and Neural Network algorithms
- Incremental methods update the weights on each sample while batch does an updated on each epoch (batch).
- Stochastic Gradient Descent (SCD) is an incremental and iterative optimization algorithm to find values of parameters (weights) of a function that minimizes  cost function. 
- Least Squares method is a form of mathematical regression analysis that finds the line of best fit for a dataset, providing a - visual demonstration of the relationship between the data points.

## Neural Network Approximating the Q Function

![title](images/va_nonlinear.png)
![title](images/va_nonlinear_z.png)


## Q-Learning with Non-Linear Approximation

- Step 1: Start with initial parameter values
- Step 2: Take action a according to an explore or exploit policy, transitioning from s to s’
- Step 3: Perform TD update for each parameter
     ![title](images/va_nonlinear_eq.png)
- Step 4: Go to Step 2

Typically the space has many local minima and we no longer guarantee convergence, often works well in practice.


# Neural Network Concepts

- Perceptron, the first generation neural network, created a simple mathematical model or a function, mimicking neuron – the basic unit of brain
- Sigmoid Neuron improved learning by giving some weightage to the input
- Neural Network is a directed graph, organized by layers and layers are created by number of interconnected neurons (nodes)
- Typical neural network contains three layers: input, hidden and output. If the hidden layers are more than one, then it is called deep neural network
- Actual processing happens in hidden layers where each neuron acts as an activation function to process the input (from previous layers)
- The performance of neural network is measured using cost or error function and the dependent weight functions
- Forward and backward-propagation are two techniques, neural network users repeatedly until all the input variables are adjusted or calibrated to predict accurate output.
- During, forward-propagation, information moves in forward direction and passes through all the layers by applying certain weights to the input parameters. Back-propagation method minimizes the error in the weights by applying an algorithm called gradient descent at each iteration step.

![title](images/nn.png)


# Deep Neural Networks

- Deep Learning is an advanced neural network with multiple hidden layers that can work with supervised or unsupervised datasets.
- Deep Learning vectorizes the input and converts it into output vector space by decomposing complex geometric and polynomial equations into a series of simple transformations. These transformations go through neuron activation functions at each layer parameterized by input weights.
- Convolutional Neural Network (CNN) consists of (1) convolutional layers - to identify the features using weights and biases, followed by (2) fully connected layers - where each neuron is connected from all the neurons of previous layers - to provide nonlinearity, sub-sampling or max-pooling, performance and control data overfitting. Examples include: image and voice recognition.
- Recursive Neural Network (RNN) is, another type of Deep Learning, that uses same shared feature weights recursively for processing sequential data, emitted by sensors or the way spoken words are processed in NLP, to produce arbitrary size input and output vectors. Long Short Term Memory (LSTM) is an advanced RNN to learn and remember longer sequences by composing series of repeated modules of neural network. 

![title](images/deep_nn.png)
![title](images/cnn.png)
![title](images/rnn.png)

# Weight Sharing and Experience Replay

- **Weight Sharing**: Convolutional Neural Network shares weights between local regions Recurrent Neural Network shares weights between time-steps


- **Experience Replay**: Store experience (S, A, R, Snext) in a replay buffer and sample mini-batches from it to train the network. This de-correlates the data and leads to better data efficiency. In the beginning, the replay buffer is filled with random experience.Better convergence behavior when training a function approximator. 



# Deep Q-Learning Network (DQN)

- Step 1: Take action at according to e-greedy policy
- Step 2: Store transition (st, at, rt+1, st+1) in replay memory D
- Step 3: Sample random mini-batch of transitions (s, a, r, s’) from D
- Step 4: Compute Q-learning targets w.r.t old, fixed parameters w—
- Step 5: Optimize MSE (mean squared error) between Q-network and Q-learning targets
    ![title](images/dqn.png)
- Step 6: Using variant of stochastic gradient descent


## Some key aspects of the implementation:

Libraries used: Keras with TensorFlow (**GPU version**) and trained for several hours in Azure Windows Environment.

To scale the implementation, we pre-process the images by converting color images to grayscale and then crop the images to 80X80 pixels. And then stack 4 frames together so that the flappy bird velocity is inferred properly.

- The input to the neural network consists of an 4x80x80 images. 
- The first hidden layer convolves 32 filters of 8 x 8 with stride 4 and applies ReLU activation function. 
- The 2nd layer convolves a 64 filters of 4 x 4 with stride 2 and applies ReLU activation function. 
- The 3rd layer convolves a 64 filters of 3 x 3 with stride 1 and applies ReLU activation function. 
- The final hidden layer is fully-connected consisted of 512 rectifier units. 
- The output layer is a fully-connected linear layer with a single output for each valid action.  

**Convolution** actually helps computer to learn higher features like edges and shapes. The example below shows how the edges are stand out after a convolution filter is applied.

**Keras** makes it very easy to build convolution neural network. However, there are few things to track:

- A) It is important to choose a right initialization method. I choose normal distribution with sigma(σ) =0.01. init=lambda shape, name: normal(shape, scale=0.01, name=name)

- B) The ordering of the dimension is important, the default setting is 4x80x80 (Theano setting), so if your input is 80x80x4 (Tensorflow setting) then you are in trouble because the dimension is wrong. Alert: If your input dimension is 80x80x4 (Tensorflow setting) you need to set dim_ordering = tf (tf means tensorflow, th means theano)

- C) In Keras, subsample=(2,2) means you down sample the image size from (80x80) to (40x40). In ML literature it is often called “stride”

- D) We have used an adaptive learning algorithm called ADAM to do the optimization. The learning rate is 1-e6.

**Experience Relay:**

It was found that approximation of Q-value using non-linear functions like neural network is not very stable. During the game-play all the episode (s,a,r,s′) are stored in replay memory D. When training the network, random mini-batches from the replay memory are used instead of most the recent transition, which will greatly improve the stability.

# Deep Neural Network Implementation

In [1]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

<IPython.core.display.Javascript object>

In [2]:
import datetime
import gym
import random
import json
import os
import numpy as np
from collections      import deque
from keras.models     import Sequential
from keras.layers     import Dense
from keras.optimizers import Adam

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [3]:
%run flappy_bird_env.py #open AI gym clone

In [4]:
class ValueApproxDeepNNModel(object):
    
    def __init__(self, state_size, action_size, algorithm):
        self.algorithm          = algorithm
        self.learning_rate      = 0.001        
        self.weight_backup      = "flappy_va_{}.h5".format(algorithm) 
        
        self.state_size         = state_size        
        self.action_size        = action_size    
        
        self.exploration_rate   = 1.0
        self.exploration_min    = 0.01        
        
        self.brain              = self._build_model()
    
    def _build_model(self):
        
        # Neural Net for Deep-Q learning Model
        model = Sequential()
        
        #input layer containing 4 80X80 images
        model.add(Convolution2D(32, 8, 8, subsample=(4, 4), border_mode='same',input_shape=(80,80,4)))  
        
        #hidden layers 
        model.add(Activation('relu'))
        model.add(Convolution2D(64, 4, 4, subsample=(2, 2), border_mode='same'))
        model.add(Activation('relu'))
        model.add(Convolution2D(64, 3, 3, subsample=(1, 1), border_mode='same'))
        model.add(Activation('relu'))
        model.add(Flatten())
        model.add(Dense(512))
        model.add(Activation('relu'))
        
        #Output Layer 
        model.add(Dense(2))
   
        #set the loss function
        model.compile(loss='mse', optimizer=Adam(lr=self.learning_rate))

        #check file exists to load the weights from
        if os.path.isfile(self.weight_backup):
            model.load_weights(self.weight_backup)
            self.exploration_rate = self.exploration_min
        return model

    def save_model(self):
            self.brain.save(self.weight_backup)

In [5]:
class ValueApproxAgent(object):
    def __init__(self, state_size, action_size, model):        
        self.state_size         = state_size
        self.action_size        = action_size
        self.memory             = deque(maxlen=2000)
        self.learning_rate      = 0.001
        self.gamma              = 0.95
        self.exploration_rate   = 1.0
        self.exploration_min    = 0.01
        self.exploration_decay  = 0.995
        self.model              = model        
    
    def act(self, state):
        #*****************************************
        #ACT: agent will randomly select its action at first by a certain percentage, 
        #called ‘exploration rate’ (or ‘epsilon’). 
        #At the beginning, it is better for the DQN agent to try 
        #different things before it starts to search for a pattern
        #*****************************************
        
        if np.random.rand() <= self.exploration_rate:
            return random.randrange(self.action_size)
        
        act_values = self.model.brain.predict(state)
        return np.argmax(act_values[0])

   
    def remember(self, state, action, reward, next_state, done):
        #****************************************************
         #One of the most important steps in the learning process is to remember 
         #what we did in the past and how the reward was bound to that action
        #****************************************************
        self.memory.append((state, action, reward, next_state, done))

    
    def replay(self, sample_batch_size):
        #*****************************************
        #REPLAY: Now that we have our past experiences in an array, 
        #we can train our neural network. 
        #We cannot afford to go through all our memory, it will take too many ressources. 
        #Therefore, we will only take a few samples (sample_batch_size and here set as 32) 
        #and we will just pick them randomly.
        #*************************************************************
        #not enough data to train; play another episode
        if len(self.memory) < sample_batch_size:
            return
        
        sample_batch = random.sample(self.memory, sample_batch_size)
        for state, action, reward, next_state, done in sample_batch:
            target = reward
            
            if not done:
                #q-learning
                target = reward + self.gamma * np.amax(self.model.brain.predict(next_state)[0])
                
            target_f = self.model.brain.predict(state)
            target_f[0][action] = target
                                    
            #online learning with One Sample and discard this after fitting it to the model
            self.model.brain.fit(state, target_f, epochs=1, verbose=0)
            
        #adjust the exploration based on decay
        if self.exploration_rate > self.exploration_min:
            self.exploration_rate *= self.exploration_decay   
                

In [6]:
class ValueApproxGame:
    
    def __init__(self, env, agent, max_iterations= 10000):
        
        self.sample_batch_size = 32 #only few samples
        self.episodes          = max_iterations
        self.agent             = agent
        self.env               = env #Open AI gym    
        self.state_size        = self.env.observation_space.shape[0]
        self.action_size       = self.env.action_space.n
        
        self.start             = datetime.datetime.now()      
        self.data              = []

    def run(self):
        
        try:
            for index_episode in range(self.episodes):
                state = self.env.reset()
                #x_t, r_0, terminal = game_state.frame_step(do_nothing)
                
                #Convert the image into grayscale
                state = skimage.color.rgb2gray(state, 3)
                
                #Resize the image to 80X80
                state = skimage.transform.resize(state,(80,80))
                
                #Rescale the intensity of the image
                state = skimage.exposure.rescale_intensity(state,out_range=(0,255))

                #Standardazing the values
                state = state / 255.0
                
                #Stacking 4 images together
                state_t = np.stack((state, state, state, state), axis=2)
                
                #Reshaping the data for Keras input layer
                state_t = state_t.reshape(1, state_t.shape[0], state_t.shape[1], state_t.shape[2])

                done = False
                index = 0
                episode_reward = 0
                
                while not done:
                    self.env.render(close = True)

                   #take action
                    action = self.agent.act(state_t)                    
                    
                    next_state_colored, reward, done, _ = self.env.step(action, 3)
                    #print("next_state_colored", next_state_colored)
                    next_state = skimage.color.rgb2gray(next_state_colored)
                    next_state = skimage.transform.resize(next_state,(80,80))
                    next_state = skimage.exposure.rescale_intensity(next_state, out_range=(0, 255))


                    next_state = next_state / 255.0


                    next_state = next_state.reshape(1, next_state.shape[0], next_state.shape[1], 1) #1x80x80x1
                    state_t1 = np.append(next_state, state_t[:, :, :, :3], axis=3)
                    
                    
                    #next_state = np.reshape(next_state, [1, self.state_size])
                    
                    #Stores the image transitions in memory
                    self.agent.remember(state_t, action, reward, state_t1, done)
                    
                    # state = new state
                    state_t = state_t1
                    index += 1                    
                #end while
                                
                self.save_stats(index_episode, episode_reward, self.env.score)
                
                self.agent.model.save_model()
                    
                #train after every episode
                self.agent.replay(self.sample_batch_size)
        finally:
            self.agent.model.save_model()

    def save_stats(self, episode, reward, score):
                
        duration = datetime.datetime.now() - self.start 
        
        if (score >= 50):
            print("Duration: {} Episode {} Score: {}".format(duration, 
                                                                episode, 
                                                                score))
        
        self.data.append(json.dumps({ "algorithm": self.agent.model.algorithm, 
                    "duration":  "{}".format(duration), 
                    "episode":   episode, 
                    "reward":    reward, 
                    "score":     score}))
        
        if (len(self.data) == 500):
            file_name = 'data/stats_flappy_bird_{}.json'.format(self.agent.model.algorithm)
            
            # delete the old file before saving data for this session
            if episode == 1 and os.path.exists(file_name): os.remove(file_name)
                
            # open the file in append mode to add more json data
            file = open(file_name, 'a+')  
            for item in self.data:
                file.write(item)  
                file.write(",")
            #end for
            file.close()
            
            self.data = []
        #end if


In [None]:
if __name__ == "__main__":
    
    algorithm = "Deep_Neural_Network"
    max_episodes = 10000
    env = FlappyBirdEnv()
    state_size = env.observation_space.shape[0]
    action_size = env.action_space.n
    
    model = ValueApproxDeepNNModel(state_size, action_size, algorithm)
    agent = ValueApproxAgent(state_size, action_size, model)
    
    flappy = ValueApproxGame(env, agent, max_episodes)
    
    flappy.run()