## ***My Last Attempt at Training DQN for Breakout Before Using C51 Code***

#The Previous Version of the Function that Generates Memory and Stores it
This was the main problem as I see it. Generating data in this manner and storing it in a small buffer is inefficient.
In the C51 code uploaded, the training starts once the buffer is filled with 80,000 global steps, and it's maximum size is 1,000,000 traisitions. This is a huge difference, and my naive implementation was far from approaching that magnitude.

So, I'll keep this function here as part of the code to demonstrates the first attempt, and create a new function following it for actual future use.

You can see in this version here how complex it was to implement this without the env wrappers. You can also see the debris of my logic.
I tried to figure out what might make it tough for my neural network to learn the game and used that to make the early training easier.
The problems I saw:


1.   **Sparse Rewards**
2.   **Inconsistent Actions**

The sparse rewards and inconsistent actions could create uncertainty and disrupt the logic: what caused the reward to be achieved?

The ideas I used to try and solve this:

*   **Momentum**:

    I figured using a momentum to determine the duration of each action could could address both issues. If at the beginig I set each action to be repeat itself for 25 steps - it might reduce uncertainty in actions leading to rewards, possibly helping the agent learn to be more consistent in its actions. The next part of the plan was to gradually decrease the number of action repetitions from 25 to 0, through out the training process.

*   **Rewards Shaping**:

    I figured using different reward function could improve the agent performance. I tried:

      1.   Setting **future rewards** for actions that lead to rewards by exponentially increasing the rewards with each action until it reaches the actual reward. I experimented with different slopes, and tried applying this method exclusively to actions resulting in positive rewards.
      2.   **Reward scaling**: I explored various approaches, such as setting reward=-1 for life lost or reward=-10 for life lost, etc.
      3.   **Using only negetive reward**. Experimenting with the impact of sparse rewards and the number of steps taken from the meaningfull action itself to its outcome. I noticed negetive reward(life lost) is reached in fewer steps compared to the positive reward(hit brick). Therefor, I thought with fewer steps and Q value iterations representing them, the connection between not approaching the ball and losing life would be easier to establish. I wanted to focus on this and check if in fewer global steps the agent learns not to lose life. Additionally, at the beginning of the game, every time the paddle hits the ball without losing life, the ball hits some bricks. So, it made sense to me that this reward system could represent a good basic game strategy.
      4.   Applying a **basic reward** for each step the agent stayed alive.
      I tried giving the agent a basic reward for every step it stayed alive, inspired by my experience with 'cart-pole', where this worked well. In 'cart-pole,' with a basic reward of 1 for each step and a gamma value of 0.9, the Q values stabilized around Q ~ 10, making the system stable. I noticed the breakout agent didn't naturally aim for stable Q values like 'cart-pole' did, so I applied a similar idea to breakout. I gave a constant reward for staying alive (reward_alive=1), larger rewards for hitting bricks (reward_brick=5*reward), and set penalties for life lost (reward_lost=-5) with the next Q value at 0.



Looking back, here are my thoughts about this process:

*   **Micro managing**

    I now realize that some of my ideas may be considered as micromanaging. What I find beautiful in the paper about DQN learning to play Atari is that the agent truly learns from the environment as it is. The original game's basic reward system, with no restrictions, no assistance, or mitigations. And the agent successfully learned the game. If the agent can learn on its own with minimum interference from our side, it is more valuable to me.

*   **clumsy implementation**

    The way I implemented the system in this code is somewhat clumsy. The training loop is not designed to sustain training over a memory buffer with a size of one million transitions. This implementation is essentially naive, vanilla, and not suitable for running training continuously for two days or more. While it works well for training a system on simpler problems like 'cart-pole,' it requires improvement to handle more extended training processes.

*   **Other work**

    Stepping up my work to match the latest solutions by actually using and integrating their code with mine is recommended. You can not make yourself available to solve new current problems, if you are too busy trying to solve past problems that took years for other people to solve. Harness what humanity has done so far, and use it to explore new horizons.

*   **Future work**

    If I had employed a better, more organized, and elegant code implementation like the one in the C51 code, I could have accurately observed and diagnosed the actual impact of the methods I used. Due to my clumsy implementation, it didn't lead to effective learning despite having the right logic behind it. In the future, I'm thinking of restructuring the code, running it with the suggested methods, and examining their effects on the agent's learning process.











IMPORTS

In [None]:
# @title
########################## IMPORTS #########################################
%pip install gymnasium
import numpy as np
import gymnasium as gym
import random
import time
import cv2
from IPython.display import clear_output
%pip install "gym[atari, accept-rom-license]"
import gym
from google.colab.patches import cv2_imshow
import torch
import torch.nn as nn
import torch.optim as optim
import copy

!apt-get install -y python-opengl ffmpeg
%pip install pyvirtualdisplay
from IPython import display as ipythondisplay
from pyvirtualdisplay import Display
from gym.wrappers.record_video import RecordVideo
from IPython.display import display, HTML
from IPython.display import clear_output
import io
import base64
import tensorflow as tf
from torch.nn.utils import clip_grad_norm_
import math

USE GPU as device

In [None]:
# @title
if torch.cuda.is_available():
    device = torch.device("cuda")
    print("Using the GPU")

else:
    device = torch.device("cpu")

#Build Neural Network

In [None]:
class Policy_NN(nn.Module):        ############### Keep it simple ################
    def __init__(self, n=8):
        super(Policy_NN, self).__init__()
        self.n_feature = n
        self.kernel_size = 8
        self.conv1 = nn.Conv2d(in_channels=4, out_channels=32, kernel_size=self.kernel_size,stride=4)
        self.conv2 = nn.Conv2d(32,64, kernel_size=self.kernel_size//2,stride=2 )
        self.conv3 = nn.Conv2d(64,64, kernel_size=3,stride=1 )

        self.fc1 = nn.Linear(64*7*7, 512)
        self.fc2 = nn.Linear(512,256)
        self.fc3 = nn.Linear(256,3)

        self.pool = nn.MaxPool2d(kernel_size=2)
        self.relu= nn.ReLU()
        self.leaky_relu= nn.LeakyReLU(0.1)
        self.log_softmax = nn.LogSoftmax( dim=1)
        self.softmax = nn.Softmax( dim=1)

    def forward(self, x):
                          # x size = (batch, n=4, 84, 84)
        x = self.conv1(x) # x size = (batch, 12, 84, 84)
        x = self.relu(x)
        x = self.pool(x) # x size = (batch, 12, 42, 42)

        x = self.conv2(x)
        x = self.relu(x)
        x = self.pool(x) # x size = (batch, 6, 21, 21)

        x = self.conv3(x) # x size = (batch, 3, 21, 21)
        x = self.relu(x)
        x = self.pool(x)  # x size = (batch, 3, 10, 10)

        batch_size, _, height, width = x.size()

        x = x.reshape(batch_size, -1)


##### FULLY CONNECTED LAYERS
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        x = self.relu(x)
        x = self.fc3(x)
        return x

# Preprocess Images Function
turn then to grayscale and resize them to (84,84)

In [None]:
def preprocess_image(rgb_image):
    # Convert the RGB image to grayscale
    grayscale_image = cv2.cvtColor(rgb_image, cv2.COLOR_RGB2GRAY)

    # Resize the grayscale image to 84x84
    resized_image = cv2.resize(grayscale_image, (84, 84), interpolation=cv2.INTER_AREA)

    # Set a threshold value
    threshold_value = 10  # You can adjust this threshold as needed

    # Apply thresholding to create a black and white image
    _, black_and_white_image = cv2.threshold(resized_image, threshold_value, 255, cv2.THRESH_BINARY)

    return black_and_white_image

def preproces_mini_batch(last4states):  #turn to 10 states
    index = 0
    img0 = np.array(preprocess_image(last4states[index]))
    img1 = np.array(preprocess_image(last4states[index+1]))
    img2 = np.array(preprocess_image(last4states[index+2]))
    img3 = np.array(preprocess_image(last4states[index+3]))
    stacked_state = np.stack((img0,img1,img2,img3),axis =2)
    return stacked_state

# **Build an OFFLINE_gradient_descent function**

In [None]:

def OFFLINE_gradient_descent(model, memory, batch_size, learning_rate, gamma=0.9):
    model_copy = copy.deepcopy(model).to(device)
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)
    criterion = nn.MSELoss()

    len_memory = len(memory)
    print("len_memory=", len_memory, "batch_size=", batch_size)
    ## Can add  random.shuffle(memory)
    for batch_start in range(0, len_memory, batch_size):
        batch_memory = memory[batch_start:batch_start + batch_size]

        # Unpack batch_memory
        States, Actions, Rewards, Next_States, Future_Rewards, Lost_Ball = zip(*batch_memory)

        # Convert to tensors
        States = torch.Tensor(States).transpose(1, 3).transpose(2,3) #from [n,84,84,4] to [n,4,84,84]
        Actions = torch.tensor(Actions, dtype=int).to(device)
        Rewards = torch.tensor(Rewards, dtype=torch.float32).to(device)
        Next_States = torch.Tensor(Next_States).transpose(1, 3).transpose(2,3) #from [n,84,84,4] to [n,4,84,84]
        Future_Rewards = torch.tensor(Future_Rewards,dtype=torch.float32).to(device)
        Lost_Ball = torch.tensor(Lost_Ball,dtype=int).to(device)

        # Get Q-values for current states
        Q_values = model(States)

        # Compute targets
        with torch.no_grad():
            max_next_Q_values, _ = torch.max(model_copy(Next_States), dim=1, keepdim=True)
            #print("max_next_Q_values=",max_next_Q_values) ## Sanity check
            #print("Lost_Ball.int()=",Lost_Ball.int())  ## Sanity check
            #print("Rewards=", Rewards)  ## Sanity check
            targets = 10*Rewards + Lost_Ball.int() + gamma * (Lost_Ball.int()) * max_next_Q_values.squeeze(dim=1) ### I did not use the Future Rewards option
            #print("Future =", Future_Rewards) ## Sanity check
            #print("targets=",targets) ## Sanity check
            targets = targets.reshape(-1,1)
            Future_Rewards = Future_Rewards.reshape(-1,1)
            #print("targets after reshape=",targets) ## Sanity check
            #targets = targets + Future_Rewards
            #print("targets after reshape=",targets) ## Sanity check
            targets = targets.to(device)
            #targets = 10*torch.ones_like(targets, dtype=torch.float32) ######### Sanity check ####### Make sure to cancel !!!!!!

        # Zero-out Q-values for actions not taken
        Q_values_selected = torch.gather(Q_values, 1, Actions.view(-1, 1))
        #print("Q_values_selected=",Q_values_selected) ## Sanity check

        # Compute the loss
        loss = criterion(targets,Q_values_selected)
        #print("Q_values_selected=", Q_values_selected) ## Sanity check
        #print("diff =",targets - Q_values_selected ) ## Sanity check

        # Perform optimization step
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if batch_start % 50 == 0: model_copy = copy.deepcopy(model).to(device)  # Update target network
        # Print loss
        if batch_start % 50 == 0:
            print("----------------------------------------------------")
            print("batch=", batch_start, "loss=", float(loss.item()))
            print("Q_values_selected[-1]=",Q_values_selected[-1].item(),"targets[-1]=",targets[-1].item(),"max_next_Q_values[-1]=",max_next_Q_values[-1].item())


    return model


# Create the NN Argument

In [None]:
police = Policy_NN()
#police.load_state_dict(torch.load('/content/gdrive/My Drive/DeepLearning/BreakOut/ckpt-100.pk'))
police = police.to(device)

#The Previous Version of the Function that Generates and Stores Memory

In [None]:
def model_choose(model,stacked_state):
      xt = np.array([stacked_state])          ### Enter last 4 states stacked together
      XT = torch.Tensor(xt).transpose(1, 3).transpose(2,3) ### Turn to Tensor and transpose to (1,4,84,84)
      XT = XT.to(device)
      ZS = model(XT)
      if torch.all(ZS == ZS[0]).item(): ### In case it devides equally, select randomly
          action = random.randint(0, 2)
      else : action = torch.argmax(ZS).item()
      return action

def print_and_track(printed_content,*args, sep=' ', end='\n'):
    content = sep.join(map(str, args)) + end
    printed_content.append(content)
    print(content, end='')

def play_states(states,sec):
    states = np.array(states)
    for i in range(states.shape[0]):
        cv2_imshow(states[i])
        time.sleep(sec)
        clear_output(wait=True)

In [None]:
# @title
def Create_Data(model,Explore,length,Epsilon,len_action,Force_good=True):
    printed_content = []
    env = gym.make("BreakoutNoFrameskip-v4") ### Create enviroment
    reset_state = env.reset()
    Good_Exp, Good_states = [],[]
    for episode in range(50):
        Momentum, life, count, lost_ball, step = 0, 5, 0, 1, 0
        print_and_track(printed_content,"new episode, step =", step)
        Exp, states, episode_rewards, episode_rewards_pos = [],[],[], []
        states.append(reset_state)
        if len(Good_Exp) > length :
            print_and_track(printed_content,"long enough")
            break ### enough

        while step < 2000:
            if step < 3 :
                print_and_track(printed_content,"step =", step)
                action = 0
            else :
                stacked_state = preproces_mini_batch(states[-4:])
                if Momentum >= len_action or step == 3 :
                    ### Choose new action
                    Momentum = 0
                    ### Choose explore or exploite
                    exploration_rate_threshold = random.uniform(0, 1)
                    if exploration_rate_threshold < Epsilon  :
                        #action = random.randint(1, 2)
                        action = random.randint(1, 2)
                        print_and_track(printed_content,"Random action")
                    else :
                        print_and_track(printed_content,"model choose")
                        action = model_choose(model,stacked_state)
            if lost_ball == 0 :
                action = 0 ## CHECK again
                lost_ball = 1 ## Should I change step to step = 0 ?

            state, reward, done, info = env.step(action+1)
            Momentum += 1 ## CHECK again
            count +=1
            states.append(state)
            episode_rewards_pos.append(reward)
            if info['lives'] < life :
                reward = -1
                lost_ball = 0
                Momentum = 0 #len_action - 1
                print_and_track(printed_content,"life=",info['lives'],"compare_life=",life,"break for lost ball")
            episode_rewards.append(reward)
            ### ADD to memory
            if count == 3 and len(Exp) > 0 :
                print_and_track(printed_content,"add next states in step =", step)
                if Exp[-1][1] == 0 : print("FIRE")
                if Exp[-1][1] == 1 : print("RIGHT")
                if Exp[-1][1] == 2 : print("LEFT")
                time.sleep(2)
                clear_output(wait=True)
                play_states(states[-4:],1)
                Exp[-1][3] = stacked_state
                new_momentum_reward = sum(episode_rewards[-4:])  ### Rewrite Past reward, because it may gained reward while 0<count<3
                Exp[-1][2] = new_momentum_reward
                #### Give Human Feedback
                User_Reward = input('What is the Reward? Reward in[-1,1]\n')
                print("User_Reward=", User_Reward)
                Exp[-1][4] = float(User_Reward)
            if ((Momentum%4 == 0 or lost_ball == 0) and step > 3) or step == 3 :
                count = 0 ## 4 Frames count
                stacked_next_state = preproces_mini_batch(states[-4:])
                print_and_track(printed_content,"add exp in step =", step)
                time.sleep(2)
                clear_output(wait=True)
                play_states(states[-4:],0.5)
                Exp.append([stacked_state, action, reward, stacked_next_state, 0,lost_ball])
                print_and_track(printed_content,"action = ", action)

            if done == True or lost_ball == 0 :
                reset_state = env.reset()
                step = length
                print_and_track(printed_content,"Done")
                break ### No more steps for this episode
            step += 1
        time.sleep(3)
        clear_output(wait=True)
        play_states(states,0.05)
        for item in printed_content:
            print(item, end='')
        total_reward_per_episode = sum(episode_rewards_pos)
        if Force_good == False :
            print_and_track(printed_content,"got it")
            Good_states.extend(states)
            Good_Exp.extend(Exp)
        elif total_reward_per_episode > 0.5 and Force_good == True :
            print_and_track(printed_content,"got it")
            Good_states.extend(states)
            Good_Exp.extend(Exp)
    env.close()                                            ### Close env
    return Good_Exp, Good_states

In [None]:
def Create_Data_no_print(model,Explore,length,Epsilon,len_action,Force_good=True):
    printed_content = []
    env = gym.make("BreakoutNoFrameskip-v4") ### Create enviroment
    reset_state = env.reset()
    Good_Exp, Good_states = [],[]
    for episode in range(50):
        Momentum, life, count, lost_ball, step = 0, 5, 0, 1, 0
        #print_and_track(printed_content,"new episode, step =", step)
        Exp, states, episode_rewards, episode_rewards_pos = [],[],[], []
        states.append(reset_state)
        if len(Good_Exp) > length :
            #print_and_track(printed_content,"long enough")
            break ### enough

        while step < 2000:
            if step < 3 :
                #print_and_track(printed_content,"step =", step)
                action = 0
            else :
                stacked_state = preproces_mini_batch(states[-4:])
                if Momentum >= len_action or step == 3 :
                    ### Choose new action
                    Momentum = 0
                    ### Choose explore or exploite
                    exploration_rate_threshold = random.uniform(0, 1)
                    if exploration_rate_threshold < Epsilon  :
                        #action = random.randint(1, 2)
                        action = random.randint(1, 2)
                        #print_and_track(printed_content,"Random action")
                    else :
                        #print_and_track(printed_content,"model choose")
                        action = model_choose(model,stacked_state)
            if lost_ball == 0 :
                action = 0 ## CHECK again
                lost_ball = 1 ## Should I change step to step = 0 ?

            state, reward, done, info = env.step(action+1)
            Momentum += 1 ## CHECK again
            count +=1
            states.append(state)
            episode_rewards_pos.append(reward)
            if info['lives'] < life :
                reward = -1
                lost_ball = 0
                Momentum = 0 #len_action - 1
                #print_and_track(printed_content,"life=",info['lives'],"compare_life=",life,"break for lost ball")
            episode_rewards.append(reward)
            ### ADD to memory
            if count == 3 and len(Exp) > 0 :
                #print_and_track(printed_content,"add next states in step =", step)
                '''if Exp[-1][1] == 0 : print("FIRE")
                if Exp[-1][1] == 1 : print("RIGHT")
                if Exp[-1][1] == 2 : print("LEFT")
                time.sleep(2)
                clear_output(wait=True)
                play_states(states[-4:],1)'''
                Exp[-1][3] = stacked_state
                new_momentum_reward = sum(episode_rewards[-4:])  ### Rewrite Past reward, because it may gained reward while 0<count<3
                Exp[-1][2] = new_momentum_reward
                #### Give Human Feedback
                # Nope.
            if ((Momentum%4 == 0 or lost_ball == 0) and step > 3) or step == 3 :
                count = 0 ## 4 Frames count
                stacked_next_state = preproces_mini_batch(states[-4:])
                '''print_and_track(printed_content,"add exp in step =", step)
                time.sleep(2)
                clear_output(wait=True)
                play_states(states[-4:],0.5)'''
                Exp.append([stacked_state, action, reward, stacked_next_state, 0,lost_ball])
                #print_and_track(printed_content,"action = ", action)

            if done == True or lost_ball == 0 :
                reset_state = env.reset()
                step = length
                #print_and_track(printed_content,"Done")
                break ### No more steps for this episode
            step += 1
        #time.sleep(3)
        #clear_output(wait=True)
       #play_states(states,0.01)
        for item in printed_content:
            print(item, end='')
        total_reward_per_episode = sum(episode_rewards_pos)
        if Force_good == False :
            #print_and_track(printed_content,"got it")
            Good_states.extend(states)
            Good_Exp.extend(Exp)
        elif total_reward_per_episode > 0.5 and Force_good == True :
            #print_and_track(printed_content,"got it","reward=",total_reward_per_episode)
            Good_states.extend(states)
            Good_Exp.extend(Exp)
    env.close()                                            ### Close env
    return Good_Exp, Good_states

In [None]:
G_Exp, G_states = Create_Data_no_print(police,True,600,1,25,Force_good=True)

#The New Version of the Function that Generates and Stores Memory
In the new version I am going to be efficient and use the atari env wrappers and a large buffer with max size 1,000,000.

In [None]:
### CHECKOUT NP SAVE
#G_Exp = np.array(G_Exp)
#np.save('nice_exp.npy',G_Exp)
#arr_loaded = np.load('nice_exp.npy', allow_pickle= True)

# ***Start Training***

In [None]:
batch =10
m=0.0001

OFFLINE_gradient_descent(police,G_Exp, batch_size=batch, learning_rate=m, gamma=0.9)



len_memory= 493 batch_size= 10
----------------------------------------------------
batch= 0 loss= 10.67123031616211
Q_values_selected[-1]= -1.6344971656799316 targets[-1]= 1.4943656921386719 max_next_Q_values[-1]= 0.54929518699646
----------------------------------------------------
batch= 50 loss= 0.8541105389595032
Q_values_selected[-1]= 1.8050105571746826 targets[-1]= 1.4838006496429443 max_next_Q_values[-1]= 0.5375563502311707
----------------------------------------------------
batch= 100 loss= 2.7141571044921875
Q_values_selected[-1]= 3.615630626678467 targets[-1]= 2.9711811542510986 max_next_Q_values[-1]= 2.1902012825012207


KeyboardInterrupt: ignored

In [None]:
Nice_EXP = []
police = Policy_NN()

In [None]:
Nice_EXP = []

In [None]:
batch =10
m=0.00000001
epsilon = 0.95
K = 25
for epoch in range(100):

    if epoch%5 == 0 and epoch!=0 : m=m/10  ##Create_Data_no_print(model,Explore,length,Epsilon,len_action,Force_good=True)
    print("*******************************************************\n*******************************************************")
    print("Epoch=",epoch, "LR=",m,"epsilon=",epsilon)
    if epoch%5 == 0 :
      Nice_EXP = []
      New_G_Exp, New_G_states = Create_Data_no_print(police,True,60,epsilon,25,Force_good=True)
      Nice_EXP.extend(New_G_Exp)
    epsilon = 0.99*epsilon
    for inner_epoch in range(10):
        OFFLINE_gradient_descent(police, Nice_EXP, batch_size=batch, learning_rate=m, gamma=0.9)

In [None]:
# @title
def creat_video(model,name,Explore,Force_goodnes = False, length = 100,Th = 0.5,K=25,Use = True):
    print("Explore =", Explore)        ######create_more_data(model,Force_good = True, Explore = True, length = 600, Epsilon = 1,Big_Momentum_TH=25,use = True)
    #Memory,states_tryout,rewardss, avg_rrs = create_more_data(model,Force_goodnes, Explore, length = 100,Epsilon = Th,Big_Momentum_TH=K, use = Use )
    Memory , states_tryout = Create_Data_no_print(police,True,length,0,25,Force_good=Force_goodnes)
    #Rewards = np.array([item[2] for item in Memory])
    #Actions = np.array([item[1] for item in Memory])
    #Future = np.array([item[4] for item in Memory])
    #Sum = np.array([item[6] for item in Memory])
    #print("Rewards=",Rewards)
    #print("Actions =", Actions)
    #print("Future=",Future)
    print("Avg =",avg_rrs)

    image_arrays = np.array(states_tryout)

    # Define the video writer
    fourcc = cv2.VideoWriter_fourcc(*'mp4v')

    video_writer = cv2.VideoWriter('output_video_{}.mp4'.format(name), fourcc, 60.0, (image_arrays[0].shape[1], image_arrays[0].shape[0]))

    # Write each frame to the video file
    for image_array in image_arrays:
        video_writer.write(image_array)

    # Release the video writer
    video_writer.release()

    # Display the video in Colab
    video_path = 'output_video_{}.mp4'.format(name)
    video_file = io.open(video_path, 'r+b').read()
    encoded = base64.b64encode(video_file)
    HTML(data='''<video alt="test" controls>
                    <source src="data:video/mp4;base64,{0}" type="video/mp4" />
                </video>'''.format(encoded.decode('ascii')))
    return Memory,states_tryout,rewardss, avg_rrs

import os
import glob

def delete_videos_by_name(name_pattern):
    video_files = glob.glob(f'output_video_{name_pattern}.mp4')

    for video_file in video_files:
        try:
            os.remove(video_file)
            print(f"Deleted: {video_file}")
        except OSError as e:
            print(f"Error deleting {video_file}: {e}")

# Replace 'your_name_pattern' with the actual name pattern you want to delete
delete_videos_by_name('your_name_pattern')

In [None]:
# @title
Memory,states_tryout,rewardss, avg_rrs = creat_video(police,202, False,False,100,0,25,Use =0)