Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ppo2 performance and gpu utilization #308

Closed
hn2 opened this issue May 6, 2019 · 32 comments
Closed

ppo2 performance and gpu utilization #308

hn2 opened this issue May 6, 2019 · 32 comments
Labels
question Further information is requested windows

Comments

@hn2
Copy link

hn2 commented May 6, 2019

I am running a ppo2 model. I see high cpu utilization and low gpu utilization.

When running:

from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

I get:

Python 3.7.3 (default, Mar 27 2019, 17:13:21) [MSC v.1915 64 bit (AMD64)] :: Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from tensorflow.python.client import device_lib
>>> print(device_lib.list_local_devices())
2019-05-06 11:06:02.117760: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2019-05-06 11:06:02.341488: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce GTX 1660 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:01:00.0
totalMemory: 6.00GiB freeMemory: 4.92GiB
2019-05-06 11:06:02.348112: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-05-06 11:06:02.838521: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-06 11:06:02.842724: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0
2019-05-06 11:06:02.845154: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N
2019-05-06 11:06:02.848092: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:0 with 4641 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1660 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 8905916217148098349
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 4866611609
locality {
  bus_id: 1
  links {
  }
}
incarnation: 7192145949653879362
physical_device_desc: "device: 0, name: GeForce GTX 1660 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5"
]

I understand that tensorflow is "seeing" my gpu.
Why is the low utilization when training a stable baseline model?

# multiprocess environment
n_cpu = 4
env = PortfolioEnv(total_steps=settings['total_steps'], window_length=settings['window_length'], allow_short=settings['allow_short'] )
env = SubprocVecEnv([lambda: env for i in range(n_cpu)])

if settings['policy'] == 'MlpPolicy':
    model = PPO2(MlpPolicy, env, verbose=0, tensorboard_log=settings['tensorboard_log'])
elif settings['policy'] == 'MlpLstmPolicy': 
    model = PPO2(MlpLstmPolicy, env, verbose=0, tensorboard_log=settings['tensorboard_log'])
elif settings['policy'] == 'MlpLnLstmPolicy': 
    model = PPO2(MlpLnLstmPolicy, env, verbose=0, tensorboard_log=settings['tensorboard_log'])

model.learn(total_timesteps=settings['total_timesteps'])

model_name = str(settings['model_name']) + '_' + str(settings['policy']) + '_' + str(settings['total_timesteps']) + '_' + str(settings['total_steps']) + '_' + str(settings['window_length']) + '_' + str(settings['allow_short'])  
model.save(model_name)
@hn2 hn2 changed the title ppo2 performance and gpu utilizatio ppo2 performance and gpu utilization May 6, 2019
@hn2
Copy link
Author

hn2 commented May 6, 2019

I see about 50% cpu utilization with core i7 cpu and <= 10% for gpu

@hill-a
Copy link
Owner

hill-a commented May 6, 2019

My guess is that your environment is too simple, this can cause the GPU and CPU to wait each other as the CPU is trying to run the environment with high multiprocess overhead (when compared to the load), and then having to wait for the GPU latency for the given batch size.

You are also using a very powerful GPU for a very simple task, hence the 10% load on the GPU.

Just as a side note, what CPU are you using exactly? I'm supprised to see a high power GPU combined with a 4 threaded i7, are you sure it's not 8 threads?

EDIT: checking on intel ARK for 4 threaded desktop CPUs None of them are i7, and when switched to laptop they are all low power CPUs for utlrabooks. n_cpu is the number of cpu threads not the number of cpu cores.

@hill-a hill-a added question Further information is requested windows labels May 6, 2019
@hn2
Copy link
Author

hn2 commented May 6, 2019

This is my pc configuration:
https://www.userbenchmark.com/UserRun/16739440

Also, I tested my portfolio env several times with different instruments and parameters and reward never exceeded -0.5 and that's weird.

@hill-a
Copy link
Owner

hill-a commented May 6, 2019

i7-8700, 6 cores 12 threads
Try again with n_cpu = 12.

As for the reward, it's possible the methods do not work with your problem. This is still machine learning, and there are no magic bullets unfortunatly.

@hn2
Copy link
Author

hn2 commented May 6, 2019

Yes I tried that with c_cpu = 12. Still I see 12 processes spawned in the task manager with only one using gpu with very low utilization ~ 2%. All other processes don't use gpu at all - 0%. As to the reward, the original implementation on GitHub works and profitable. It doesn't make sense that out of millions of simulation runs not even one is profitable.

@hill-a
Copy link
Owner

hill-a commented May 6, 2019

Still I see 12 processes spawned in the task manager with only one using gpu with very low utilization ~ 2%. All other processes don't use gpu at all

Thats normal, after the steps of the environments ends, the processes send the data to the master thread, which then passes through the neural network. So only one processe is using the GPU, and the rest is simulating your environment with the CPUs. The goal of Multi - CPU environments is to reduce the time to simulate the environment and run more steps per seconds to feed the GPU. If the CPUs cannot simulate any faster (either due to a lack of computing power or Amdahl's law), then the GPU will invariantly be slowed down.

As to the reward, the original implementation on GitHub works and profitable. It doesn't make sense that out of millions of simulation runs not even one is profitable.

Can you show a benchmark with a specific method I could compare too? so I can make sure if this is an implementation issue or not for a given method.

@hn2
Copy link
Author

hn2 commented May 6, 2019

Does this mean that I wasted money on gpu? I can not use it to accelerate training?

@hill-a
Copy link
Owner

hill-a commented May 6, 2019

Does this mean that I wasted money on gpu? I can not use it to accelerate training?

You can throw a bigger network at your problem (by default it is 2 layers of 64), that will use more GPU power and might help your convergence.

from the documentation:

from stable_baselines.common.policies import FeedForwardPolicy

# Custom MLP policy of three layers of size 128 each
class CustomPolicy(FeedForwardPolicy):
    def __init__(self, *args, **kwargs):
        super(CustomPolicy, self).__init__(*args, **kwargs,
                                           net_arch=[dict(pi=[128, 128, 128],
                                                          vf=[128, 128, 128])],
                                           feature_extraction="mlp")


model = PPO2(CustomPolicy, env, verbose=0, tensorboard_log=settings['tensorboard_log'])

@hn2
Copy link
Author

hn2 commented May 6, 2019

What is pi and what is vf?
What if I want custom custom MlpLnLstmPolicy?

@hill-a
Copy link
Owner

hill-a commented May 6, 2019

on the documentation page, it says this:

The LstmPolicy can be used to construct recurrent policies in a similar way:

class CustomLSTMPolicy(LstmPolicy):
    def __init__(self, sess, ob_space, ac_space, n_env, n_steps, n_batch, n_lstm=64, reuse=False, **_kwargs):
        super().__init__(sess, ob_space, ac_space, n_env, n_steps, n_batch, n_lstm, reuse,
                         net_arch=[8, 'lstm', dict(vf=[5, 10], pi=[10])],
                         layer_norm=True, feature_extraction="mlp", **_kwargs)

so:

from stable_baselines.common.policies import LstmPolicy

# Custom MLP policy of three layers of size 128 each
class CustomPolicy(LstmPolicy):
    def __init__(self, *args, **kwargs):
        super(CustomPolicy, self).__init__(*args, **kwargs,
                                           net_arch=[8, 
                                                     'lstm', 
                                                     dict(pi=[128, 128, 128],
                                                          vf=[128, 128, 128])],
                                           layer_norm=True, feature_extraction="mlp")


model = PPO2(CustomPolicy, env, verbose=0, tensorboard_log=settings['tensorboard_log'])

What is pi and what is vf?

pi is the policy function, vf is the value function (here is a really good write up if you want to know more about actor critic models)

@hn2
Copy link
Author

hn2 commented May 7, 2019

I am using now a custom policy but GPU utilization still very low (< 5%):

# Custom MLP policy of five layers
class CustomPolicy(LstmPolicy):
    def __init__(self, *args, **kwargs):
        super(CustomPolicy, self).__init__(*args, **kwargs,
                                        net_arch=[8, 
                                                    'lstm', 
                                                    dict(pi=[2048, 1024, 512, 256, 128],
                                                         vf=[2048, 1024, 512, 256, 128])],
                                        layer_norm=True, feature_extraction="mlp")


model = PPO2(CustomPolicy, env, verbose=0, tensorboard_log=settings['tensorboard_log'])

@hn2
Copy link
Author

hn2 commented May 7, 2019

Another question, once I have the model trained, how do I use it? Create observation and use predict? Do I have to step the env?

n_cpu = 12
env = PortfolioEnv(history=history, abbreviation=instruments, steps=settings['steps'], window_length=settings['window_length'])
env = SubprocVecEnv([lambda: env for i in range(n_cpu)])

model_name = str(settings['model_name']) + '_' + str(settings['policy']) + '_' + str(settings['total_timesteps']) + '_' + str(settings['total_steps']) + '_' + str(settings['window_length']) + '_' + str(settings['allow_short'])  
model = PPO2.load(model_name)

obs = env.reset()
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)

@hill-a
Copy link
Owner

hill-a commented May 10, 2019

I am using now a custom policy but GPU utilization still very low (< 5%)

How is your CPU? at least one will be a bottleneck, not supprising that it would be the CPU.
Just for reference OpenAI use massive CNNs on 128000 CPUs for 256 GPUs with OpenAI Five. Hence MLP on 16 threads will have trouble saturating a GPU. You can benchmark with timing code, most likely you have a non negligible speed up with a GPU however.

Another question, once I have the model trained, how do I use it? Create observation and use predict? Do I have to step the env?

when the model is trained you can simply give it the observations you wish to use. However if you are using reccurent networks, you need to use the state in the predict function:

states = model.initial_state  # get the initial state vector for the reccurent network
dones = np.zeros(states.shape[0])  # set all environment to not done

...

# in your loop
action, _values, states, _neglog = model.predict(obs, states, dones) 
# where obs is the observation you want to use the model on in production

...

@hn2
Copy link
Author

hn2 commented May 10, 2019

I am not sure that I understand what state is. This is my code, how do I construct the obs and state?

### Quantiacs RL
# import necessary Packages below:
import numpy as np
from quantiacsToolbox.quantiacsToolbox import runts
from stable_baselines.common.vec_env import SubprocVecEnv
from stable_baselines import PPO2
from portfolio import PortfolioEnv


def myTradingSystem(DATE, OPEN, HIGH, LOW, CLOSE, VOL, exposure, equity, settings):
    ''' This system uses trend following techniques to allocate capital into the desired equities'''

    nMarkets = CLOSE.shape[1]
    pos = np.zeros(nMarkets)
    
    instruments = []
    history = np.empty(shape=(len(settings["markets"]), len(OPEN), 5), dtype=np.float)

    instruments = settings["markets"]
    for m in range(len(instruments)):     
        for d in range(len(OPEN)):
            history[m][d] = np.array([OPEN[d,m], HIGH[d,m], LOW[d,m], CLOSE[d,m], VOL[d,m]])

    # write_to_h5py(history, instruments, 'datasets/' + settings['model_name'] + '.h5')

    # multiprocess environment
    n_cpu = 12
    env = PortfolioEnv(history=history, abbreviation=instruments, steps=settings['steps'], window_length=settings['window_length'])
    env = SubprocVecEnv([lambda: env for i in range(n_cpu)])

    print(settings['model_filename'])
    model = PPO2.load(settings['model_filename'])

    obs = env.reset()
    action, _states = model.predict(obs)
    '''
    while True:
        action, _states = model.predict(obs)
        obs, rewards, dones, info = env.step(action)
        env.render()
    '''
    #   weights = pos/np.nansum(abs(pos))

    weights = action
    return weights, settings


def mySettings():
    ''' Define your trading system settings here '''

    settings = {}

    settings['markets'] = ['CASH', 'F_AD', 'F_BP', 'F_CD', 'F_EC', 'F_JY','F_SF', 'F_ND'] 

    settings['lookback'] = 2300 
    settings['budget'] = 10**6
    settings['slippage'] = 0.05
    settings['endInSample'] = '20150101'
    settings['beginInSample'] = '20050101'

    model = 'currencies'

    settings['steps'] = 2000
    settings['window_length'] = 3
    settings['allow_short'] = False 
    settings['total_timesteps'] = 10000000     #   100000000
    settings['model_name'] = model  + '_' + settings['beginInSample'] + '_' + settings['endInSample']
    settings['model_filename'] = model  + '_' + settings['beginInSample'] + '_' + settings['endInSample'] + '_' + str(settings['total_timesteps']) + '_' + str(settings['steps']) + '_' + str(settings['window_length'])   
    #   tensorboard --logdir=tensorboard   tensorboard --logdir=src

    return settings

# Evaluate trading system defined in current file.
if __name__ == '__main__':
    results = runts(__file__)
    #optimize(__file__)

@hill-a
Copy link
Owner

hill-a commented May 10, 2019

I am not sure that I understand what state is.

In your case, the state is the LSTM internal state (denoted h_t and c_t)

image
( LSTM cell diagram, image from here )

This is my code, how do I construct the obs and state?

I already showed you how to construct the inital state:

# intialized here
states = model.initial_state  # get the initial state vector for the reccurent network
dones = np.zeros(states.shape[0])  # set all environment to not done

# updated here
action, _values, states, _neglog = model.predict(obs, states, dones) 

as for the observation, I dont know, this is not my code and I dont understand the usage or the purpose. It should be a numpy array of the same shape as the environment.

@hn2
Copy link
Author

hn2 commented May 10, 2019

Ok got it. Hopefully almost there. One more problem in model = PPO2.load(settings['model_filename'])
File is there. I tried also with admin privileges but it doesn't work.

<class 'PermissionError'>
Traceback (most recent call last):
  File "C:\Users\hanna\Anaconda3\lib\site-packages\quantiacsToolbox\quantiacsToolbox.py", line 871, in runts
    position, settings = TSobject.myTradingSystem(*argList)
  File "ppo2_quantiacs_test2.py", line 33, in myTradingSystem
    model = PPO2.load(settings['model_filename'])
  File "c:\users\hanna\stable-baselines\stable_baselines\common\base_class.py", line 550, in load
    data, params = cls._load_from_file(load_path)
  File "c:\users\hanna\stable-baselines\stable_baselines\common\base_class.py", line 361, in _load_from_file
    with open(load_path, "rb") as file:
PermissionError: [Errno 13] Permission denied: 'currencies_20050101_20150101_10000000_2000_3'

@hill-a
Copy link
Owner

hill-a commented May 10, 2019

PermissionError: [Errno 13] Permission denied: 'currencies_20050101_20150101_10000000_2000_3'

That's a directory no?

@hn2
Copy link
Author

hn2 commented May 10, 2019

mmm... I have both tensorflow log directory with that name as well as a pkl file with the same name.

@hn2
Copy link
Author

hn2 commented May 10, 2019

Ok. Directory renamed. Now got

Traceback (most recent call last):
  File "C:\Users\hanna\Anaconda3\lib\site-packages\quantiacsToolbox\quantiacsToolbox.py", line 871, in runts
    position, settings = TSobject.myTradingSystem(*argList)
  File "ppo2_quantiacs_test2.py", line 41, in myTradingSystem
    action, _values, states, _neglog = model.predict(obs, states, dones)
ValueError: not enough values to unpack (expected 4, got 2)

@hn2
Copy link
Author

hn2 commented May 11, 2019

Also, why is it refusing to accept feature_extraction='cnn' ?

@hn2
Copy link
Author

hn2 commented May 12, 2019

print(np.shape(obs))
print(np.shape(states))
print(np.shape(dones))
    
print(obs)
print(states)
print(dones)
(12, 120)
(12, 512)
(12,)

[[1.000000e+00 1.000000e+00 1.000000e+00 ... 1.057500e+05 1.061250e+05
  1.555900e+04]
 [1.000000e+00 1.000000e+00 1.000000e+00 ... 9.853750e+04 9.980000e+04
  4.457200e+04]
 [1.000000e+00 1.000000e+00 1.000000e+00 ... 1.042000e+05 1.044875e+05
  1.994300e+04]
 ...
 [1.000000e+00 1.000000e+00 1.000000e+00 ... 9.571250e+04 9.615000e+04
  2.808500e+04]
 [1.000000e+00 1.000000e+00 1.000000e+00 ... 9.853750e+04 9.980000e+04
  4.457200e+04]
 [1.000000e+00 1.000000e+00 1.000000e+00 ... 1.054250e+05 1.057750e+05
  1.149000e+04]]

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
obs = env.reset()
states = model.initial_state  # get the initial state vector for the reccurent network
dones = np.zeros(states.shape[0])  # set all environment to not done
    
print(np.shape(obs))
print(np.shape(states))
print(np.shape(dones))
    
print(obs)
print(states)
print(dones)

# updated here
# action, _values, states, _neglog = model.predict(obs, states, dones)
action, _states = model.predict(obs, states, dones)

print(action)
[[nan nan nan nan nan nan nan nan]
 [nan nan nan nan nan nan nan nan]
 [nan nan nan nan nan nan nan nan]
 [nan nan nan nan nan nan nan nan]
 [nan nan nan nan nan nan nan nan]
 [nan nan nan nan nan nan nan nan]
 [nan nan nan nan nan nan nan nan]
 [nan nan nan nan nan nan nan nan]
 [nan nan nan nan nan nan nan nan]
 [nan nan nan nan nan nan nan nan]
 [nan nan nan nan nan nan nan nan]
 [nan nan nan nan nan nan nan nan]]

@hn2
Copy link
Author

hn2 commented May 12, 2019

ValueError: could not broadcast input array from shape (12,8) into shape (8)

Why is it all nan and where does the 12 rows come from?

@hn2
Copy link
Author

hn2 commented May 15, 2019

Can anyone help why predict doesn't work?

@hn2
Copy link
Author

hn2 commented May 17, 2019

Anyone?

@hn2
Copy link
Author

hn2 commented May 18, 2019

As to the GPU utilization problem, I think that windows performance monitor doesn't show the correct utilization. I tried GPU-Z and it shows 30-60% GPU load.

@hn2
Copy link
Author

hn2 commented May 20, 2019

I figured out that the 12 rows in the action comes from number of cpu's. When I change to n_cpu = 1 I get:
ValueError: Cannot feed value of shape (1, 120) for Tensor 'input/Ob:0', which has shape '(12, 120)'
How do I predict then? How do I combine results from multiprocess env to one action?

@op1490
Copy link

op1490 commented Jun 12, 2019

I am also struggling with this - anyone have any ideas?

@troychen728
Copy link

I figured out that the 12 rows in the action comes from number of cpu's. When I change to n_cpu = 1 I get:
ValueError: Cannot feed value of shape (1, 120) for Tensor 'input/Ob:0', which has shape '(12, 120)'
How do I predict then? How do I combine results from multiprocess env to one action?

I am struggling with this too. In my case I just created 12 parallel test environments, and the result I get has dimension 12. I just flattened them, and treated as 12 individual test points? I am not sure. Would appreciate it a lot if someone can shed light on this one.

@araffin
Copy link
Collaborator

araffin commented Jul 2, 2019

@op1490 @troychen728 for predicting for only one env, you can find a solution here: #166 (comment)

@dbsxdbsx
Copy link

Hi, I am new here. And I still don't understand how to use gpu with tensorflow within stable-baseline. It seems that the gpu is automatically used when tensorflow-gpu is installed correctly?

@Miffyli
Copy link
Collaborator

Miffyli commented Oct 27, 2019

@dbsxdbsx Yes, if you have tensorflow-gpu installed, then most of the stable-baselines algorithms will use GPU.

@dbsxdbsx
Copy link

@thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested windows
Projects
None yet
Development

No branches or pull requests

7 participants