Speed issues? #53

SheldonCurtiss · 2021-07-30T18:58:08Z

I'm currently running into major speed issues.
The progress bar doesn't even appear.

Prints from my game seem to show everything working as intended.
My cpu usage is also quite low, around 8%.

Any tips for tracking down the cause of this?

jonathan-laurent · 2021-07-30T19:49:16Z

What is your GPU usage?
What is your RAM footprint?

SheldonCurtiss · 2021-07-30T20:00:38Z

What is your GPU usage?
What is your RAM footprint?

Gpu like 0-1%
Ram around 4gb currently
I'm using WSL but typically with the connect-four example the progress bar appears within 3 minutes. I left it running all last night and no progress bar appeared. I swapped to significantly smaller games and input data for turns but this doesn't seem to be helping.

Going from memory when this issue appeared I believe it started once I made the changes to vectorize state.

So I'm taking my CSV > Table.matrix then reshaping it.
return Array{Float32}(reshape(state.board, (24,15,16)))
return Array{Float32}(reshape(state.board, (6,4,10)))

I've tested if it perhaps was due to reshaping taking significant time by using a constant state variable for this but it doesn't seem to have an effect.

Could something be broken in my vectorize state that's causing this? Also - do you have discord or anything?

jonathan-laurent · 2021-07-30T20:29:45Z

Did you manage to train the connect four agent on your hardware and how much time did it take?

Could something be broken in my vectorize state that's causing this?

It seems possible but unlikely. Are you sure the problem is not simply that your states are too big for your hardware?
There are scripts in script/profile that you can use to profile different parts of AlphaZero. You can start by profiling the time it takes for the neural network to evaluate a bunch of states in your game using scripts/profile/inference.jl and then compare the number you get to the connect four example

Also - do you have discord or anything?

Unfortunately not. I am currently lacking the time to be present on discussion platforms such as Slack or Discord.

SheldonCurtiss · 2021-07-30T20:43:08Z

Did you manage to train the connect four agent on your hardware and how much time did it take?

Not fully, progress bar was filling quite quickly. I confirmed good gpu usage, unsure on cpu.
1080ti I7 8770k clocked at 5.0ghz

Could something be broken in my vectorize state that's causing this?

It seems possible but unlikely. Are you sure the problem is not simply that your states are too big for your hardware?

Hm I'll dig into this more...

There are scripts in script/profile that you can use to profile different parts of AlphaZero. You can start by profiling the time it takes for the neural network to evaluate a bunch of states in your game using scripts/profile/inference.jl and then compare the number you get to the connect four example

Ok I'll try to take a look at this.

Also - do you have discord or anything?

Unfortunately not. I am currently lacking the time to be present on discussion platforms such as Slack or Discord.

Makes sense as to how productive you are :)
I was just thinking if you did I could add you to the private repo and you can tell me in private how poorly it is configured!

SheldonCurtiss · 2021-07-31T01:30:57Z

Seem to of gotten progress bar to at least appear but am getting this?
I think I know the cause but am not entirely sure...

SheldonCurtiss · 2021-07-31T01:31:13Z

jonathan-laurent · 2021-07-31T11:21:13Z

This means that the network is outputting a distribution that is not a valid probability distribution. Maybe should should add some print statements to see what's going on.

Another source of concern here is the 98.4% redundancy figure, which may indicate you are not doing enough exploration.

SheldonCurtiss · 2021-07-31T17:24:12Z

I fixed it - Would rigging this to multiple machines increase benchmark speed?

SheldonCurtiss · 2021-07-31T17:50:18Z

Currently looking quite grim in terms of how many instances I'd need to generate this.
My estimate somewhere around 100-1000

jonathan-laurent · 2021-07-31T18:12:52Z

Can you give me the numbers you are getting with scripts/profile/inference.jl? How much time does it take to run a typical batch of your states with your neural network?

Using multiple machines would increase speed, although right now only the self-play data generation is distributed, meaning that you would start hitting diminishing returns after about a dozen machines as training the network becomes the new bottleneck. (This will be improved in future releases but I am thinking about the best way to do it without adding too much complexity.)

If your problem is too complex for the hardware you have, you may want to reformulate it so that you end up with smaller states, or use a smaller neural network architecture (or even use faster ML models such as gradient boosted trees).

SheldonCurtiss · 2021-07-31T23:59:52Z

I multiplied the amount used for previous tests by 6x

So this is what I'm reading at

I left it running all last night and when I woke up it was about 25% through a single training iteration.
My goal would be 24x this memory footprint.

My cpu usage is incredibly low, I've checked to ensure WSL is set to use all my CPU cores. I suspect I'm ram speed bottlenecked at the moment, which matches my experience with Muzero. Unfortunately I'm not aware of an easy way to check ram speed utilization. I suspect speed will increase dramatically once I rig this up to Azure since the instances I plan to use are configured so each core has it's own independent memory.

as training the network becomes the new bottleneck.
This can be done on the gpu correct?

If not what is the bottleneck in this case? I'd assume single core speed?

If your problem is too complex for the hardware you have, you may want to reformulate it so that you end up with smaller states, or use a smaller neural network architecture (or even use faster ML models such as gradient boosted trees).

Based on my limited understanding I don't think I can decrease the size of the states. My understanding is this is connected to size of the vectorized states, of which I'm storing the 'board data'. This 'board data' only contains the data used to decide which action to take, it doesn't include which actions have been taken in the past (even though I'm not quite sure if you told me to include this as it effects future actions).

One thing to note and perhaps you has a suggestion for is: My 'board' is the current time step along with the previous 60 steps in my dataset with the idea being for this to be used to make the decision at the current step, after this step is complete the board is changed to the next step along with the previous 60 steps before that. I'm trying to think of a way to reuse this data but am pretty sure alphazero doesn't have any sort of LSTM.

As for hardware, I'm not too worried about my personal hardware and am worried about if I'm able to throw cloud instances at the problem. I plan to test tonight to see the speed on azure.

SheldonCurtiss · 2021-08-01T00:08:21Z

If I could figure out a way to integrate an LSTM into this it would 'cut the data by' 1500x, but even if I had that wouldn't I still need to store that state?

I need to research alphazero further and get a deep enough understanding to understand how to solve this. But yeah in your expert opinion is there any solution to solve this 'refeeding' of data problem?

SheldonCurtiss · 2021-08-01T03:24:07Z

Apparently Open.Ai's Alphazero stores previous states for it's calculations, does Alphazero.jl do the same in any capacity?

jonathan-laurent · 2021-08-01T13:01:35Z

Apparently Open.Ai's Alphazero stores previous states for it's calculations, does Alphazero.jl do the same in any capacity?

I am not sure what you mean by this. Previous states are typically stored in the MCTS tree but there is no need to send previous states to the network (almost by definition of a state).

More generally, I am wondering if you're not trying to apply AlphaZero on a problem where it does not really apply.
If the problem is to make online prediction on sequential time series data for example, AlphaZero does not really look like a good fit.

Also, here is an advice. Despite many people's efforts to make AlphaZero more accessible (including my own efforts with this library), AlphaZero is not an algortihm you should be expecting to use on your problems as a black box. Even for using it on simple board games, you will need some deep knowledge about that algorithm so that you can tune the hyperparameters properly without spending too much compute resources. For more unusual applications, you should be ready to modify the implementation itself so as to generalize some components or integrate domain knowledge to improve sample efficiency.

So my suggestion is to learn more about AlphaZero and make sure that AlphaZero is the most appropriate algorithm for solving your problem. Here, it may be useful to write a post on Julia's Discourse or an ML forum where you explain your problem clearly and where people can chime in and give suggestions on what approaches make sense.

SheldonCurtiss · 2021-08-03T05:13:19Z

Think I figured out how to simplify it. It was quite obvious but somehow escaped me. Once again sorry to be a pain but I'm struggling with this bug. I'm struggling to figure out the cause of this. I've attempted to have debug prints to narrow down the cause. I can't tell where it's coming from. Perhaps something with vectorize states?

SheldonCurtiss · 2021-08-03T07:07:37Z

Seemed to figure it out. It doesn't appear to like my 'clever' way to simplify things.
Something about limiting the amask to a single action causes it to break.

SheldonCurtiss · 2021-08-03T07:13:45Z

So yeah is there anyway to continue a game whist forcing a single move?
I.e each player is allowed to jump once then they must sit until the end of the game?
I can't end early since that will end the other players game right?
Should I just add an extra possible move so that it doesn't cause this breaking?

SheldonCurtiss · 2021-08-03T07:24:53Z

Final Update: I attempted to add an extra possible move that can always be made but it doesn't seem to fix the problem? I have no idea why this works but this recent logic I added causes it to break. Something regarding the actions and the amask

SheldonCurtiss · 2021-08-03T08:20:41Z

I think it's something busted with state maybe? Like perhaps some sort of desync?
I don't understand why this behavior is presenting itself given the behavior of when it appears.

Could this be do to set_state! and the fact I'm not setting all my variables from this state in set_state?

SheldonCurtiss closed this as completed Aug 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed issues? #53

Speed issues? #53

SheldonCurtiss commented Jul 30, 2021

jonathan-laurent commented Jul 30, 2021 •

edited

SheldonCurtiss commented Jul 30, 2021 •

edited

jonathan-laurent commented Jul 30, 2021

SheldonCurtiss commented Jul 30, 2021

SheldonCurtiss commented Jul 31, 2021

SheldonCurtiss commented Jul 31, 2021

jonathan-laurent commented Jul 31, 2021

SheldonCurtiss commented Jul 31, 2021

SheldonCurtiss commented Jul 31, 2021

jonathan-laurent commented Jul 31, 2021

SheldonCurtiss commented Jul 31, 2021

SheldonCurtiss commented Aug 1, 2021

SheldonCurtiss commented Aug 1, 2021

jonathan-laurent commented Aug 1, 2021

SheldonCurtiss commented Aug 3, 2021

SheldonCurtiss commented Aug 3, 2021

SheldonCurtiss commented Aug 3, 2021

SheldonCurtiss commented Aug 3, 2021

SheldonCurtiss commented Aug 3, 2021 •

edited

Speed issues? #53

Speed issues? #53

Comments

SheldonCurtiss commented Jul 30, 2021

jonathan-laurent commented Jul 30, 2021 • edited

SheldonCurtiss commented Jul 30, 2021 • edited

jonathan-laurent commented Jul 30, 2021

SheldonCurtiss commented Jul 30, 2021

SheldonCurtiss commented Jul 31, 2021

SheldonCurtiss commented Jul 31, 2021

jonathan-laurent commented Jul 31, 2021

SheldonCurtiss commented Jul 31, 2021

SheldonCurtiss commented Jul 31, 2021

jonathan-laurent commented Jul 31, 2021

SheldonCurtiss commented Jul 31, 2021

SheldonCurtiss commented Aug 1, 2021

SheldonCurtiss commented Aug 1, 2021

jonathan-laurent commented Aug 1, 2021

SheldonCurtiss commented Aug 3, 2021

SheldonCurtiss commented Aug 3, 2021

SheldonCurtiss commented Aug 3, 2021

SheldonCurtiss commented Aug 3, 2021

SheldonCurtiss commented Aug 3, 2021 • edited

jonathan-laurent commented Jul 30, 2021 •

edited

SheldonCurtiss commented Jul 30, 2021 •

edited

SheldonCurtiss commented Aug 3, 2021 •

edited