Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduced plotting performance after upgrading from 0.0.5 to 0.1.1 #785

Open
stevekm opened this issue Jul 9, 2021 · 40 comments
Open

reduced plotting performance after upgrading from 0.0.5 to 0.1.1 #785

stevekm opened this issue Jul 9, 2021 · 40 comments

Comments

@stevekm
Copy link

stevekm commented Jul 9, 2021

After upgrading from chia_plot verision 0.0.5 to 0.1.1 (on Windows), I saw a ~50% decrease in speed on the same system with the same configuration.

command used with 0.0.5:

chia_plot --threads 15 --tmpdir D:\ --tmpdir2 D:\ --farmerkey ... --poolkey ...

command used with 0.1.1 is the same as above except with --contract instead of --poolkey

the D:\ drive here is a single SSD. The default value of 256 buckets was not changed.

I also tested 0.1.1 again with more threads (--threads 30) and only saw minimal improvements. Some estimated average times:

phase v0.0.5 (15 threads) v0.1.1 (15 threads) v0.1.1 (30 threads)
1 1100s 1700s 1400s
2 500s 800s 700s
3 480s 1800s 1800s
4 170s 170s 170s
total 2250s 4470s 4070s

system:

  • Ryzen 3950X
  • 64GB RAM (4x16GB) 3200MHz
  • Intel P4600 1.8TB SSD (used for both temp1 and temp2; est. 1600MB/s write, 3000MB/s read speeds)

From the looks of the Task Manager, there is a lot of unused system resources while plotting with 0.1.1, whereas with 0.0.5 chia_plot was frequently using nearly 100% of the available CPU time and SSD bandwidth.

I understand that changes have been made since version 0.0.5 that may have improved performance on some Xeon and Threadripper test systems, but it seems to have greatly hurt performance in this case.

Maybe we could get those new upgrades toggled on/off from the command line? As it stands right now, version 0.1.1 is required in order to use the new Chia pooling protocol, which also means all the old plots made with 0.0.5 will need to be re-created under 0.1.1 at this reduced speed.

@aj10017
Copy link

aj10017 commented Jul 10, 2021

I'm also experiencing similar performance issues. My 3700x with 2x NVME would finish in 3300s-3500s but now it's ~4000s+

@stevekm
Copy link
Author

stevekm commented Jul 10, 2021

worth noting that at ~75min/plot, I am getting about 19 plots/day, which is actually less than I was able to get with the official Chia plotter (maxed around 26 plots/day), which is not a good situation for mad max plotter to be in as an alternative plotter :(

Previously with version 0.0.5 I was getting about 40 plots/day on this system

@Mattchew86
Copy link

Mattchew86 commented Jul 10, 2021

I'm experiencing the same issue using a Ryzen 5900x.

Plot hardware:
Ryzen 5900x
128gb ddr 4 3600 ram
2 x 2tb firecuda 520 nvme's in raid 0

Plotter config
22 threads
256 & 128 buckets
t1: Nvme's raid0 drive
t2: ramdisk

Plot times are varying a lot. Anywhere between 40 minutes and 85 minutes. I've tried loads of different settings, it seems like the system is underperforming? I've been trawling the internet and other people with similar configs seem to be having the same issue.

Something weird seems to be going on with the temp directories. Even though -G is not set, it seems to be alternating the drives?

@reubes
Copy link

reubes commented Jul 10, 2021

make sure to trim your SSDs regularly with 'sudo fstrim -v /mnt/ssdpath/' and also mount them (and any SMR HDDs) with the discard option.

@Mattchew86
Copy link

make sure to trim your SSDs regularly with 'sudo fstrim -v /mnt/ssdpath/' and also mount them (and any SMR HDDs) with the discard option.

I have trim enabled and and have also turned off indexing and write cache buffer, with CPU is set to realtime priority.

I just can't get the system to consistently write 30-40 minute plots.

@madMAx43v3r
Copy link
Owner

is this all on windows?

@Mattchew86
Copy link

is this all on windows?

Hey - yes, I am on windows10.

A commenter on issue #786 cut their plotting times in half by installing ubuntu - so looks like a windows issue?

@stotiks
Copy link
Contributor

stotiks commented Jul 10, 2021

@altendky
Copy link

make sure to trim your SSDs regularly with 'sudo fstrim -v /mnt/ssdpath/' and also mount them (and any SMR HDDs) with the discard option.

When you mount with discard you don't need to explicitly trim.

@Mattchew86
Copy link

try this version
https://github.com/stotiks/chia-plotter/releases/download/v0.1.1/chia_plot_0.1.2a.zip

Thanks - testing now, will let you know how it goes after first plot

@aj10017
Copy link

aj10017 commented Jul 10, 2021

try this version
https://github.com/stotiks/chia-plotter/releases/download/v0.1.1/chia_plot_0.1.2a.zip

Giving this a shot as well.

@ditaker
Copy link

ditaker commented Jul 10, 2021

try this version
https://github.com/stotiks/chia-plotter/releases/download/v0.1.1/chia_plot_0.1.2a.zip

Will try as well because 3 of 4 PC dropped speed on Windows 10 for approximetly 20-60%.

@Mattchew86
Copy link

Mattchew86 commented Jul 10, 2021

Number of threads: 22
Number of Buckets P1: 256
Number of Buckers P3+P4: 256
n 5

CPU Ryzen 5900x
Ram 128gb
T1: 2 x 2tb nvme firecuda's raid 0
T2: 115gb ram risk

Plot 1
Phase 1: 1018s
Phase 2: 540s
Phase 3: 1075s
Phase 4: 87s
Total Time: 2719s (45 minutes)

Only 1 plot written so far, but doesn't seem much different.

@stotiks
Copy link
Contributor

stotiks commented Jul 10, 2021

@Mattchew86, maybe NVME overheating or something else
Here are my results with v0.1.1

AMD Ryzen 7 5800X
64GB@3600Mhz
T1: Gigabyte AORUS M.2 Gen4 PCIe X4 NVMe 2TB
T2: Gigabyte AORUS M.2 Gen4 PCIe X4 NVMe 2TB

Crafting plot 67 out of 145
Process ID: 3612
Number of Threads: 16
Number of Buckets P1: 2^9 (512)
Number of Buckets P3+P4: 2^8 (256)
Phase 1 took 994.423 sec
Phase 2 took 425.397 sec
Phase 3 took 504.534 sec, wrote 21872348936 entries to final plot
Phase 4 took 56.2946 sec, final plot size is 108806383894 bytes
Total plot creation time was 1980.75 sec (33.0126 min)

@Mattchew86
Copy link

Mattchew86 commented Jul 10, 2021

@stotiks What OS are you on? I'm on Windows10 Pro 64-bit.

I don't think it's linked to the NVME's, as:

  1. The temperature for both drives is showing as 40oC in crystal disk and remains constant throughout the whole plotting process and they have their own fan and thermal paste.

  2. I have another 1tb WD drive that I use for the OS and I tried that for plotting and no difference.

  3. There seems to be little difference between plotting only using the NVME's vs using T1 & T2 with t2 as a ram disk.

I have had a plot on v0.1.1 that has plotted in around 35 minutes, so something isn't right.

There seems to be an issue at phase 3.

Plot 2
Phase 1: 1079s
Phase 2: 818s
Phase 3: 1480s
Phase 4: 91s
Total Time: 3468s (58 minutes)

@vvavepacket
Copy link

I am on Ubuntu and im experiencing the same issue.

Previous version: 20 mins
Current version: 60 mins

@Mattchew86
Copy link

@stotiks Just for completeness- here are the results of the first 4of my plots using v 0.1.2a

Plot 1
Phase 1: 1018s
Phase 2: 540s
Phase 3: 1075s
Phase 4: 87s
Total Time: 2719s (45 minutes)

Plot 2
Phase 1: 1079s
Phase 2: 818s
Phase 3: 1480s
Phase 4: 91s
Total Time: 3468s (58 minutes)

Plot 3
Phase 1: 1195s
Phase 2: 880s
Phase 3: 1419s
Phase 4: 78s
Total Time: 3571s (60 minutes)

Plot 4
Phase 1: 1057s
Phase 2: 709s
Phase 3: 1307s
Phase 4: 79s
Total Time: 3153s (53 minutes)

@Mattchew86
Copy link

@vvavepacket @ditaker @aj10017 @stevekm

Is your temp plotting drive in RAID?

@ditaker
Copy link

ditaker commented Jul 10, 2021

@vvavepacket @ditaker @aj10017 @stevekm

Is your temp plotting drive in RAID?

Me not. I have 4PCs. All plotting using M2 SSD 1TB. Speed +-2500. All have 4*8GB RAM. 3PCs with 20 Thread Cores (intel 10900X) and 1PC with 8Thread core intel 9700k if not mistakes).

So... 3 or 5 days ago they created plots approximately from 5000sec to 7000sec.
Now:
PC with 9700k (weakest) +-6000 sec (+- no changed)
PC with 10900x +-10000sec (was 5000-6000sec average before)

Maybe I did smth wrong but I updated to Chia 1.2, then downloaded new plot.exe file and putted it in directory (changed old version 1.1mb to new version 1.8mb).
I didn't changed nothing else.

No any idea why it happens and why worst of 4 PCs works slower then PC with more threads and better RAM :D.

Will have free 4 hours tomorrow and will try again If there will be not any resolution before from somebody with the same problem.

Here is last timing of one PC:
Phase 1: 6000sec
Phase 2: 2166sec
Phase 3: 4992sec
Phase 4: 193sec
Total: 13623sec (before this PC was made it for 5500sec +-.
Settings: r -18, u -7
When settings was: r -18, I -8 total time was +-10000sec
PC: 10900X (10 cores, 20 threads), 32GB ram, 1TB M2 SSD +-2500 write/read speed.

@Mattchew86
Copy link

Mattchew86 commented Jul 10, 2021

Well I swapped to my OS NVME 1tb drive and two plots in a row sub 40 minutes - so that would suggest my NVME raid drive is causing an issue for me - nothing else changed

@stevekm
Copy link
Author

stevekm commented Jul 10, 2021

@stotiks

try this version
https://github.com/stotiks/chia-plotter/releases/download/v0.1.1/chia_plot_0.1.2a.zip

Unfortunately this version actually runs slower for me, avg 4700s with 15 threads. Especially phase 3 took avg 1950s.

@st-zelenin
Copy link

st-zelenin commented Jul 11, 2021

what I see in my logs is that plot creation time remains the same (50 min on my machine), but total time for a single plot (creation + copy) increased (2 hours on my machine). Looks like the process of copying is not dedicated now and next plot creation is suspended until copying of the previous is finished.

Total plot creation time was 3386.29 sec (56.4382 min)
Started copy to F:\plots\plot-k32-2021-07-09-23-19-0abcfa6e2b8105fb92eac0299f274251b3a4aa54169a1f3261db75a196c11866.plot
Copy to F:\plots\plot-k32-2021-07-09-21-38-d80e5d3a9e28fcfdedafa6cbc38132109a9bf48e77d026bc19f3341aff80e5f4.plot finished, took 6174.27 sec, 16.8107 MB/s avg.

@vvavepacket
Copy link

How do we start the file copy process asynchronously? Such that it copies the file to background while start the next plot immediately

@chiamaster
Copy link

How do we start the file copy process asynchronously? Such that it copies the file to background while start the next plot immediately

It is already doing that as you describe.

@trevoriv
Copy link

Hopefully this helps someone. I was using the previous version to plot in around 9000 seconds (10 year old i5 2500k) but since using 0.1.1 with the -c function dropped times to around 13000 - 14000 seconds. Tried a few things including 0.1.2a which was the same speed, possibly a bit slower.

However after MS forcing a windows update last night my speeds are back to close to normal, last 3 plots using 0.1.1 have been 9600 seconds, 9800 and now the last one was 9700 so a little slower than before (10%ish) but close enough for me. Maybe something to do with Windows redistributable packages?

@daveooo11
Copy link

daveooo11 commented Jul 12, 2021

I am experiencing the same thing on a 5800x across multiple brands of NVMe using the W10 version. Phase 1 and 3 both started spiking in times once I started NFT plotting with stotik's 0.1.1 version. I am not experiencing any heat throttling either. It seems like these times started getting long specifically after I updated.

Specifically it seems like in the P3-2 in Phase 3 have way longer periods of time now and my CPU barely crosses 30%, and normally sites in the 10-20% ranges.

Secondly, Phase 1 the calculation time just got a bit longer. But I do not notice the same CPU % correlation with Phase 3-2 instances.

EDIT: I am noticing the CPU drops in phase 1 as well. But they are more drastic in Phase 3-2 instances.

5800X stock
Tomahawk b550
32GB Ram @3200
Firecuda 1TB Gen 4 (Used for all temp writing)

@stevekm
Copy link
Author

stevekm commented Jul 12, 2021

are we sure this is specific to Windows? Seems like Linux users are also reporting performance drops?

Re: Windows updates; I am running Windows 10 Pro 21H1 with all updates applied and still getting the reduced plot rates

@localh0rst1337
Copy link

idk what people expect from el cheapo nvme drives - check your nvme saturation in task manager and I promise, it is 100% all the time when the CPU is in 10-20% range. I plot with an enterprise HPE NVMe with 29PB TBW (TLC) ($2000) and it can barely keep up with a 5900X.

EDIT: I am noticing the CPU drops in phase 1 as well. But they are more drastic in Phase 3-2 instances.

5800X stock
Tomahawk b550
32GB Ram @3200
Firecuda 1TB Gen 4 (Used for all temp writing)

@daveooo11
Copy link

daveooo11 commented Jul 13, 2021

idk what people expect from el cheapo nvme drives - check your nvme saturation in task manager and I promise, it is 100% all the time when the CPU is in 10-20% range. I plot with an enterprise HPE NVMe with 29PB TBW (TLC) ($2000) and it can barely keep up with a 5900X.

EDIT: I am noticing the CPU drops in phase 1 as well. But they are more drastic in Phase 3-2 instances.
5800X stock
Tomahawk b550
32GB Ram @3200
Firecuda 1TB Gen 4 (Used for all temp writing)

I understand why you might think this if you've been plotting on 2k enterprise grade equipment. But the fact is it has nothing to do with the hardware as nothing has changed. Do you think everyone here just decided to all change out their NVMe drives right when the contract plotter came out thus increasing all their times? The point is even though there were no configuration changes from before and now, timings still increased seemingly for no reason.

And no, my NVMe drives stop being saturated in the same places I mentioned the CPU activity lowers. Phase 1, and Phase 3-2 iterations. And this happens on regardless if I slap in the Firecudas I use primarily, or the cheapo 60 dollar WD blues I have. All their timings have increased by 40-60% due to Phase 1 and 3-2.

Here's what my NVMe activity looks like specifically in Phase 3-2 iterations since my plotter happened to be in it when I was posting this. It crashes to sometimes single digits. Then ramps back up to 100% in 3-1 iterations. This happens on every NVMe I try.
graph

And here is what the activity time looks like once it hits Phase 3-1 iterations

graph2

@daveooo11
Copy link

I have downloaded and tested 0.0.5 as well and the issue is still happening. Now I am suspicious this is related to some kind of Windows update that happened at roughly the same time as the contract plotter that is causing some bottlenecks in the plotter.

To note, my plotter is on 21H1 with the most recently KB updates.

@localh0rst1337
Copy link

localh0rst1337 commented Jul 13, 2021

idk what people expect from el cheapo nvme drives - check your nvme saturation in task manager and I promise, it is 100% all the time when the CPU is in 10-20% range. I plot with an enterprise HPE NVMe with 29PB TBW (TLC) ($2000) and it can barely keep up with a 5900X.

EDIT: I am noticing the CPU drops in phase 1 as well. But they are more drastic in Phase 3-2 instances.
5800X stock
Tomahawk b550
32GB Ram @3200
Firecuda 1TB Gen 4 (Used for all temp writing)

I understand why you might think this if you've been plotting on 2k enterprise grade equipment. But the fact is it has nothing to do with the hardware as nothing has changed. Do you think everyone here just decided to all change out their NVMe drives right when the contract plotter came out thus increasing all their times? The point is even though there were no configuration changes from before and now, timings still increased seemingly for no reason.

And no, my NVMe drives stop being saturated in the same places I mentioned the CPU activity lowers. Phase 1, and Phase 3-2 iterations. And this happens on regardless if I slap in the Firecudas I use primarily, or the cheapo 60 dollar WD blues I have. All their timings have increased by 40-60% due to Phase 1 and 3-2.

Here's what my NVMe activity looks like specifically in Phase 3-2 iterations since my plotter happened to be in it when I was posting this. It crashes to sometimes single digits. Then ramps back up to 100% in 3-1 iterations. This happens on every NVMe I try.
graph

And here is what the activity time looks like once it hits Phase 3-1 iterations

graph2

I get your point. In terms of different performance between 2 SW versions or Win updates, sure. From my experience, 99% what I read about "my CPU is not used as it should be" people forget that there is a limited sustained write rate until a (consumer) flash disk breaks. btw, this drive I have (got it for cheap, would never spend 2k on this) has it's limits as well. It's just with 29PB TBW I'll probably won't have to buy a new drive ever again ;)

In the first picture, how is the CPU load looking like? I can't saturate my 5900X even with this NVMe, so I put another (cheapo) NVMe in I had laying around, doing 2 plots in parallel with half the threads. The CPU is 100% all the time and NVMe's are no longer a bottleneck.

@stevekm
Copy link
Author

stevekm commented Jul 13, 2021

@localh0rst1337 you'll see from my original post that I'm experiencing this while using a Ryzen 3950X with enterprise grade Intel SSDs. Previously both CPU and SSD reached saturation frequently while plotting. After upgrading the Chia plot software, this is no longer the case.

@daveooo11
Copy link

daveooo11 commented Jul 13, 2021

@localh0rst1337

I do now understand I was wrong to measure by CPU loads. CPU loads will fluctuate in some phases, not really maxing out other than when needed for heavy calculation phases.

It is a bit odd the actual plot drive crashes to such low read and write now especially in Phase 3. Like I mentioned in a prior post, the 0.0.5 version I tried as well is also having an issue on my plotter now. So I am now suspicious its a Windows related issue with some kind of update. I have rolled back 21H1 and some KB updates and trying again.

@localh0rst1337
Copy link

@localh0rst1337 you'll see from my original post that I'm experiencing this while using a Ryzen 3950X with enterprise grade Intel SSDs. Previously both CPU and SSD reached saturation frequently while plotting. After upgrading the Chia plot software, this is no longer the case.

Yes, this is strange. It might really be a problem with the system itself. Unfortunately I have no comparison because I started plotting with Madmax only after 1.2 came out.

@daveooo11 , maybe the blocksize in P3 is changing, doing a lot of inserts with a large number of small blocks, creating an immensely high I/O load, but comparable low throughput. In this case, random IOPS matter, forget the (sequential) throughput. It would be interesting to set up a monitoring with perfmon (Physical Disk IO, Wait Time, Queue length etc) and then compare it with your CPU load.
After all, this doesn't explain why it was faster before, but you could spot the bottleneck and investigate further.

@endurance1968
Copy link

endurance1968 commented Jul 15, 2021

if it helps here my AMD Ryzen 9 5900X + 128GB 3600 DRAM + 2TB nVME SSD using 110GB RAMDISK
Processor is undervolted 1.05V 4200Mhz to increase lifetime, RAM XMP Profile activated.
setting -r 8 -u 256

Phase 1 took 1031.07 sec
Phase 2 took 474.573 sec
Phase 3 took 565.131 sec
Phase 4 took 89.2898 sec
Total plot creation time was 2160.17 sec (36.0029 min)

Setup on a Ryzen 9 3900X with 128GB RAM but only running 2400Mhz (othewise instable :() takes roughly 50min.
setting -r 6 -u 256

could increase thread but machines are also used fpr office work in parallel. all runing MM 0.1.1 on windows 10

@andriusst
Copy link

Guys, it is so easy to confirm if the new version is causing the issue. Just go back to the older version that worked well for you. All older versions are still here on Github availabe for download.
https://github.com/stotiks/chia-plotter/releases

@daveooo11
Copy link

daveooo11 commented Jul 17, 2021

Guys, it is so easy to confirm if the new version is causing the issue. Just go back to the older version that worked well for you. All older versions are still here on Github availabe for download.
https://github.com/stotiks/chia-plotter/releases

This kind of reply is incredibly ignorant.

1: How can we go back to an older version when only 0.1.1 and up have the NFT plotting? Did previous versions magically get plot to singleton capabilities? No.
2: If you read above, I did go back to a previous version and it still had the same timing issue. Leading me to believe its a windows issue.

Please read the replies before commenting.

I have booted up a linux VM and passed through 16 vCPUs, 8GB of RAM. I cloned the current linux github and ran the program after reformatting the SSD plot drive I am using to Ext4. The timings are much better. And yes I know linux timings are naturally better. But this is like 100% faster (see times below). Phase one is faster, and Phase 3 hitches no longer happen. The SSD usage is always pegged at 100% unlike on windows where it will drop to 10% or less at times in Phase 3-2 iterations.

So in my case, this is 100% a Windows issue, as the Linux version even when passed through a VM is kicking its rear. My timing for my setup on Windows is roughly 75 mins. On Linux even through a vm. 36 mins.

@andriusst
Copy link

  1. You are troubleshooting the issue so no need to permanently go back. Just do a quick test, get timing results. Process of elimintation. But you already had the same ignorant idea haven't you?
  2. Yes I saw it but looks like other people are still guessing about potential changes in the new plotter version causing this. So it needed repeating. This is about helping everyone, not just you.
    I am not seeing such issue so I came with good intentions to offer help. But if you continue with offensive comments then I got better things to do.

@stevekm
Copy link
Author

stevekm commented Sep 2, 2021

a small update on this, testing out the latest version (a9a490) on Linux with the same hardware gives much better times;

  • phase 1: 984s
  • phase 2: 453s
  • phase 3: 407s
  • phase 4: 43s
  • total: 1889s (31min)

so it seems like this could definitely be a Windows issue. Its not clear what happened in Windows to cause this. Watching htop with the plotter running you can see that in Linux its using much more CPU than Windows was on the latest releases.

@Peacemak3r96
Copy link

Can somebody explain my how it is possible to plott 20tb in one day with madmax?

System from @bways021
Asus ROG Strix TRX40-E | AMD Ryzen Threadripper 3970X | 256GB DDR4 2666

https://docs.google.com/spreadsheets/d/14Iw5drdvNJuKTSh6CQpTwnMM5855MQ46/edit#gid=7029096

if yes could you give my the settings or a tutorial how to do?

regarts :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests