New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with pages with duplicate date not being generated on multicore PC #1293

Closed
ghost opened this Issue Jul 20, 2015 · 16 comments

Comments

Projects
None yet
1 participant
@ghost

ghost commented Jul 20, 2015

There appears to be an issue in hugo that causes pages in a larger blog to be randomly generated (or not...). About 50% of the pages are there the rest is not but the reported totals at the end of the run are always the same.

Setting the number of processors in main.go to '1' seems to be a work-around for this problem (which suggests a race condition of sorts but I'm a total 'go newbie' so I don't feel qualified to start ferreting it out), after that the generated content is the same for every time the program is run.

If you want to aid reproduction I can supply an archive of the site that has this issue (jacques@mattheij.com).

@bep

This comment has been minimized.

Show comment
Hide comment
@bep

bep Jul 20, 2015

Member

Hugo is widely used, even for big sites, so I would be surprised if this was an undiscovered isssue.

You can start by giving better info.

  • Hugo version?
  • If not a prebuilt Hugo, what Go version used to build?
  • OS in use?
  • Does your PC really have multiple physical CPUs?
  • By "large blog", what do you mean? How large?
Member

bep commented Jul 20, 2015

Hugo is widely used, even for big sites, so I would be surprised if this was an undiscovered isssue.

You can start by giving better info.

  • Hugo version?
  • If not a prebuilt Hugo, what Go version used to build?
  • OS in use?
  • Does your PC really have multiple physical CPUs?
  • By "large blog", what do you mean? How large?
@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost Jul 20, 2015

Slowly homing in on the cause, it seems to be related to more than one file
having the same date stamp, I'm not sure why that would matter but if I
give the files unique 'date:' entries in the front matter then the pages do
get created.

I was originally using the hugo version that can be downloaded pre-built
(0.14), I've
since switched to building one from scratch and I'm now trying to create
a minimal
test-case that will reproduce this error.

Large: about 300 articles written over the course of a couple of years,
total text about 2 MB.

Running on Ubuntu, Go version is 1.3.3

I meant 'cores', not CPUs, apologies, I've edited the title accordingly.

ghost commented Jul 20, 2015

Slowly homing in on the cause, it seems to be related to more than one file
having the same date stamp, I'm not sure why that would matter but if I
give the files unique 'date:' entries in the front matter then the pages do
get created.

I was originally using the hugo version that can be downloaded pre-built
(0.14), I've
since switched to building one from scratch and I'm now trying to create
a minimal
test-case that will reproduce this error.

Large: about 300 articles written over the course of a couple of years,
total text about 2 MB.

Running on Ubuntu, Go version is 1.3.3

I meant 'cores', not CPUs, apologies, I've edited the title accordingly.

@ghost ghost changed the title from more than one cpu in use causes lots of pages not to be generated to more than one core (not cpu, sorry!) in use causes lots of pages not to be generated Jul 20, 2015

@bep bep changed the title from more than one core (not cpu, sorry!) in use causes lots of pages not to be generated to Issue with pages with duplicate date not being generated on multicore PC Jul 20, 2015

@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost Jul 20, 2015

I've placed a minimal set of files on http://jacquesmattheij.com/hugo_date.tar.gz that allow the problem to be reproduced. If you run hugo on that dataset and inspect the 'public' directory after the generation there ought to be two generated pages, one in 'review-what-you-fork-on-github', the other in 'your-own-company-you-can-do-it', but that second one does not exist.

If I then change the date on the second one to '2011-03-10' and re-run hugo it does create the second directory and the index.html file.

This seems to happen even with the multi-core support turned off, so even if that improved the problem this particular combination is not affected by that work-around.

ghost commented Jul 20, 2015

I've placed a minimal set of files on http://jacquesmattheij.com/hugo_date.tar.gz that allow the problem to be reproduced. If you run hugo on that dataset and inspect the 'public' directory after the generation there ought to be two generated pages, one in 'review-what-you-fork-on-github', the other in 'your-own-company-you-can-do-it', but that second one does not exist.

If I then change the date on the second one to '2011-03-10' and re-run hugo it does create the second directory and the index.html file.

This seems to happen even with the multi-core support turned off, so even if that improved the problem this particular combination is not affected by that work-around.

@bep

This comment has been minimized.

Show comment
Hide comment
@bep

bep Jul 20, 2015

Member

I have built this in a loop many times on my multicore MacBook without hitting your problem. Maybe others can chime in.

On an added note, the date attribute in front matter is meant to be the date and time when this article was written:

http://golang.org/pkg/time/#Time

Member

bep commented Jul 20, 2015

I have built this in a loop many times on my multicore MacBook without hitting your problem. Maybe others can chime in.

On an added note, the date attribute in front matter is meant to be the date and time when this article was written:

http://golang.org/pkg/time/#Time

@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost Jul 20, 2015

Really odd... I've changed the date to '2011-03-09 11:00:01' and that also seems to cure the problem, but if I do that structurally I get errors on the date conversion from a module called caste for some of the files so that's also not a good solution.

After that I moved the test data to another machine and tried it there and I can reproduce it right away (that's an older box running fedora 8).

I don't have a Mac here so I can't try it on OS/X, do you have access to a linux machine? If not I can give you access to one.

ghost commented Jul 20, 2015

Really odd... I've changed the date to '2011-03-09 11:00:01' and that also seems to cure the problem, but if I do that structurally I get errors on the date conversion from a module called caste for some of the files so that's also not a good solution.

After that I moved the test data to another machine and tried it there and I can reproduce it right away (that's an older box running fedora 8).

I don't have a Mac here so I can't try it on OS/X, do you have access to a linux machine? If not I can give you access to one.

@bep

This comment has been minimized.

Show comment
Hide comment
@bep

bep Jul 20, 2015

Member

The Linux box I currently have is one core only. I have limited time to spend on this, so lets just wait for other Linux users to chime in, @digitalcraftsman maybe?

The short version of the date gets parsed OK ... To get the time into play you have to use this format:

 2015-07-16T13:18:54+02:00
Member

bep commented Jul 20, 2015

The Linux box I currently have is one core only. I have limited time to spend on this, so lets just wait for other Linux users to chime in, @digitalcraftsman maybe?

The short version of the date gets parsed OK ... To get the time into play you have to use this format:

 2015-07-16T13:18:54+02:00
@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost Jul 20, 2015

Ah ok, that explains that. I've done a lot more work on trying to track this down but my lack of 'go' knowledge isn't helping. It seems that if I reduce the number of pageRenderer workers that a lot of files that otherwise are missing start to appear, when I reduce it to 1 I have most (but still not all) of the files generated. This is a pretty complex interplay of elements.

Here are the outputs of two subsequent runs, the first is with the multiplier (normally 4) set to '1' and maxprocs also set to '1'

jam@thinkpad:~/blog$ make | grep -i are-the-product
In sourceReader 2012-08-24-you-are-not-the-user-you-are-the-product.markdown
In pageConvertor 2012-08-24-you-are-not-the-user-you-are-the-product.markdown
beginning of renderpages 2012-08-24-you-are-not-the-user-you-are-the-product.markdown
sending 2012-08-24-you-are-not-the-user-you-are-the-product.markdown
pageRenderer 2012-08-24-you-are-not-the-user-you-are-the-product.markdown
RenderAndWritePage page 2012-08-24-you-are-not-the-user-you-are-the-product.markdown /you-are-not-the-customer-you-are-the-product

Then a second run with the multiplier back at 4. As you can see the 'sending' line is where things derail but at the top of the function renderpages the data is still in one piece.

jam@thinkpad:/blog$ make | grep -i are-the-product
In sourceReader 2012-08-24-you-are-not-the-user-you-are-the-product.markdown
In pageConvertor 2012-08-24-you-are-not-the-user-you-are-the-product.markdown
beginning of renderpages 2012-08-24-you-are-not-the-user-you-are-the-product.markdown
jam@thinkpad:
/blog$

So it seems as if somehow the s.Pages structure is affected by the channel magic, but I can't see how that could be the case. Will continue digging, thanks for all your time so far!

ghost commented Jul 20, 2015

Ah ok, that explains that. I've done a lot more work on trying to track this down but my lack of 'go' knowledge isn't helping. It seems that if I reduce the number of pageRenderer workers that a lot of files that otherwise are missing start to appear, when I reduce it to 1 I have most (but still not all) of the files generated. This is a pretty complex interplay of elements.

Here are the outputs of two subsequent runs, the first is with the multiplier (normally 4) set to '1' and maxprocs also set to '1'

jam@thinkpad:~/blog$ make | grep -i are-the-product
In sourceReader 2012-08-24-you-are-not-the-user-you-are-the-product.markdown
In pageConvertor 2012-08-24-you-are-not-the-user-you-are-the-product.markdown
beginning of renderpages 2012-08-24-you-are-not-the-user-you-are-the-product.markdown
sending 2012-08-24-you-are-not-the-user-you-are-the-product.markdown
pageRenderer 2012-08-24-you-are-not-the-user-you-are-the-product.markdown
RenderAndWritePage page 2012-08-24-you-are-not-the-user-you-are-the-product.markdown /you-are-not-the-customer-you-are-the-product

Then a second run with the multiplier back at 4. As you can see the 'sending' line is where things derail but at the top of the function renderpages the data is still in one piece.

jam@thinkpad:/blog$ make | grep -i are-the-product
In sourceReader 2012-08-24-you-are-not-the-user-you-are-the-product.markdown
In pageConvertor 2012-08-24-you-are-not-the-user-you-are-the-product.markdown
beginning of renderpages 2012-08-24-you-are-not-the-user-you-are-the-product.markdown
jam@thinkpad:
/blog$

So it seems as if somehow the s.Pages structure is affected by the channel magic, but I can't see how that could be the case. Will continue digging, thanks for all your time so far!

@bep

This comment has been minimized.

Show comment
Hide comment
@bep

bep Jul 20, 2015

Member

Those log statements makes no sense without a context.

Member

bep commented Jul 20, 2015

Those log statements makes no sense without a context.

@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost Jul 20, 2015

Here is the modified function:

http://pastebin.com/06GvAQKa

The loop at the top shows all the entries in s.Pages before the channels are set up, then during the sending of the pages through the channel the list is shown once more. But not all the entries that are present at the top are still present when they are sent through the channel. I have no idea how they could get lost in between those two points, it looks to me as if there is nothing there that could affect the s.Pages list.

The version in the pastebin is with '1' as the multiplier (line 15), in the original source that is a '4', and when I put 4 there the bug re-appears.

The makefile simply calls hugo like this:

/home/jam/hugo/hugo --logFile="log.out" --theme=sshq

ghost commented Jul 20, 2015

Here is the modified function:

http://pastebin.com/06GvAQKa

The loop at the top shows all the entries in s.Pages before the channels are set up, then during the sending of the pages through the channel the list is shown once more. But not all the entries that are present at the top are still present when they are sent through the channel. I have no idea how they could get lost in between those two points, it looks to me as if there is nothing there that could affect the s.Pages list.

The version in the pastebin is with '1' as the multiplier (line 15), in the original source that is a '4', and when I put 4 there the bug re-appears.

The makefile simply calls hugo like this:

/home/jam/hugo/hugo --logFile="log.out" --theme=sshq

@bep

This comment has been minimized.

Show comment
Hide comment
@bep

bep Jul 20, 2015

Member

OK, I think I can reproduce it on my MacBook, but I had to rebuild the site 1152 times to get there.

If someone else want to take it for a spin, here is a simple loop:

#!/usr/bin/env bash

for i in `seq 1 10000`;
do
rm -rf public
hugo -t sshq
lc=$(ls -lf public | wc -l)
if [ $lc -ne 12 ]
then
echo "Failed, got: $lc - $i"
exit 1  
fi
done
Member

bep commented Jul 20, 2015

OK, I think I can reproduce it on my MacBook, but I had to rebuild the site 1152 times to get there.

If someone else want to take it for a spin, here is a simple loop:

#!/usr/bin/env bash

for i in `seq 1 10000`;
do
rm -rf public
hugo -t sshq
lc=$(ls -lf public | wc -l)
if [ $lc -ne 12 ]
then
echo "Failed, got: $lc - $i"
exit 1  
fi
done
@bep

This comment has been minimized.

Show comment
Hide comment
@bep

bep Jul 20, 2015

Member

I think I have found the culprit. Thanks for the report. I will test a little more and push a fix soon.

Member

bep commented Jul 20, 2015

I think I have found the culprit. Thanks for the report. I will test a little more and push a fix soon.

@bep bep added the Bug label Jul 20, 2015

@bep bep added this to the v0.15 milestone Jul 20, 2015

@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost Jul 20, 2015

Super :) That was a pretty rare condition on your machine. Maybe it becomes worse under load? (I'm running quite a bit of stuff in the background on both test machines). Really curious about what the actual cause was.

ghost commented Jul 20, 2015

Super :) That was a pretty rare condition on your machine. Maybe it becomes worse under load? (I'm running quite a bit of stuff in the background on both test machines). Really curious about what the actual cause was.

@bep

This comment has been minimized.

Show comment
Hide comment
@bep

bep Jul 20, 2015

Member

@tatsushid you know this part of the code better than me. There are some obvious data races in the page sorting department, in the site shown above it fails because both sorting and the Reverse function is used in the single page template.

I have verified that the problem above goes away by disabling Reverse:

bep@a286c1a

But the end solution is not obvious to me. As I see it there are several problems:

  1. The data race could be fixed by some mutex locking, but how and where isn't obvious
  2. How do we handle different "sortings" of the same page set?
  3. On the minor side: In the common use case, we do the same sorting and reversing repeatedly for every page on the site.

I guess the answer to 1 and 2 is mutex locks and replacement of the exported site.Pages with a method ... Or maybe create copies and cache them?

/cc @spf13

Member

bep commented Jul 20, 2015

@tatsushid you know this part of the code better than me. There are some obvious data races in the page sorting department, in the site shown above it fails because both sorting and the Reverse function is used in the single page template.

I have verified that the problem above goes away by disabling Reverse:

bep@a286c1a

But the end solution is not obvious to me. As I see it there are several problems:

  1. The data race could be fixed by some mutex locking, but how and where isn't obvious
  2. How do we handle different "sortings" of the same page set?
  3. On the minor side: In the common use case, we do the same sorting and reversing repeatedly for every page on the site.

I guess the answer to 1 and 2 is mutex locks and replacement of the exported site.Pages with a method ... Or maybe create copies and cache them?

/cc @spf13

@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost Jul 21, 2015

Maybe an option would be to create just one copy in advance, a reversed sorted one and then the various worker threads would read-only pick the copy that they need.

ghost commented Jul 21, 2015

Maybe an option would be to create just one copy in advance, a reversed sorted one and then the various worker threads would read-only pick the copy that they need.

@bep

This comment has been minimized.

Show comment
Hide comment
@bep

bep Jul 21, 2015

Member

@jacquesmattheij the connected Pull Request should fix this, but needs more thinking and tests (please take it for a spin). A workaround for you would be to drop the ByDate.Reverse. The default sort is by date so you should get the latest first.

Member

bep commented Jul 21, 2015

@jacquesmattheij the connected Pull Request should fix this, but needs more thinking and tests (please take it for a spin). A workaround for you would be to drop the ByDate.Reverse. The default sort is by date so you should get the latest first.

@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost Jul 21, 2015

Ok, I will try this, thank you!

edit: looking good so far

On 21-07-15 20:09, Bjørn Erik Pedersen wrote:

@jacquesmattheij https://github.com/jacquesmattheij the connected
Pull Request should fix this, but needs more thinking and tests
(please take it for a spin). A workaround for you would be to drop the
|ByDate.Reverse|. The default sort is by date so you should get the
latest first.


Reply to this email directly or view it on GitHub
#1293 (comment)
Bug from
https://github.com/notifications/beacon/AMy2Mnhoz4kqzRqVi-0M-7EPkqkw9N5kks5ofoJbgaJpZM4Fb_HH.gif

ghost commented Jul 21, 2015

Ok, I will try this, thank you!

edit: looking good so far

On 21-07-15 20:09, Bjørn Erik Pedersen wrote:

@jacquesmattheij https://github.com/jacquesmattheij the connected
Pull Request should fix this, but needs more thinking and tests
(please take it for a spin). A workaround for you would be to drop the
|ByDate.Reverse|. The default sort is by date so you should get the
latest first.


Reply to this email directly or view it on GitHub
#1293 (comment)
Bug from
https://github.com/notifications/beacon/AMy2Mnhoz4kqzRqVi-0M-7EPkqkw9N5kks5ofoJbgaJpZM4Fb_HH.gif

bep added a commit to bep/hugo that referenced this issue Jul 23, 2015

Fix data races in sorting and Reverse
The custom sort functions used from the templates had some subtle data race- and related issues,
especially when used in the single page template.

This commit fixes this by making copies and protect the read and writes with a RWMutex.

The results are cached (it will typically be invoked *number of pages* times with exactly the same data).

This is, not surprisingly, also faster:

```
benchmark                           old ns/op     new ns/op     delta
BenchmarkSortByWeightAndReverse     14228         491           -96.55%

benchmark                           old allocs     new allocs     delta
BenchmarkSortByWeightAndReverse     1              0              -100.00%

benchmark                           old bytes     new bytes     delta
BenchmarkSortByWeightAndReverse     32            0             -100.00%
```

Fixes #1293

@bep bep closed this in a9c5133 Jul 23, 2015

bep added a commit that referenced this issue Jul 23, 2015

Move apply before cache put
It just looks better.

See #1293

tychoish added a commit to tychoish/hugo that referenced this issue Aug 13, 2017

Fix data races in sorting and Reverse
The custom sort functions used from the templates had some subtle data race- and related issues,
especially when used in the single page template.

This commit fixes this by making copies and protect the read and writes with a RWMutex.

The results are cached (it will typically be invoked *number of pages* times with exactly the same data).

This is, not surprisingly, also faster:

```
benchmark                           old ns/op     new ns/op     delta
BenchmarkSortByWeightAndReverse     14228         491           -96.55%

benchmark                           old allocs     new allocs     delta
BenchmarkSortByWeightAndReverse     1              0              -100.00%

benchmark                           old bytes     new bytes     delta
BenchmarkSortByWeightAndReverse     32            0             -100.00%
```

Fixes #1293

tychoish added a commit to tychoish/hugo that referenced this issue Aug 13, 2017

Move apply before cache put
It just looks better.

See #1293
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment