-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed silent failure when no post_date_gmt was present in "post" item #8
Conversation
date = item['wp:post_date']; | ||
if (date == null) { | ||
date='0000-00-00 00:00:00' | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you fix the indentation for this block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doh! I thought I fixed that! Will do!
} else { | ||
date = item['wp:post_date'].match(/(\d{4})-(\d+)-(\d+) (\d+):(\d+):(\d+)/); | ||
date = date.match(/(\d{4})-(\d+)-(\d+) (\d+):(\d+):(\d+)/); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This if
/else
should not be necessary any more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, did not notice it was the same match thing. Will fix.
@@ -119,10 +119,18 @@ exports.fromStream = function(stream) { | |||
|
|||
var date; | |||
|
|||
date = item['wp:post_date_gmt']; | |||
if (date == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also include checking for 0000-00-00 00:00:00
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original check for 0000-00-00 00:00:00
was only against post_date_gmt
, should it also check post_date
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I should change these to (date == undefined) checks. Both null and undefined give same results but consol.log(date) gives undefined so I guess that is a better value to compare against.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
undefined
seems fine.
date='0000-00-00 00:00:00' | ||
} | ||
} | ||
|
||
if (item['wp:post_date_gmt'] !== "0000-00-00 00:00:00") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arguably this entire patch could just be replaced with adding a check for && item['wp:post_date_gmt'] !== null
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well it depends on if you want to check for if post_date
is not defined as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding && item['wp:post_date_gmt'] !== undefined
does help in the case wp:post_date_gmt
is not present but wp:post_date
is. However if both are missing, as it was in my case the program still crashes silently.
Example of what it crashed on:
<item>
<title>Ideas, not to be published!</title>
<link>/blog/58</link>
<content:encoded><![CDATA[<p>stuff...</p></content:encoded>
<excerpt:encoded />
<wp:post_name>58</wp:post_name>
<wp:post_type>post</wp:post_type>
<wp:post_id>39</wp:post_id>
<wp:status>draft</wp:status>
<pubDate>Fri, 24146140482 15:36:27 +0000</pubDate>
</item>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that my original example .xml misstakenly included a post_date_gmt. It should have been without both post_date_gmt and post_date. I guess that why we were not on the same here. My bad.
Including a test case would also be great. |
…defined since that is the value date gets when the tags post_date_gmt and post_date are missing
I can add testcases when I get a greenlight on the code. Hopefully my second commit is better. Should I assume that the time is year 0 or should I maybe go for 1970? |
The old code did not produce a sensible outcome in that case, so I think either is fine. 0000 is probably semantically more correct (1970 is just a UNIX timestamp artifact). The current change looks fine :) |
Testcases added in ecf12b4. I added two new items to the same test .xml. I also checked for their existence besides doing simple checks for the fallbacks working as intended (to some degree). I used year 0 BC as discussed. The values that we are checking for for the two dates (year 0 and 2011) were checked against http://www.epochconverter.com/. |
@@ -116,5 +116,23 @@ It has multiple lines, some <code>code</code>, and a bit more text.]]></wp:comme | |||
<wp:comment_user_id>0</wp:comment_user_id> | |||
</wp:comment> | |||
</item> | |||
|
|||
<item> | |||
<!-- An post missing both post_date_gmt and post_date --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be "A post"
</item> | ||
|
||
<item> | ||
<!-- An post missing post_date_gmt but post_date is present --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be "A post"
describe('Post date fallbacks', function(){ | ||
describe('no post_date_gmt or post_date', function(){ | ||
it("should return a default year 0000-00-00 00:00:00 BC", function() { | ||
post_missing_both_post_fields.created_at.should.equal(-2211753600000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This magic number isn't great. Can you use a Date()
here instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have an issue here. First of I just realised that there is no month or day 0. So we should use "0000-01-01 00:00:00 UTC" I guess.
As it so happens Date.UTC does not support years before 1900, well there is still just Date.
To use the normal Date you need to first set a year:
var date = new Date(0,1,1)
and since years 00-99 are defaulted to 19xx (of course) we instead have to use .setFullYear(0); which seems to be implemented for this very purpose.
date.setFullYear(0);
Now typing date into a console directly in chrome gives us:
Tue Feb 01 0 00:00:00 GMT+0100 (W. Europe Standard Time)
So somehow its now 0000-02-01.
Though playing around with date to string functions gives me this:
date.toLocaleDateString()
>"1-02-01"
date.toLocaleTimeString()
>"00:12:12"
date.toString()
>"Tue Feb 01 0 00:00:00 GMT+0100 (W. Europe Standard Time)"
So Im not sure if its year 0 or 1 right now and if the time is 12 minutes past midnight or not.
It is at this point I find out that there is no year 0 in ISO 8601 so I guess year 1 makes sense then and Jesus was not born on his birthday.
You can however, as a sidenote, write 0000-00-00 and get a date, as I did try a few times. A day or month 0 simply is simply the previous day/month in the last month/year. So by that logic I think year 0000 should default to 1 BC which is also supported by wikipedia
The "basic" format for year 0 is the four-digit form 0000, which equals the historical year 1 BC.
Also, for a fun read there is also this on that same just below:
Several "expanded" formats are possible: −0000 and +0000, as well as five- and six-digit versions. Earlier years are also negative four-, five- or six-digit years, which have an absolute value one less than the equivalent BC year, hence -0001 = 2 BC. Because only ISO 646 (7-bit ASCII) characters are allowed by ISO 8601, the minus sign is represented by a hyphen-minus."
Well knowing that the issue was that there is no year 0 Imma just put in a 1 there.
var date = new Date(1,1,1)
date.setFullYear(0);
date
Tue Feb 01 0 00:00:00 GMT+0100 (W. Europe Standard Time)
Notice Feb thing right there? Yeah, duno about that..
Ok so lets give date the longformat instead, where you even can specify the timezone. Which really looks like this:
Sat Jan 01 0001 00:00:00 UTC
But you can actually just skip the day of the week. which is great since you do not have to check that yourself! However I did check that in python and it says its a Monday, so Im not sure if I should care about this or not, Im hoping that I dont have to so I will just leave it at that!
>>> import time
>>> import datetime
>>> datetime.date(1,1,1)
>>> datetime.date(1,1,1).ctime()
'**Mon** Jan 1 00:00:00 000**1**'
Anyways, we can use that longer format now:
var date = new Date("Sat Jan 01 0001 00:00:00 UTC");
date.setFullYear(0);
Again, even with the long form it messes up the year, this time though to 2001. So it seems that 01 gives you 1901, but 0001 gives you 2001. just a FYI for anyone out there.
Now to get the unix time we use .getTime()
date.getTime()
-62167219200000
which returns you the number of milliseconds so / 1000 and we should be done finally!
unixtime/universal time of midnight january 1st at year 1 AC should be:
-62167219200
Great!
Since I have not come to trust javascripts Date and my understanding of how to use it I really want to make sure this is the correct time. A quick check against http://www.onlineconversion.com/unix_time.htm seems to give the same thing. However looking at the source of the page it uses javascript Date so thats moot for confirmation on the correctness of js Date.
I was thinking of using python for checking, but decided against it.
So leeme just make a quick check manually to see if it is resonable.
60*60*24*365.25 =31557600
, seconds in a typical year
(1970 - 1) * 31557600 = 62136914400
, 1969 years in between year 1 and 1970.
62167219200-62136914400 = 30304800
, seconds we differ
30304800 / 31557600 =0.96
, these seconds converted back to years. So we seem to have an off-by-one and then some. I'm fine with that, its late and I need to get to bed.
What I used in the code was -2211753600000 which I got out from http://www.epochconverter.com/ when giving it the time 0000-00-00 00:00:00. Looking at it now I can see that it gives a timestamp in milliseconds and that the "Human time" is 1899 which goes back to the issue mentioned early on, years 00-99 gets year 19xx instead, a 0 in day and month just means we go back in time a bit. Makes sense. Though somehow this site says that it should be the 30th of November, so while it makes sense that it goes backwards to 1899 its baffling why it would be November. Also, come to think about it 0000 got translated to 2001 in an earlier instance but I guess it all depends on how the site implements its things in the backend.
So while I agree that a magic number is not great I would suggest that we change the date to unixtime 0 instead and use:
var date = new Date(1970,01,01)
date.getTime()
> 2674800000
Ok, obviously something is a bit off here..
date
> Sun Feb 01 1970 00:00:00 GMT+0100 (W. Europe Standard Time)
Ok, so why is it Febuary again? As it turns out, when giving time in full javascript format or whatever then the months range from 01-12. But, and this is the fun part, when giving the numbers in any other way its range is 0-11. This makes sense since we are programmers and count from 0.
So lets do that last part again! And I kid you not, I assumed at this point that days would work the same way as months in this format, starting from 0. But no its 1-31. Again this makes sense since we are programmers we count from 0 besides when it would be confusing! asdasdaslfjhsg!
var date = new Date(1970,00,01)
date.getTime() / 1000 // lets convert to seconds.
> -3600
and now we just have my GMT+1 error left. So lets use the longform then.
var date = new Date("1970-01-01 00:00:00 UTC")
date.getTime() / 1000
> 0
So, finally! My suggestion is then! Let us change the default fallback day to unix time 0. I will replace the other magic number with the corresponding new Date("year-in-longform").getTime / 1000.
Rant done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, the reason I used 0000-00-00 00:00:00
initially was that I actually saw that being used in Wordpress exports. While it's unclear which is semantically more correct to use when importing that into Ghost (given that it's clearly inaccurate), I agree with you that we should instead just go with whatever is easier to implement. UNIX 0 seems fine.
}); | ||
describe('no post_date_gmt', function(){ | ||
it("should fallback to post_date", function() { | ||
post_missing_both_post_date_gmt.created_at.should.equal(1303116069000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, try to use Date()
g = data.data; | ||
p = g.posts[0]; | ||
post_missing_both_post_fields = g.posts[1]; | ||
post_missing_both_post_date_gmt = g.posts[2]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be post_missing_post_date_gmt
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yupp, will fix.
when.then(function(data) { | ||
g = data.data; | ||
p = g.posts[0]; | ||
post_missing_both_post_fields = g.posts[1]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think post_missing_all_date_fields
would be a better name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well there is still a pubDate field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True.. I wonder if we should add fallback to that too. But that's for another PR. post_missing_post_date_fields
?
Thanks! See inline comments. I think we're getting pretty close to merging. |
Great! Yeah its just some minor things left to do. While the Date thing took much longer than anticipated it was actually quite a fun trip down that rabbithole! 😉 |
Hehe, yes, it was an interesting read. Maybe you should post #8 (comment) to Hacker News! |
There we go. Hopefully there are not many issues left here :) |
Looks good to me! Thank you! |
When trying to convert my wordpres .xml I at first got a silent failure. No output was produced nor were any warnings displayed.
The problem and fix can be tested on the following input .xml example.
Without the fix nothing happens when running:
node bin/wp2ghost.js wordpress.xml
wordpress.xml:
With this fix the output is
which can successfully be imported into ghost.
I did assume that if post_date_gmt and post_date were missing we were out of luck. However we usually have a pubDate field as well which can be parsed as well. But in my case where the post_data_gmt and post_date were missing the pubDates were in a weird format seen below:
<pubDate>Fri, 24146140482 15:36:27 +0000</pubDate>
So the code would be even better if it were made to try to get the date from pubDate when possible. But I did not need that since this only happened for drafts and I did not personally care about correct dates for them since they will get a proper date once posted anyway.
I got the original wordpress.xml from when i used the export function on a squarespace trial account. It should not make a difference, but just in case Im leaving that tidbit here at the end in case someone else has problems and is searching for an answer.