Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed silent failure when no post_date_gmt was present in "post" item #8

Merged
merged 4 commits into from
Nov 2, 2016

Conversation

AlexTelon
Copy link
Contributor

@AlexTelon AlexTelon commented Oct 22, 2016

When trying to convert my wordpres .xml I at first got a silent failure. No output was produced nor were any warnings displayed.

The problem and fix can be tested on the following input .xml example.
Without the fix nothing happens when running:
node bin/wp2ghost.js wordpress.xml

wordpress.xml:

<?xml version="1.0" encoding="UTF-8" ?>
<channel>
  <item>
    <pubDate>Mon, 18 Apr 2011 08:41:09 +0000</pubDate>
    <content:encoded>stuff</content:encoded>
    <wp:post_type>post</wp:post_type>
    <wp:post_id>1</wp:post_id>
  </item>
</channel>

With this fix the output is

{"meta":{"exported_on":1303116069000000,"version":"000"},"data":{"posts":[{"id":4,"title":"Untitled post","slug":"untitled-post","markdown":"stuff>","html":"stuff>","image":null,"featured":false,"page":0,"status":"draft","language":"en_US","meta_title":null,"meta_description":null,"created_at":1303116069000,"created_by":1,"updated_at":1303116069000,"updated_by":1,"published_at":1303116069000,"published_by":1}],"tags":[],"posts_tags":[]}}

which can successfully be imported into ghost.

I did assume that if post_date_gmt and post_date were missing we were out of luck. However we usually have a pubDate field as well which can be parsed as well. But in my case where the post_data_gmt and post_date were missing the pubDates were in a weird format seen below:
<pubDate>Fri, 24146140482 15:36:27 +0000</pubDate>
So the code would be even better if it were made to try to get the date from pubDate when possible. But I did not need that since this only happened for drafts and I did not personally care about correct dates for them since they will get a proper date once posted anyway.

I got the original wordpress.xml from when i used the export function on a squarespace trial account. It should not make a difference, but just in case Im leaving that tidbit here at the end in case someone else has problems and is searching for an answer.

@AlexTelon AlexTelon changed the title Fixed silent failure when no post_date_gmt was present in post Fixed silent failure when no post_date_gmt was present in "post" item Oct 22, 2016
date = item['wp:post_date'];
if (date == null) {
date='0000-00-00 00:00:00'
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you fix the indentation for this block?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doh! I thought I fixed that! Will do!

} else {
date = item['wp:post_date'].match(/(\d{4})-(\d+)-(\d+) (\d+):(\d+):(\d+)/);
date = date.match(/(\d{4})-(\d+)-(\d+) (\d+):(\d+):(\d+)/);
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if/else should not be necessary any more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, did not notice it was the same match thing. Will fix.

@@ -119,10 +119,18 @@ exports.fromStream = function(stream) {

var date;

date = item['wp:post_date_gmt'];
if (date == null) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also include checking for 0000-00-00 00:00:00.

Copy link
Contributor Author

@AlexTelon AlexTelon Oct 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original check for 0000-00-00 00:00:00 was only against post_date_gmt , should it also check post_date?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I should change these to (date == undefined) checks. Both null and undefined give same results but consol.log(date) gives undefined so I guess that is a better value to compare against.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

undefined seems fine.

date='0000-00-00 00:00:00'
}
}

if (item['wp:post_date_gmt'] !== "0000-00-00 00:00:00") {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguably this entire patch could just be replaced with adding a check for && item['wp:post_date_gmt'] !== null here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well it depends on if you want to check for if post_dateis not defined as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding && item['wp:post_date_gmt'] !== undefined does help in the case wp:post_date_gmtis not present but wp:post_date is. However if both are missing, as it was in my case the program still crashes silently.

Example of what it crashed on:

    <item>
      <title>Ideas, not to be published!</title>
      <link>/blog/58</link>
      <content:encoded><![CDATA[<p>stuff...</p></content:encoded>
      <excerpt:encoded />
      <wp:post_name>58</wp:post_name>
      <wp:post_type>post</wp:post_type>
      <wp:post_id>39</wp:post_id>
      <wp:status>draft</wp:status>
      <pubDate>Fri, 24146140482 15:36:27 +0000</pubDate>
    </item>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that my original example .xml misstakenly included a post_date_gmt. It should have been without both post_date_gmt and post_date. I guess that why we were not on the same here. My bad.

@jonhoo
Copy link
Owner

jonhoo commented Oct 22, 2016

Including a test case would also be great.

…defined since that is the value date gets when the tags post_date_gmt and post_date are missing
@AlexTelon
Copy link
Contributor Author

I can add testcases when I get a greenlight on the code. Hopefully my second commit is better.

Should I assume that the time is year 0 or should I maybe go for 1970?

@jonhoo
Copy link
Owner

jonhoo commented Oct 23, 2016

The old code did not produce a sensible outcome in that case, so I think either is fine. 0000 is probably semantically more correct (1970 is just a UNIX timestamp artifact). The current change looks fine :)

@AlexTelon
Copy link
Contributor Author

AlexTelon commented Oct 24, 2016

Testcases added in ecf12b4. I added two new items to the same test .xml. I also checked for their existence besides doing simple checks for the fallbacks working as intended (to some degree). I used year 0 BC as discussed. The values that we are checking for for the two dates (year 0 and 2011) were checked against http://www.epochconverter.com/.

@@ -116,5 +116,23 @@ It has multiple lines, some <code>code</code>, and a bit more text.]]></wp:comme
<wp:comment_user_id>0</wp:comment_user_id>
</wp:comment>
</item>

<item>
<!-- An post missing both post_date_gmt and post_date -->
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be "A post"

</item>

<item>
<!-- An post missing post_date_gmt but post_date is present -->
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be "A post"

describe('Post date fallbacks', function(){
describe('no post_date_gmt or post_date', function(){
it("should return a default year 0000-00-00 00:00:00 BC", function() {
post_missing_both_post_fields.created_at.should.equal(-2211753600000);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This magic number isn't great. Can you use a Date() here instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have an issue here. First of I just realised that there is no month or day 0. So we should use "0000-01-01 00:00:00 UTC" I guess.

As it so happens Date.UTC does not support years before 1900, well there is still just Date.
To use the normal Date you need to first set a year:
var date = new Date(0,1,1)
and since years 00-99 are defaulted to 19xx (of course) we instead have to use .setFullYear(0); which seems to be implemented for this very purpose.
date.setFullYear(0);

Now typing date into a console directly in chrome gives us:
Tue Feb 01 0 00:00:00 GMT+0100 (W. Europe Standard Time)

So somehow its now 0000-02-01.

Though playing around with date to string functions gives me this:

date.toLocaleDateString()
>"1-02-01" 
date.toLocaleTimeString()
>"00:12:12"
date.toString()
>"Tue Feb 01    0 00:00:00 GMT+0100 (W. Europe Standard Time)"

So Im not sure if its year 0 or 1 right now and if the time is 12 minutes past midnight or not.

It is at this point I find out that there is no year 0 in ISO 8601 so I guess year 1 makes sense then and Jesus was not born on his birthday.

You can however, as a sidenote, write 0000-00-00 and get a date, as I did try a few times. A day or month 0 simply is simply the previous day/month in the last month/year. So by that logic I think year 0000 should default to 1 BC which is also supported by wikipedia

The "basic" format for year 0 is the four-digit form 0000, which equals the historical year 1 BC.
Also, for a fun read there is also this on that same just below:
Several "expanded" formats are possible: −0000 and +0000, as well as five- and six-digit versions. Earlier years are also negative four-, five- or six-digit years, which have an absolute value one less than the equivalent BC year, hence -0001 = 2 BC. Because only ISO 646 (7-bit ASCII) characters are allowed by ISO 8601, the minus sign is represented by a hyphen-minus."

Well knowing that the issue was that there is no year 0 Imma just put in a 1 there.

var date = new Date(1,1,1)
date.setFullYear(0);
date
Tue Feb 01    0 00:00:00 GMT+0100 (W. Europe Standard Time)

Notice Feb thing right there? Yeah, duno about that..

Ok so lets give date the longformat instead, where you even can specify the timezone. Which really looks like this:
Sat Jan 01 0001 00:00:00 UTC

But you can actually just skip the day of the week. which is great since you do not have to check that yourself! However I did check that in python and it says its a Monday, so Im not sure if I should care about this or not, Im hoping that I dont have to so I will just leave it at that!

>>> import time
>>> import datetime
>>> datetime.date(1,1,1)
>>> datetime.date(1,1,1).ctime()
'**Mon** Jan  1 00:00:00 000**1**'

Anyways, we can use that longer format now:

var date = new Date("Sat Jan 01 0001 00:00:00 UTC");
date.setFullYear(0);

Again, even with the long form it messes up the year, this time though to 2001. So it seems that 01 gives you 1901, but 0001 gives you 2001. just a FYI for anyone out there.

Now to get the unix time we use .getTime()

date.getTime()
-62167219200000

which returns you the number of milliseconds so / 1000 and we should be done finally!

unixtime/universal time of midnight january 1st at year 1 AC should be:
-62167219200
Great!

Since I have not come to trust javascripts Date and my understanding of how to use it I really want to make sure this is the correct time. A quick check against http://www.onlineconversion.com/unix_time.htm seems to give the same thing. However looking at the source of the page it uses javascript Date so thats moot for confirmation on the correctness of js Date.

I was thinking of using python for checking, but decided against it.

So leeme just make a quick check manually to see if it is resonable.
60*60*24*365.25 =31557600, seconds in a typical year
(1970 - 1) * 31557600 = 62136914400, 1969 years in between year 1 and 1970.
62167219200-62136914400 = 30304800, seconds we differ
30304800 / 31557600 =0.96, these seconds converted back to years. So we seem to have an off-by-one and then some. I'm fine with that, its late and I need to get to bed.

What I used in the code was -2211753600000 which I got out from http://www.epochconverter.com/ when giving it the time 0000-00-00 00:00:00. Looking at it now I can see that it gives a timestamp in milliseconds and that the "Human time" is 1899 which goes back to the issue mentioned early on, years 00-99 gets year 19xx instead, a 0 in day and month just means we go back in time a bit. Makes sense. Though somehow this site says that it should be the 30th of November, so while it makes sense that it goes backwards to 1899 its baffling why it would be November. Also, come to think about it 0000 got translated to 2001 in an earlier instance but I guess it all depends on how the site implements its things in the backend.

So while I agree that a magic number is not great I would suggest that we change the date to unixtime 0 instead and use:

var date = new Date(1970,01,01)
date.getTime()
> 2674800000

Ok, obviously something is a bit off here..

date
> Sun Feb 01 1970 00:00:00 GMT+0100 (W. Europe Standard Time)

Ok, so why is it Febuary again? As it turns out, when giving time in full javascript format or whatever then the months range from 01-12. But, and this is the fun part, when giving the numbers in any other way its range is 0-11. This makes sense since we are programmers and count from 0.

So lets do that last part again! And I kid you not, I assumed at this point that days would work the same way as months in this format, starting from 0. But no its 1-31. Again this makes sense since we are programmers we count from 0 besides when it would be confusing! asdasdaslfjhsg!

var date = new Date(1970,00,01)
date.getTime() / 1000 // lets convert to seconds.
> -3600

and now we just have my GMT+1 error left. So lets use the longform then.

var date = new Date("1970-01-01 00:00:00 UTC") 
date.getTime() / 1000
> 0

So, finally! My suggestion is then! Let us change the default fallback day to unix time 0. I will replace the other magic number with the corresponding new Date("year-in-longform").getTime / 1000.

Rant done.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the reason I used 0000-00-00 00:00:00 initially was that I actually saw that being used in Wordpress exports. While it's unclear which is semantically more correct to use when importing that into Ghost (given that it's clearly inaccurate), I agree with you that we should instead just go with whatever is easier to implement. UNIX 0 seems fine.

});
describe('no post_date_gmt', function(){
it("should fallback to post_date", function() {
post_missing_both_post_date_gmt.created_at.should.equal(1303116069000);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, try to use Date()

g = data.data;
p = g.posts[0];
post_missing_both_post_fields = g.posts[1];
post_missing_both_post_date_gmt = g.posts[2];
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be post_missing_post_date_gmt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yupp, will fix.

when.then(function(data) {
g = data.data;
p = g.posts[0];
post_missing_both_post_fields = g.posts[1];
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think post_missing_all_date_fields would be a better name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well there is still a pubDate field?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True.. I wonder if we should add fallback to that too. But that's for another PR. post_missing_post_date_fields?

@jonhoo
Copy link
Owner

jonhoo commented Oct 24, 2016

Thanks! See inline comments. I think we're getting pretty close to merging.

@AlexTelon
Copy link
Contributor Author

Great! Yeah its just some minor things left to do. While the Date thing took much longer than anticipated it was actually quite a fun trip down that rabbithole! 😉

@jonhoo
Copy link
Owner

jonhoo commented Oct 24, 2016

Hehe, yes, it was an interesting read. Maybe you should post #8 (comment) to Hacker News!

@AlexTelon
Copy link
Contributor Author

There we go. Hopefully there are not many issues left here :)

@jonhoo jonhoo merged commit 241e7c0 into jonhoo:master Nov 2, 2016
@jonhoo
Copy link
Owner

jonhoo commented Nov 2, 2016

Looks good to me! Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants