Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use dateparser for parsing dates #159

Open
anarcat opened this issue Feb 19, 2019 · 8 comments
Open

use dateparser for parsing dates #159

anarcat opened this issue Feb 19, 2019 · 8 comments
Labels

Comments

@anarcat
Copy link

anarcat commented Feb 19, 2019

I recently worked with the dateparser module for parsing dates, and it seems it might be better than feedparser at its job. For example, it will support parsing internationalized dates and it can parse the date Tue,19 Feb 2019 00:27:25 GMT (notice the missing space?) or Sun, 15 Feb 2015 00:00:00 (no timezone??), both of which feedparser cannot parse.

I don't have time to patch this myself right now, but I wonder if it might not be as simple as doing:

try:
    import dateparser
    def dateparser_tuple(string):
        return dateparser.parse(string).utctimetuple()
    feedparser.registerDateHandler(dateparser_tuple)
except ImportError:
    pass

In my tests, the above successfully parses the times specified above.

As dateparser trickles into various linux distributions and (hopefully) standardizes, the built-in date parsing code could eventually be replaced by dateparser as well.

Thank you for any advice.

(Note: the issues affected by this are far ranging, and we'd need to see them one by one, but I suspect it could fix #135, #114, #91, #75, aaand #72, so it's totally worth it. :)

@kurtmckee
Copy link
Owner

@anarcat I completely agree that it would be great to replace a lot of the edge case date parsing with a dedicated date parser (like dateparser). My first thought is that I'd like to keep dedicated parsers for the date formats that are documented in the RSS/Atom specs, but there's a lot of edge case code that could fall back to a dedicated date parser!

Quick question: do you have experience with other libraries besides dateparser?

@anarcat
Copy link
Author

anarcat commented Mar 6, 2019 via email

@kurtmckee
Copy link
Owner

Okay great! Thank you for the quick response!

@kurtmckee
Copy link
Owner

@anarcat there's an open merge request, #132, that adds support for dateutil.

Would this work well for your use case? I don't have experience with any of the libraries we're discussing so I'm open to feedback and suggestions, particularly if dateparser or dateutil work better for your use case!

@anarcat
Copy link
Author

anarcat commented Apr 3, 2019

dateutil is less flexible than dateparser, but might be enough to handle the problematic case. Did anyone check the actual bug reports to see if they are fixed by the patch?

@kurtmckee
Copy link
Owner

Good point, I haven't done that yet.

@kurtmckee kurtmckee added the dates label Sep 1, 2020
@Ash-Crow
Copy link

Hi,
I wanted to compare if dateparser produced the same results as the _parse_date function of feedparser, so I took the samples from the doc, plus the French date format I got in a feed that had me look at it in the first place, and here is what I found:

  • They had the same result a bit less than half the time (13/29)
  • They made different assumptions and ended up with different dates 8 times (without any of them being wrong: 031231 can be interpreted as March 12, 2031 (dateparser) or the 31 of December, 2003 (feedparser), or even 3 of December, 2031... Though one date is interpreted as the year 312 by dateparser, which is not likely to be used in a RSS feed.)
  • feedparser was able to interpret as dates 5 formats that dateparser could not interpret:
    • valid ISO 8601 (yyyy-o)
    • bogus W3CDTF (invalid hour)
    • bogus W3CDTF (invalid minute)
    • bogus W3CDTF (invalid second)
    • bogus (Korean)
  • dateparser was able to interpret as dates 2 formats that feedparser could not interpret:
    • The French date format sample of course
    • but also the "bogus (Hungarian)" format from feedparser documentation.
  • Neither was able to interpret the "bogus RFC 822 (invalid day/month)". Feedparser should have correctly interpreted those two as they come from its documentation.

My test script, as well as the detailed result, is here: https://gist.github.com/Ash-Crow/08bededf6a2d87a0a06453ff7803f355

@guyskk
Copy link

guyskk commented Jan 20, 2021

I found published field is useful when hit unexpected format, eg: <pubDate>1611146768</pubDate>
If published_parsed is None then I can fallback to my customized date parser.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants