Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added fallback time extraction engine #135

Merged
merged 11 commits into from
Nov 8, 2020
Merged

Added fallback time extraction engine #135

merged 11 commits into from
Nov 8, 2020

Conversation

TheMulti0
Copy link
Contributor

@TheMulti0 TheMulti0 commented Nov 5, 2020

Solves #134, where for some reason time is not always correctly extracted, therefore I have added a fallback way of extracting time (if the current one fails), by trying to extract it from the element's text.

As of right now, it successfully works with two different type of dates:
Example 1: October 28 11:58 AM
Example 2: 9 hrs (9 hours ago, the example was only tested with hours)

The engine uses the following regex:
(\w+ \d{2} at \d{2}:\d{2} (AM|PM))|(\d{1,2} \w+)
with a segment for example 1 (\w+ \d{2} at \d{2}:\d{2} (AM|PM)) and for example 2 ((\d{1,2} \w+).

In order to easily parse dates, I have added a dependency on the package dateparser (added to requirements.txt and requirements-dev.txt)

Even though not all example of dates have been tested yet, the regex will probably be able to extract the different undiscovered date templates and let the date parser parse it automatically.

Copy link

@balazssandor balazssandor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't work in cases:

-> 'time': dateparser.parse(time)
(Pdb) time
'13 at'
(Pdb) dateparser.parse(time)
datetime.datetime(2020, 11, 13, 0, 0)

dateparser is not able to find the date from 13 at string

This was the date appearance from where it failed:
image

if time_match:
time = time_match.group(0)
return {
'time': dateparser.parse(time)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understood the problem, you tried to scrap a post with the date 'October 13 at 9:44 PM', or did you send the string 'at 13' to dateparser?

I would love seeing the original post if so.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is that the matched group was ASFOROctober 13 at 8:44 PM, including a part of the page name or title and that's how (\d{1,2} \w+) ended up finding 13 at
I came with this ugly regex fix
time_regex = re.compile(r"((Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?) \d{1,2} at \d{1,2}:\d{2} (AM|PM))|(\d{1,2} \w+)")

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added also regex for Yesterday at 12:30 PM-like matching also:
((Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?) \d{1,2} at \d{1,2}:\d{2} (AM|PM))|Yesterday at \d{1,2}:\d{2} (AM|PM)|(\d{1,2} \w+)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thks balazssandor, there is also has Today, so will be this:
((Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?) \d{1,2} at \d{1,2}:\d{2} (AM|PM))|Today at \d{1,2}:\d{2} (AM|PM)|Yesterday at \d{1,2}:\d{2} (AM|PM)|(\d{1,2} \w+)

@TheMulti0
Copy link
Contributor Author

@evoup @balazssandor Thanks for the help!

I added the new regex to utils.py, in the latest commit (also extracted it a little bit to make it more readable), mind checking it out?

@valterartur
Copy link

valterartur commented Nov 6, 2020

Dateparser is not threadsafe, so I think it's not a really good realization. This issue is still open on GitHub. I have got the same errors with your way.

@TheMulti0
Copy link
Contributor Author

Thanks for the issue, I will implement custom date parsing with the desired formats.

I have got the same errors with your way.

What errors did you get with 'my way'? Thread safe errors?

@valterartur
Copy link

Thanks for the issue, I will implement custom date parsing with the desired formats.

I have got the same errors with your way.

What errors did you get with 'my way'? Thread safe errors?

I have got a problem described in the referenced issue ticket.

@baraths92
Copy link

@evoup @balazssandor Thanks for the help!

I added the new regex to utils.py, in the latest commit (also extracted it a little bit to make it more readable), mind checking it out?

Thank you. I tried it and it worked. Thanks a lot once again. The original author should consider fixing this in his code.

@TheMulti0
Copy link
Contributor Author

There are multiple different date formats we need to parse, since as @valterartur stated, using dateparser is not trival.

  1. Oct 16 at 11:00 PM
  2. October 16 at 11:00 PM
  3. Yesterday at 11:00 PM
  4. Today at 11:00 PM
  5. 16 hrs (16 hours ago)
  6. 16h (16 hours ago)

@kevinzg
Copy link
Owner

kevinzg commented Nov 7, 2020

I think this is alright.
The only issue seems to be that it isn't thread safe, however anyone that needs a thread safe date parsing function could monkeypatch utils.parse_date, so I'm inclined to merge it with the dateparser library unless someone suggests a better option.

@TheMulti0
Copy link
Contributor Author

I think this is alright.
The only issue seems to be that it isn't thread safe, however anyone that needs a thread safe date parsing function could monkeypatch utils.parse_date, so I'm inclined to merge it with the dateparser library unless someone suggests a better option.

I agree, this change is a fallback solution and the benefits are bigger than the downsides (the thread safety bug).
Parsing for all of the different 5 formats I specified isn't extremely complicated, it can be implemented if there is a need.

Let's merge!

@TheMulti0
Copy link
Contributor Author

Hi, can we please merge?

facebook_scraper/utils.py Outdated Show resolved Hide resolved
hour = r"\d{1,2}"
minute = r"\d{2}"
period = r"AM|PM"
exact_time = f"({date}) at {hour}:{minute} ({period})"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I think it shouldn't capture the date or period.

Suggested change
exact_time = f"({date}) at {hour}:{minute} ({period})"
exact_time = f"(?:{date}) at {hour}:{minute} (?:{period})"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand, why again shouldn't the whole date and am\pm be captured?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because those groups are not being used.

In parse_datetime only the group 0 is being retrieved, which is whole matched string.
It would make sense to capture every element of the date separately if we were parsing the date ourselves, but we are just passing the whole string to dateparser.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks!

minute = r"\d{2}"
period = r"AM|PM"
exact_time = f"({date}) at {hour}:{minute} ({period})"
relative_time = r"\d{1,2} \w+"
Copy link
Owner

@kevinzg kevinzg Nov 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit vague since it is being matched with the text from the whole post and not just a date.
For example, with the previous date April 3, 2018 at 8:02 PM, it was matching with 18 at and returning 2020-11-18 00:00.
What exactly is this trying to match?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to a previous comment it could be 16 hrs or 16h so maybe:

Suggested change
relative_time = r"\d{1,2} \w+"
relative_time = r"\b\d{1,2}(?:h| hrs)"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only date I found is in the entire post text, and I only saw dates that have explicit month and day of month specification and a time, days that have 'Today' or 'Yesterday' and a time, and just relative time dates (16 hours ago, hours is the only case I found)

Copy link
Owner

@kevinzg kevinzg Nov 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only date I found is in the entire post text

I guess this is a reply to my suggestion of using self.element.find('abbr', first=True). If there are cases where the date is not inside that tag, then it could fallback to look in the entire text.

TheMulti0 and others added 3 commits November 8, 2020 21:54
Co-authored-by: Kevin Zúñiga <kevin.zun@gmail.com>
Co-authored-by: Kevin Zúñiga <kevin.zun@gmail.com>
Co-authored-by: Kevin Zúñiga <kevin.zun@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants