Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added fallback time extraction engine #135

Merged
merged 11 commits into from
Nov 8, 2020
9 changes: 9 additions & 0 deletions facebook_scraper/extractors.py
Original file line number Diff line number Diff line change
Expand Up @@ -203,6 +203,15 @@ def extract_time(self) -> PartialPost:
except (KeyError, ValueError):
continue

try:
date = utils.parse_date(element_full_text=self.element.full_text)
if date:
return {
'time': date
}
except:
TheMulti0 marked this conversation as resolved.
Show resolved Hide resolved
return None

return None

def extract_user_id(self) -> PartialPost:
Expand Down
35 changes: 35 additions & 0 deletions facebook_scraper/utils.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
import codecs
import re
from datetime import datetime
from typing import Optional
from urllib.parse import parse_qsl, unquote, urlencode, urljoin, urlparse, urlunparse

import dateparser
from html2text import html2text as _html2text
from requests_html import DEFAULT_URL, Element, PyQuery

Expand Down Expand Up @@ -43,3 +46,35 @@ def make_html_element(html: str, url=DEFAULT_URL) -> Element:

def html2text(html: str) -> str:
return _html2text(html)


date = r"Jan(?:uary)?|" \
r"Feb(?:ruary)?|" \
r"Mar(?:ch)?|" \
r"Apr(?:il)?|" \
r"May|" \
r"Jun(?:e)?|" \
r"Jul(?:y)?|" \
r"Aug(?:ust)?|" \
r"Sep(?:tember)?|" \
r"Oct(?:ober)?|" \
r"Nov(?:ember)?|" \
r"Dec(?:ember)?|" \
r"Yesterday|" \
r"Today"
hour = r"\d{1,2}"
minute = r"\d{2}"
period = r"AM|PM"
exact_time = fr"({date}) at {hour}:{minute} ({period})"
relative_time = r"\d{1,2} \w+"
Copy link
Owner

@kevinzg kevinzg Nov 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit vague since it is being matched with the text from the whole post and not just a date.
For example, with the previous date April 3, 2018 at 8:02 PM, it was matching with 18 at and returning 2020-11-18 00:00.
What exactly is this trying to match?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to a previous comment it could be 16 hrs or 16h so maybe:

Suggested change
relative_time = r"\d{1,2} \w+"
relative_time = r"\b\d{1,2}(?:h| hrs)"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only date I found is in the entire post text, and I only saw dates that have explicit month and day of month specification and a time, days that have 'Today' or 'Yesterday' and a time, and just relative time dates (16 hours ago, hours is the only case I found)

Copy link
Owner

@kevinzg kevinzg Nov 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only date I found is in the entire post text

I guess this is a reply to my suggestion of using self.element.find('abbr', first=True). If there are cases where the date is not inside that tag, then it could fallback to look in the entire text.


time_regex = re.compile(fr"({exact_time}|{relative_time})")


def parse_date(element_full_text: str) -> Optional[datetime]:
time_match = time_regex.search(element_full_text)
if time_match:
time = time_match.group(0)
return dateparser.parse(time)
else:
return None
1 change: 1 addition & 0 deletions requirements-dev.txt
Original file line number Diff line number Diff line change
Expand Up @@ -223,3 +223,4 @@ yarl==1.4.2; python_version >= "3.6" \
zipp==3.1.0; python_version < "3.8" \
--hash=sha256:aa36550ff0c0b7ef7fa639055d797116ee891440eac1a56f378e2d3179e0320b \
--hash=sha256:c599e4d75c98f6798c509911d08a22e6c021d074469042177c8c86fb92eefd96
dateparser~=1.0.0
4 changes: 4 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -92,3 +92,7 @@ websockets==8.0.2 \
--hash=sha256:f5cb2683367e32da6a256b60929a3af9c29c212b5091cf5bace9358d03011bf5 \
--hash=sha256:049e694abe33f8a1d99969fee7bfc0ae6761f7fd5f297c58ea933b27dd6805f2 \
--hash=sha256:882a7266fa867a2ebb2c0baaa0f9159cabf131cf18c1b4270d79ad42f9208dc5

html2text~=2020.1.16
requests~=2.24.0
dateparser~=1.0.0
kevinzg marked this conversation as resolved.
Show resolved Hide resolved