Skip to content

Commit

Permalink
Add default schema to discovered schemaless links
Browse files Browse the repository at this point in the history
  • Loading branch information
pushshift committed May 7, 2019
1 parent fda3e87 commit 989ae5a
Showing 1 changed file with 3 additions and 0 deletions.
3 changes: 3 additions & 0 deletions feed_seeker/feed_seeker.py
Original file line number Diff line number Diff line change
Expand Up @@ -316,6 +316,9 @@ def find_internal_links(self):
possible_links = []
for link_node in self.soup.find_all('a', href=True):
link = link_node.get('href')
# Sometimes links without schemas are discovered -- this applies a default "http" schema to the discovered link
if link.startswith('//'):
link = 'http:{}'.format(link)
parsed_link = urlparse(link)
if not parsed_link.hostname:
parsed_link = parsed_link._replace(netloc=parsed_url.hostname,
Expand Down

0 comments on commit 989ae5a

Please sign in to comment.