Failed to parse URL correctly #38

ninoseki · 2020-02-14T04:03:39Z

A URL which is surrounded by Japanese characters is not parsed correctly.

print(list(iocextract.extract_urls('『http://example.com』あああああ')))
# => ['http://example.com』あああああ']

# My expectation is ['http://example.com']

I'm not sure how to fix it. But I think checking TLD might work well.

cmmorrow · 2020-05-22T01:38:31Z

Hello @ninoseki, I'll take a look at this and see if I can adjust the regular expression to get this to work.

cmmorrow · 2020-05-23T01:49:24Z

I think I have a solution. This works:

echo "『http://example.com』インコ\u1f99c" | python iocextract.py
http://example.com

cmmorrow self-assigned this May 22, 2020

cmmorrow added this to To do in Issues via automation May 22, 2020

cmmorrow linked a pull request May 23, 2020 that will close this issue

Bugfix Issue 38 #45

Merged

cmmorrow closed this as completed in #45 Jul 9, 2020

Issues automation moved this from To do to Done Jul 9, 2020

Provide feedback