Domain parsing fails with trailing spaces #75

evanvolgas · 2015-11-04T19:55:49Z

The text values passed to extract should strip the text for trailing spaces, eg,

>>> tldextract.extract('ads.adiquity.com ')
ExtractResult(subdomain='ads.adiquity', domain='com ', suffix='')
>>> tldextract.extract('ads.adiquity.com   '.strip())
ExtractResult(subdomain='ads', domain='adiquity', suffix='com')

The text was updated successfully, but these errors were encountered:

floer32 · 2015-11-04T20:03:35Z

I think this is the calling code's responsibility, not tldextract. 2 cents.

evanvolgas · 2015-11-04T20:11:00Z

Perhaps. Lines are a bit blurry here for me. On the one hand sure it's the calling code's responsibility. On the other I doubt if my team is the first one that's stumbled on this, or that we'll be the last.

In any event, that's certainly how we will respond to it. But either way I think it deserved to be mentioned here as an issue (perhaps as you say one that shouldn't be fixed)

floer32 · 2015-11-04T22:45:21Z

I get you, I'm glad you opened the ticket. So. Assessing whether it should be changed.

I don't think other tools would do this cleaning for you either. TLDExtract expects a valid domain name to be given. A domain name with a space isn't valid. Note, for example, that you couldn't make a request to this URL.

>>> requests.get('http://ads.adiquity.com ')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/requests/api.py", line 69, in get
    return request('get', url, params=params, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests/api.py", line 50, in request
    response = session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 468, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 576, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests/adapters.py", line 423, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='ads.adiquity.com%20', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x10ad50890>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known',))

or aside from Python, let's try curl

$ curl 'ads.adiquity.com '
curl: (6) Could not resolve host: ads.adiquity.com

$ curl 'ads.adiquity.com'
<html>
<head><title>302 Found</title></head>
<body bgcolor="white">
<center><h1>302 Found</h1></center>
<hr><center>nginx</center>
</body>
</html>

I think users would generally assume it's their responsibility to clean their input (domain name, URL, what have you). If for no other reason: because other libraries also assume they have cleaned their input.

floer32 · 2015-11-04T22:47:46Z

But maybe tldextract should yell. Maybe if there's whitespace it's an obviously invalid thing, so an error should be thrown, instead of returning a result that might be confusing. I must defer to @john-kurkowski on this question

evanvolgas · 2015-11-04T22:56:14Z

I would certainly understand if tldextract threw an exception and said "hey, that isn't a url." I would also understand if it trimmed its input. But what it's doing now is a little counter intuitive imo. Actually we have noticed it handle non urls like

>>> tldextract.extract('ads.adiquity.com:android')
ExtractResult(subdomain='ads', domain='adiquity', suffix='com')

Which turned out to be useful once when we had to deal with some bad ones that had :android tacked at the end of them. But at the same time, now that you mention it, I think I'd actually be more inclined to think it should scream when you throw it something that isn't a url. I don't know... I also defer to @john-kurkowski here

tomhogans · 2015-11-04T23:20:05Z

I understand that you wouldn't want tldextract to make assumptions for the user, i.e. cleaning their input for them. I definitely like the idea of raising an exception when tldextract is able to detect an invalid domain being passed as input, however. At the very least, checking for "known" illegal characters could be useful even if the check isn't fully comprehensive.

john-kurkowski · 2015-11-08T19:26:27Z

tldextract stripping whitespace sounds fine to me! I imagine it's a single call to str.strip()? It'd be in line with the library's lenient validation, like how input URLs don't require a scheme.

john-kurkowski · 2015-11-08T19:33:39Z

As for the bigger issue of raising errors on obviously bad input, like ads.adiquity.com:android? If it's in line with the conditionals that are already in the code, and not much extra code, tldextract could raise an error, sure.

However in the interest of users being able to get up and running with tldextract, it is lenient with URLs, and it is an explicit non-goal of the library to be a URL validator. See the FAQ.

If valid URLs are important to you, validate them before calling tldextract.

(A validation example would be nice.)

evanvolgas · 2015-11-08T19:45:48Z

In keeping with the library's attitude towards url validation, as documented in the FAQ, it seems to me like striping the whitespace off of urls would make more sense that leaving trimming whitespace up to pre-validation of urls by the user. As it stands right now, it's pretty easy to get an unexpected result without warning or exception. Stripping the white space out does make an assumption on behalf of the user... but in the absence of url validation, I suspect it's better to assume that whites spaces should be trimmed

mauricioabreu · 2015-11-11T23:00:57Z

I am not sure if we should trim whitespace.
Let's understand the use cases:

users can type URLs in a form and tldextract can be used to extract or even validate the entry (my case)
this data can be generated for browsers, devices or other non user input.

Both cases can give wrong results if we trim whitespaces. Users should debug what they sent to the function and understand what we are doing.

floer32 · 2015-11-11T23:23:37Z

@mauricioabreu I think that @EvanV is right, about keeping with the existing attitude in the library. A caller should clean their input before calling, but the library will tolerate a bit of fudgery.

tldextract should never be used to validate URLs in such a broad way, however. Use other tools for that, such as tools from stdlib 😉 In any case even if you wanted to validate user-inputted URLs, you should probably do some basic cleaning before validating or using ... what user wants a brittle input field? This line of thinking starts to get off topic from tldextract ...

mauricioabreu · 2015-11-11T23:57:34Z

@hangtwenty ok, I agree. And it is exactly my point.
I don't use it to validate URLs but to valite if URLs have right TLDs.
Anyways I think we should not clean anything because it can mislead users. But I am not all against trim whitespaces. IMHO I would not even notice this behavior.

floer32 · 2015-11-12T00:45:38Z

Maybe I've lost track of what you're getting at @mauricioabreu . The consensus in this thread so far was that whitespace should be trimmed because there is no case where valid input would have whitespace. i.e. While the library shouldn't do a bunch of validation that's out of scope (and shouldn't be relied upon for validation), it's really not a problem to call .strip() because it's cheap, easy, and should shock no-one. At this point I cannot think of a case where it would be undesireable to strip whitespace. If you know one, can you articulate it? Otherwise, I'd like to close this issue (after merging the PR, #76).

evanvolgas · 2015-11-12T01:36:16Z

I'm not sure I follow you either @mauricioabreu... could you try explaining your line of thinking again? My apologies... just not sure I follow you

On my end, I still think that as it stands right now, it's easy to get an unexpected result without warning or exception. If TLDExtract intends to validate urls (which it doesn't) it's a different matter.... and in the absence of validation, it's very easy to silently get an unexpected result, which I would argue is really the worst thing that can happen. I think trimming whitespace is a better move than what it does right now, albeit certainly The Right Thing To Do is to sanitize your input.

mauricioabreu · 2015-11-12T11:20:05Z

I think we are having some problems to communicate. I am sorry for this.
After reading and thinking more about the problem, context, etc it is ok and I agree with you 👍

evanvolgas mentioned this issue Nov 8, 2015

strip whitespace from urls #76

Merged

floer32 closed this as completed in #76 Nov 12, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Domain parsing fails with trailing spaces #75

Domain parsing fails with trailing spaces #75

evanvolgas commented Nov 4, 2015

floer32 commented Nov 4, 2015

evanvolgas commented Nov 4, 2015

floer32 commented Nov 4, 2015

floer32 commented Nov 4, 2015

evanvolgas commented Nov 4, 2015

tomhogans commented Nov 4, 2015

john-kurkowski commented Nov 8, 2015

john-kurkowski commented Nov 8, 2015

evanvolgas commented Nov 8, 2015

mauricioabreu commented Nov 11, 2015

floer32 commented Nov 11, 2015

mauricioabreu commented Nov 11, 2015

floer32 commented Nov 12, 2015

evanvolgas commented Nov 12, 2015

mauricioabreu commented Nov 12, 2015

Domain parsing fails with trailing spaces #75

Domain parsing fails with trailing spaces #75

Comments

evanvolgas commented Nov 4, 2015

floer32 commented Nov 4, 2015

evanvolgas commented Nov 4, 2015

floer32 commented Nov 4, 2015

floer32 commented Nov 4, 2015

evanvolgas commented Nov 4, 2015

tomhogans commented Nov 4, 2015

john-kurkowski commented Nov 8, 2015

john-kurkowski commented Nov 8, 2015

evanvolgas commented Nov 8, 2015

mauricioabreu commented Nov 11, 2015

floer32 commented Nov 11, 2015

mauricioabreu commented Nov 11, 2015

floer32 commented Nov 12, 2015

evanvolgas commented Nov 12, 2015

mauricioabreu commented Nov 12, 2015