Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Domain parsing fails with trailing spaces #75

Closed
evanvolgas opened this issue Nov 4, 2015 · 15 comments · Fixed by #76
Closed

Domain parsing fails with trailing spaces #75

evanvolgas opened this issue Nov 4, 2015 · 15 comments · Fixed by #76

Comments

@evanvolgas
Copy link
Contributor

The text values passed to extract should strip the text for trailing spaces, eg,

>>> tldextract.extract('ads.adiquity.com ')
ExtractResult(subdomain='ads.adiquity', domain='com ', suffix='')
>>> tldextract.extract('ads.adiquity.com   '.strip())
ExtractResult(subdomain='ads', domain='adiquity', suffix='com')
@floer32
Copy link
Collaborator

floer32 commented Nov 4, 2015

I think this is the calling code's responsibility, not tldextract. 2 cents.

@evanvolgas
Copy link
Contributor Author

Perhaps. Lines are a bit blurry here for me. On the one hand sure it's the calling code's responsibility. On the other I doubt if my team is the first one that's stumbled on this, or that we'll be the last.

In any event, that's certainly how we will respond to it. But either way I think it deserved to be mentioned here as an issue (perhaps as you say one that shouldn't be fixed)

@floer32
Copy link
Collaborator

floer32 commented Nov 4, 2015

I get you, I'm glad you opened the ticket. So. Assessing whether it should be changed.

I don't think other tools would do this cleaning for you either. TLDExtract expects a valid domain name to be given. A domain name with a space isn't valid. Note, for example, that you couldn't make a request to this URL.

>>> requests.get('http://ads.adiquity.com ')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/requests/api.py", line 69, in get
    return request('get', url, params=params, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests/api.py", line 50, in request
    response = session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 468, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 576, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests/adapters.py", line 423, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='ads.adiquity.com%20', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x10ad50890>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known',))

or aside from Python, let's try curl

$ curl 'ads.adiquity.com '
curl: (6) Could not resolve host: ads.adiquity.com

$ curl 'ads.adiquity.com'
<html>
<head><title>302 Found</title></head>
<body bgcolor="white">
<center><h1>302 Found</h1></center>
<hr><center>nginx</center>
</body>
</html>

I think users would generally assume it's their responsibility to clean their input (domain name, URL, what have you). If for no other reason: because other libraries also assume they have cleaned their input.

@floer32
Copy link
Collaborator

floer32 commented Nov 4, 2015

But maybe tldextract should yell. Maybe if there's whitespace it's an obviously invalid thing, so an error should be thrown, instead of returning a result that might be confusing. I must defer to @john-kurkowski on this question

@evanvolgas
Copy link
Contributor Author

I would certainly understand if tldextract threw an exception and said "hey, that isn't a url." I would also understand if it trimmed its input. But what it's doing now is a little counter intuitive imo. Actually we have noticed it handle non urls like

>>> tldextract.extract('ads.adiquity.com:android')
ExtractResult(subdomain='ads', domain='adiquity', suffix='com')

Which turned out to be useful once when we had to deal with some bad ones that had :android tacked at the end of them. But at the same time, now that you mention it, I think I'd actually be more inclined to think it should scream when you throw it something that isn't a url. I don't know... I also defer to @john-kurkowski here

@tomhogans
Copy link

I understand that you wouldn't want tldextract to make assumptions for the user, i.e. cleaning their input for them. I definitely like the idea of raising an exception when tldextract is able to detect an invalid domain being passed as input, however. At the very least, checking for "known" illegal characters could be useful even if the check isn't fully comprehensive.

@john-kurkowski
Copy link
Owner

tldextract stripping whitespace sounds fine to me! I imagine it's a single call to str.strip()? It'd be in line with the library's lenient validation, like how input URLs don't require a scheme.

@john-kurkowski
Copy link
Owner

As for the bigger issue of raising errors on obviously bad input, like ads.adiquity.com:android? If it's in line with the conditionals that are already in the code, and not much extra code, tldextract could raise an error, sure.

However in the interest of users being able to get up and running with tldextract, it is lenient with URLs, and it is an explicit non-goal of the library to be a URL validator. See the FAQ.

If valid URLs are important to you, validate them before calling tldextract.

(A validation example would be nice.)

@evanvolgas
Copy link
Contributor Author

In keeping with the library's attitude towards url validation, as documented in the FAQ, it seems to me like striping the whitespace off of urls would make more sense that leaving trimming whitespace up to pre-validation of urls by the user. As it stands right now, it's pretty easy to get an unexpected result without warning or exception. Stripping the white space out does make an assumption on behalf of the user... but in the absence of url validation, I suspect it's better to assume that whites spaces should be trimmed

@mauricioabreu
Copy link
Contributor

I am not sure if we should trim whitespace.
Let's understand the use cases:

  • users can type URLs in a form and tldextract can be used to extract or even validate the entry (my case)
  • this data can be generated for browsers, devices or other non user input.

Both cases can give wrong results if we trim whitespaces. Users should debug what they sent to the function and understand what we are doing.

@floer32
Copy link
Collaborator

floer32 commented Nov 11, 2015

@mauricioabreu I think that @EvanV is right, about keeping with the existing attitude in the library. A caller should clean their input before calling, but the library will tolerate a bit of fudgery.

tldextract should never be used to validate URLs in such a broad way, however. Use other tools for that, such as tools from stdlib 😉 In any case even if you wanted to validate user-inputted URLs, you should probably do some basic cleaning before validating or using ... what user wants a brittle input field? This line of thinking starts to get off topic from tldextract ...

@mauricioabreu
Copy link
Contributor

@hangtwenty ok, I agree. And it is exactly my point.
I don't use it to validate URLs but to valite if URLs have right TLDs.
Anyways I think we should not clean anything because it can mislead users. But I am not all against trim whitespaces. IMHO I would not even notice this behavior.

@floer32
Copy link
Collaborator

floer32 commented Nov 12, 2015

Maybe I've lost track of what you're getting at @mauricioabreu . The consensus in this thread so far was that whitespace should be trimmed because there is no case where valid input would have whitespace. i.e. While the library shouldn't do a bunch of validation that's out of scope (and shouldn't be relied upon for validation), it's really not a problem to call .strip() because it's cheap, easy, and should shock no-one. At this point I cannot think of a case where it would be undesireable to strip whitespace. If you know one, can you articulate it? Otherwise, I'd like to close this issue (after merging the PR, #76).

@evanvolgas
Copy link
Contributor Author

I'm not sure I follow you either @mauricioabreu... could you try explaining your line of thinking again? My apologies... just not sure I follow you

On my end, I still think that as it stands right now, it's easy to get an unexpected result without warning or exception. If TLDExtract intends to validate urls (which it doesn't) it's a different matter.... and in the absence of validation, it's very easy to silently get an unexpected result, which I would argue is really the worst thing that can happen. I think trimming whitespace is a better move than what it does right now, albeit certainly The Right Thing To Do is to sanitize your input.

@mauricioabreu
Copy link
Contributor

I think we are having some problems to communicate. I am sorry for this.
After reading and thinking more about the problem, context, etc it is ok and I agree with you 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants