Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

carbon date truncates arguments with "&" in them #13

Closed
phonedude opened this issue Oct 19, 2017 · 5 comments
Closed

carbon date truncates arguments with "&" in them #13

phonedude opened this issue Oct 19, 2017 · 5 comments
Labels
bug

Comments

@phonedude
Copy link
Member

@phonedude phonedude commented Oct 19, 2017

@phonedude phonedude added the bug label Oct 19, 2017
@HanySalahEldeen
Copy link
Contributor

@HanySalahEldeen HanySalahEldeen commented Oct 19, 2017

@grantat
Copy link
Member

@grantat grantat commented Oct 19, 2017

I see whats happening, its counting those arg1 and arg2 parameters as part of carbondate.cs.odu.edu rather than that of the URI specified.

The parameters can make a difference in finding mementos for some thing like that URI:
http://web.archive.org/web/*/www.cs.odu.edu/foo.cgi = 1 memento
http://web.archive.org/web/*/www.cs.odu.edu/foo.cgi&arg1=1&arg2=2 = 0 mementos

However for something like youtube.com we definitely need those parameters.
For example, http://carbondate.cs.odu.edu/cd?url=www.youtube.com/watch&v=Tnf_Brn-zdA
which makes it www.youtube.com/watch which is a redirect to www.youtube.com
and that clearly isn't the video want. We're looking for http://carbondate.cs.odu.edu/cd?url=www.youtube.com/watch?v=Tnf_Brn-zdA.

To correct this I think I'll remove the "/cd=" parameter and create a route such as "/cd/". Open to other suggestions as well.

@ibnesayeed
Copy link
Member

@ibnesayeed ibnesayeed commented Oct 19, 2017

If I remember correctly, when we were discussing the output JSON structure, I also mentioned that this should be made inline with how other archiving related services work. They take URI as the last path parameter after every significant path prefix in the route. This eliminates the need of explicit URL encoding.

@phonedude
Copy link
Member Author

@phonedude phonedude commented Oct 19, 2017

thanks guys. yes, a structure like:

http://carbondate.cs.odu.edu/cd/www.youtube.com/watch&v=Tnf_Brn-zdA

would be better.

@ibnesayeed
Copy link
Member

@ibnesayeed ibnesayeed commented Oct 19, 2017

Hey @HanySalahEldeen, it's great to hear from you. Hope you are doing good.

Correct me if i am wrong, but isn't that a desired behavior? To clean up
the url from parameters and find the source?

I think non-significant parameters/protocol/subdomain are removed as part of the canonicalization. This is done by most of the web archives, but we can do canonicalization on our end too to take advantage of it in non-archival sources. However, in this report, URL parameters were misses unintentionally, which is a bug.

@grantat grantat mentioned this issue Oct 19, 2017
@grantat grantat closed this in #14 Oct 21, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.