New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

carbon date truncates arguments with "&" in them #13

Closed
phonedude opened this Issue Oct 19, 2017 · 5 comments

Comments

Projects
None yet
4 participants
@phonedude
Member

phonedude commented Oct 19, 2017

@phonedude phonedude added the bug label Oct 19, 2017

@HanySalahEldeen

This comment has been minimized.

Show comment
Hide comment
@HanySalahEldeen

HanySalahEldeen Oct 19, 2017

Contributor
Contributor

HanySalahEldeen commented Oct 19, 2017

@grantat

This comment has been minimized.

Show comment
Hide comment
@grantat

grantat Oct 19, 2017

Member

I see whats happening, its counting those arg1 and arg2 parameters as part of carbondate.cs.odu.edu rather than that of the URI specified.

The parameters can make a difference in finding mementos for some thing like that URI:
http://web.archive.org/web/*/www.cs.odu.edu/foo.cgi = 1 memento
http://web.archive.org/web/*/www.cs.odu.edu/foo.cgi&arg1=1&arg2=2 = 0 mementos

However for something like youtube.com we definitely need those parameters.
For example, http://carbondate.cs.odu.edu/cd?url=www.youtube.com/watch&v=Tnf_Brn-zdA
which makes it www.youtube.com/watch which is a redirect to www.youtube.com
and that clearly isn't the video want. We're looking for http://carbondate.cs.odu.edu/cd?url=www.youtube.com/watch?v=Tnf_Brn-zdA.

To correct this I think I'll remove the "/cd=" parameter and create a route such as "/cd/". Open to other suggestions as well.

Member

grantat commented Oct 19, 2017

I see whats happening, its counting those arg1 and arg2 parameters as part of carbondate.cs.odu.edu rather than that of the URI specified.

The parameters can make a difference in finding mementos for some thing like that URI:
http://web.archive.org/web/*/www.cs.odu.edu/foo.cgi = 1 memento
http://web.archive.org/web/*/www.cs.odu.edu/foo.cgi&arg1=1&arg2=2 = 0 mementos

However for something like youtube.com we definitely need those parameters.
For example, http://carbondate.cs.odu.edu/cd?url=www.youtube.com/watch&v=Tnf_Brn-zdA
which makes it www.youtube.com/watch which is a redirect to www.youtube.com
and that clearly isn't the video want. We're looking for http://carbondate.cs.odu.edu/cd?url=www.youtube.com/watch?v=Tnf_Brn-zdA.

To correct this I think I'll remove the "/cd=" parameter and create a route such as "/cd/". Open to other suggestions as well.

@ibnesayeed

This comment has been minimized.

Show comment
Hide comment
@ibnesayeed

ibnesayeed Oct 19, 2017

Member

If I remember correctly, when we were discussing the output JSON structure, I also mentioned that this should be made inline with how other archiving related services work. They take URI as the last path parameter after every significant path prefix in the route. This eliminates the need of explicit URL encoding.

Member

ibnesayeed commented Oct 19, 2017

If I remember correctly, when we were discussing the output JSON structure, I also mentioned that this should be made inline with how other archiving related services work. They take URI as the last path parameter after every significant path prefix in the route. This eliminates the need of explicit URL encoding.

@phonedude

This comment has been minimized.

Show comment
Hide comment
@phonedude

phonedude Oct 19, 2017

Member

thanks guys. yes, a structure like:

http://carbondate.cs.odu.edu/cd/www.youtube.com/watch&v=Tnf_Brn-zdA

would be better.

Member

phonedude commented Oct 19, 2017

thanks guys. yes, a structure like:

http://carbondate.cs.odu.edu/cd/www.youtube.com/watch&v=Tnf_Brn-zdA

would be better.

@ibnesayeed

This comment has been minimized.

Show comment
Hide comment
@ibnesayeed

ibnesayeed Oct 19, 2017

Member

Hey @HanySalahEldeen, it's great to hear from you. Hope you are doing good.

Correct me if i am wrong, but isn't that a desired behavior? To clean up
the url from parameters and find the source?

I think non-significant parameters/protocol/subdomain are removed as part of the canonicalization. This is done by most of the web archives, but we can do canonicalization on our end too to take advantage of it in non-archival sources. However, in this report, URL parameters were misses unintentionally, which is a bug.

Member

ibnesayeed commented Oct 19, 2017

Hey @HanySalahEldeen, it's great to hear from you. Hope you are doing good.

Correct me if i am wrong, but isn't that a desired behavior? To clean up
the url from parameters and find the source?

I think non-significant parameters/protocol/subdomain are removed as part of the canonicalization. This is done by most of the web archives, but we can do canonicalization on our end too to take advantage of it in non-archival sources. However, in this report, URL parameters were misses unintentionally, which is a bug.

@grantat grantat referenced this issue Oct 19, 2017

Merged

Add cd route #14

@grantat grantat closed this in #14 Oct 21, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment