Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wayback machine image URLs still loading images from original Amazon S3 URL #1379

Open
jywarren opened this issue Mar 6, 2023 · 12 comments
Open
Labels

Comments

@jywarren
Copy link
Member

jywarren commented Mar 6, 2023

I found a strange issue when I pointed at a collection of JSON files which have had images routed to the Internet Archive's Wayback Machine caches.

As you can see, the image links are routed to Wayback URLs: https://ia601603.us.archive.org/20/items/mapknitter-wayback/ceres--2.json :

i.e.: https://web.archive.org/web/0id_/https://s3.amazonaws.com/grassrootsmapping/warpables/305268/PuglisiTerrazzeHaghiaTriadaCretaAntica2007-28.jpg

However, when I actually load a page like this, somehow it still loads images directly from Amazon s3, not the Internet Archive:

https://publiclab.github.io/Leaflet.DistortableImage/examples/archive?json=https://archive.org/download/mapknitter-wayback/ceres--2.json

I inspected in the console and still can't figure it out.

@segun-codes @7malikk I was curious, if you had an interest in this, what do you think is happening here? Could any application logic we've written be causing this?

See for example the images at https://publiclab.github.io/Leaflet.DistortableImage/examples/archive?json=https://archive.org/download/mapknitter-wayback/ceres--2.json

still loads https://s3.amazonaws.com/grassrootsmapping/warpables/306187/DJI_1207.JPG

@jywarren jywarren added the bug label Mar 6, 2023
@segun-codes
Copy link
Collaborator

Hi @jywarren, I am happy to check this out.

@segun-codes
Copy link
Collaborator

Hi @jywarren, I checked the code. The transformation that takes place in the function (in archive.js) below is responsible for the behaviour you are talking about. If my memory serves me right, I think we designed it this way at the time because of issues related to accessing the images programmatically via IA. I also observed the wayback machine itself simply loads the images from s3. What do you think?

// where imageSrc is in format: https://web.archive.org/web/20220803171120/https://s3.amazonaws.com/grassrootsmapping/warpables/48659/t82n_r09w_01-02_1985.jpg
// returns https://s3.amazonaws.com/grassrootsmapping/warpables/48659/t82n_r09w_01-02_1985.jpg or
// returns same url unchanged (no transformation required)
function extractImageSource(imageSrc) {
  if (imageSrc.startsWith('https://web.archive.org/web/')) {
    return imageSrc.substring(imageSrc.lastIndexOf('https'), imageSrc.length);
  }
  return imageSrc;
}

Illustration 1:
img

@jywarren
Copy link
Member Author

jywarren commented Mar 14, 2023 via email

@segun-codes
Copy link
Collaborator

Okay @jywarren, I'll look into this. Many thanks!

@jywarren
Copy link
Member Author

jywarren commented Apr 2, 2023

Ah yes. I see - we get this error if we don't do that --

Access to image at 'https://web.archive.org/web/0id_/https://s3.amazonaws.com/grassrootsmapping/warpables/409/IMG_4155.JPG' from origin 'http://localhost:8082' has been blocked by CORS policy: No 'Access-Control-Allow-Origin' header is present on the requested resource.

I'm not sure... is there another way to access https://web.archive.org/web/20200506081918id_/http://s3.amazonaws.com/grassrootsmapping/warpables/417/img_0135.jpg without CORS issues? Otherwise, we could... upload that entire directory into an Archive collection, and serve it from there.

That is, wayback URLs have CORS limitations, but images in regular archive.org/download/_____ archive.org URLs do not.

@segun-codes
Copy link
Collaborator

segun-codes commented Apr 3, 2023

Yes, I pointed out the fact of CORS limitation in my previous message. It was the reason I fetched from s3 directly.

Okay, but is there something wrong with fetching from s3 given that the legacy json files all have the image sources pointing to s3 either directly or indirectly ? For instance, https://web.archive.org/web/20200506081918id_/http://s3.amazonaws.com/grassrootsmapping/warpables/417/img_0135.jpg simply points to s3 indirectly nothing more.

@jywarren
Copy link
Member Author

jywarren commented Apr 3, 2023

Yes, sorry, just agreeing and confirming from my test. Thank you!

The only issue with s3 is that it costs Public Lab money to host -- it's not forever storage. I think perhaps the best choice is to create an archive.org collection and add to this logic in extractImageSource(), where we replace http://s3.amazonaws.com/grassrootsmapping with https://archive.org/download/mapknitter-wayback

I'm working on uploading all the files, but it'll be a while. We can check in here again once it's complete!

@segun-codes
Copy link
Collaborator

Ha! okay, I understand now. So archive.org option is definitely the route to take. I will check back then.

@jywarren
Copy link
Member Author

jywarren commented Apr 3, 2023

gosh it's going to take a while! it's 631,813 files, i'm only at downloading 3875...

I may try another way at a remote server that's faster... we'll see!

@segun-codes
Copy link
Collaborator

Yeah... this has to take a while

@Mustafa-Hersi
Copy link

is this issue being worked on?

@jywarren
Copy link
Member Author

jywarren commented Aug 7, 2023

Hi, we are still working on uploading the archive.org collection, apologies!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants