# Met API Investigations

In this notebook, I investigated how we could use the table describing the individual images, and pull together images from the relevant cultures.

### Table of contents

1. Analysis of API
2. Bigquery dataset
3. Reviewing Download Speeds
4. Assembly of table

## 1. Analysis of API

At first, it does seem that we could try to use their API - here I investigated connecting to it.

In [1]:
import requests

response = requests.get('https://collectionapi.metmuseum.org/public/collection/v1/objects')


In [13]:
response.json()['total']

474434

In [39]:
%timeit  IDs = response.json()['objectIDs'][1:300]



108 ms ± 3.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [32]:
for i in IDs[30:72]:
    url = 'https://collectionapi.metmuseum.org/public/collection/v1/objects/' + str(i)
    print(url)
    item = requests.get(url)
    

https://collectionapi.metmuseum.org/public/collection/v1/objects/32
https://collectionapi.metmuseum.org/public/collection/v1/objects/33
https://collectionapi.metmuseum.org/public/collection/v1/objects/34
https://collectionapi.metmuseum.org/public/collection/v1/objects/35
https://collectionapi.metmuseum.org/public/collection/v1/objects/36
https://collectionapi.metmuseum.org/public/collection/v1/objects/37
https://collectionapi.metmuseum.org/public/collection/v1/objects/38
https://collectionapi.metmuseum.org/public/collection/v1/objects/39
https://collectionapi.metmuseum.org/public/collection/v1/objects/40
https://collectionapi.metmuseum.org/public/collection/v1/objects/41
https://collectionapi.metmuseum.org/public/collection/v1/objects/42
https://collectionapi.metmuseum.org/public/collection/v1/objects/43
https://collectionapi.metmuseum.org/public/collection/v1/objects/44
https://collectionapi.metmuseum.org/public/collection/v1/objects/45
https://collectionapi.metmuseum.org/public/colle

In [33]:
item.json()

{'objectID': 79,
 'isHighlight': False,
 'accessionNumber': '37.174.2',
 'accessionYear': '1937',
 'isPublicDomain': True,
 'primaryImage': 'https://images.metmuseum.org/CRDImages/ad/original/113560.jpg',
 'primaryImageSmall': 'https://images.metmuseum.org/CRDImages/ad/web-large/113560.jpg',
 'additionalImages': [],
 'constituents': None,
 'department': 'The American Wing',
 'objectName': 'Andiron',
 'title': 'Andiron',
 'culture': '',
 'period': '',
 'dynasty': '',
 'reign': '',
 'portfolio': '',
 'artistRole': '',
 'artistPrefix': '',
 'artistDisplayName': '',
 'artistDisplayBio': '',
 'artistSuffix': '',
 'artistAlphaSort': '',
 'artistNationality': '',
 'artistBeginDate': '',
 'artistEndDate': '',
 'artistGender': '',
 'artistWikidata_URL': '',
 'artistULAN_URL': '',
 'objectDate': 'ca. 1700',
 'objectBeginDate': 1697,
 'objectEndDate': 1700,
 'medium': 'Brass, iron',
 'dimensions': '17 3/4 x 24 in. (45.1 x 61 cm)',
 'creditLine': 'Rogers Fund, 1937',
 'geographyType': 'Made in',
 

In [34]:
item.json()['primaryImage']

'https://images.metmuseum.org/CRDImages/ad/original/113560.jpg'

In [36]:
import urllib

urllib.request.urlretrieve(item.json()['primaryImage'], "00000001.jpg")

('00000001.jpg', <http.client.HTTPMessage at 0x2531a62e320>)

Okay.  So, we can connect, pull a table that creates info, filter out those with no image or other data, and then we can  pull the images.  How long does that take:

In [42]:
%timeit urllib.request.urlretrieve(item.json()['primaryImage'], "00000001.jpg")

455 ms ± 43 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Based on this speed, let's calculate how many hours this would take.

In [44]:
100000/2/60/60

13.88888888888889

Thirteen hours, we may need to use GCP for that part already.

## 2. BigQuery Dataset

Below, loading the extension to use bigquery.

In [8]:
%load_ext google.cloud.bigquery

In order to run this, downloaded an auth key per this link:
https://cloud.google.com/docs/authentication/getting-started#windows


then ran this in powershell
<code>$env:GOOGLE_APPLICATION_CREDENTIALS="C:\Users\matfl\Documents\MSCA 31009 Project -_NUMBER_.json"

set GOOGLE_APPLICATION_CREDENTIALS="C:\Users\matfl\Documents\MSCA 31009 Project -_NUMBER_.json"</code>


Trying another way...

In [5]:
import os

credential_path = r"C:\Users\matfl\Documents\MSCA 31009 Project - (details).json"
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = credential_path
from google.cloud import bigquery

This works - I can now query bq in jupyter.

In [46]:
%%bigquery
SELECT * FROM `bigquery-public-data.the_met.objects` limit 50

Unnamed: 0,object_number,is_highlight,is_public_domain,object_id,department,object_name,title,culture,period,dynasty,...,subregion,locale,locus,excavation,river,classification,rights_and_reproduction,link_resource,metadata_date,repository
0,U565 .S68 1740,True,True,681274,The Libraries,,Proposition concernant le payement et la polic...,,,,...,,,,,,||,,http://www.metmuseum.org/art/collection/search...,2017-02-06 08:00:16+00:00,"Metropolitan Museum of Art, New York, NY"
1,M2149.5 .C5 1768,True,True,681259,The Libraries,,"Loffice de Noël, 1768",,,,...,,,,,,||,,http://www.metmuseum.org/art/collection/search...,2017-02-06 08:00:16+00:00,"Metropolitan Museum of Art, New York, NY"
2,GB1321 .M68 1766,True,True,682011,The Libraries,,Dialogo sobre hua nova obra no Rio Tejo ...,,,,...,,,,,,||,,http://www.metmuseum.org/art/collection/search...,2017-02-06 08:00:16+00:00,"Metropolitan Museum of Art, New York, NY"
3,272.4 Au8,True,True,591859,The Libraries,,Autographs and Sketches from Artist Friends to...,,,,...,,,,,,||,,http://www.metmuseum.org/art/collection/search...,2017-02-06 08:00:16+00:00,"Metropolitan Museum of Art, New York, NY"
4,201.9Av3 Av32,True,True,591857,The Libraries,,"S. P. Avery, Engraver on Wood","New York: [s.n.], [18--]",,,...,,,,,,||,,http://www.metmuseum.org/art/collection/search...,2017-02-06 08:00:16+00:00,"Metropolitan Museum of Art, New York, NY"
5,1984.489,False,True,38568,Asian Art,Plaque,,Tibet,,,...,,,,,,Bone,,http://www.metmuseum.org/art/collection/search...,2017-02-06 08:00:16+00:00,"Metropolitan Museum of Art, New York, NY"
6,1998.348,False,True,41900,Asian Art,Plaque,,"Afghanistan, possibly of West Indian manufacture",,,...,,,,,,Bone,,http://www.metmuseum.org/art/collection/search...,2017-02-06 08:00:16+00:00,"Metropolitan Museum of Art, New York, NY"
7,91.1.926,False,True,60517,Asian Art,Pipe case,,Japan,,,...,,,,,,Bone,,http://www.metmuseum.org/art/collection/search...,2017-02-06 08:00:16+00:00,"Metropolitan Museum of Art, New York, NY"
8,91.1.927,False,True,60518,Asian Art,Pipe case,,Japan,,,...,,,,,,Bone,,http://www.metmuseum.org/art/collection/search...,2017-02-06 08:00:16+00:00,"Metropolitan Museum of Art, New York, NY"
9,1980.528.1,False,True,38565,Asian Art,Plaque,,Tibet (or Nepal),,,...,,,,,,Bone,,http://www.metmuseum.org/art/collection/search...,2017-02-06 08:00:16+00:00,"Metropolitan Museum of Art, New York, NY"


Let's see how they're spread across departments - at this point, we still hadn't yet decided which variable we planned to predict.  This distribution isn't great, so we'll look at another.

In [47]:
%%bigquery
SELECT department, count(*) FROM `bigquery-public-data.the_met.objects` GROUP BY department

Unnamed: 0,department,f0_
0,The Libraries,120
1,Asian Art,29844
2,Islamic Art,10435
3,Photographs,6583
4,Medieval Art,6838
5,The Cloisters,2268
6,Arms and Armor,4252
7,European Paintings,2322
8,Drawings and Prints,43488
9,Greek and Roman Art,12518


In [86]:
%%bigquery
SELECT a.department, count(DISTINCT b.gcs_url) 
FROM `bigquery-public-data.the_met.objects` a JOIN `bigquery-public-data.the_met.images` b
on a.object_id = b.object_id GROUP BY a.department ORDER BY 2 desc

Unnamed: 0,department,f0_
0,Asian Art,74039
1,Drawings and Prints,60452
2,European Sculpture and Decorative Arts,57371
3,Islamic Art,36275
4,Egyptian Art,24783
5,Costume Institute,22257
6,Medieval Art,19615
7,Greek and Roman Art,18389
8,American Decorative Arts,18201
9,Photographs,13668


So if we use the top 5 classes of this, it could still be tons of data and work well.

Let's investigate what information is available about each object.

In [73]:
%%bigquery tbl2
SELECT * FROM `bigquery-public-data.the_met.objects` limit 50

In [74]:
tbl2.columns

Index(['object_number', 'is_highlight', 'is_public_domain', 'object_id',
       'department', 'object_name', 'title', 'culture', 'period', 'dynasty',
       'reign', 'portfolio', 'artist_role', 'artist_prefix',
       'artist_display_name', 'artist_display_bio', 'artist_suffix',
       'artist_alpha_sort', 'artist_nationality', 'artist_begin_date',
       'artist_end_date', 'object_date', 'object_begin_date',
       'object_end_date', 'medium', 'dimensions', 'credit_line',
       'geography_type', 'city', 'state', 'county', 'country', 'region',
       'subregion', 'locale', 'locus', 'excavation', 'river', 'classification',
       'rights_and_reproduction', 'link_resource', 'metadata_date',
       'repository'],
      dtype='object')

Investigating date - it seems that there aren't a lot of them with set dates, so this isn't the best route.

In [85]:
%%bigquery
SELECT a.object_date, count(DISTINCT b.gcs_url) 
FROM `bigquery-public-data.the_met.objects` a JOIN `bigquery-public-data.the_met.images` b
on a.object_id = b.object_id GROUP BY a.object_date ORDER BY 2 desc
limit 30

Unnamed: 0,object_date,f0_
0,,18418
1,18th century,15409
2,19th century,11639
3,17th century,8755
4,16th century,7706
5,early 19th century,3418
6,18th–19th century,3373
7,15th century,3276
8,1880s,2096
9,16th–17th century,1990


What about medium?

In [80]:
%%bigquery
SELECT a.medium, count(DISTINCT b.gcs_url) 
FROM `bigquery-public-data.the_met.objects` a JOIN `bigquery-public-data.the_met.images` b
on a.object_id = b.object_id GROUP BY a.medium ORDER BY 2 desc limit 50

Unnamed: 0,medium,f0_
0,Bronze,8652
1,Silver,8354
2,Silk,7474
3,Terracotta,7030
4,Glass,6078
5,Woodcut,5573
6,Etching,5457
7,silk,5048
8,Gold,4723
9,Engraving,4301


Medium is surprisingly disappointing!  Ultimately, we settled on culture, which can be seen below.

In [81]:
%%bigquery
SELECT a.culture, count(DISTINCT b.gcs_url) 
FROM `bigquery-public-data.the_met.objects` a JOIN `bigquery-public-data.the_met.images` b
on a.object_id = b.object_id GROUP BY a.culture ORDER BY 2 desc limit 50

Unnamed: 0,culture,f0_
0,,155813
1,Japan,43278
2,China,23351
3,American,21249
4,French,20244
5,Italian,9102
6,British,8196
7,Roman,5575
8,German,5175
9,"French, Paris",4025


In [234]:
%%bigquery
SELECT a.culture, count(DISTINCT b.object_id) 
FROM `bigquery-public-data.the_met.objects` a JOIN `bigquery-public-data.the_met.images` b
on a.object_id = b.object_id GROUP BY a.culture ORDER BY 2 desc limit 10

Unnamed: 0,culture,f0_
0,,85443
1,Japan,14746
2,China,10429
3,American,9239
4,French,8853
5,Italian,4220
6,Roman,3563
7,British,3416
8,German,2108
9,Cypriot,1935


Now let's see how many images there are that we can access.

In [64]:
%%bigquery
SELECT count(DISTINCT b.gcs_url)
FROM `bigquery-public-data.the_met.objects` a JOIN `bigquery-public-data.the_met.images` b
on a.object_id = b.object_id 
LIMIT 16

Unnamed: 0,f0_
0,401477


In [66]:
%%bigquery
select count(*) from `bigquery-public-data.the_met.images`

Unnamed: 0,f0_
0,401596


Let's see what this looks like in a table.

In [43]:
%%bigquery
SELECT
    *
FROM `bigquery-public-data.the_met.images`
--GROUP BY year
--ORDER BY year DESC
LIMIT 45

Unnamed: 0,object_id,public_caption,title,original_image_url,caption,is_oasc,gcs_url
0,435868,"Fig. 8. X-radiograph of The Met, 61.101.1",,http://images.metmuseum.org/CRDImages/ep/origi...,,False,gs://gcs-public-data--met/435868/8.jpg
1,634108,"Fig. 5. Ferdinand Hodler, sketch for ""The Drea...",,http://images.metmuseum.org/CRDImages/ep/origi...,,False,gs://gcs-public-data--met/634108/5.jpg
2,634108,"Fig. 2. Ferdinand Hodler, sketch for ""The Drea...",,http://images.metmuseum.org/CRDImages/ep/origi...,,False,gs://gcs-public-data--met/634108/2.jpg
3,634108,"Fig. 7. Ferdinand Hodler, sketch for ""The Drea...",,http://images.metmuseum.org/CRDImages/ep/origi...,,False,gs://gcs-public-data--met/634108/7.jpg
4,634108,"Fig. 6. Ferdinand Hodler, sketch for ""The Drea...",,http://images.metmuseum.org/CRDImages/ep/origi...,,False,gs://gcs-public-data--met/634108/6.jpg
5,634108,"Fig. 3. Ferdinand Hodler, sketch for ""The Drea...",,http://images.metmuseum.org/CRDImages/ep/origi...,,False,gs://gcs-public-data--met/634108/3.jpg
6,634108,"Fig. 4. Ferdinand Hodler, sketch for ""The Drea...",,http://images.metmuseum.org/CRDImages/ep/origi...,,False,gs://gcs-public-data--met/634108/4.jpg
7,435868,"Fig. 1. Paul Cézanne, ""The Card Players,"" ca. ...",,http://images.metmuseum.org/CRDImages/ep/origi...,,False,gs://gcs-public-data--met/435868/1.jpg
8,435868,"Fig. 7. Antoine Le Nain, ""The Little Card Play...",,http://images.metmuseum.org/CRDImages/ep/origi...,,False,gs://gcs-public-data--met/435868/7.jpg
9,435868,"Fig. 3. Paul Cézanne, ""The Card Players,"" ca. ...",,http://images.metmuseum.org/CRDImages/ep/origi...,,False,gs://gcs-public-data--met/435868/3.jpg


Pulling it into a pandas dataframe to work with:

In [9]:
%%bigquery tbl1
SELECT
    *
FROM `bigquery-public-data.the_met.images`
--GROUP BY year
ORDER BY object_id DESC
LIMIT 15

In [10]:
tbl1

Unnamed: 0,object_id,public_caption,title,original_image_url,caption,is_oasc,gcs_url
0,746939,,,http://images.metmuseum.org/CRDImages/es/origi...,,True,gs://gcs-public-data--met/746939/0.jpg
1,746938,,,http://images.metmuseum.org/CRDImages/es/origi...,,True,gs://gcs-public-data--met/746938/0.jpg
2,746760,,,http://images.metmuseum.org/CRDImages/es/origi...,,True,gs://gcs-public-data--met/746760/1.jpg
3,746760,,,http://images.metmuseum.org/CRDImages/es/origi...,,True,gs://gcs-public-data--met/746760/5.jpg
4,746760,,,http://images.metmuseum.org/CRDImages/es/origi...,,True,gs://gcs-public-data--met/746760/0.jpg
5,746760,,,http://images.metmuseum.org/CRDImages/es/origi...,,True,gs://gcs-public-data--met/746760/2.jpg
6,746760,,,http://images.metmuseum.org/CRDImages/es/origi...,,True,gs://gcs-public-data--met/746760/4.jpg
7,746760,,,http://images.metmuseum.org/CRDImages/es/origi...,,True,gs://gcs-public-data--met/746760/6.jpg
8,746760,,,http://images.metmuseum.org/CRDImages/es/origi...,,True,gs://gcs-public-data--met/746760/3.jpg
9,746253,,,http://images.metmuseum.org/CRDImages/eg/origi...,view 1,True,gs://gcs-public-data--met/746253/0.jpg


## 3. Reviewing Download Speeds

Now that I can find all of the locations of each item, it's worth understanding how fast they are to download, seeing how the data should be structured - Should it be put in a bucket or local VM, or can it be accessed from the public bucket?  Understanding latency will be important.

In [11]:
import urllib
import os
from google.cloud import storage

In [27]:
#load the bucketname
credential_path = r"C:\Users\matfl\Documents\MSCA 31009 Project -24078705271d.json"
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = credential_path

storage_client = storage.Client()
bucket = storage_client.get_bucket('met-image-bucket')

In [41]:
%%timeit
#make this a file
#name variable
fn = tbl1.loc[0,'object_id'].astype(str)+'image.jpg'
urllib.request.urlretrieve(tbl1.loc[0, 'original_image_url'], fn)
#load the file onto gcp
blob = bucket.blob(fn)
blob.upload_from_filename(fn)


2.06 s ± 182 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [39]:
#v2
fn = tbl1.loc[4,'object_id'].astype(str)+'image.jpg'
#set up the file
blob = bucket.blob(fn)
#see if we can load it in in one line
blob.upload_from_file()
#this apparently doesn't work


AttributeError: 'bytes' object has no attribute 'tell'

In [37]:
#urllib.request.urlopen(tbl1.loc[0, 'original_image_url'])
import requests
requests.get(tbl1.loc[0, 'original_image_url'])

b'\xff\xd8\xff\xdb\x00C\x00\x03\x02\x02\x03\x02\x02\x03\x03\x03\x03\x04\x03\x03\x04\x05\x08\x05\x05\x04\x04\x05\n\x07\x07\x06\x08\x0c\n\x0c\x0c\x0b\n\x0b\x0b\r\x0e\x12\x10\r\x0e\x11\x0e\x0b\x0b\x10\x16\x10\x11\x13\x14\x15\x15\x15\x0c\x0f\x17\x18\x16\x14\x18\x12\x14\x15\x14\xff\xdb\x00C\x01\x03\x04\x04\x05\x04\x05\t\x05\x05\t\x14\r\x0b\r\x14\x14\x14\x14\x14\x14\x14\x14\x14\x14\x14\x14\x14\x14\x14\x14\x14\x14\x14\x14\x14\x14\x14\x14\x14\x14\x14\x14\x14\x14\x14\x14\x14\x14\x14\x14\x14\x14\x14\x14\x14\x14\x14\x14\x14\x14\x14\x14\x14\x14\xff\xc0\x00\x11\x08\x0f\xa0\tj\x03\x01"\x00\x02\x11\x01\x03\x11\x01\xff\xc4\x00\x1d\x00\x00\x01\x05\x01\x01\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x03\x00\x01\x02\x04\x05\x06\x07\x08\t\xff\xc4\x00Q\x10\x00\x02\x01\x03\x03\x02\x05\x01\x05\x07\x03\x04\x00\x01\x02\x17\x01\x02\x03\x00\x04\x11\x12!1\x05A\x06\x13"Qaq\x07\x142\x81\x91#B\xa1\xb1\xc1\xd1\xf0\x15R\xe1\x083b\xf1\x16$Cr\x17%4S\x82\t\x92\x18&5Dc\xa2s\xc2\xb2\x19Td\x83\xd2\x93\xff\xc4\x00\x14\x01\x0

matflig@cloudshell:~ (msca-31009-project)$ gsutil cp gs://gcs-public-data--met/746760/3.jpg gs://met-image-bucket

this does it in gsutil!!!!!!!!!!!!!!!!!

matflig@cloudshell:~ (msca-31009-project)$ gsutil cp -r gs://gcs-public-data--met/74625* gs://met-image-bucket/

-r allows it to do the folders, and then the -m before cp will allow multithreading...

FINAL RUN

matflig@cloudshell:~ (msca-31009-project)$ gsutil -m cp -r gs://gcs-public-data--met/* gs://met-image-bucket/

Now, i'll need to move it to persistent disk, get it restructured into the train and test, then arrange by y variable, and then possibly take them out of folders.

Size of data: matflig@cloudshell:~ (msca-31009-project)$ gsutil du -sh gs://met-image-bucket/

..Large.



Now, to attempt writing a shell script.

In [88]:
%%bigquery todel_test
SELECT object_id FROM `bigquery-public-data.the_met.objects` WHERE department='The Libraries'

In [109]:
script = '\n'.join(todel_test['object_id'].apply(lambda x: 'gsutil rm -r gs://met-image-bucket/'+str(x)).tolist())

In [110]:
script

'gsutil rm -r gs://met-image-bucket/681274\ngsutil rm -r gs://met-image-bucket/681259\ngsutil rm -r gs://met-image-bucket/682011\ngsutil rm -r gs://met-image-bucket/591859\ngsutil rm -r gs://met-image-bucket/591857\ngsutil rm -r gs://met-image-bucket/591868\ngsutil rm -r gs://met-image-bucket/705638\ngsutil rm -r gs://met-image-bucket/591846\ngsutil rm -r gs://met-image-bucket/591833\ngsutil rm -r gs://met-image-bucket/591837\ngsutil rm -r gs://met-image-bucket/591850\ngsutil rm -r gs://met-image-bucket/734108\ngsutil rm -r gs://met-image-bucket/734109\ngsutil rm -r gs://met-image-bucket/734110\ngsutil rm -r gs://met-image-bucket/734111\ngsutil rm -r gs://met-image-bucket/738516\ngsutil rm -r gs://met-image-bucket/700238\ngsutil rm -r gs://met-image-bucket/591856\ngsutil rm -r gs://met-image-bucket/591861\ngsutil rm -r gs://met-image-bucket/738466\ngsutil rm -r gs://met-image-bucket/700032\ngsutil rm -r gs://met-image-bucket/700050\ngsutil rm -r gs://met-image-bucket/591862\ngsutil rm 

In [111]:
f = open("demofile2.sh", "w")
f.write(script)
f.close()

#made a version called demofile without gsutil at the start 

In [113]:
script3 = '\n'.join(todel_test['object_id'].apply(lambda x: 'rm gs://met-image-bucket/'+str(x)).tolist())
f = open("demofile3.sh", "w")
f.write(script3)
f.close()


making sure I have a few of these to run on...

gsutil -m cp -r gs://gcs-public-data--met/681* gs://met-image-bucket/

Now in gsutil I can load this and then run

gsutil cat gs://met-image-bucket/demofile2.sh | sh -- didn't work!
gsutil cat gs://met-image-bucket/demofile2.sh | gsutil

May add a -m before the rms if this works...



Tried with a load - had to make it utf 8?

In [123]:
scriptl = '\n'.join(todel_test['object_id'].apply(lambda x: 'gsutil -m cp -r gs://gcs-public-data--met/'+str(x)\
                                                  +'* gs://met-image-bucket/').tolist()).encode('utf8')
f = open("demofilel.sh", "wb")
f.write(scriptl)
f.close()


This is slow, so I am going to try putting it on one line.  This worked with an rm (I removed 100, and 1000, that I should add back later!)

In [133]:
scriptfs = 'gsutil -m rm -r'+' '.join(todel_test['object_id'].apply(lambda x: ' gs://met-image-bucket/'+str(x)).tolist())
scriptfs = scriptfs.encode('utf8')

In [134]:
f = open("demofilefs.sh", "wb")
f.write(scriptfs)
f.close()

THIS WORKS DELETED ABOUT 1000 OBJECTS IN A COUPLE SECONDS!

To that end, I should probably delete whats on there, use python to select the test and train and validation and categories, make their folders, and then copy in the relevant jpg objects.

There's a way to do this:
cat filelist | gsutil -m cp -I gs://my-bucket


cat filelist | gsutil -m cp -I gs://met-image-bucket/mftst

Apparently the list needs to be a one per line..

In [144]:
scriptcp = '\n'.join(todel_test['object_id'].apply(lambda x: 'gs://gcs-public-data--met/'+str(x)+'/*.jpg').tolist())
scriptcp = scriptcp.encode('utf8')

f = open("demofilecp.sh", "wb")
f.write(scriptcp)
f.close()

gsutil cat gs://met-image-bucket/demofilecp.sh | gsutil -m cp -I gs://met-image-bucket/mftst

This works!  Or, well, it would if they didn't all have the same name.  We may need to copy all of the folders, and then delete the json items, or figure out how to rename each picture on the way in.

I think I'll have to copy all of the folders, with the json items, with a slightly modified version (will need there to be no /*jpg at the end):

gsutil cat gs://met-image-bucket/demofilecp.sh | gsutil -m cp -r -I gs://met-image-bucket/mftst


-this works
matflig@cloudshell:~ (msca-31009-project)$ gsutil cat gs://met-image-bucket/demofilecp.sh | gsutil -m cp -r -I gs://met-image-bucket/mftst

from there, I can probably move these accordingly and rename them?  :et's see how slow this is, since it has to be done line by line...

In [None]:
#script to newly copy in each of these as folders
scriptcp = '\n'.join(todel_test['object_id'].apply(lambda x: 'gs://gcs-public-data--met/'+str(x)).tolist())
scriptcp = scriptcp.encode('utf8')

f = open("demofilecp.sh", "wb")
f.write(scriptcp)
f.close()

In [157]:
%%bigquery rn
SELECT b.*
FROM `bigquery-public-data.the_met.objects` a JOIN `bigquery-public-data.the_met.images` b
on a.object_id = b.object_id WHERE department='The Libraries'

In [158]:
rn

Unnamed: 0,object_id,public_caption,title,original_image_url,caption,is_oasc,gcs_url
0,700240,Title Page,,http://images.metmuseum.org/CRDImages/li/origi...,,True,gs://gcs-public-data--met/700240/0.jpg
1,700240,Plate 2,,http://images.metmuseum.org/CRDImages/li/origi...,,True,gs://gcs-public-data--met/700240/1.jpg
2,700240,Plate 6,,http://images.metmuseum.org/CRDImages/li/origi...,,True,gs://gcs-public-data--met/700240/2.jpg
3,700032,"Title Page, Petri Gyllii De Bosporo thracio li...",,http://images.metmuseum.org/CRDImages/li/origi...,,True,gs://gcs-public-data--met/700032/0.jpg
4,700032,"Title Page, Petri Gyllii De topographia Consta...",,http://images.metmuseum.org/CRDImages/li/origi...,,True,gs://gcs-public-data--met/700032/1.jpg
5,700032,"Page 1, Petri Gyllii De topographia Constantin...",,http://images.metmuseum.org/CRDImages/li/origi...,,True,gs://gcs-public-data--met/700032/2.jpg
6,700238,Title Page,,http://images.metmuseum.org/CRDImages/li/origi...,,True,gs://gcs-public-data--met/700238/0.jpg
7,700238,Frontispiece,,http://images.metmuseum.org/CRDImages/li/origi...,,True,gs://gcs-public-data--met/700238/1.jpg
8,700238,Plate 1,,http://images.metmuseum.org/CRDImages/li/origi...,,True,gs://gcs-public-data--met/700238/2.jpg
9,700238,Plate 6,,http://images.metmuseum.org/CRDImages/li/origi...,,True,gs://gcs-public-data--met/700238/3.jpg


In [169]:
#script to try and rename
scriptrn = '\n'.join(rn['gcs_url'].apply(lambda x: 'gsutil mv gs://met-image-bucket/mftst/'+str(x)[26:]+\
                                         ' gs://met-image-bucket/mftst/'+str(x)[26:].replace('/', '_')).tolist())
scriptrn = scriptrn.encode('utf8')
#scriptrn                                                                                                 
                                                                                                  

In [170]:
f = open("demofilern.sh", "wb")
f.write(scriptrn)
f.close()

This would be gsutil cat gs://met-image-bucket/demofilern.sh | sh

Aaand it does work but it is far too slow.

From here on stackoverflow this is a possibility:

gsutil ls -r gs://test-lab-12345-67890/\*\*/test_result_* | awk -F"/" {'system("gsutil cp "$0" \~/localpath/"$6)'}

https://stackoverflow.com/questions/48956181/gsutil-rename-files-as-they-are-being-copied-from-different-directories



In [None]:
testlist = rn['gcs_url'].apply(lambda x: str(x)[26:]).tolist()

In [225]:
testlist

['700240/0.jpg',
 '700240/1.jpg',
 '700240/2.jpg',
 '700032/0.jpg',
 '700032/1.jpg',
 '700032/2.jpg',
 '700238/0.jpg',
 '700238/1.jpg',
 '700238/2.jpg',
 '700238/3.jpg',
 '738758/1.jpg',
 '738758/2.jpg',
 '738758/3.jpg',
 '738758/0.jpg',
 '738765/1.jpg',
 '738765/3.jpg',
 '716639/0.jpg',
 '716648/0.jpg',
 '716648/1.jpg',
 '724625/0.jpg',
 '724625/1.jpg',
 '724625/2.jpg',
 '738466/0.jpg',
 '738466/1.jpg',
 '738466/2.jpg',
 '699505/0.jpg',
 '699505/1.jpg',
 '739504/0.jpg',
 '739504/1.jpg',
 '738463/0.jpg',
 '738463/1.jpg',
 '715005/0.jpg',
 '715005/1.jpg',
 '715005/2.jpg',
 '715005/3.jpg',
 '715005/4.jpg',
 '715005/5.jpg',
 '681372/0.jpg',
 '681388/0.jpg',
 '681388/4.jpg',
 '681551/0.jpg',
 '682013/0.jpg',
 '682013/4.jpg',
 '681546/0.jpg',
 '681384/0.jpg',
 '681149/0.jpg',
 '680832/0.jpg',
 '680832/5.jpg',
 '681274/0.jpg',
 '681274/5.jpg',
 '681246/0.jpg',
 '681128/1.jpg',
 '681128/3.jpg',
 '680819/0.jpg',
 '682015/0.jpg',
 '681545/0.jpg',
 '681545/3.jpg',
 '680318/0.jpg',
 '680821/0.jpg

In [220]:
from os import environ
from google.cloud import storage

credential_path = r"C:\Users\matfl\Documents\MSCA 31009 Project -24078705271d.json"
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = credential_path

storage_client = storage.Client()
bucket = storage_client.get_bucket('met-image-bucket')



def rename_file(bucket, bucketFolder, fileName):
    """Rename file in GCP bucket."""
    blob = bucket.blob(bucketFolder + fileName)
    bucket.rename_blob(blob,
                       new_name=(bucketFolder+fileName.replace('/', '_')))

In [216]:
blob = bucket.blob('mftst/'+str(testlist[1]))

In [219]:
bucket.rename_blob(blob, new_name=('mftst/'+str(testlist[1]).replace('/','_')))

<Blob: met-image-bucket, mftst/700240_1.jpg, 1595053204932529>

In [226]:
%time rename_file(bucket, 'mftst/',str(testlist[15]))

Wall time: 543 ms


This could...scale? slowly but almost acceptibly?

In [227]:
for i in testlist:
    rename_file(bucket, 'mftst/', str(i))

In [228]:
len(testlist)

519

Informally timed, but this took about 3 min for the five hundred files (whatever the length of testlist is) for this to do.  Some back of the envelope math...

In [232]:
100000/519 * 3 / 60

9.633911368015415

Still about a day to process which...we have the time to do I guess?