<a href="https://colab.research.google.com/github/mc2/LiteWeight/blob/master/ia_command_line.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial for learning to use the Internet Archive's command line tool


Tutorial for learning to use the Internet Archive's command line tool: https://internetarchive.readthedocs.io/en/stable/cli.html

On this page, will show you how to do a more sophisticated query than you can do through the website UI.

The second exercise is counting the words in the Project Gutenberg bible.  Then some reports.  

This is an experiment.  We are using a slack channel on internetarchive.slack.com for questions.  Pls write to mek@archive.org if interested.

-brewster



# Getting started and Extracting some metadata


1. You will need your archive.org username and password, if you don't have an account, create an Internet Archive account at https://archive.org/account/signup.

2. Back on this page, run the commands in the next box: 
  
  move your mouse over the gray area just below this box,

  the "`[ ]`" turns into a play button. 

  Click on the play button. (it will warn you --> say Run Anyway)

  It starts a virtual machine and install some commands.

3. enter your Internet Archive username (email) and password.

4. When the installation is complete, scroll down to play more.  Have fun.

In [None]:
# Installing All Software necessary for this page
# Move your mouse here and hit the play button on the left <<
!echo "Scroll down to see output in the window"
!echo ""
!echo "Installing jq for processing json..."; apt-get install jq > /dev/null
!echo "Installing parallel to go fast..."; apt-get install parallel > /dev/null
!echo "Installing bc to do command line math..."; apt-get install bc > /dev/null
!echo "Finally... Install ia command line tool!"; sudo pip install internetarchive > /dev/null


# type your archive.org username (email) and password at prompt
!until ia configure; do echo -e "\n*** Try again! or reset password: https://archive.org/account/forgot-password ***\n"; done   

!echo "Scroll down"
!ia --help # show what you can now do 
!echo "READY!"

Scroll down to see output in the window

Installing jq for processing json...
Installing parallel to go fast...
Installing bc to do command line math...
Finally... Install ia command line tool!
Enter your Archive.org credentials below to configure 'ia'.

Email address: brewster@archive.org
Password: 

Config saved to: /root/.ia
Scroll down
A command line interface to Archive.org.

usage:
    ia [--help | --version]
    ia [--config-file FILE] [--log | --debug]
       [--insecure] [--host HOST] <command> [<args>]...

options:
    -h, --help
    -v, --version
    -c, --config-file FILE  Use FILE as config file.
    -l, --log               Turn on logging [default: False].
    -d, --debug             Turn on verbose logging [default: False].
    -i, --insecure          Use HTTP for all requests instead of HTTPS [default: false]
    -H, --host HOST         Host to use for requests (doesn't work for requests made to
                            s3.us.archive.org) [default: archive.org]

comm

Now lets search in a collection and get some item identifiers.  An archive.org identifier is something you can put on https://archive.org/details/itemID  for instance https://archive.org/details/78_--and-mimi_frankie-carle-and-his-orchestra-gregg-lawrence-kennedy-simon_gbia0006176a

Also you can see metadata and files to download with different "verbs"-- notice it is not "/details/" but "/metadata/" and "/download/"

https://archive.org/metadata/78_--and-mimi_frankie-carle-and-his-orchestra-gregg-lawrence-kennedy-simon_gbia0006176a

https://archive.org/download/78_--and-mimi_frankie-carle-and-his-orchestra-gregg-lawrence-kennedy-simon_gbia0006176a

Hit the play button below by moving your mouse over the gray area below.


In [None]:
!ia search "collection:georgeblood" --itemlist | head

78_--and-mimi_frankie-carle-and-his-orchestra-gregg-lawrence-kennedy-simon_gbia0006176a
78_--cold_the-savannah-six-mills-and-barrif_gbia0009322a
78_-and-mimi-pronounced-mee-mee_frankie-carle-and-his-orchestra-gregg-lawrence-ken_gbia0018080a
78_-and-mimi_art-lund-kennedy-simon-johnny-thompson_gbia0012786b
78_-and-mimi_charlie-spivak-and-his-orchestra-tommy-mercer-jimmy-kennedy-nat-simon_gbia0078293b
78_-and-mimi_dick-haymes-gordon-jenkins-and-his-orchestra-jimmy-kennedy-nat-simon_gbia0019235a
78_-and-mimi_ray-dorey-kennedy-simon-jack-matthias_gbia0068930b
78_-and-mimi_the-dinning-sisters-the-art-van-damme-quintet-jimmy-kennedy-nat-simon_gbia0071740a
78_-but-what-are-these_pearl-bailey-mitchell-ayres-thomas-fairbanks_gbia0030610b
78_-hist-hvor-vejen-slaar-en-bugt-this-is-where-the-road-is-turning-2-den-lille-o_gbia0051679b


With an identifier, you can download the files or get metadata for it.  The metadata comes back as json.  We can take one of the identifiers and use the "ia metadata" command.

In [None]:
!ia metadata 78_--and-mimi_frankie-carle-and-his-orchestra-gregg-lawrence-kennedy-simon_gbia0006176a

{"created": 1600015357, "d1": "ia601609.us.archive.org", "d2": "ia801609.us.archive.org", "dir": "/6/items/78_--and-mimi_frankie-carle-and-his-orchestra-gregg-lawrence-kennedy-simon_gbia0006176a", "files": [{"name": "- And Mimi - Frankie Carle and his Orchestra.afpk", "source": "derivative", "format": "Columbia Peaks", "original": "- And Mimi - Frankie Carle and his Orchestra.flac", "mtime": "1493242632", "size": "33944", "md5": "be4c294f6d0622779aace02fc5912d82", "crc32": "52d603ad", "sha1": "a409021a5b2536ee4f4fded32e63334a090f34cb"}, {"name": "- And Mimi - Frankie Carle and his Orchestra.flac", "source": "original", "mtime": "1493242116", "size": "60400750", "md5": "9db12efde2740c455496a30d6843a39d", "crc32": "0acfebc9", "sha1": "ae9b6902380c54047c09a987e488f9a2969096f1", "format": "24bit Flac", "length": "167.25", "height": "1680", "width": "1680", "matrix_number": "HCO2467", "collection-catalog-number": "GBIA0006176A", "publisher": "Columbia", "title": "- And Mimi", "album": "- An

Getting a field out of json is easy with a cool tool called jq: it is a world in itself, so helpful to get to know it.

I use "jq -r" when I do not want the quotation marks to return, or if I want them then just "jq"

In [None]:
!echo '{"foo":"thing", "bar":"other thing"}' | jq -r .bar
!echo '{"foo":"thing", "bar":"other thing"}' | jq .bar

other thing
[0;32m"other thing"[0m


Say you wanted to list all the paramount 78's and their catalog numbers (a researcher in the netherlands wanted to know this, and there is no easy way to list it on archive.org's UI).

Lets find the names of the metadata fields of one of the 78s, so we can find the ones we want to pull out.

Before we do this, lets use another handy tool: GNU Parallel.  helpful for speeding things up by running in parallel, but also just for constructing ia command lines.  (there is an odd thing with the parallel command: if you put --will-cite as a parameter it does not ask you for money) 

In [None]:
#print one identifier:
!echo ""
!echo "#One identifier:"
!ia search "collection:georgeblood AND publisher:paramount" --itemlist | head -1

#print the json, but it is all in one line, so it is hard to read
!echo ""
!echo "#Metadata for that identifier as json:"
!ia search "collection:georgeblood AND publisher:paramount" --itemlist | head -1 | parallel --will-cite 'ia metadata {}'

#pretty print just the item's catalog metadata part of the metadata (little confusing, but there is other metadata)
!echo ""
!echo "Pretty print part of the metadata for that identifier as json:"
!ia search "collection:georgeblood AND publisher:paramount" --itemlist | head -1 | parallel --will-cite 'ia metadata {}' | jq .metadata


#One identifier:
78_1.-kitchen-parade-2.-a-family-is-a-blanket_chubbys-rascals-david-jackson_gbia0003323b

#Metadata for that identifier as json:
{"created": 1600021497, "d1": "ia600603.us.archive.org", "d2": "ia800603.us.archive.org", "dir": "/29/items/78_1.-kitchen-parade-2.-a-family-is-a-blanket_chubbys-rascals-david-jackson_gbia0003323b", "files": [{"name": "1. Kitchen Parade; 2. A Family Is A Blanket - Chubby's Rascals.afpk", "source": "derivative", "format": "Columbia Peaks", "original": "1. Kitchen Parade; 2. A Family Is A Blanket - Chubby's Rascals.flac", "mtime": "1489020126", "size": "47960", "md5": "750b906c751e76bf501a21960b1550dc", "crc32": "66196366", "sha1": "8747a905254f3eb1f95d6c1e046ac047d82cdc8e"}, {"name": "1. Kitchen Parade; 2. A Family Is A Blanket - Chubby's Rascals.flac", "source": "original", "mtime": "1489019670", "size": "92570551", "md5": "baef67d8c3f7ef6eb9d153464ed341e7", "crc32": "789aea7f", "sha1": "941a15382aafec829bc587d1f8441e2b0a2bac97", "format": "

Ok, got it.  We want the publisher-catalog-number field.   So let's say we want to print, in tab delimited, the publisher, catalog number, title, and identifier.   This will print 1 at first to test our command, and then 100.



In [None]:
!ia metadata 78_1.-kitchen-parade-2.-a-family-is-a-blanket_chubbys-rascals-david-jackson_gbia0003323b | jq -r '[.metadata.publisher, .metadata."publisher-catalog-number", .metadata.title, .metadata.identifier]| @tsv'

ABC-Paramount	9755	1. Kitchen Parade; 2. A Family Is A Blanket	78_1.-kitchen-parade-2.-a-family-is-a-blanket_chubbys-rascals-david-jackson_gbia0003323b


In [None]:
!ia search "collection:georgeblood AND publisher:paramount" --itemlist | head -100 | parallel -j10 --will-cite 'ia metadata {}' | jq -r '[.metadata.publisher, .metadata."publisher-catalog-number", .metadata.title,.metadata.identifier]| @tsv'

ABC-Paramount	9755	1. Kitchen Parade; 2. A Family Is A Blanket	78_1.-kitchen-parade-2.-a-family-is-a-blanket_chubbys-rascals-david-jackson_gbia0003323b
ABC-Paramount	78-9874	A Very Special Love	78_a-very-special-love_johnny-nash-don-costa-robert-allen_gbia0067143a
ABC-Paramount	9755	1. The Little Rascal Song; 2. The Hippo And The Walrus	78_1.-the-little-rascal-song-2.-the-hippo-and-the-walrus_chubbys-rascals-jackson-dav_gbia0003323a
ABC-Paramount	9743	A Teenager Sings the Blues	78_a-teenager-sings-the-blues_johnny-nash-don-costa-his-orchestra-and-chorus-reid-marc_gbia0073275b
Paramount	30081-A	A Little Birch Canoe and You	78_a-little-birch-canoe-and-you_sterling-trio-callahan-roberts_gbia0195527a
Paramount	14004B	African Rag	78_african-rag_unknown-pianist_gbia0001305b
ABC-Paramount	9765	A Rose and a Baby Ruth	78_a-rose-and-a-baby-ruth_george-hamilton-iv-johnny-dee_gbia0059301a
Paramount	14004B	African Rag	78_african-rag_unknown-pianist_gbia0000937b
ABC-Paramount	9765	A Rose and a Baby 

Now you know how to do a pretty sophisticated query.




# Count the words in the bible of the Project Gutenberg collection of public domain texts



First, we have to find the item in the Archive.  I went to the website and typed "project gutenberg bible" but got too much, so went to find the project gutenberg collection by typing: [Project Gutenberg](https://archive.org/search.php?query=project%20gutenberg) and then looking for the collection [clicked it](https://archive.org/details/gutenberg), then [typed Bible](https://archive.org/details/gutenberg?and%5B%5D=bible&sin=) in the left sidebar and again got too much, so typed [King James Bible](https://archive.org/details/gutenberg?and%5B%5D=king+james+bible&sin=) in the left sidebar and found [this one](https://archive.org/details/thebibleoldandne00010gut).  Internet Archive item identifier: thebibleoldandne00010gut

To find the right file in thebibleoldandne00010gut lets list it.  This is done with the 'ia list' command

In [None]:
!ia list thebibleoldandne00010gut

kjv10.txt
kjv10.zip
thebibleoldandne00010gut_archive.torrent
thebibleoldandne00010gut_files.xml
thebibleoldandne00010gut_meta.xml
thebibleoldandne00010gut_reviews.xml


Now lets get the .txt file and type out the first 35 lines.  We can use the 'ia download' command to access the file, but instead of downloading it onto the virtual machine, we can use 'download' to pipe it into another command, in this case 'head', by using the --stdout parameter.

In [None]:
!ia download thebibleoldandne00010gut kjv10.txt --stdout | head -35

**Welcome To The World of Free Plain Vanilla Electronic Texts**

**Etexts Readable By Both Humans and By Computers, Since 1971**

*These Etexts Prepared By Hundreds of Volunteers and Donations*


August, 1989  [Etext #10]


******The Project Gutenberg Etext of The King James Bible******
******This file should be named kjv10.txt or kjv10.zip*****


Copyright laws are changing all over the world, be sure to check
the copyright laws for your country before posting these files!!

Please take a look at the important information in this header.
We encourage you to keep this file on your own disk, keeping an
electronic path open for the next readers.  Do not remove this.


**Welcome To The World of Free Plain Vanilla Electronic Texts**

**Etexts Readable By Both Humans and By Computers, Since 1971**

*These Etexts Prepared By Hundreds of Volunteers and Donations*

Information on contacting Project Gutenberg to get Etexts, and
further information is included below.

Now lets get the number of words in the whole txt file.   Lets not write it to the drive, but rather, stream it into the unix wordcount command "wc"

In [None]:
!ia download thebibleoldandne00010gut kjv10.txt --stdout | wc -w

823156


As a trick, to print that resulting number with comma's there is a cool command:  numfmt --grouping

In [None]:
!ia download thebibleoldandne00010gut kjv10.txt --stdout | wc -w | numfmt --grouping

823,156


Done, congratulations!

# Report on the Number of uploads per hour to a collection

When uploads of 78rpm records are proceeding, I find it helpful to see how many per hour are happening.   This is how I watch it, maybe it has some command line tricks that will be helpful or at least instructive.

In [None]:

# only way I found to set an environment variable in google colab (sorry):
import os
os.environ['SEARCH'] = "collection:georgeblood AND publicdate:[2020-08-01 TO *]"
!echo "$SEARCH"

!printf "\n\n"; echo -n "Number of 78rpm items uploaded completely: "
!ia search $SEARCH --num-found

!echo -n "Now: "; date -u +"%Y-%m-%dT%H"
!echo "Number of uploads completed in each hour of recent 24 hours:"
!ia search $SEARCH -f publicdate -s publicdate | jq -r .publicdate | cut -d: -f1| sort -nr | uniq -c | head -24

!date



collection:georgeblood AND publicdate:[2020-08-01 TO *]


Number of 78rpm items uploaded completely: 8164
Number not included in index (not complete): 0
Now: 2020-09-13T16
Number completed per hour in each of 24 hours:
     16 2020-08-21T20
      7 2020-08-21T19
      1 2020-08-21T18
      4 2020-08-21T09
    169 2020-08-21T08
    181 2020-08-21T07
    167 2020-08-21T06
    166 2020-08-21T05
    174 2020-08-21T04
    172 2020-08-21T03
    163 2020-08-21T02
    170 2020-08-21T01
    188 2020-08-21T00
    182 2020-08-20T23
    180 2020-08-20T22
    175 2020-08-20T21
    193 2020-08-20T20
    167 2020-08-20T19
    159 2020-08-20T18
    170 2020-08-20T17
     14 2020-08-19T23
     31 2020-08-19T22
     50 2020-08-19T21
     56 2020-08-19T20
Sun Sep 13 16:58:47 UTC 2020


# Producing a report on periodical uploads


Producing a report on uploaded microfilm periodicals is sometimes helpful, and might be seen as a template for other such reports.

the last command "pages processed" of this takes 10 minutes to run because it is summing a field across over 100k items. 

You can stop it from running by clicking on the spinning play button if you want to go on to the next section.


In [None]:
!date
!echo -n "Canisters cataloged:  "  # count all items in this collection
!ia search "collection:sim_raw_scans" --num-found |  numfmt --grouping

!echo -n "Canisters uploaded:   "  # count the canister items that have a jp2 in it, since that means it has been uploaded
!ia search "collection:sim_raw_scans AND format:jp2" --num-found |  numfmt --grouping

!echo -n "Issues uploaded:      "   #count the number of texts in this collection
!ia search "collection:sim_microfilm AND mediatype:texts" --num-found | numfmt --grouping

!echo -n "Issues processed:     "   # count the number of texts that have a pdf, so therefore processed
!ia search "collection:sim_microfilm AND mediatype:texts AND format:pdf" --num-found | numfmt --grouping

!echo -n "Issues unprocessed:   "   # count those that do not have a pdf, as those have not been processed
!ia search "collection:sim_microfilm AND mediatype:texts AND NOT format:pdf" --num-found |  numfmt --grouping

!echo -n "Pages processed:      "   # go through each item in this collection and sum the imagecount metadata values
!time ia search 'collection:sim_microfilm' -f imagecount | jq -r '.imagecount' | paste -sd+ - | bc |  numfmt --grouping

Mon Sep 14 17:54:05 UTC 2020
Canisters cataloged:  1,072
Canisters uploaded:   878
Issues uploaded:      143,815
Issues processed:     143,761
Issues unprocessed:   53
Pages processed:      13,198,506

real	3m21.751s
user	0m2.245s
sys	0m0.160s


# Watching microfilm processing to debug issues

Ufilm Canisters/ribbons go through [different states](https://docs.google.com/spreadsheets/d/1cadT7wCoMlVU0UXrokrcuzxfRzj4CwZf72lDdCkvPzk/edit#gid=0).  This can be useful to see what states canisters are in to see if there are any workflow issues.

In [None]:
!echo "How many canisters are in each state (as represented in the metadata of the canister)"
!date
!ia search "ribbon_state:*" -f ribbon_state | jq -r .ribbon_state | sort | uniq -c

How many canisters are in each state (as represented in the metadata of the canister)
Mon Sep 14 19:26:17 UTC 2020
     28 output_complete
      1 outputting
      3 packaging
     52 packaging_complete
      3 pre_output
     28 prepped
    156 scanned
    401 upload_complete


In [None]:
!echo "list 50 most recent state changes on canisters"
!date
!ia search "ribbon_state:*" -f ribbon_state -f ribbon_state_modify_date | jq -r '[.ribbon_state_modify_date, .ribbon_state, .identifier] | @tsv' | sort -r |\
 head -50


list 50 most recent state changes on canisters
Mon Sep 14 19:16:31 UTC 2020
2020-09-15T02:37:37Z	packaging	sim_raw_scan_IA1504516-06
2020-09-15T02:35:57Z	outputting	sim_raw_scan_IA1504516-07
2020-09-15T02:35:00Z	packaging	sim_raw_scan_IA1504515-07
2020-09-14T22:15:50Z	upload_complete	sim_raw_scan_IA1504516-03
2020-09-14T21:53:53Z	upload_complete	sim_raw_scan_IA1504521-03
2020-09-14T21:47:41Z	scanned	sim_raw_scan_IA1532504-07
2020-09-14T21:44:20Z	upload_complete	sim_raw_scan_IA1504518-05
2020-09-14T21:41:44Z	packaging_complete	sim_raw_scan_IA1504511-07
2020-09-14T21:19:55Z	scanned	sim_raw_scan_IA1532504-05
2020-09-14T21:13:18Z	scanned	sim_raw_scan_IA1532503-05
2020-09-14T21:09:33Z	scanned	sim_raw_scan_IA1532504-04
2020-09-14T20:47:30Z	scanned	sim_raw_scan_IA1532504-02
2020-09-14T20:32:02Z	scanned	sim_raw_scan_IA1532504-01
2020-09-14T20:26:20Z	packaging_complete	sim_raw_scan_IA1504518-03
2020-09-14T20:16:08Z	upload_complete	sim_raw_scan_IA1504515-05
2020-09-14T20:07:57Z	scanned	sim_raw_s

# What can you do now?

Now you could add to or modify commands even using this tool!   By hitting "Copy to Drive" above, then you can "fork" this page to make it your own and add to it.  Please share!