# Hesburgh Libraries Internet Archive Script

Script created in 2018 by Daniel Johnson, English and DH Librarian, at the Navari Family Center for Digital Scholarship, Hesburgh Library, University of Notre Dame. Use at your own risk -- the Hesburgh Libraries Internet Archive Script is released under a [University of Illinois/NCSA Open Source License](https://opensource.org/licenses/NCSA). For terms and details, see the `license` file.

**NOTE:** *press "ctrl+enter" on any of these explanation cells if they look like computer code to get nice rendered text. Pressing "ctrl+enter" on a cell which **does** contain computer code will execute that code, however, so just be aware.*

## Install and configure login

The first thing you'll need to do is install the Internet Archive package from the command line, which will provide both Python functionality and a command-line program to configure your credentials for archive.org (essentially a stored log-in). Your Internet Archive email address will need to be authorized by an Internet Archive administrator to contribute to the Notre Dame Internet Archive collection for anything to work.

[Here are the installation instructions](https://internetarchive.readthedocs.io/en/latest/installation.html). This is decidedly friendlier with Mac or Linux, though it does work with Windows, too, with varying degrees of troubleshooting. You may have to install pip, for example, though pip is supposed to come with recent versions of Python 3.

[Next, here are the login configuration instructions](https://internetarchive.readthedocs.io/en/latest/quickstart.html), under the *"Configuring"* header. The good thing is, once you successfully do this, you shouldn't have to repeat it again for that computer.

## Running the script

Your input spreadsheet (CSV) will need to follow a specific format to work correctly. Also, the name for the PDF you wish to upload with a given record should be the same as the item's "File ID", just with the file extension ".pdf" added to the end. The script expects these PDF files to be in a subfolder called `pdfs`.

The IA prefers certain formats for certain fields.

[MARC21 Language Codes](https://www.loc.gov/marc/languages/language_name.html) are preferred for the language column.

[An ISO 8601 format](https://en.wikipedia.org/wiki/ISO_8601) is strongly requested for dates. Basically, that means one of the following:
* YYYY
* YYYY-MM-DD
* YYYY-MM-DD HH:MM:SS

The following columns (underneath the appropriate column letter) are what the script expects. The first row of the script should contain these headers exactly. Notes on specific usage are included below the appropriate header.

| A | B | C | D | E | F | G |
| :---- | :---- | :---- | :---- | :---- | :---- | :---- |
| **EMPTY** | **File ID** | **System Number** | **RDDS notes** | **OCLC Control Number** | **Title** | **Author** | 
| Leave empty | Must match pdf name | ---- | ---- | MUST be included | ---- | ---- |

| H | I | J | K | L | M |
| :---- | :---- | :---- | :---- | :---- | :---- |
| **Abstract** | **Subject Headings** | **Source** | **Edition** | **Place of Publication** | **Publisher** |
| ---- | Delimit multiple subjects with a pipe (<code>&#124;</code>) | ---- | ---- | ---- | ---- |

| N | O | P | Q | R | S |
| :---- | :---- | :---- | :---- | :---- | :---- |
| **Extent** | **Rights** | **Date Created** | **Date Copyrighted** | **Type** | **Language** |
| ---- | ---- | Use an ISO 8601 format | (currently not uploaded) | ---- | Use MARC21 language codes; delimit multiple languages with a pipe (<code>&#124;</code>) |

| T | U | V | W | 
| :---- | :---- | :---- | :---- |
| **Permission** | **Version** | **Table of Contents** | **Contributor Institution** |
| ---- | ---- | ---- | Should always be "University of Notre Dame" | 

The "EMPTY" column (A) is there to correct a potential byte-order issue that could arise when Excel is used on the CSV file.

### Importing packages

First, there is a fairly simple import. We have the upload module from the Internet Archive package, and the CSV package for working with our spreadsheet format. If this doesn't work, you can go no further in the program until you fix.

In [None]:
from internetarchive import upload
import csv

### Running the upload

In [1]:
### ENTER input CSV file in place of MYFILE.csv here
with open('MYFILE.csv', 'r', newline='', encoding='utf-8') as csvIn:
    
    csvObject = csv.DictReader(csvIn)
    
    for row in csvObject:
              
        # Subjects are delimited on pipes (|) so we make a list of them to feed in.
        splitSubjects = (row['Subject Headings']).split("|")
        
        # Languages are delimited on pipes (|) so we make a list of them to feed in.
        splitLanguages = (row['Language']).split("|")
        
        # Formula for making a unique Notre Dame identifier
        itemIdentifier = 'nd' + row['OCLC Control Number']
        
        print(itemIdentifier)      
        
        # We create a dictionary of all the metadata
        # Make sure to check that dates are ISO-8601 compatible!
        md = dict(
            collection='universitynotredamelibraries', 
            title=row['Title'], 
            mediatype='texts', 
            contributor=row['Contributor Institution'], 
            subject=splitSubjects,
            
            system_number = row['System Number'], 
            oclc_number = row['OCLC Control Number'], 
            call_number = row['File ID'],
            creator = row['Author'], 
            abstract = row['Abstract'], 
            publisher = (row['Place of Publication'] + " : " + row['Publisher']),
            place_of_publication = row['Place of Publication'],
            extent = row['Extent'],
            date = row['Date Created'],
            language = splitLanguages,
                    
        )
        
        # Variable containing the path to the PDF for the given book. Notice: stored in the "pdfs" folder
        #theFile = "pdfs/" + row['File ID'] + ".pdf"
        
        # First, split PDF files if there's more than one:
        theFileList = (row['File ID']).split("|")
        
        # Next, prefix the correct directory (pdfs) and suffix the correct extension (.pdf)
        ["pdfs/" + s + ".pdf" for s in theFileList]
        
        # Subjects are delimited on pipes (|) so we make a list of them to feed in.
        #splitSubjects = (row['Subject Headings']).split("|")
        
        print(theFile)
        #print(md)
        
        firstfile = upload(itemIdentifier, files=theFileList, metadata=md)
        firstfile[0].status_code
        print(firstfile[0].status_code)




# General Internet Archive Documentation: https://internetarchive.readthedocs.io/en/latest/quickstart.html#uploading
# New URL for the above: https://archive.org/services/docs/api/internetarchive/quickstart.html#uploading

# Metadata Fields: https://internetarchive.readthedocs.io/en/latest/metadata.html



SyntaxError: invalid syntax (<ipython-input-1-b5f34d5a4b21>, line 47)