# Grabbing MODS for LC Web archives

Here are a few IDs that can be used to pull out records that are represented on the LC website [loc.gov](http://www.loc.gov/) as of August 2018:

These are newer MODS records that don't have LCSH:
```
lcwaN0010234,lcwaN0001999,lcwaN0003238,lcwaN0010144,lcwaN0010145,
lcwaN0012178,lcwaN0012179,lcwaN0012180,lcwaN0012184,lcwaN0012195,
lcwaN0010932,lcwaN0010933,lcwaN0010936,lcwaN0010937,lcwaN0010940,
```

These have LCSH in `<subject>`:

`lcwaN0010888,lcwaN0010226,lcwaN0009692,lcwaN0009700,lcwaN0010401`

These are election sites that include `<subject>` with both lcsh and local:
`lcwaE0008846,lcwaE0008263,lcwaE0008338,lcwaE0008918,lcwaE0008001`

In [1]:
#a list of LCWA site IDs that can be used to retrieve corresponding records
siteIDs = [
            'lcwaN0010234','lcwaN0001999','lcwaN0003238','lcwaN0010144','lcwaN0010145',
            'lcwaN0012178','lcwaN0012179','lcwaN0012180','lcwaN0012184','lcwaN0012195',
            'lcwaN0010932','lcwaN0010933','lcwaN0010936','lcwaN0010937','lcwaN0010940',
            'lcwaN0010888','lcwaN0010226','lcwaN0009692','lcwaN0009700','lcwaN0010401', #Note: these records are older and have a different structure
            'lcwaE0008846','lcwaE0008263','lcwaE0008338','lcwaE0008918','lcwaE0008001'
          ]

print(len(siteIDs))

25


URLS for corresponding IDs can be formed as follows:

`lcwaN0010888,https://cdn.loc.gov/service/webcapture/project_1/mods/lcwaN0010888.xml`

Below is a function `request_single_lcwa_MODS()` that will take an lcwa identifier and format request (here, xml) to request a MODS file. You can use this to generate collections of MODS records, which can be grouped in a `<modsCollection>` group. 

In [2]:
import requests
import xml.etree.ElementTree as ET

count = 0
endpoint = 'https://cdn.loc.gov/service/webcapture/project_1/mods/'

def request_single_lcwa_MODS(lcwa,fo):
    endpoint = 'https://cdn.loc.gov/service/webcapture/project_1/mods/'
    urlRequest = endpoint + lcwa + '.' + fo
    response = requests.get(urlRequest)
    return response

for x in siteIDs:
    count = count + 1
    if count > 1:
        break
    response_back = request_single_lcwa_MODS(x, 'xml')
    returnedHeaders = response_back.headers
    returnedContent = response_back.content
    print(returnedHeaders)
    print(returnedContent.decode('UTF-8'))
    print('Type: ',type(returnedContent))
    

{'Date': 'Mon, 27 Aug 2018 01:01:35 GMT', 'Content-Type': 'text/xml', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Set-Cookie': '__cfduid=dac995cd5603f413afdbdf9bb9354b7a31535331695; expires=Tue, 27-Aug-19 01:01:35 GMT; path=/; domain=.loc.gov; HttpOnly', 'Last-Modified': 'Tue, 12 Jun 2018 17:37:27 GMT', 'ETag': 'W/"25672b74-997-56e7551e84a87"', 'Access-Control-Allow-Origin': '*', 'CF-Cache-Status': 'HIT', 'Vary': 'Accept-Encoding', 'Expires': 'Mon, 27 Aug 2018 05:01:35 GMT', 'Cache-Control': 'public, max-age=14400', 'Expect-CT': 'max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"', 'Server': 'cloudflare', 'CF-RAY': '450a9b991e8023e4-IAD', 'Content-Encoding': 'gzip'}
<mods version="3.4" xmlns="http://www.loc.gov/mods/v3" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-4.xsd"><identifier>lcwaN0010

Some of the records (anything in the list with an index of 10 or greater) are generated differently, including a top-level XML declaration and line-by-line printing with indentation rather than run-in text. This is better for human readability and should not substantively affect parsing. Resulting collections of mods will still be valid XML.  

In [3]:
mods = request_single_lcwa_MODS('lcwaN0010888','xml')

print(mods.content.decode('UTF-8'))

<?xml version='1.0' encoding='UTF-8'?>
<mods version="3.4" xmlns="http://www.loc.gov/mods/v3" xmlns:xlink="http://www.w3.org/1999/xlink"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-4.xsd">
    <identifier>lcwaN0010888</identifier>
    <titleInfo>
        <title>Cute Overload! ;)</title>
    </titleInfo>
    <name authority="naf" type="corporate">
        <namePart><!-- TODO: Insert name authority here. --></namePart>
    </name>
    <language>
        <languageTerm authority="iso639-2b" type="code">eng</languageTerm>
    </language>
    <physicalDescription>
        <form authority="marcform">electronic</form>
        <internetMediaType>text/html</internetMediaType>
        <digitalOrigin>born digital</digitalOrigin>
    </physicalDescription>
    <targetAudience>general</targetAudience>
    <typeOfResource>text</typeOfResource>
    <genre authority="marcgt">web site</genre>
   

In [4]:
fname = '2018_lcwa_MODS_single.xml'
count2 = 0
with open(fname,'w') as f:
    f.write('<?xml version="1.0" encoding="UTF-8"?>\n<modsCollection>\n')
    for x in siteIDs:
        #use this count to limit the number of mods records that are written to the output file
        if count2 > 1:
            break
        count2 = count2 + 1
        print('Pulling record for',x)
        mods_response = request_single_lcwa_MODS(x,'xml')
        f.write(mods_response.content.decode('UTF-8'))
        f.write('\n')
        print('writing',x,'to file')
    f.write('</modsCollection>')
    print('wrote',count2,'MODS to',fname)


Pulling record for lcwaN0010234
writing lcwaN0010234 to file
Pulling record for lcwaN0001999
writing lcwaN0001999 to file
wrote 2 MODS to 2018_lcwa_MODS_single.xml
