# Data extraction from MARC-XML files

Created in February 2024 for the University of Strathclyde by Gustavo Candela 

### About the Hutton Drawings Dataset

This dataset represents the descriptive metadata from the [Hutton Drawings](https://data.nls.uk/data/metadata-collections/hutton-drawings/).
This dataset represents the complete descriptive metadata for the Hutton drawings, a digitised collection of drawings, maps, plans and prints relating mainly to Scottish churches and other ecclesiastical buildings, castles or other dwellings.
The original drawings date from 1781-1792 and 1811-1820 and are arranged by county. Some of the drawings are by George Henry Hutton, a professional soldier and amateur antiquary, who compiled the collection. 

- Data format: metadata available as MARCXML and Dublin Core
- Data source: https://data.nls.uk/data/metadata-collections/hutton-drawings/

### Table of contents

- [Preparation](#Preparation)
- [Data Extraction](#Extraction-of-the-data-to-a-CSV)

### Preparation

Import the libraries required to extract the information from MARCXML to a CSV file:

In [18]:
import pymarc, re, csv
from pymarc import parse_xml_to_array

## Extraction of the data to a CSV

To extract the metadata we'll mainly use [Pymarc](https://pymarc.readthedocs.io/en/latest/), a Python 3 library for working with bibliographic data encoded in MARC21. The metadata will be stored in a CSV (comma-separated values) text file. 

*Note: If you'd like to reuse this code for other MARC datasets you may have to refine the code to retrieve additional and/or different MARC fields according to how the metadata is defined.



In [19]:
input_MARC_file = 'input/Hutton-Drawings/Hutton-Drawings-Dataset-MARC.xml'
output_CSV_file = 'output/Hutton-Drawings.csv'

In [20]:
with open(output_CSV_file, 'w') as csv_file:
    csv_output = csv.writer(csv_file, delimiter = ',', quotechar = '"', quoting = csv.QUOTE_MINIMAL)
    csv_output.writerow(['title', 'author', 'date', 'subjects', 'geographic_names'])

    with open(input_MARC_file) as marc_file:
        records = parse_xml_to_array(marc_file)
    
        for record in records:
    
            title = author = date = subjects = geographic_names = ''
    
            # title
            if record['245'] is not None:
                title = record['245']['a']
                title = title.strip()
    
            # date
            for f in record.get_fields('264'):
                dates = f.get_subfields('c')
                if len(dates):
                    date = dates[0]

                    # remove '.' at the end
                    if date.endswith('.'): 
                        date = date[:-1]

             # subjects 
            if record['650'] is not None:
                subjects = ''
                for f in record.get_fields('650'):
                    if f.indicator2 == '7':
                        subjects += f.get_subfields('a')[0] + ' -- '
    
                # remove -- at the end
                subjects = re.sub(' -- $', '', subjects)
    
            # subjects geographic names
            if record['651'] is not None:
                geographic_names = ''
                for f in record.get_fields('651'):
                    if f.indicator2 == '7' and f.get_subfields('e')[0]=='Place depicted':
                        geographic_names += f.get_subfields('a')[0] + ' -- '
    
                # remove -- at the end
                geographic_names = re.sub(' -- $', '', geographic_names)
                
            # author
            if record['700'] is not None:
                author = ''
                for f in record.get_fields('700'):
                    author += f.get_subfields('a')[0].strip() + " -- "
                    author = author.replace("\n", " ")
               
            # remove -- at the end
            author = re.sub(' -- $', '', author)
                
            csv_output.writerow([title,author,date,subjects,geographic_names])

### Now go to the output folder and check the CSV file!