# Processing MARC Files Using the Pymarc Library

This notebook will guide you through the process of processing MARC files using the Pymarc library. MARC (Machine-Readable Cataloging) is a standard format for library catalog records and metadata. We will load MARC files, iterate through records, extract necessary data, and save it to CSV and Excel files.

This notebook is suitable for both beginners and those who want to familiarize themselves with the Pymarc library and MARC data processing in Python.

## Prerequisites

This notebook does not require deep knowledge of Python, but a basic understanding of programming will be helpful.

## Notebook Structure

This notebook is divided into several sections:

0. **Preparation**: We will import the necessary libraries for processing the MARC file.

1. **Load the MARC file**: We will demonstrate how to load a MARC file and print individual records.

2. **Data Extraction**: We will learn how to extract specific data from MARC records, such as titles, authors, and genres.

4. **Export to CSV and Excel**: In the final part, we will show you how to save the extracted data to CSV and Excel files for further analysis.

## Additional Resources

- [LearnPython.org](https://www.learnpython.org/): This online course offers Python tutorials for both beginners and advanced learners. It can be a useful resource for those looking to expand their Python knowledge.

- [W3Schools.com/Python](https://www.w3schools.com/python/): An extensive tutorial that covers Python along with some popular Python libraries.


### 0. Preparation
First, we need to install the libraries we will be working with. Libraries are packages of functions that are not part of the Python language's core. <br>
To install libraries, use the command `%pip install <library_name>`. Then, we add them to our notebook using the command `import <library_name> (as alias)`. To access functions from the library, use `library_name.function_name` <br>
If we only want to use a single function from a library, we add it using `from <library_name> import <function_name>`.

In [None]:
# Install libraries
%pip install pandas 
%pip install pymarc 
%pip install openpyxl

# Add libraries 
import pandas as pd
from pymarc import MARCReader


### 1. Load the MARC File

Our data are stored in the MARC library format with the .mrc extension. For our work, we use the `MARCReader` function from the `pymarc` library, which reads data from the file and divides them into records so that we can access them individually.

The MARC files are located in the directory <b>data/marc</b>. Each database is stored by the name <b>ucla_\<database_code\>.mrc</b>.
The complete path, including the file, is <b>data/marc/ucla_\<database_code\>.mrc</b>.

The following databases are available:

* <b>ret</b> - Retrospective Bibliography of Czech Literature 1770–1945

* <b>smz</b> - Czech Literary Samizdat Bibliography

* <b>int</b> - Czech Literary Web Bibliography

* <b>cle</b> - Czech Literary Exile Bibliography (1948–1989)


#### 1.1 Display Record

Let's see how a MARC record looks like. <br>
First, we choose a base and determine its path. The `format()` function adds the code of our selected base from the `base` variable to the path.<br>
Then, we specify which record we want to display by using the `ith` variable.<br> 
Next, we open the MARC file and iterate through it. The `enumerate()` function assigns numbers to the records, starting from 0. This allows us to display the i-th record.<br>
To avoid processing the entire file, we use the `break` keyword, which terminates the file processing.


In [None]:
# Select base
base = 'cle' 

# Path to marc file
database = 'data/marc/ucla_{base}.mrc'.format(base = base)

# ith record to print
ith = 5

# Open file
with open(database, 'rb') as data:
    # Read file
    reader = MARCReader(data)

    # Iterate through records in marc file 
    for i,record in enumerate(reader):        
        
        # If i is our record 
        if i == ith:
            # Print record
            print(record)

            # Terminate the loop
            break


We can see that the MARC file has a clear structure. It contains multiple fields typically labeled with three digits or three letters. Each code has its internal logic; for example, fields for subject access always start with the number ``6XX``.<br>
Following the field number (tag), there are usually two indicators. If an indicator is not defined, a backslash (\\) is used in its place.<br>
Most fields are further divided into subfields, indicated by a dollar sign ($) followed by either a single letter or a number.


<div class='alert alert-block alert-info'>
    <b>Try It!</b>  By using the 'ith' parameter, we can specify which record we want to print (don't forget that indexing starts at 0). If we want to print all records up to the 'ith' record, we change `if i == ith:` to `if i <= ith:` and remove the `break` statement.
</div>

#### 1.2 Displaying Individual Fields

In order to work with the database, we'll likely only need specific fields from the MARC records. In this section, we demonstrate how to access individual fields within a record.
We display the record number, title, author, and genre. Some fields can be accessed using dot notation (e.g., `record._`), while others require us to use the `get_fields(<field_number>)` function, which retrieves all fields.


In [None]:
ith = 5

# Open file
with open(database, 'rb') as data:
    
    # Read marc file
    reader = MARCReader(data)

    # Iterate through records in marc file 
    for i, record in enumerate(reader):

        # If i is our record 
        if i == ith:
            # Print marc file
            # Some fields are accessible via dot notation, e.g. record.leader or record.title
            print("Record: " + record.leader)
        
            # It is better to check if a a field exists (= is not None)
            # Printing a None value triggers an error
            if record.title is not None:
                print("Title: " + record.title)
            if record.author is not None:
                print("Author: " + record.author)
            
            # We call a function .get_fields() if a field is not accessible via dot notation   
            if record.get_fields('655') is not None:     
                # Almost all field are accessible via square brackets  
                print("Genre: " + record['655']['a'])
            break        

#### 1.3 Number of Records

We might also want to know how many records are in a given database. To find out, we create a separate function, `number_of_records(database)`. User-defined functions are defined using the keyword `def`. <br>
The function takes the path to the MARC database as its input. Inside the function, it opens the database, iterates through it, and increments a counter, `counter`, for each record encountered.

In [None]:
# To define out own functions we use 'def'
def number_of_records(database):
    
    with open(database, 'rb') as data:
        # Read marc file
        reader = MARCReader(data)
        # Create a counter 
        counter = 0
        # Underscore is used to ignore values that we don't need
        for _ in reader:
            counter += 1
    
    # Function returns value         
    return counter 

print("There are " + str(number_of_records(database)) + " records in the database.")        

### 2. Data Extraction

Working directly with MARC documents can be challenging, so it's often better to extract data and store it in a simpler format like a CSV table. <br>
Now, we need to clarify which data we want to extract. In our example, we want to store the title, author, author's code, publication year, and the fields '``600 $a``', '``650 $a``', '``655 $a``', and '``773 $t``'.<br>

#### 2.1 MARC Fields

All records starting with the number 6XX contain subject information about the record. These fields may repeat.<br> 
Under field '600,' we find information about subjects associated with the record or to whom the record is dedicated. <br>
Field '650' contains subject terms or topics that describe what the record is about. <br>
Field '655' contains the genre or form of the record. Unlike fields '600' and '650,' field '655' should be present in every record. <br>
Records starting with numbers 76X - 78X are linking fields used for referencing the source (773) or reviewed (787) documents.


In [None]:
# ith record to print
ith = 10

# Open the file 
with open(database, 'rb') as data:
    # Read the MARC file
    reader = MARCReader(data)

    # Iterate through records in the MARC file
    for i,record in enumerate(reader):        
        
        # If 'i' is our desired record
        if i == ith:
            
            print("Record: " + record.leader)
            
            # If field exists, print it
            if record.get_fields('600') is not None:   
                # There may be more fields under the tag, so we iterate through all of them   
                for field in record.get_fields('600'): 
                    print("Personal name: " + field['a'])
                
                
            # If field exists, print it   
            if record.get_fields('650') is not None:    
                # There may be more fields under the tag, so we iterate through all of them  
                for field in record.get_fields('650'): 
                    print("Topical term: " + field['a'])
            
            # If field exists, print it
            if record.get_fields('655') is not None: 
                # There may be more fields under the tag, so we iterate through all of them     
                for field in record.get_fields('655'): 
                    print("Genre/Form: " + field['a'])

            # Terminate the loop
            break


#### 2.2 Field Selection

For data storage, we prepared a function called `save_to_dict(record, dictionary, field_list)` that stores one record (`record`) into a Python dictionary (`marc_dictionary`).<br>
A dictionary in Python is a data structure consisting of key-value pairs. We access the value through the key in square brackets: `dict[<key>] = <value>`.<br>

Generally we don't need to save all fields, subfields, and indicators, therefore the function `save_to_dict` contains parameter `field_list` that specifies which fields we want to save. <br>
Because some fields (e.g., '700') may repeat, it's a good practice to store values for each field in a list (a collection in Python). A list is a collection of values, such as strings, integers, or floats, etc. For generality, we save all values into a list. It's easier to work with a single type (e.g., list or a string) than to have some values stored in a list and some as strings.
However we can store values individually outside a list if we are sure that fields in the original records do not repeat. 
 

In [None]:
def save_to_dict(record, marc_dictionary, field_list):
    if not record is None:
        try:
            # Iterate through 'field_list'
            for field_tags in field_list:

                # Key name in the dictionary
                dict_key_name =  field_tags[0]

                # Field tag
                tag =  field_tags[1]

                # Subfield tag
                subfield_tag =  field_tags[2]
                
                # List for adding values to the dictionary 
                dict_add_list = []
                
                # Iterate through all fields with the tag 'tag'
                for field in record.get_fields(tag):
                    
                    # If field doesn't have any subfields, add the whole field to 'dict_add_list'
                    if subfield_tag is None:
                        dict_add_list.append(field.data)  
                    
                    # If the subfield tag is slice instance (we only want a part of a field that does not have a subfield)
                    # add the slice to 'dict_add_list'
                    elif isinstance(subfield_tag, slice):
                        dict_add_list.append(field.data[subfield_tag])     

                    # If the field contains our subfield tag, add the subfield to 'dict_add_list'
                    elif '$'+subfield_tag in str(field):  
                        dict_add_list.append(str(field[subfield_tag]))

                # We need to use dot notation for accessing the leader
                if tag == 'LDR':
                    dict_add_list.append(record.leader)        

                # Add 'dict_add_list' to 'dict_key_name'         
                marc_dictionary[dict_key_name].append(dict_add_list)
        except Exception as error:
            print("Exception: " + type(error).__name__)  
            print("LDR: " + str(record.leader))   
    return marc_dictionary 

print("Function saved.")

Now, we use our function to extract data from the MARC file. First, we specify the values we want to extract in the `field_list`.<br>

`field_list` consists of tuples, where each tuple follows this format: 
1. The first position is the key name under which we save the field.
2. The second position is the field tag.
3. The third position is the subfield tag, e.g., ('author', '100', 'a') or ('author', '100', None).

Next, we create a variable `marc_dictionary` where we add values one by one using our `save_to_dict` function. The keys in `marc_dictionary` are the first values in the tuples in `field_list`, and the values are the data from the MARC records stored in a single list.

Finally, we convert the data to a DataFrame structure, which is similar to an Excel table, making it easier to work with. Each row in the DataFrame represents a single record, and columns represent different data types (e.g., author's name).


In [None]:
with open(database, 'rb') as data:
    reader = MARCReader(data)
    # List of values we want to save
    field_list = [('title', '245', 'a'),
                ('author', '100', 'a'),
                ('author code', '100', '7'),
                # Date of publication is in the 8th to 11th place, so we use slice function 
                ('year', '008', slice(7,11, None)), # Indexing starts at 0. 
                ('figures', '600', 'a'),
                ('description', '650', 'a'),
                ('genre', '655', 'a'),
                ('magazine', '773', 't')]
    
    # Dictionary for saving our data
    marc_dictionary = {}
    
    # Iterate through tuples in 'field_list'
    for t in field_list:
        
        # Key name is first in the tuple     
        dict_key_name = t[0]
        
        # We add the key to the dictionary and an empty list (that we will later fill) as a value
        marc_dictionary[dict_key_name] = []
    
    # Iterate through all records in the database  
    for record in reader:
        
        # Call our function save_to_dict
        marc_dictionary = save_to_dict(record, marc_dictionary, field_list)

# Create a DataFrame from 'marc_dictionary'       
df = pd.DataFrame.from_dict(marc_dictionary)

print("MARC file saved to DataFrame df.")

<div class='alert alert-block alert-info'>
    <b>Try It!</b> We can easily retrieve additional values from the records by adding more tuples to the list, right after the last entry. In our case after ('magazine', '773', 't'). <br>
    We can also add fields that do not have subfields. For example, if we want to add field '005,' which contains information about the last transaction, we can use ('latest transaction','005', None). <br>
    To add the leader information, use this tuple: ('leader', 'LDR', None).
</div>


In [None]:
# Print last 5 records in the DataFrame 'df'
df.tail()

#### 2.3 Data Processing

We can see that names have an extra comma at the end, which we can easily remove. Similarly, titles have an unnecessary slash. Using the `apply()` function and `lambda`, we can modify all values in a column with a single line of code. In the `lambda` function, we define how we want to adjust the data. We apply the `lambda` function to all values using `apply()`.

All data are stored in lists, so we need to process each list separately. The function takes individual lists within the column and checks if the list is not empty (i.e., it checks if the list's length is greater than 0 `len(y) > 0`). If it's not empty, the function processes the values in the list and removes extra commas using the `strip(' ,')` function, which eliminates extra commas or spaces in names. For titles, we use the `strip(' /')` function to remove the extra slashes. If the list is empty, the function leaves it unchanged.


In [None]:
# Save name and surname without the extra comma at the end of the string
df['figures'] = df['figures'].apply(lambda x: [y.strip(' ,') if len(y) > 0  else y for y in x])  
df['author'] = df['author'].apply(lambda x: [y.strip(' ,') if len(y) > 0 else y for y in x]) 

# Save title without the extra slash at the end of the string  
df['title'] = df['title'].apply(lambda x: [y.strip(' /') if len(y) > 0 else y for y in x])  

# Print last 5 records in the DataFrame 'df'
df.tail()

### 3. Export to CSV and Excel

In the following step, we export the data into CSV format. Since CSV tables do not work well with lists, we join the values in the lists into a single string using the `join()` function. Once again, we utilize a lambda function for this purpose.

###### The `join()` function can concatenate individual characters within a string. To prevent having a semicolon between every character, we use the `isinstance()` function within the lambda function to test whether the data are indeed in a list.


In [None]:
# Iterate through columns in the DataFrame 'df'
for column in df.columns:
    
    # Join all values in the list with a semicolon ';' 
    df[column] = df[column].apply(lambda x: ';'.join(x) if isinstance(x, list) else x )

# Print the last 5 records in the DataFrame 'df'
df.tail()

Finally, we save the DataFrame to both CSV and Excel formats in the 'data/csv' and 'data/excel' directories, respectively.

In [None]:
out_csv = 'data/csv/out_{base}.csv'.format(base = base)

# Save DataFrame to CSV format
df.to_csv(out_csv, encoding = 'utf8', sep = ",", index=False)   

print("Data saved to csv.")

out_excel = 'data/excel/out_{base}.xlsx'.format(base = base)

# Save DataFrame to Excel format
df.to_excel(out_excel,  index=False) 

print("Data saved to xlsx.")