## ArchiveSpace Barcodes Project
### Notebook 3

This notebook contains code for matching Alma item records with ASPace top containers, using a combination of the Alma item `Enum A` field amd the Alma holdings `Permanent Call Number` field.

In [1]:
import pandas as pd
import re
import pickle

`tc_df`: ASpace top containers with associated series information.

`tc_bibs_df`: Top containers with linked bib ID's.

Note that because the bib ID's are note on the **resource** record, not the top container, it's not always possible to match a given top container to a single bib record in Alma. In those cases, we need to check multiple bib records, looking across all the holdings to identify the best match based on the series and box information encoded in the item `Enum A` field and the holdings `Permanent Call Number`. 

In [2]:
tc_df = pd.read_pickle('./aspace_data/merged-dataset.pkl.gz')
tc_bibs_df = pd.read_pickle('./aspace_data/tc-to-bib-mapping.pkl.gz')

In [3]:
# Convert pandas nulls to empty strings
tc_bibs_df.mms_id = tc_bibs_df.mms_id.fillna("")

In [4]:
tc_df.component_id = tc_df.component_id.fillna("")

Analytics path to Alma items report: `/shared/The George Washington University/SCRC Projects/barcodes-to-aspace/scrc_barcodes_final`

In [5]:
# Load Alma item records (from Alma Analytics report)
# Set columns as string to facilitate matching
dtypes={'MMS Id': str,
       'Barcode': str,
       'Enum A': str}
alma_items = pd.read_csv('./aspace_data/scrc_barcodes.csv', dtype=dtypes)

To map top containers (TC) to Alma item records:

1. Identify the bib record(s) associated with a given TC
2. Isolate the Alma items associated with that bib record
3. Construct the enum for the TC in 2 ways:
    - as the TC `indicator`
    - as the series-level (archival object) `component_id` plus the TC `indicator`
4. Attempt to match on the Alma item `Enum A` field

In [7]:
orig_id = r'(\d+)-wrlcdb'   # Pattern for the Originating System ID field in Alma
local_param = r'\(DGW\)(\d+)\-wrlcdb.+' # Pattern for the Local Param 06 field in Alma

In [8]:
# Fill nulls for string matching
alma_items['enum_a'] = alma_items['Enum A'].fillna('')
alma_items['voyager_id'] = alma_items['Local Param 06'].fillna("").str.extract(local_param)

In [9]:
# Where local params is null, use Originating System ID
alma_items.voyager_id = alma_items.voyager_id.fillna(alma_items['Originating System ID'].str.extract(orig_id, expand=False))

In [10]:
# Patterns in Permanent Call Number field
call_nos = re.compile(r'.*[Ss]eries ([0-9 ,\-]+)(?:and )?([0-9 ,\-]*)')
with_series = alma_items.loc[alma_items['Permanent Call Number'].str.contains('[Ss]eries')].copy()
ws_df = with_series.join(with_series['Permanent Call Number'].str.extract(call_nos))#.to_clipboard(index=True)

In [11]:
# Combine matches and convert to list, preserving ranges
ws_df['call_no_series'] = (ws_df[0] + ws_df[1]).str.split(r'(?:,\s?)|\s+')

The following code takes the contents of the `Enum A` field on the Alma item record and attempts to extract series- and box-level indicators from the string representing that field.

In [12]:
series_patterns = (r'(?:\s|,|^)series ([0-9IVX]+)',)
box_patterns = (r'box\s?(\d+)(?:,|$|\s)', # Box 1
            r'box\s?([A-Z0-9\-\.]+)(?:,|$|\s)', # Box UP3, Box WPA-2014.066
            r'box (up [a-zA-Z0-9]+)(?:,|$|\s)', # Box UP 3
            r'^([A-Z0-9\-\.]+)$'                # RG3.2-0000.001
            )
series_patterns = tuple((re.compile(p, re.IGNORECASE) for p in series_patterns))
box_patterns = tuple((re.compile(p, re.IGNORECASE) for p in box_patterns))

def extract_enum(enum):
    '''
    enum => Alma item record
    Returns: {'series': Series matching string, if present
             'box': Box matching string, if present}
    '''
    values = {'series': None,
             'box': None}
    for pat in series_patterns:
        m = pat.search(enum)
        if m:
            values['series'] =  m.group(1)
            break
    for pat in box_patterns:
        m = pat.search(enum)
        if m:
            values['box'] = m.group(1)
            break
    return values

The following code takes 1) an ASpace TC `indicator` string, 2) an Alma item enum string, 3) an optional ASpace `component_id` string, and 4) and optional string representing the `Permanent Call Number` field from an Alma holdings record.

It expects box-level information to be found on the Alma item enumeration.

It attempts to find a match in the following order:

| ASpace                            | Alma                  |
| :----------------------------------| :----------------------|  
| indicator (box) & component_id (series)         | Enum A (series & box) |
| indicator (box) & component_id (series)          | Enum A (box) & Call No (series) |
| indicator (box)                       | Enum A (box) |

In [13]:
def find_match(indicator, enum, component=None, call_no_series=None):
    '''
    indicator => ASpace top container
    component => ASpace top-level archival object (series)
    enum => Alma item record
    call_no_series => list of series numbers from Alma call number
    Returns: ('S enum' => series (item-level) & box match,
              'S call_no' => series (holding-level) & box match,
             'B'  => box match,
             None => no match)
    '''
    enum_values = extract_enum(enum)
    # Must have a box value to match on
    if not enum_values['box']:
        return
    # Compare enum and indicator for box match
    indicator_match = enum_values['box'].lower() == indicator.lower()
    # Some numbers begin with leading zeroes
    if not indicator_match:
        ind_z = indicator.zfill(len(enum_values['box']))
        indicator_match = enum_values['box'] == ind_z
    # If there's a component, it may contain series information
    if component:
        series_comp = series_patterns[0].search(component)
        if series_comp and enum_values['series']:
            # Compare enum and component for series information
            series_match = enum_values['series'].lower() == series_comp.group(1).lower()
            if series_match and enum_values['series'] and indicator_match:
                return 'S enum'
        # Compare component and holdings call number
        elif call_no_series and (series_comp.group(1).lower() in call_no_series) and indicator_match:
            return 'S call_no'
    # No series match? Box match, if there is one
    if indicator_match:
        return 'B'
            

An instance of the following class encodes a single top container and the Alma items and holdings to which it *may* refer (based on the bib ID's from the resource record).

The `find_matches` method iterates through the possible matches and identifies any where the box and/or box + series information correspond.

In [15]:
from collections import defaultdict

# Each top container has one or more component id's, one indicator, and one resource record 
class TopContainer:
    
    def __init__(self, tc_id, tc_df):
        '''
        Initialized with DataFrame group of top containers from ArchiveSpace
        :param tc_id: id of top container
        :param tc_df: DataFrame with rows for that top container
        '''
        self.tc_id = tc_id
        self.resource_id = tc_df.resource_id.values[0]
        self.resource_title = tc_df.resource_title.values[0]
        # multiple components (top-level series) are possible per container
        self.component_ids = tc_df.component_id.unique()
        # one indicator per container
        self.indicator = tc_df.indicator.values[0]
        
    def add_bib_ids(self, bibs_df):
        '''
        :param bibs_df: DataFrame mapping a top container to Alma/Voyager bib IDs.
        A top container may be associated with more than one MMS Id/Voyager Id, since the latter 
        are recorded at the resource level, not the container level
        '''
        self.mms_ids = bibs_df.loc[bibs_df.top_container_id == self.tc_id].mms_id.dropna().unique()
        self.voyager_ids = bibs_df.loc[bibs_df.top_container_id == self.tc_id].voyager_id.dropna().unique()
        return self
    
    def add_alma_items(self, alma_df):
        '''
        :param alma_df: DataFrame of item-level info from Alma
        Each barcode should map to a single top container. However, we have to use a combination 
        of the MMS Id/Voyager Id, the call number (holding) and the item-level enumeration on the Alma record
        to make this match.
        '''
        # Create groups of items matching the associated bib ID's
        alma_tc_items = alma_df.loc[alma_df['MMS Id'].isin(self.mms_ids) | 
                                   alma_df.voyager_id.isin(self.voyager_ids)]
        # Group by holdings ID (for matching on the call number)
        holding_groups = alma_tc_items.groupby('Holding Id')
        # Retrieves values for a column and associates it with the group key
        self.barcodes = holding_groups.Barcode.apply(lambda x: x.values).to_dict()
        self.enums = holding_groups.enum_a.apply(lambda x: x.values).to_dict()
        return self
        
    def add_series_cn(self, call_no_df):
        '''
        :param call_no_df: DataFrame of call-number series info (Alma, holdings level)
        Series information can be encoded in Alma in the call number (on the holdings record).
        
        '''
        self.call_no_series = defaultdict(list)
        self.call_nos = {}
        # loop through holding ID's associated with this resource
        for holding_id in self.barcodes:
            # Retrieve series info
            holdings = call_no_df.loc[call_no_df['Holding Id'] == holding_id]
            # May not contain any series info
            if not holdings.empty:
                # Store the call number string for reference on result
                self.call_nos[holding_id] = holdings['Permanent Call Number'].values[0]
                # Only one call number per holding
                series_list = holdings.call_no_series.values[0]
                # Expand ranges: 1-3 => 1, 2, 3
                for s in series_list:
                    if '-' in s:
                        endpoints = s.split('-')
                        series_range = [str(n) for n in range(int(endpoints[0]),
                                                              int(endpoints[1]) + 1)]
                        self.call_no_series[holding_id].extend(series_range)
                    else:
                        self.call_no_series[holding_id].append(s)
        return self 
    
    def find_matches(self):
        '''
        Identifies possible matches between a top container (indicator/series component) and 
        an Alma item.
        Encodes the following assumptions:
          - A top container with multiple distinct series (archival-object components) will match on the indicator
          - A top container with a single unique series may match on the following:
            - indicator alone => item enum
            - indicator + component => item enum
            - indicator => item enum, component => call number
        Inspect all holdings/items associated with this top container and records matches
        '''
        self.matches = []
        # Iterate over holdings
        for holding in self.barcodes:
            # Iterate over items
            for barcode, enum in zip(self.barcodes[holding], self.enums[holding]):
                if len(self.component_ids) == 1:
                    match = find_match(self.indicator, enum, component=self.component_ids[0], 
                                      call_no_series=self.call_no_series.get(holding))
                else:
                    match = find_match(self.indicator, enum, call_no_series=self.call_no_series.get(holding))
                if match:
                    self.matches.append({'barcode': barcode,
                                            'alma_item': enum,
                                            'alma_call_number': self.call_nos.get(holding),
                                            'holding_id': holding,
                                            'match_type': match,
                                            'indicator': self.indicator,
                                            'components': '; '.join(self.component_ids),
                                            'top_container': self.tc_id,
                                            'aspace_resource': self.resource_id,
                                            'resource_title': self.resource_title
                                        })
        return self
                        
                
        

In [16]:
# Creates a top-container object for each top container in the ASpace data, associates it with the corresponding Alma holdings, and looks for matches
top_containers = []
for tc_id, df in tc_df.groupby('top_container_id'):
    tc = TopContainer(tc_id, df)
    tc.add_bib_ids(tc_bibs_df).add_alma_items(alma_items).add_series_cn(ws_df)
    tc.find_matches()
    top_containers.append(tc)

In [17]:
results = pd.DataFrame.from_records([rec for container in top_containers for rec in container.matches])

We separate the results into three batches:
1. Those with a 1:1 match between top container and Alma item record (barcode).
2. Those where a 1:many match between top container and Alma item records.
3. Those with a many:1 match between top containers and Alma item record.

In [18]:
one_to_one = results.groupby('top_container').filter(lambda x: len(x) == 1)
one_to_one = one_to_one.groupby('barcode').filter(lambda x: len(x) == 1)

In [20]:
one_to_one.to_csv('./aspace_data/one-to-one-matches.csv', index=False)

In [21]:
one_to_many = results.groupby('top_container').filter(lambda x: len(x) > 1)
one_to_many.to_csv('./aspace_data/one-to-many-matches.csv', index=False)

In [22]:
many_to_one = results.groupby('barcode').filter(lambda x: len(x) > 1)

In [25]:
many_to_one.to_csv('./aspace_data/many_to_one.csv', index=False)

In [26]:
with open('./aspace_data/top-container-matches.pkl', 'wb') as f:
    pickle.dump(top_containers, f)

To load matches for further analysis:

In [27]:
with open('./aspace_data/top-container-matches.pkl', 'rb') as f:
    top_containers = pickle.load(f)