In [1]:
import pandas as pd
import re
import numpy as np

# Signatures Processing

In the following notebook, the signatures from the Bibliotheca Hertziana are decoded into several levels of meaning. 
Each Signature encodes not only the location of any given document in the library, but also it's content. 
The generic signatures are built like a tree with several levels:
- Level 1: Encompasses all the documents that are about this broad topic: e.g. A -> Manuals

    - Level 2: A subgroup of level 1, more specific: e.g. Aa -> Manuals -> General Manuals

        - Level 3: A group of numbers with an overall topic: e.g. Aa 60 - 75 -> Bibliographies
        
            - Level 4: A specific number with the topic of the document: e.g. Aa 60 -> Bibliographies of Bibliographies (whatever that is...)




In this notebook, for each signature of the 'generic type' (as described above), the meaning of the signatures will be extracted and saved in a csv document. 
Additionally for some other 'non generic types' of signatures, i.e. People and Places, the same information structure will be created from the signatures. 



The generic type signatures and their meaning are taken from the excel sheets 'bhr1' and 'bhr2', which are internal library documents. 
The non generic types are decoded using an export of the SyCa database, where the librarians save newly allocated signatures. 
Some non generic type signatures (like A and B italian artists) have been added by hand

            
    


### Import

In [4]:
# Importing the signature key in two parts

df1 = pd.read_csv('data/bhr1.csv',  sep=';')
columns = df1.columns
df2 = pd.read_csv('data/bhr2.csv', names = columns, sep=';')

#Concatenating
df = pd.concat([df1, df2], ignore_index=True)


## Cleaning

In [5]:
df.drop('vw', axis=1, inplace=True)

#### Text 

In [6]:
# getting rid of the '...' in the texts
df["text"] = df["text"].str.replace(r'\.{3,}', '')

# getting rid of \n
df['sys'] = df['sys'].str.replace('\n', '')

#### Backreferencing

In [8]:
# replace idem (=...) with the reference
backreference = re.compile(r"idem\s*\(\s*=\s*(.*)\s*\)")

df["backreference"] = df["text"].str.extract(backreference, expand=False)

# Replaces the backreference with the first captured group
df["text"] = df["text"].str.replace(backreference, r"\1", regex=True)



In [9]:
# find the remaining idems
idem = re.compile(r'\bidem\s*[,.]?\s*')

idem_df = df[df["text"].str.contains(idem)]

In [10]:
# replace the idem without backreference by iterating over the rows directly above to find what it references

for i, row in idem_df.iterrows():
    # loop over rows in the df DataFrame above the current row
    for j in range(i-1, -1, -1):
        # No 'idem' in the row, then it contains the reference
        if not idem.search(df.loc[j, "text"]):

            #find the reference, first group before comma
            reference = df.iloc[j].text.split(',')[0]
            
            #replace idem with the actual backreference
            substring = df.iloc[i].text

            modified_substring = idem.sub(f'{reference} ', substring)

            df.loc[i, "text"] = modified_substring
            break



#### Signature Tree
Filling in the table with the meaning of the signatures for each level in the rows that belong to it.

The structure acts like a tree, and in each leaf node we want the information from the nodes above.
>A -> level 1 -> Manuals

>Aa -> level 2 -> General Manuals


In [11]:
# Function to propagate the meaning of each signature level to the rows that belong to it 
def level_text (lev, text, df): 
    df[text]= ''

    indices = df[df.lev == lev].index

    for i in range(len(indices) -1): 
        #Iterate from this row to the next row with the same level, everything in between belongs to current category
        start = indices[i]
        end = indices[i+1] 

        # add level to all rows below in the tree
        df.loc[start + 1 :end - 1, text][df.lev >= lev] = df.iloc[start].text

    #handle the last case seperately 
    if len(indices) > 0: 
        start = indices[-1]
        end = len(df)
        df.loc[start + 1 :end - 1, text][df.lev >= lev]  = df.iloc[start].text

# There are 1-5 levels for any given signature
for i in range(1, 5): 
    level_text(i, 'text_' + str(i), df)

#### Ranges

Numbis column contains the identifiers for rows with consecutive signatures, but the same meaning/category

For lookup operations, it's simpler to have each signature instead of ranges, so the rows with signatures extracted from the ranges are appended to the end of the df.

The rows are appended to the end and not just after the row where the range was mentioned for efficiency reasons

In [None]:
# Appending all the ranges as rows, to make lookup operations easier later
df.numbis = df.numbis.fillna('')
range_rows = df[df.numbis.str.isdigit()]

for i, row in range_rows.iterrows():

    start = int(re.findall(r'\d+', row.sys)[0])
    end = int(row.numbis)

    for j in range(start, end): 
        new_row = row.copy()
        new_row['sys'] = new_row['sys'].replace(str(start), str(j+1))

        df = pd.concat([df, pd.DataFrame([new_row])], ignore_index=True)


# Intermediate Export 

In [127]:
#df.to_csv('data/csv/signatures.csv', index=False)

In [14]:
df = pd.read_csv('data/csv/signatures.csv')

# Merge with SyCa database
Database containing the newly allocated signatures, mostly artists and Topography. 

In [15]:
syca = pd.read_csv('data/csv/signatures_C_E.csv', sep='\t')

## Parse People (C, W, Z; Hh)

The Signatures starting with C, W, Z, Hh respectively contain artists and other people. The signatures have a different pattern than the generic ones, as they contain the first three letters of the name of the person (e.g. BER for BERnini). 

>The generic artists (C and W) have only one number allocated to their name (e.g. Ca-ABB 70 -> Abbate, Niccolo dell')

>For A and B List artists, there are multiple numbers allocated, each containing a specific part of the literature concerning this artist (e.g. Ca-VER 1340 -> Bibliographies for VERonese)

In [16]:
artists = pd.DataFrame(columns=df.columns)

rows = []

artist_rows = syca[syca.sign.str.startswith('C', 'W') | syca.sign.str.startswith('Z')]

def parse_artists(row):
        field = row.sign
        name = row['name']
        if field.startswith('C'): 
            if len(field) > 1 and field[1] == 'a':
                return 'Italienische Künstler', 'Alte Künstler (geboren vor 1870)', name
            if len(field) > 1 and field[1] == 'm':
                return 'Italienische Künstler', 'Moderne Künstler (geboren nach 1870)', name
            if len(field) > 1 and field[1] == 'f':
                return 'Italienische Künstler', 'Filmschaffende', name
            else:
               return None, None, None
        elif field.startswith('W'): 
            if len(field) > 1 and field[1] == 'a':
               return 'Ausseritalienische Künstler', 'Alte Künstler (geboren vor 1870)', name
            if len(field) > 1 and field[1] == 'm':
               return 'Ausseritalienische Künstler', 'Moderne Künstler (geboren nach 1870)', name
            if len(field) > 1 and field[1] == 'f':
                return 'Ausseritalienische Künstler', 'Filmschaffende', name
            else:
                return None, None, None
        elif field.startswith('Z'): 
            if len(field) > 1 and field[1] == 'o':
               return 'Nachbarwissenschaften', 'Italienische Dichter und ihre Werke', name
            if len(field) > 1 and field[1] == 'p':
               return 'Nachbarwissenschaften', 'Aussertalienische Dichter und ihre Werke', name
            if len(field) > 1 and field[1] == 's':
                return 'Nachbarwissenschaften', 'Werkausgaben zur Philosophie, Pädagogik und anderen geisteswissenschaftlichen Disziplinen', name
            if len(field) > 1 and field[1] == 'u':
                return 'Nachbarwissenschaften', 'Werkausgaben zur Theologie und religiöse Devotionsschriften', name
        else: 
            return '', '', ''
        

# Mapping a and b artists to the start number of their signature

a_artists = {'Bernini': 1920, 'Giotto': 660, 'Leonardo Da Vinci': 220, 'Michelangelo': 20, 'Raffael': 140, 'Tiepolo, Giov. Batt.': 10, 'Tiziano': 10}
b_artists = {'Angelico (Fra Angelico)': 310, 'Bellini, Giovanni': 770, 'Borromini': 530, 'Boticelli': 180, 'Bramante': 270, 'Canaletto, Bernardo': 110,
             'Canova': 980, 'Caravaggio': 316, 'Cellini': 290, 'Correggio': 1080, 'Donatello': 70, 'Duccio di Buoninsegna': 90,
             'Franceschi, Piero': 250, 'Ghiberti, Lorenzo': 40, 'Ghirlandaio': 350, 'Giorgione': 580, 'Guardi, Francesco': 320, 
             'Mantegna': 980, 'Masaccio': 20, 'Palladio': 320, 'Perugino': 1200 ,'Reni, Guido': 70, 'Tintoretto, Jacopo': 220, 'Veronese': 670}

# Generic artists in C, W and Z

for i, row in artist_rows.iterrows():
    text_1, text_2, text_3 = parse_artists(row)
    sys = row.sign + ' ' + str(row.nr)
    new_row = {'lev': 3, 'sys': sys, 'text': text_3, 'text_1': text_1, 'text_2': text_2}
    rows.append(new_row)

# A artists

for a, start in a_artists.items():
    sys = 'Ca-' + a[:3].upper() + ' '
    rows += [{'lev': 3, 'sys': sys, 'text': a, 'text_2': 'Alte Künstler (geboren vor 1870)', 'text_1': 'Italienische Künstler'}]
    rows += [{'lev': 4, 'sys': sys + str(start+i), 'text': t, 'text_3': a, 'text_2': 'Alte Künstler (geboren vor 1870)', 'text_1': 'Italienische Künstler'} 
            for i, t in enumerate(['Bibliographien', 'Quellenpublikationen', 'Sammelschriften', 'Ausstellungskataloge', 
                                   'Vollbiographien und Oeuvreverzeichnisse des Gesamtlebenswerkes', 'Teilbiographien und Oeuvreverzeichnisse einzelner Arbeitsperioden', 
                                   'Teilbiographien und Oeuvreverzeichnisse einzelner Arbeitsgebiete', 'Werkmonographien', 'Einzelfragen'], start=start)]
# B artists

for b, start in b_artists.items(): 
    sys = 'Ca-' + b[:3].upper() + ' '
    rows += [{'lev': 3, 'sys': sys, 'text': b, 'text_2': 'Alte Künstler (geboren vor 1870)', 'text_1': 'Italienische Künstler'}]
    rows += [{'lev': 4, 'sys': sys + str(start+i), 'text': t, 'text_3': a, 'text_2': 'Alte Künstler (geboren vor 1870)', 'text_1': 'Italienische Künstler'}
              for i,t in enumerate(['Bibliographien, Quellenpublikationen', 'Sammelschriften, Ausstellungskataloge, Voll- und Teilbiografien (Arbeitsperioden und gebiete)', 'Werkmonographien, Einzelfragen'], start=start)]



artists = pd.concat([artists, pd.DataFrame(rows)],ignore_index=True)

In [17]:
df = pd.concat([df, artists], ignore_index=True)

# Intermediate Export 

In [18]:
df.to_csv('data/csv/sig_with_artists.csv', index=False)


## Topography parsing (E, X, Y)

The Topographies are similar to the people, as they contain Level 1 (and 2) identifiers (e.g. 'E' or 'Xa') followed by the first three letters of the place (e.g. BOL -> BOLogna).
The letters are followed by a set of numbers, *ALWAYS* starting with an odd one, which contains the art-historical literature, and the following even number contains the non-art historical literature.


>E contains italian topography
>> For italy there is a list of special cities with more numbers allocated (see it_cities further down)

>X contains European topography
>> The names are from a different century, the reader has been warned

>Y contains topography outside of Europe
>> The same warning as for Europe applies here.

In [19]:
df = pd.read_csv('data/csv/sig_with_artists.csv')

  df = pd.read_csv('data/csv/sig_with_artists.csv')


In [62]:
# generic topographies in Italy

topographies = pd.DataFrame(columns=df.columns)
rows = []

topo_rows_italy = syca[syca.sign.str.startswith('E')]

for i, row in topo_rows_italy.iterrows():

    # start should be odd, if it's even then start = nr - 1
    start = row.nr
    if start % 2 == 0: 
        start -= 1
    
    sys = row.sign + ' ' + str(start)
    new_row_odd = {'lev': 3, 'sys': sys, 'text': 'nicht-kunstgeschichtliche Literatur', 'text_1': 'Topographie Italien (ohne Rom)', 'text_2': row['name']}
    rows.append(new_row_odd)

    sys = row.sign + ' ' + str(start + 1)
    new_row_even = {'lev': 3, 'sys': sys, 'text': 'kunstgeschichtliche Literatur', 'text_1': 'Topographie Italien (ohne Rom)', 'text_2': row['name']}
    rows.append(new_row_even)

topographies = pd.concat([topographies, pd.DataFrame(rows)], ignore_index=True)

# Topographies in Europe: 

X_topos = {
  'Xa': 'Deutschland',
  'Xb': 'Österreich',
  'Xc': 'Schweiz und Lichtenstein',
  'Xd': 'Frankreich und Monaco',
  'Xe': 'Belgien',
  'Xf': 'Holland',
  'Xg': 'Luxemburg',
  'Xh': 'Grossbritannien und Irland',
  'Xi': 'Spanien mit Gibraltar und Andorra',
  'Xk': 'Portugal',
  'Xl': 'Dänemark und Island',
  'Xm': 'Schweden',
  'Xn': 'Norwegen',
  'Xo': 'Finnland',
  'Xp': 'Tschechoslowakei',
  'Xq': 'Polen',
  'Xr': 'Europäische Sowjetunion (einschließlich baltische Staaten)',
  'Xs': 'Ungarn',
  'Xt': 'Jugoslawien und Albanien',
  'Xu': 'Bulgarien',
  'Xw': 'Rumänien',
  'Xx': 'Griechenland (mit Rhodos) und Zypern',
  'Xy': 'Europäische Türkei'
}

rows = []
topo_rows_europe = syca[syca.sign.str.startswith('X')]

for i, row in topo_rows_europe.iterrows():

    # start should be odd, if it's even then start = nr - 1
    start = row.nr
    if start % 2 == 0: 
        start -= 1
    
    country = X_topos[row.sign[:2]]

    sys = row.sign + ' ' + str(start)
    new_row_odd = {'lev': 4, 'sys': sys, 'text': 'nicht-kunstgeschichtliche Literatur', 'text_1': 'Topographie Europa (ohne Italien)', 'text_2': country, 'text_3': row['name']}
    rows.append(new_row_odd)

    sys = row.sign + ' ' + str(start + 1)
    new_row_even = {'lev': 4, 'sys': sys, 'text': 'kunstgeschichtliche Literatur', 'text_1': 'Topographie Europa (ohne Italien)', 'text_2': country, 'text_3': row['name']}
    rows.append(new_row_even)

topographies = pd.concat([topographies, pd.DataFrame(rows)], ignore_index=True)


# Topographies not in Europe: 

Y_topos = {
  'Ya': 'Asiatische Türkei',
  'Yb': 'Syrien und Libanon',
  'Yc': 'Israel und Jordanien',
  'Ye': 'Saudi-Arabien mit Jemen, Aden und Oman',
  'Yf': 'Irak',
  'Yg': 'Iran (Persien)',
  'Yh': 'Afghanistan',
  'Yi': 'Indien, Pakistan und Nepal',
  'Yk': 'hinterindische Staaten (Burma, Thailand, Kambodscha, Vietnam etc.)',
  'Yl': 'Japan',
  'Ym': 'China',
  'Yn': 'asiatische Sowjetunion',
  'Yo': 'malaiische Inseln und Ozeanien (Südsee-Inseln)',
  'Yp': 'Ägypten',
  'Yq': 'Abessinien',
  'Yr': 'übrige nordafrikanische Staaten (Libyen mit Cyrenaica und Tripolitanien, Tunesien, Algerien, Marokko)',
  'Ys': 'mittel- und südafrikanische Staaten',
  'Yt': 'Kanada',
  'Yu': 'USA',
  'Yw': 'Mexiko',
  'Yx': 'Mittelamerika (Guatemala, Honduras, Salvador, Nicaragua, Costa Rica, Panama und die Inseln des Karibischen Meeres)',
  'Yy': 'Südamerika (Kolumbien, Venezuela, Guayana, Ecuador, Peru, Brasilien, Bolivien, Paraguay, Uruguay, Argentinien, Chile)',
  'Yz': 'Australien mit Neuseeland'
}

rows = []
topo_rows_world = syca[syca.sign.str.startswith('Y')]

for i, row in topo_rows_world.iterrows():

    # start should be odd, if it's even then start = nr - 1
    start = row.nr
    if start % 2 == 0: 
        start -= 1
    
    country = Y_topos[row.sign[:2]]

    sys = row.sign + ' ' + str(start)
    new_row_odd = {'lev': 4, 'sys': sys, 'text': 'nicht-kunstgeschichtliche Literatur', 'text_1': 'Topographie Europa (ohne Italien)', 'text_2': country, 'text_3': row['name']}
    rows.append(new_row_odd)

    sys = row.sign + ' ' + str(start + 1)
    new_row_even = {'lev': 4, 'sys': sys, 'text': 'kunstgeschichtliche Literatur', 'text_1': 'Topographie Europa (ohne Italien)', 'text_2': country, 'text_3': row['name']}
    rows.append(new_row_even)

topographies = pd.concat([topographies, pd.DataFrame(rows)], ignore_index=True)

# Special cities in Italy
rows = []

it_cities = {
    'Bologna': 60,
    'Brescia': 290,
    'Ferrara': 100,
    'Genova': 60,
    'Messina': 70,
    'Milano': 10,
    'Modena': 10,
    'Napoli': 10,
    'Padova': 90,
    'Palermo': 240,
    'Parma': 120,
    'Perugia': 310,
    'Pisa': 10,
    'Ravenna': 50,
    'Siena': 10,
    'Torino': 120,
    'Verona': 300
}

for city, start in it_cities.items():
    
    sys = 'E-' + city[:3].upper() + ' '
    rows += [{'lev': 2, 'sys': sys, 'text': city, 'text_1': 'Topographie Italien (ohne Rom)'}]
    rows += [{'lev': 3, 'sys': sys + str(start), 'text': t, 'text_2': city, 'text_1': 'Topographie Italien (ohne Rom)'} 
            for i, t in enumerate(['nicht-kunstgeschichtliche Literatur, Topographien und Bibliographien', 'Guiden', 'Kunst allgemein', 'Architektur', 
                                   'Plastik', 'Malerei, Grafik, Mosaik, Buchmalerei', 
                                   'Hauptkirche', 'sonstige einzelne Kirchen', 'einelne Profangebäude', 'Varia'], start=start)]

topographies = pd.concat([topographies, pd.DataFrame(rows)],ignore_index=True)

In [63]:
df = pd.concat([df, topographies], ignore_index=True)

In [68]:
text_cols = ['text', 'text_1', 'text_2', 'text_3']

In [71]:
df.to_csv('data/csv/sig_updated.csv', index=False)