# Jupyter notebook to query the harvested metadata records from the IISG bibliographic materials (authority)

This notebook makes it possible to get overviews and query the metadata records of the International Institute of Social History (IISG) Bibliographic materials ("Biblio"). It uses as source the file "converted.csv" obtained via metadata harvesting using the scripts in this repository (https://github.com/lilimelgar/iisg-metadata-overviews).  It contains MARC records from the OAIPMH endpoint. 
The file contains one record per row, and each marc property (field and subfield) is in a column.

Note: the data includes only metadata records at the "item" level.

Created by Liliana Melgar (April, 2024).

# A. Set up

## A1. Import the required python libraries 
*(nothing to change)*

In [1]:
import pandas as pd
import numpy as np
import csv
import re

from IPython.display import display, HTML
from IPython.display import clear_output
display(HTML("<style>.container { width:95% !important; }</style>"))
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

# to add timestamp to file names
import time
# import os.path to add paths to files
import os

## A2. Set the path to the csv file 
*nothing to change if you cloned the repository. If you downloaded the file only ("biblio_as_csv.gzip"), then set here the path to where you have downloaded the file*

In [2]:
# path to where the relevant data is located
# authority
script_dir = os.getcwd()  # Gets the current working directory
project_root = os.path.abspath(os.path.join(script_dir, "..", ".."))  # Moves up two levels to reach 'repo'
data_directory_authority = os.path.join(project_root, "data", "authority")
data_converted_authority = os.path.join(data_directory_authority, 'converted')
data_downloads_authority = os.path.join(data_directory_authority, 'downloads') #path to the folder where the reports will be downloaded

## A3. Read the csv file as a pandas dataframe
*nothing to change here, just be patient, IT TAKES LONG TO LOAD (around started at 19.00h and finished sometime before 20:48h same day)*

In [3]:
# read csv as dataframe
authority_df_v0 = pd.read_csv(f'{data_converted_authority}/authority_as_csv_per_field.gzip', sep="\t", compression='gzip', low_memory=False)
# low_memory=False was set after this warning message: "/var/folders/3y/xbjxw0b94jxg6x2bcbyjsmmcgvnf7q/T/ipykernel_987/2912965462.py:3: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False."

# B. First overview and data preparation

## B1. First overview: all fields and data types
Execute the cell and view the general information of the data, which includes the Columns (marc properties with subfields), the Non-Null Count (i.e., how many cells have values; for example: if a cell says "1 non-null" it means that only one row has a value); and the Data type (object (i.e., a string or a combination of data types), a float or an integer).
- Keep in mind that the MARC labels have 3 characters, and that the fourth character can be an indicator or a subfield. For example: 1000 is Marc label 100 with indicator 0. And 100a is Marc label 100 with subfield a.

In [4]:
authority_df_v0.info(verbose = True, show_counts = True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 610872 entries, 0 to 610871
Data columns (total 54 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   001     610872 non-null  int64  
 1   003     610872 non-null  object 
 2   005     600600 non-null  float64
 3   008     610870 non-null  object 
 4   030     1 non-null       object 
 5   035     579434 non-null  object 
 6   036     1 non-null       object 
 7   040     610864 non-null  object 
 8   041     1 non-null       object 
 9   051     1 non-null       object 
 10  100     427878 non-null  object 
 11  110     103066 non-null  object 
 12  111     17101 non-null   object 
 13  130     55130 non-null   object 
 14  148     1440 non-null    object 
 15  150     983 non-null     object 
 16  151     5169 non-null    object 
 17  155     71 non-null      object 
 18  370     175 non-null     object 
 19  371     3 non-null       object 
 20  372     7 non-null       object 
 21  373     34

## B2. Optional (documentation)
Ideally, each field above would have a definition explaining what it means and what kind of values does it contain (in relation to the conventions for creating IISG metadata). That documentation can exist somewhere else (e.g., on Confluence), but this could be a place to start updating or writing those definitions since here one can see the data that they contain in detail.

## B3. Prepare the data for search
Because we know that the data doesn't have proper numerical values to be computed, we rather convert all values to strings in order to facilitate querying. This also includes filling in empty values with a standard string: "null"
*(nothing to change here)*

In [5]:
# convert datatypes and fill in empty values
df_columns = authority_df_v0.columns
for column in df_columns:
    dataType = authority_df_v0.dtypes[column]
    if dataType == np.float64:
        authority_df_v0[column] = authority_df_v0[column].fillna('null')
        authority_df_v0[column] = authority_df_v0[column].astype(str)
    if dataType == np.int_:
        authority_df_v0[column] = authority_df_v0[column].fillna('null')
        authority_df_v0[column] = authority_df_v0[column].astype(str)
    if dataType == object:
        authority_df_v0[column] = authority_df_v0[column].fillna('null')
        authority_df_v0[column] = authority_df_v0[column].astype(str)

In [6]:
# create a copy
authority_df = authority_df_v0.copy()

In [7]:
# # # save the csv (in case one wants to inspect it outside this noteobok). Make sure the "downloads" directory exists inside Authority
# authority_df.to_csv(f'{data_downloads}/authority_all.csv.gz', index=False, compression='gzip')

In [8]:
# Check again the general information of the data after having filled in the emtpy values and converted the data types
authority_df.info(verbose = True, show_counts = True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 610872 entries, 0 to 610871
Data columns (total 54 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   001     610872 non-null  object
 1   003     610872 non-null  object
 2   005     610872 non-null  object
 3   008     610872 non-null  object
 4   030     610872 non-null  object
 5   035     610872 non-null  object
 6   036     610872 non-null  object
 7   040     610872 non-null  object
 8   041     610872 non-null  object
 9   051     610872 non-null  object
 10  100     610872 non-null  object
 11  110     610872 non-null  object
 12  111     610872 non-null  object
 13  130     610872 non-null  object
 14  148     610872 non-null  object
 15  150     610872 non-null  object
 16  151     610872 non-null  object
 17  155     610872 non-null  object
 18  370     610872 non-null  object
 19  371     610872 non-null  object
 20  372     610872 non-null  object
 21  373     610872 non-null  object
 

# C. Get a glimpse of the data

## C1. First rows
Here you can see a sample of the records, one per line. You can change the value "10" to any other desired size for your sample, preferably not too big. You can also use "tail" instead of "head" to see the records in the last rows.
- Keep in mind to scroll horizontally and vertically to see the entire record.
- NaN means that the cell is empty.
- Arbitrarily, some cells above, we decided that the omega "Ω" would be the separator for multi-value cells.

In [9]:
authority_df.head(20)

Unnamed: 0,001,003,005,008,030,035,036,040,041,051,100,110,111,130,148,150,151,155,370,371,372,373,374,377,378,400,401,405,410,411,419,430,450,455,4J0,500,505,510,511,530,550,651,663,667,680,710,880,901,905,941,942,999,leader,o35
0,1,NL-AMISG,20021205191805.0,130909n| acnaaabn |n anc d|||||||||||||d,,"""a"":(IISG)IISGa10000882",,"""a"":IISG⑄""c"":IISG",,,,"""a"":1. Leipziger gehörlosenverein ""1864""",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,"""c"":1⑄""t"":authority",,,,,00232nz a2200097o 45 0,
1,10,NL-AMISG,20021205191805.0,130909n| acnaaabn |n anc d|||||||||||||d,,"""a"":(IISG)IISGa10001010",,"""a"":IISG⑄""c"":IISG",,,,,"""a"":1 mei⑄""d"":(1903)",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,"""c"":10⑄""t"":authority",,,,,00208nz a2200097o 45 0,
2,100,NL-AMISG,20021205191805.0,130909n| acnaaabn |n anc d|||||||||||||d,,"""a"":(IISG)IISGa10001109",,"""a"":IISG⑄""c"":IISG",,,,,"""a"":1 mei⑄""d"":(1934)",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,"""c"":100⑄""t"":authority",,,,,00208nz a2200097o 45 0,
3,1000,NL-AMISG,20021205191805.0,130909n| acnaaabn |n aac d|||||||||||||d,,"""a"":(IISG)IISGa10012954",,"""a"":IISG⑄""c"":IISG",,,"""a"":Abdalla, Ahmed",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,"""c"":1000⑄""t"":authority",,,,,00209nz a2200097o 45 0,
4,10000,NL-AMISG,20021205191805.0,021205n| acannaabn |n anc d,,"""a"":(IISG)IISGa10046623",,"""a"":IISG⑄""c"":IISG",,,,"""a"":Ahmadu Bello University (Zaria).⑄""b"":Department of English",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,"""c"":10000⑄""t"":authority",,,,,00250nz a2200097o 45 0,
5,100000,NL-AMISG,20021205191805.0,021205n| acannaabn |n anc d,,"""a"":(IISG)IISGa10341430",,"""a"":IISG⑄""c"":IISG",,,,"""a"":Federal Reserve System (USA).⑄""b"":Board of Governors",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,"""c"":100000⑄""t"":authority",,,,,00244nz a2200097o 45 0,
6,100001,NL-AMISG,20021205191805.0,021205n| acannaabn |n anc d,,"""a"":(IISG)IISGa10341437",,"""a"":IISG⑄""c"":IISG",,,,"""a"":Federal Supply Service (USA).⑄""b"":General Services Administration",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,"""c"":100001⑄""t"":authority",,,,,00257nz a2200097o 45 0,
7,100002,NL-AMISG,20021205191805.0,021205n| acaabaaan |n anc d,,"""a"":(IISG)IISGa10341443",,"""a"":IISG⑄""c"":IISG",,,,,,"""a"":Europees jeugdwerk",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,"""c"":100002⑄""t"":authority",,,,,00213nz a2200097o 45 0,
8,100003,NL-AMISG,20021205191805.0,021205n| acaaaaaan |n anc d,,"""a"":(IISG)IISGa10341457",,"""a"":IISG⑄""c"":IISG",,,,,,"""a"":Geschichte / Akademie-Verl",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,"""c"":100003⑄""t"":authority",,,,,00221nz a2200097o 45 0,
9,100004,NL-AMISG,20021205191805.0,021205n| acaaaaaan |n anc d,,"""a"":(IISG)IISGa10341482",,"""a"":IISG⑄""c"":IISG",,,,,,"""a"":Geschichte am See",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,"""c"":100004⑄""t"":authority",,,,,00212nz a2200097o 45 0,


## C2. Size (shape) of the data
Here you can see how many rows (first value) and how many columns (second value) are in the data.

In [10]:
authority_df.shape

(610872, 54)

## C3. Unique values
Here you can see a general description of the data, including how many unique values are per column.

In [11]:
# describe the dataframe
authority_df.describe()

Unnamed: 0,001,003,005,008,030,035,036,040,041,051,100,110,111,130,148,150,151,155,370,371,372,373,374,377,378,400,401,405,410,411,419,430,450,455,4J0,500,505,510,511,530,550,651,663,667,680,710,880,901,905,941,942,999,leader,o35
count,610872,610872,610872.0,610872,610872.0,610872.0,610872.0,610872,610872.0,610872.0,610872.0,610872.0,610872.0,610872.0,610872.0,610872.0,610872.0,610872.0,610872.0,610872.0,610872.0,610872.0,610872.0,610872.0,610872.0,610872.0,610872.0,610872.0,610872.0,610872.0,610872.0,610872.0,610872.0,610872.0,610872.0,610872.0,610872.0,610872.0,610872.0,610872.0,610872.0,610872.0,610872.0,610872.0,610872.0,610872.0,610872.0,610872,610872.0,610872.0,610872.0,610872.0,610872,610872.0
unique,610872,1,195380.0,12650,2.0,579389.0,2.0,11,2.0,2.0,425395.0,102903.0,17080.0,55059.0,1391.0,981.0,5059.0,68.0,121.0,4.0,7.0,29.0,37.0,2.0,11.0,5143.0,3.0,6.0,12967.0,327.0,2.0,299.0,180.0,8.0,2.0,495.0,4.0,317.0,8.0,24.0,416.0,2.0,4.0,16.0,7.0,2.0,14.0,610872,4.0,2.0,5.0,5.0,2225,2.0
top,1,NL-AMISG,20021205191805.0,021205n| acannaabn |n aac d,,,,"""a"":IISG⑄""c"":IISG",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,"""c"":1⑄""t"":authority",,,,,00209nz a2200097o 45 0,
freq,1,610872,355756.0,247394,610871.0,31438.0,610871.0,571518,610871.0,610871.0,182994.0,507806.0,593771.0,555742.0,609432.0,609889.0,605703.0,610801.0,610697.0,610869.0,610865.0,610838.0,610832.0,610871.0,610862.0,605711.0,610870.0,610867.0,597740.0,610542.0,610871.0,610571.0,610689.0,610865.0,610871.0,610366.0,610869.0,610546.0,610863.0,610849.0,610394.0,610871.0,610869.0,610857.0,610866.0,610871.0,610859.0,1,595615.0,610871.0,606828.0,610868.0,42190,610871.0


In [12]:
# Test
authority_df.head(5)

Unnamed: 0,001,003,005,008,030,035,036,040,041,051,100,110,111,130,148,150,151,155,370,371,372,373,374,377,378,400,401,405,410,411,419,430,450,455,4J0,500,505,510,511,530,550,651,663,667,680,710,880,901,905,941,942,999,leader,o35
0,1,NL-AMISG,20021205191805.0,130909n| acnaaabn |n anc d|||||||||||||d,,"""a"":(IISG)IISGa10000882",,"""a"":IISG⑄""c"":IISG",,,,"""a"":1. Leipziger gehörlosenverein ""1864""",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,"""c"":1⑄""t"":authority",,,,,00232nz a2200097o 45 0,
1,10,NL-AMISG,20021205191805.0,130909n| acnaaabn |n anc d|||||||||||||d,,"""a"":(IISG)IISGa10001010",,"""a"":IISG⑄""c"":IISG",,,,,"""a"":1 mei⑄""d"":(1903)",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,"""c"":10⑄""t"":authority",,,,,00208nz a2200097o 45 0,
2,100,NL-AMISG,20021205191805.0,130909n| acnaaabn |n anc d|||||||||||||d,,"""a"":(IISG)IISGa10001109",,"""a"":IISG⑄""c"":IISG",,,,,"""a"":1 mei⑄""d"":(1934)",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,"""c"":100⑄""t"":authority",,,,,00208nz a2200097o 45 0,
3,1000,NL-AMISG,20021205191805.0,130909n| acnaaabn |n aac d|||||||||||||d,,"""a"":(IISG)IISGa10012954",,"""a"":IISG⑄""c"":IISG",,,"""a"":Abdalla, Ahmed",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,"""c"":1000⑄""t"":authority",,,,,00209nz a2200097o 45 0,
4,10000,NL-AMISG,20021205191805.0,021205n| acannaabn |n anc d,,"""a"":(IISG)IISGa10046623",,"""a"":IISG⑄""c"":IISG",,,,"""a"":Ahmadu Bello University (Zaria).⑄""b"":Department of English",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,"""c"":10000⑄""t"":authority",,,,,00250nz a2200097o 45 0,


# D. Inspect one record
This is a test, to see if the data is retrieved correctly for one record, pick up any TCN from the list above

In [None]:
# TEST (see one record)
# check if a string value exists in a column (the string is exactly the same)
# test_exact = biblio_df[biblio_df['651a'] == '1362253']
# test_exact = authority_df[authority_df['151a'] == 'Srebrenica (Yugoslavia)']
test_exact = authority_df[authority_df['001'] == '239940'] #strikes
test_exact

In [None]:
# # You may want to dowload the table above to an excel file for further inspection:

# # choose any name for your file, the file will go to the ../data/downloads folder.
# name_file = 'authority_151a_Srebrenica'

# test_exact.to_excel(f'{data_downloads}/{name_file}.xlsx')

In [None]:
# biblio_df['100a'].unique().tolist()

## D1. Create a subset with certain column(s)/field(s)
At this point you may be curious to know which values are in one column. For example, 040a has 7 unique values, which are those?
- You can change the field inside the quotation marcs for any other field of interest.

In [None]:
# create subset with record Id and record of interest, here enter the name of the field(s) that you are interested in separated by commas, each field has to be within single quotation marks, e.g., biblio_df[['001','100e', '110e']]
# field_subset_df = biblio_df[['001','090a','901a','245a','245b','260a','852p','852j','866a','902a','leader']] #--> For LA periodicals
# field_subset_df = authority_df[['001','150a','151a','155a','leader']] #--> For geographic terms exploration
field_subset_df_v1 = authority_df[['001','150','450','550','leader']] #--> For subject thesaurus
# field_subset_df

In [None]:
# check again the number of unique values in your subset
field_subset_df_v1.describe()

In [None]:
# field_subset_df_v1['150'].unique().tolist()

In [None]:
field_subset_df_v1.tail()

In [None]:
#  At this point you may wonder which record has one of the values observed
# query_value_aprox = field_subset_df_v1[field_subset_df_v1['150'].str.contains("⑄", case=False, regex=True)]
# query_value_aprox = field_subset_df_v1[field_subset_df_v1['450'].str.contains("⑄", case=False, regex=True)]
# query_value_aprox = field_subset_df_v1[field_subset_df_v1['150'].str.contains("¶", case=False, regex=True)]
# query_value_aprox = field_subset_df_v1[field_subset_df_v1['450'].str.contains("¶", case=False, regex=True)] ## FUTURE WORK: GET THESE ONES IN SEPARATE ROWS

# query_value_aprox

In [None]:
# for now just dropping the rows that have multiple values since it's only two records
field_subset_df_v1.drop([152622, 317560])

In [None]:
field_subset_df_v2 = field_subset_df_v1.reset_index(drop=True).copy()

In [None]:
# remove empty rows
# Replace string "null" with actual NaN
field_subset_df_v2.replace("null", np.nan, inplace=True)

# Drop rows where specific columns ('A' and 'B') contain NaN
field_subset_df_v3 = field_subset_df_v2.dropna(subset=['150'])

In [None]:
field_subset_df = field_subset_df_v3.reset_index(drop=True).copy()

In [None]:
# You may want to dowload the table above to an excel file for further inspection:

# choose any name for your file, the file will go to the ../data/downloads folder.
# name_file = 'biblio_author_person_field_100a' #--> authors test
# name_file = 'biblio_geo_651a' #--> geoterms
# name_file = 'authorities_geo_151a_parenthesis'
name_file = 'subject_terms_per_150'

# field_subset_df.to_excel(f'{data_downloads}/{name_file}.xlsx')

## or download to csv
field_subset_df.to_csv(f'{data_downloads_authority}/{name_file}.csv', index=False) # if too big, use compression='gzip'

## D2. Create a subset of records with a certain value in a given column
You may also want to create a list of the records with a certain value in a given column, for example, for field 100e you got these unique values: ['creator.', 'null', 'creator']. You may want to get only the list of records that have "creator."

In [None]:
# when the file above is too big, it's useful sometimes to download it and upload it here again
path = '/Users/lilianam/workspace/iisg-metadata-overviews/biblio/data'
field_subset_df = pd.read_csv(f'{path}/biblio_titles.csv.gz', sep=",", compression='gzip', low_memory=False)

In [None]:
field_subset_df.head(5)

In [None]:
# check if a string value exists in a column (the string is exactly the same)
# query_value_exact = field_subset_df[field_subset_df['100a'] == 'Hajnal, Henri.'] --> I used in ....

query_value_exact

In [None]:
# check if a string value exists in a column (the string is approximately the same)
# you may want to find the records that have either "creator." (with dot) or "creator" without dot, but not the null values
# here it's possible to use regular expressions

query_value_aprox = field_subset_df[field_subset_df['100a'].str.contains("Ka.*nelson, Berl", case=False, regex=True)]

In [None]:
query_value_aprox.head(100)

In [None]:
# get some idea of how many rows are in this set
query_value_aprox.info(verbose = True, show_counts = True)

In [None]:
# check again the number of unique values in your subset
query_value_aprox.describe()

In [None]:
# You may want to dowload the table above to an excel file for further inspection:

# choose any name for your file, the file will go to the ../data/downloads folder.
# name_file = 'biblio_author_person_field_100a_henri'
name_file = 'biblio_to_map_la_periodicals_852j'

query_value_aprox.to_excel(f'{data_downloads}/{name_file}.xlsx')

## or download to csv
# query_value_aprox.to_csv()

# E. Create subsets using inverse query
You may need to create a report with all the records that do not contain a certain value. For example, because we used "null" to fill in all empty values, one could create a list with all the records that have a value in a certain column.

In [None]:
# create a slice with the records that have non-null values in the column of interest
# Note: if you want to query the subset instead of the whole data, then replace "biblio_df" with "field_subset_df" and run the cell again

query_inverse = biblio_df[~biblio_df['100a'].str.contains("null", case=False, regex=True)]

query_inverse.head(10)

In [None]:
# get some info about the subset you got as a result of the query:
query_inverse.info(verbose=True, show_counts = True)

In [None]:
# You may want to dowload the table above to an excel file for further inspection:

# choose any name for your file, the file will go to the ../data/downloads folder.
name_file = 'biblio_author_person_field_100a_notEmpty'

query_inverse.to_excel(f'{data_downloads}/{name_file}.xlsx')

## or download to csv
# query_inverse.to_csv()

# F. Query for a specific record
You may want to see the details of a specific record, this can be done in two ways:

In [None]:
# 1. by using the index position. Example: This item: ToDo has index position 0. 
# This position can be seen in the left corner of the entire table (cell above in Section5: biblio_df.head(10))
# We will query it using the entire version of the data, not the subset

# show record vertically using index position
query_recordIndex = biblio_df.iloc[0]
query_recordIndex

In [None]:
# 2. By using the record Id using the Marc field 001
query_recordId = biblio_df[biblio_df['001'] == '8']
query_recordId