# Jupyter notebook to query the harvested metadata records from the IISG bibliographic materials (authority)

This notebook makes it possible to get overviews and query the metadata records of the International Institute of Social History (IISG) Bibliographic materials ("Biblio"). It uses as source the file "converted.csv" obtained via metadata harvesting using the scripts in this repository (https://github.com/lilimelgar/iisg-metadata-overviews).  It contains MARC records from the OAIPMH endpoint. 
The file contains one record per row, and each marc property (field and subfield) is in a column.

Note: the data includes only metadata records at the "item" level.

Created by Liliana Melgar (April, 2024).

# A. Set up

## A1. Import the required python libraries 
*(nothing to change)*

In [1]:
import pandas as pd
import numpy as np
import csv
import re

from IPython.display import display, HTML
from IPython.display import clear_output
display(HTML("<style>.container { width:95% !important; }</style>"))
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

# to add timestamp to file names
import time
# import os.path to add paths to files
import os

## A2. Set the path to the csv file 
*nothing to change if you cloned the repository. If you downloaded the file only ("biblio_as_csv.gzip"), then set here the path to where you have downloaded the file*

In [2]:
# path to where the transformed csv is located
data_directory = os.path.abspath(os.path.join('..', 'data'))
data_converted = os.path.join(data_directory, 'converted') #path to the repository folder where the csv file is located, if you have not cloned the repository, change the path here
data_downloads = os.path.join(data_directory, 'downloads') #path to the folder where the reports will be downloaded

## A3. Read the csv file as a pandas dataframe
*nothing to change here, just be patient, IT TAKES LONG TO LOAD (around started at 19.00h and finished sometime before 20:48h same day)*

In [3]:
# read csv as dataframe
authority_df_v0 = pd.read_csv(f'{data_converted}/authority_as_csv.gzip', sep="\t", compression='gzip', low_memory=False)
# low_memory=False was set after this warning message: "/var/folders/3y/xbjxw0b94jxg6x2bcbyjsmmcgvnf7q/T/ipykernel_987/2912965462.py:3: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False."

# B. First overview and data preparation

## B1. First overview: all fields and data types
Execute the cell and view the general information of the data, which includes the Columns (marc properties with subfields), the Non-Null Count (i.e., how many cells have values; for example: if a cell says "1 non-null" it means that only one row has a value); and the Data type (object (i.e., a string or a combination of data types), a float or an integer).
- Keep in mind that the MARC labels have 3 characters, and that the fourth character can be an indicator or a subfield. For example: 1000 is Marc label 100 with indicator 0. And 100a is Marc label 100 with subfield a.

In [4]:
authority_df_v0.info(verbose = True, show_counts = True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 610056 entries, 0 to 610055
Data columns (total 154 columns):
 #    Column  Non-Null Count   Dtype  
---   ------  --------------   -----  
 0    001     610056 non-null  int64  
 1    003     610056 non-null  object 
 2    005     600159 non-null  float64
 3    008     610054 non-null  object 
 4    035     20 non-null      object 
 5    035a    579034 non-null  object 
 6    035b    1 non-null       object 
 7    035c    1 non-null       object 
 8    035d    7 non-null       object 
 9    040a    610048 non-null  object 
 10   040c    610048 non-null  object 
 11   040f    1 non-null       object 
 12   100     2 non-null       object 
 13   1000    8 non-null       object 
 14   1001    4 non-null       object 
 15   1004    1 non-null       object 
 16   1006    3 non-null       object 
 17   100C    1 non-null       object 
 18   100D    1 non-null       object 
 19   100a    427316 non-null  object 
 20   100b    477 non-null    

## B2. Optional (documentation)
Ideally, each field above would have a definition explaining what it means and what kind of values does it contain (in relation to the conventions for creating IISG metadata). That documentation can exist somewhere else (e.g., on Confluence), but this could be a place to start updating or writing those definitions since here one can see the data that they contain in detail.

## B3. Prepare the data for search
Because we know that the data doesn't have proper numerical values to be computed, we rather convert all values to strings in order to facilitate querying. This also includes filling in empty values with a standard string: "null"
*(nothing to change here)*

In [5]:
# convert datatypes and fill in empty values
df_columns = authority_df_v0.columns
for column in df_columns:
    dataType = authority_df_v0.dtypes[column]
    if dataType == np.float64:
        authority_df_v0[column] = authority_df_v0[column].fillna('null')
        authority_df_v0[column] = authority_df_v0[column].astype(str)
    if dataType == np.int_:
        authority_df_v0[column] = authority_df_v0[column].fillna('null')
        authority_df_v0[column] = authority_df_v0[column].astype(str)
    if dataType == object:
        authority_df_v0[column] = authority_df_v0[column].fillna('null')
        authority_df_v0[column] = authority_df_v0[column].astype(str)

In [8]:
# create a copy
authority_df = authority_df_v0.copy()

In [14]:
# save the csv
authority_df.to_csv('authority_all.csv.gz', index=False, compression='gzip')

In [9]:
# Check again the general information of the data after having filled in the emtpy values and converted the data types
authority_df.info(verbose = True, show_counts = True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 610056 entries, 0 to 610055
Data columns (total 154 columns):
 #    Column  Non-Null Count   Dtype 
---   ------  --------------   ----- 
 0    001     610056 non-null  object
 1    003     610056 non-null  object
 2    005     610056 non-null  object
 3    008     610056 non-null  object
 4    035     610056 non-null  object
 5    035a    610056 non-null  object
 6    035b    610056 non-null  object
 7    035c    610056 non-null  object
 8    035d    610056 non-null  object
 9    040a    610056 non-null  object
 10   040c    610056 non-null  object
 11   040f    610056 non-null  object
 12   100     610056 non-null  object
 13   1000    610056 non-null  object
 14   1001    610056 non-null  object
 15   1004    610056 non-null  object
 16   1006    610056 non-null  object
 17   100C    610056 non-null  object
 18   100D    610056 non-null  object
 19   100a    610056 non-null  object
 20   100b    610056 non-null  object
 21   100c    

# C. Get a glimpse of the data

## C1. First rows
Here you can see a sample of the records, one per line. You can change the value "10" to any other desired size for your sample, preferably not too big. You can also use "tail" instead of "head" to see the records in the last rows.
- Keep in mind to scroll horizontally and vertically to see the entire record.
- NaN means that the cell is empty.
- Arbitrarily, some cells above, we decided that the omega "Ω" would be the separator for multi-value cells.

In [10]:
authority_df.head(10)

Unnamed: 0,001,003,005,008,035,035a,035b,035c,035d,040a,040c,040f,100,1000,1001,1004,1006,100C,100D,100a,100b,100c,100d,100e,100q,100t,100v,100x,1100,1106,110a,110b,110c,110d,110e,110g,110n,110x,111a,111b,111c,111d,111e,111g,111n,111t,111x,130a,130b,130d,130l,130n,130p,130v,130w,130x,1480,148a,1500,150a,151a,155a,370a,370b,370c,370e,370f,370s,370t,371a,371b,372a,373a,373s,373t,374a,374s,374t,377a,378q,400,4001,4006,400A,400a,400b,400c,400d,400q,401a,405a,405e,405g,410,4106,410a,410b,410c,410d,410e,410n,410s,410t,411a,411c,411d,411e,411n,419a,430a,430b,430e,450a,450e,455a,4J0a,500a,500b,500c,500d,500q,500w,505a,505e,510a,510b,510e,511a,511c,511d,511e,511n,530a,550a,651a,663a,663b,667a,680a,680i,710a,8806,880a,880d,880q,901c,901t,905u,941m,942m,999,999a,leader,o35a
0,1,NL-AMISG,20021205191805.0,130909n| acnaaabn |n anc d|||||||||||||d,,(IISG)IISGa10000882,,,,IISG,IISG,,,,,,,,,,,,,,,,,,,,"1. Leipziger gehörlosenverein ""1864""",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1,authority,,,,,,00232nz a2200097o 45 0,
1,10,NL-AMISG,20021205191805.0,130909n| acnaaabn |n anc d|||||||||||||d,,(IISG)IISGa10001010,,,,IISG,IISG,,,,,,,,,,,,,,,,,,,,,,,,,,,,1 mei,,,(1903),,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,10,authority,,,,,,00208nz a2200097o 45 0,
2,100,NL-AMISG,20021205191805.0,130909n| acnaaabn |n anc d|||||||||||||d,,(IISG)IISGa10001109,,,,IISG,IISG,,,,,,,,,,,,,,,,,,,,,,,,,,,,1 mei,,,(1934),,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,100,authority,,,,,,00208nz a2200097o 45 0,
3,1000,NL-AMISG,20021205191805.0,130909n| acnaaabn |n aac d|||||||||||||d,,(IISG)IISGa10012954,,,,IISG,IISG,,,,,,,,,"Abdalla, Ahmed",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1000,authority,,,,,,00209nz a2200097o 45 0,
4,10000,NL-AMISG,20021205191805.0,021205n| acannaabn |n anc d,,(IISG)IISGa10046623,,,,IISG,IISG,,,,,,,,,,,,,,,,,,,,Ahmadu Bello University (Zaria).,Department of English,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,10000,authority,,,,,,00250nz a2200097o 45 0,
5,100000,NL-AMISG,20021205191805.0,021205n| acannaabn |n anc d,,(IISG)IISGa10341430,,,,IISG,IISG,,,,,,,,,,,,,,,,,,,,Federal Reserve System (USA).,Board of Governors,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,100000,authority,,,,,,00244nz a2200097o 45 0,
6,100001,NL-AMISG,20021205191805.0,021205n| acannaabn |n anc d,,(IISG)IISGa10341437,,,,IISG,IISG,,,,,,,,,,,,,,,,,,,,Federal Supply Service (USA).,General Services Administration,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,100001,authority,,,,,,00257nz a2200097o 45 0,
7,100002,NL-AMISG,20021205191805.0,021205n| acaabaaan |n anc d,,(IISG)IISGa10341443,,,,IISG,IISG,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Europees jeugdwerk,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,100002,authority,,,,,,00213nz a2200097o 45 0,
8,100003,NL-AMISG,20021205191805.0,021205n| acaaaaaan |n anc d,,(IISG)IISGa10341457,,,,IISG,IISG,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Geschichte / Akademie-Verl,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,100003,authority,,,,,,00221nz a2200097o 45 0,
9,100004,NL-AMISG,20021205191805.0,021205n| acaaaaaan |n anc d,,(IISG)IISGa10341482,,,,IISG,IISG,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Geschichte am See,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,100004,authority,,,,,,00212nz a2200097o 45 0,


## C2. Size (shape) of the data
Here you can see how many rows (first value) and how many columns (second value) are in the data.

In [11]:
authority_df.shape

(610056, 154)

## C3. Unique values
Here you can see a general description of the data, including how many unique values are per column.

In [12]:
# describe the dataframe
authority_df.describe()

Unnamed: 0,001,003,005,008,035,035a,035b,035c,035d,040a,040c,040f,100,1000,1001,1004,1006,100C,100D,100a,100b,100c,100d,100e,100q,100t,100v,100x,1100,1106,110a,110b,110c,110d,110e,110g,110n,110x,111a,111b,111c,111d,111e,111g,111n,111t,111x,130a,130b,130d,130l,130n,130p,130v,130w,130x,1480,148a,1500,150a,151a,155a,370a,370b,370c,370e,370f,370s,370t,371a,371b,372a,373a,373s,373t,374a,374s,374t,377a,378q,400,4001,4006,400A,400a,400b,400c,400d,400q,401a,405a,405e,405g,410,4106,410a,410b,410c,410d,410e,410n,410s,410t,411a,411c,411d,411e,411n,419a,430a,430b,430e,450a,450e,455a,4J0a,500a,500b,500c,500d,500q,500w,505a,505e,510a,510b,510e,511a,511c,511d,511e,511n,530a,550a,651a,663a,663b,667a,680a,680i,710a,8806,880a,880d,880q,901c,901t,905u,941m,942m,999,999a,leader,o35a
count,610056,610056,610056.0,610056,610056.0,610056.0,610056.0,610056.0,610056.0,610056,610056,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056.0,610056,610056,610056.0,610056.0,610056.0,610056.0,610056.0,610056,610056.0
unique,610056,1,194939.0,12650,21.0,578992.0,2.0,2.0,8.0,10,8,2.0,3.0,9.0,5.0,2.0,2.0,2.0,2.0,423455.0,180.0,5484.0,4839.0,27.0,1395.0,3.0,1.0,3.0,5.0,10.0,90297.0,14779.0,23.0,30.0,13.0,2.0,14.0,4.0,8239.0,2.0,2169.0,985.0,1572.0,2.0,287.0,2.0,4.0,54941.0,5.0,1.0,2.0,2.0,38.0,39.0,2.0,658.0,6.0,1377.0,1.0,980.0,5055.0,68.0,67.0,44.0,43.0,2.0,3.0,2.0,2.0,3.0,4.0,7.0,29.0,6.0,4.0,37.0,5.0,5.0,2.0,11.0,5.0,2.0,2.0,2.0,5024.0,13.0,86.0,197.0,21.0,3.0,5.0,6.0,3.0,2.0,3.0,12705.0,1212.0,2.0,3.0,2.0,2.0,2.0,2.0,291.0,96.0,86.0,129.0,33.0,2.0,276.0,31.0,57.0,180.0,2.0,8.0,2.0,492.0,3.0,23.0,17.0,2.0,2.0,3.0,4.0,312.0,25.0,2.0,8.0,2.0,2.0,3.0,2.0,24.0,416.0,2.0,4.0,4.0,16.0,3.0,5.0,2.0,9.0,15.0,2.0,2.0,610056,1,4.0,2.0,5.0,1.0,4.0,2223,2.0
top,1,NL-AMISG,20021205191805.0,021205n| acannaabn |n aac d,,,,,,IISG,IISG,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1,authority,,,,,,00209nz a2200097o 45 0,
freq,1,610056,355756.0,247394,610036.0,31022.0,610055.0,610055.0,610049.0,571519,571519,610055.0,610054.0,610048.0,610052.0,610055.0,610053.0,610055.0,610055.0,182740.0,609579.0,600031.0,597435.0,607496.0,608271.0,610054.0,610056.0,610053.0,610052.0,610043.0,507241.0,590733.0,610033.0,610024.0,610043.0,610055.0,610042.0,610053.0,592989.0,610055.0,598369.0,595292.0,604443.0,610054.0,606442.0,610055.0,610052.0,554960.0,610052.0,610056.0,610055.0,610055.0,610019.0,610003.0,610055.0,609387.0,610049.0,608628.0,610056.0,609074.0,604890.0,609985.0,609987.0,610006.0,609948.0,610055.0,610052.0,610055.0,610055.0,610054.0,610053.0,610049.0,610022.0,610051.0,610053.0,610016.0,610052.0,610052.0,610055.0,610046.0,610052.0,610055.0,610052.0,610055.0,605012.0,610038.0,609960.0,609841.0,610036.0,610054.0,610051.0,610051.0,610054.0,610055.0,610048.0,596958.0,608746.0,610055.0,610054.0,610055.0,610055.0,610055.0,610055.0,609732.0,609947.0,609933.0,609869.0,609998.0,610055.0,609755.0,610026.0,610000.0,609873.0,610055.0,610049.0,610055.0,609551.0,610054.0,610028.0,610034.0,610055.0,610054.0,610053.0,610053.0,609730.0,610026.0,610055.0,610047.0,610054.0,610054.0,610049.0,610054.0,610033.0,609578.0,610055.0,610053.0,610053.0,610041.0,610054.0,610052.0,610055.0,610042.0,610042.0,610055.0,610055.0,1,610056,594808.0,610055.0,606012.0,610056.0,610053.0,42190,610055.0


# D. Check the values in one column (marc property)
At this point you may be curious to know which values are in one column. For example, 100e has only 3 unique values, which are those?
- You can change the field inside the quotation marcs for any other field of interest.

In [13]:
# TEST (see one record)
# check if a string value exists in a column (the string is exactly the same)
# test_exact = biblio_df[biblio_df['651a'] == '1362253']
test_exact = authority_df[authority_df['651a'] == 'Srebrenica (Yugoslavia)']
test_exact

Unnamed: 0,001,003,005,008,035,035a,035b,035c,035d,040a,040c,040f,100,1000,1001,1004,1006,100C,100D,100a,100b,100c,100d,100e,100q,100t,100v,100x,1100,1106,110a,110b,110c,110d,110e,110g,110n,110x,111a,111b,111c,111d,111e,111g,111n,111t,111x,130a,130b,130d,130l,130n,130p,130v,130w,130x,1480,148a,1500,150a,151a,155a,370a,370b,370c,370e,370f,370s,370t,371a,371b,372a,373a,373s,373t,374a,374s,374t,377a,378q,400,4001,4006,400A,400a,400b,400c,400d,400q,401a,405a,405e,405g,410,4106,410a,410b,410c,410d,410e,410n,410s,410t,411a,411c,411d,411e,411n,419a,430a,430b,430e,450a,450e,455a,4J0a,500a,500b,500c,500d,500q,500w,505a,505e,510a,510b,510e,511a,511c,511d,511e,511n,530a,550a,651a,663a,663b,667a,680a,680i,710a,8806,880a,880d,880q,901c,901t,905u,941m,942m,999,999a,leader,o35a


In [None]:
# You may want to dowload the table above to an excel file for further inspection:

# choose any name for your file, the file will go to the ../data/downloads folder.
name_file = 'biblio_651a_Srebrenica'

test_exact.to_excel(f'{data_downloads}/{name_file}.xlsx')

In [None]:
# biblio_df['100a'].unique().tolist()

## D1. Create a subset with certain column(s)/field(s)
At this point you may have thought that you could perhaps correct some of the records which contain an inconsistent value. For example, in the first version of this data, if you queried above for "biblio_df['100e'].unique()" you may have obtained certain values. You may decide that you want to change one or some of them into another value. But for this, you need the TCN (record Id) numbers. The command below facilitates creating a subset with the TCN and the field of interest.


In [None]:
# create subset with record Id and record of interest, here enter the name of the field(s) that you are interested in separated by commas, each field has to be within single quotation marks, e.g., biblio_df[['001','100e', '110e']]
# field_subset_df = biblio_df[['001','090a','901a','245a','245b','260a','852p','852j','866a','902a','leader']] #--> For LA periodicals
field_subset_df = biblio_df[['001','245a','245b','6510','651a','695g','leader']] #--> For geographic terms exploration
# field_subset_df

In [None]:
# check again the number of unique values in your subset
field_subset_df.describe()

In [None]:
# You may want to dowload the table above to an excel file for further inspection:

# choose any name for your file, the file will go to the ../data/downloads folder.
# name_file = 'biblio_author_person_field_100a' #--> authors test
name_file = 'biblio_geo_651a' #--> geoterms

# field_subset_df.to_excel(f'{data_downloads}/{name_file}.xlsx')

## or download to csv
field_subset_df.to_csv(f'{data_downloads}/{name_file}.csv', index=False) # if too big, use compression='gzip'

## D2. Create a subset of records with a certain value in a given column
You may also want to create a list of the records with a certain value in a given column, for example, for field 100e you got these unique values: ['creator.', 'null', 'creator']. You may want to get only the list of records that have "creator."

In [None]:
# when the file above is too big, it's useful sometimes to download it and upload it here again
path = '/Users/lilianam/workspace/iisg-metadata-overviews/biblio/data'
field_subset_df = pd.read_csv(f'{path}/biblio_titles.csv.gz', sep=",", compression='gzip', low_memory=False)

In [None]:
field_subset_df.head(5)

In [None]:
# check if a string value exists in a column (the string is exactly the same)
query_value_exact = field_subset_df[field_subset_df['100a'] == 'Hajnal, Henri.']
query_value_exact

In [None]:
# check if a string value exists in a column (the string is approximately the same)
# you may want to find the records that have either "creator." (with dot) or "creator" without dot, but not the null values
# here it's possible to use regular expressions

query_value_aprox = field_subset_df[field_subset_df['852j'].str.contains("ZDF|ZF|ZDK|ZO|XZK|ZDO|ZK", case=True, regex=True)]

In [None]:
query_value_aprox.head(5)

In [None]:
# get some idea of how many rows are in this set
query_value_aprox.info(verbose = True, show_counts = True)

In [None]:
# check again the number of unique values in your subset
query_value_aprox.describe()

In [None]:
# You may want to dowload the table above to an excel file for further inspection:

# choose any name for your file, the file will go to the ../data/downloads folder.
# name_file = 'biblio_author_person_field_100a_henri'
name_file = 'biblio_to_map_la_periodicals_852j'

query_value_aprox.to_excel(f'{data_downloads}/{name_file}.xlsx')

## or download to csv
# query_value_aprox.to_csv()

# E. Create subsets using inverse query
You may need to create a report with all the records that do not contain a certain value. For example, because we used "null" to fill in all empty values, one could create a list with all the records that have a value in a certain column.

In [None]:
# create a slice with the records that have non-null values in the column of interest
# Note: if you want to query the subset instead of the whole data, then replace "biblio_df" with "field_subset_df" and run the cell again

query_inverse = biblio_df[~biblio_df['100a'].str.contains("null", case=False, regex=True)]

query_inverse.head(10)

In [None]:
# get some info about the subset you got as a result of the query:
query_inverse.info(verbose=True, show_counts = True)

In [None]:
# You may want to dowload the table above to an excel file for further inspection:

# choose any name for your file, the file will go to the ../data/downloads folder.
name_file = 'biblio_author_person_field_100a_notEmpty'

query_inverse.to_excel(f'{data_downloads}/{name_file}.xlsx')

## or download to csv
# query_inverse.to_csv()

# F. Query for a specific record
You may want to see the details of a specific record, this can be done in two ways:

In [None]:
# 1. by using the index position. Example: This item: ToDo has index position 0. 
# This position can be seen in the left corner of the entire table (cell above in Section5: biblio_df.head(10))
# We will query it using the entire version of the data, not the subset

# show record vertically using index position
query_recordIndex = biblio_df.iloc[0]
query_recordIndex

In [None]:
# 2. By using the record Id using the Marc field 001
query_recordId = biblio_df[biblio_df['001'] == '8']
query_recordId