# INTRODUCTION
Jupyter notebook to get an overview of the different subject terms used at the IISH from four different sources
- dateCreated: 2025-01-06
- creator: Liliana Melgar

This notebook gives an overview of the current status (April 2025) of the subject terms in Evergreen and of the thesaurus in Poolparty.

# Preparation

## Import libraries

In [None]:
import pandas as pd
import numpy as np
import csv
import re
import matplotlib.pyplot as plt

from IPython.display import display, HTML
from IPython.display import clear_output
display(HTML("<style>.container { width:98% !important; }</style>"))
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

# import os.path to add paths to files
import os

# FOR VENN DIAGRAMS
from matplotlib import pyplot as plt
from matplotlib_venn import venn2
from matplotlib_venn import venn3

## Set paths to files

In [None]:
# path to where the relevant data is located
# biblio
script_dir = os.getcwd()  # Gets the current working directory
project_root = os.path.abspath(os.path.join(script_dir, "..", ".."))  # Moves up two levels to reach 'repo'
data_directory_biblio = os.path.join(project_root, "data", "biblio")
data_downloads_biblio = os.path.join(data_directory_biblio, 'downloads') #path to the folder where the reports will be downloaded

# authority
script_dir = os.getcwd()  # Gets the current working directory
project_root = os.path.abspath(os.path.join(script_dir, "..", ".."))  # Moves up two levels to reach 'repo'
data_directory_authority = os.path.join(project_root, "data", "authority")
data_downloads_authority = os.path.join(data_directory_authority, 'downloads') #path to the folder where the reports will be downloaded

# subjects (thesauri)
script_dir = os.getcwd()  # Gets the current working directory
project_root = os.path.abspath(os.path.join(script_dir, "..", ".."))  # Moves up two levels to reach 'repo'
data_directory_subjects = os.path.join(project_root, "data", "subjects")
data_downloads_subjects = os.path.join(data_directory_subjects, 'downloads') #path to the folder where the reports will be downloaded


# BIBLIO subject terms overview
- Biblio is ...
- These terms are extracted from the IISH metadata using the public version of the OAI-PMH endpoint. For more information about what BIBLIO contains, see: https://confluence.socialhistoryservices.org/x/S4FeBw.
- The harvesting was done using the code from the "Metadata overviews" repository: https://github.com/lilimelgar/iisg-metadata-overviews
- **The harvesting date was April 4, 2025**.
- All the records from the catalog are included, but not all the columns, only the 650 field was included. 

## Preparation

### Read csv file
This csv file was created using another jupyter notebook (https://github.com/lilimelgar/iisg-metadata-overviews/blob/main/biblio/src/biblio_query.ipynb). It creates a slice of the entire MARC metadata from Evergreen by selecting only the MARC field 650, because 650 corresponds to the subject terms in MARC (https://www.loc.gov/marc/bibliographic/bd650.html). In that notebook, the repeated fields and subfields are split into separate rows.

In [None]:
# read csv as dataframe
biblio_650_df_v0 = pd.read_csv(f'{data_downloads_biblio}/subjects_600_subfields_20250928-103109.csv', sep=",", low_memory=False)

# low_memory=False was set after this warning message: "/var/folders/3y/xbjxw0b94jxg6x2bcbyjsmmcgvnf7q/T/ipykernel_987/2912965462.py:3: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False."

# history of use
# biblio_650_df_v0 = pd.read_csv(f'{data_downloads_biblio}/subjects_600_subfields_20250417-162028.csv', sep=",", low_memory=False)

### Inspect if import was correct

In [None]:
# This shows the subfields and the number of filled-in values for field 650
# 001 is the recordId (which is called TCN in Evergreen)
biblio_650_df_v0.info()

### Fill in emtpy values

In [None]:
# convert datatypes and fill in empty values
df_columns = biblio_650_df_v0.columns
for column in df_columns:
    dataType = biblio_650_df_v0.dtypes[column]
    if dataType == np.float64:
        biblio_650_df_v0[column] = biblio_650_df_v0[column].fillna('null')
        biblio_650_df_v0[column] = biblio_650_df_v0[column].astype(str)
    if dataType == np.int_:
        biblio_650_df_v0[column] = biblio_650_df_v0[column].fillna('null')
        biblio_650_df_v0[column] = biblio_650_df_v0[column].astype(str)
    if dataType == object:
        biblio_650_df_v0[column] = biblio_650_df_v0[column].fillna('null')
        biblio_650_df_v0[column] = biblio_650_df_v0[column].astype(str)

In [None]:
# Check if empty values were filled in
biblio_650_df = biblio_650_df_v0.copy()
biblio_650_df.info(verbose = True, show_counts = True)

## Overview of Biblio's 650 field

In [None]:
# shape shows the number of rows and columns
shape = biblio_650_df.shape
print(shape)

In [None]:
# this shows the number of unique records using the identifier (TCN) column from Evergreen records
number_of_records = biblio_650_df['001'].nunique()
print(number_of_records)

In [None]:
# explanation
print(f"Shape = {shape}: This value shows that the data contains {shape[0]} rows and {shape[1]} columns. The number of rows is not the same as the number of records ({number_of_records}) in the catalog since the rows were split in a way that, if the record had a 650 field with multiple values, each value was put in a separate row. The total number of records is the following:")

In [None]:
# head shows the first 5 records (if you want to see more records, change 10 for another value)
biblio_650_df.head(5)

### Column names explained

If you go to the "MARC" tab of the online record, that is the data that was extracted. There you can see that this record has three values in MARC field 650. The view displayed here is per subfield. The documentation explaining what each code means is here: https://www.loc.gov/marc/bibliographic/bd650.html. In sum:
- Column "0": is a MARC indicator, it contains the identifier of that specific subject
- Column "1": is a MARC indicator: "Real World Object URI (R)"
- Column "2": is a MARC indicator: "Source of heading or term (NR)"
- Column "4": is a MARC indicator: ? Not sure which one, to check!
- Column "6": is a MARC indicator: "Linkage (NR)"
- Column "8": is a MARC indicator: "Field link and sequence number (R)"

In [None]:
biblio_650_df.describe()

### Examples

In [None]:
# Fill in any record Id (TCN number) you want to explore
example_record_tcn = '1534711' #'1528765'
example_record = biblio_650_df[biblio_650_df['001'] == example_record_tcn] 
example_record

In [None]:
print(f"Above one can see record with Id {example_record_tcn} which is here in the online catalog: https://search.iisg.amsterdam/Record/{example_record_tcn}. Below one can see the records that contain a specific string, e.g., 'strikes', here you can use regular expressions")

In [None]:
search_records = biblio_650_df[biblio_650_df['subfield_a'].str.contains("strike*|staking*", case=False, regex=True)]
search_records

In [None]:
# see all variants of the term in the 650a subfield
search_unique_values = search_records['subfield_a'].unique()
values_list = search_unique_values.tolist()

# sort alphabetically
values_list_sorted = sorted(values_list, key=str.lower)

for line in values_list_sorted:
    print(line)

## Records with subjects
This section focuses on the presence or absense of subject terms (in Marc field 650) in ALL the Catalog records (Biblio)

### Number of records with/without subject terms in 650

In [None]:
# create subset of biblio with relevant columns
biblio_records_subjects_v0 = biblio_650_df[['001','indicator_0', '650', 'leader_code']]

In [None]:
biblio_records_subjects_v0.info()

In [None]:
# Count occurrences of 'null' and 'notNull'
value_counts = biblio_records_subjects_v0.groupby('650')['001'].nunique()
value_counts

In [None]:
# Get total unique '001' count (to check that it's all correct the total should be the entire number of records in the catalog)
total_unique_ids = biblio_records_subjects_v0['001'].nunique()
total_unique_ids

### Plot of records with/without 650

In [None]:
# Plot (Label mapping)
labels_mapping = {
    'notnull': f'records with field 650 ({value_counts.get("notnull", 0)})',
    'null': f'records without field 650 ({value_counts.get("null", 0)})'
}
custom_labels = [labels_mapping[label] for label in value_counts.index]

# Colors
colors = ['#90ee90','#cccccc']

# Step 4: Plot pie chart
fig1, ax = plt.subplots(figsize=(6,6))
wedges, texts, autotexts = ax.pie(
    value_counts, labels=custom_labels, autopct='%1.1f%%', colors=colors, startangle=90,
    wedgeprops={'edgecolor': 'white', 'linewidth': 2}, pctdistance=0.85
)

# Donut hole
centre_circle = plt.Circle((0,0),0.70,fc='white')
fig1.gca().add_artist(centre_circle)

# Center text
ax.text(0, 0, f'Total number of records in Biblio\n{total_unique_ids}', 
        ha='center', va='center', fontsize=7, fontweight='bold')

plt.tight_layout()
plt.show()

In [None]:
## To save the image
# name_file = 'plot1_recordsWithSubject650-'
# fig1.savefig(f'{data_downloads_subjects}/{name_file}.png', format='png', dpi=300, bbox_inches='tight')

### Number of records with/without subject terms in 650 per media type

The following table and plots show the number of records that have/don't have subject terms (field 650) distributed per media type. The media type is encoded in the Marc leader. An explanation of the codes can be found here: https://confluence.socialhistoryservices.org/x/OoJAC

In [None]:
# group by null and notnull and leader code
# value_counts_with_leader = biblio_records_subjects_v0.groupby(['650', 'leader_code'])['001'].count()
# table_counts_df = value_counts_with_leader.reset_index(name='count')

# Step 1: Group and count
value_counts_media_type = (
    biblio_records_subjects_v0
    .groupby(['650', 'leader_code'])['001']
    .count()
    .reset_index(name='count')
)

In [None]:
# Step 3: Pivot the table
pivot_df = value_counts_media_type.pivot_table(
    index='650',
    columns='leader_code',
    values='count',
    fill_value=0
)

# Optional: Order rows as 'notnull' then 'null'
pivot_df = pivot_df.reindex(['notnull', 'null'])

# Step 4: Display nicely in Jupyter
from IPython.display import display

styled = (
    pivot_df.style
        .set_caption("Record Counts by Leader Code and 650 Field Presence")
        .format('{:,}')
        .background_gradient(cmap='Greens')
        .set_properties(**{'text-align': 'center'})
)

display(styled)

### Plot of records with/without 650 per media type

In [None]:
import plotly.express as px

# Step 1: Create long-form dataframe from your pivot table
long_df = pivot_df.reset_index().melt(id_vars='650', var_name='leader_code', value_name='count')

# Step 2: Get total NOTNULL count per leader_code
notnull_order = (
    long_df[long_df['650'] == 'notnull']
    .sort_values('count', ascending=False)
    ['leader_code']
    .tolist()
)

# Step 3: Convert 'leader_code' column to ordered category
long_df['leader_code'] = pd.Categorical(long_df['leader_code'], categories=notnull_order, ordered=True)

# Step 4: Plot as before
fig = px.bar(
    long_df,
    x='leader_code',
    y='count',
    color='650',
    barmode='stack',
    text='count',
    title='Stacked Bar Chart Sorted by Notnull Count',
)

fig.update_layout(
    xaxis=dict(type='category'),  # ensures equal bar width
    yaxis_title='Number of Records',
    xaxis_title='Leader Code',
    bargap=0.15
)

fig.show()

In [None]:
import plotly.express as px


fig = px.treemap(
    value_counts_media_type,
    path=['650', 'leader_code'],  # hierarchy levels
    values='count',
    title='Treemap: Record Distribution by Subject Presence and Code'
)

fig.show()

In [None]:

fig = px.bar(
    value_counts_media_type,
    x='leader_code',
    y='count',
    color='650',
    barmode='group',
    title='Grouped Bar Chart: Record Counts by Code and 650'
)

fig.show()

## Unique subject terms

### Overview

### Terms per frequency

In [None]:
# create a df groupping per term showing the counts
# Group by 'term' and count unique '001' values for each term
biblio_subjectUniqueTerms_v0 = biblio_650_df.groupby('subfield_a', as_index=False).agg(
    count_of_records=('001', 'nunique'),
    ids=('001', lambda x: ','.join(map(str, x)))
)

# Sort by 'count_of_ids' in descending order (most frequent to least frequent)
biblio_subjectUniqueTerms_v1 = biblio_subjectUniqueTerms_v0.sort_values(by='count_of_records', ascending=False)

In [None]:
biblio_subjectUniqueTerms_v1.info()

In [None]:
# remove column with the record Ids
biblio_subjectUniqueTerms_v2 = biblio_subjectUniqueTerms_v1[['subfield_a','count_of_records']]

In [None]:
biblio_subjectUniqueTerms_v2.info()

In [None]:
# dropping the row that contains the null values
biblio_subjectUniqueTerms_v3 = biblio_subjectUniqueTerms_v2[biblio_subjectUniqueTerms_v2['subfield_a'] != 'null']

In [None]:
biblio_subjectUniqueTerms_v3.info()

In [None]:
biblio_subjectUniqueTerms_v3.head()

In [None]:
biblio_subjectUniqueTerms_v4 = biblio_subjectUniqueTerms_v3.reset_index(drop=True)

In [None]:
biblio_subjectUniqueTerms_v4.info()

In [None]:
# determine how many subjects should be shown
top_r = 20
# create small df for displaying and plotting
biblio_subjectUniqueTerms_top = biblio_subjectUniqueTerms_v4.head(top_r).reset_index(drop=True).copy()

# plotting in a barh chart the top n terms
fig2, ax = plt.subplots(figsize=(20, 10))  # Create the figure object
ax = biblio_subjectUniqueTerms_top.groupby(['subfield_a'])['count_of_records'].sum().sort_values(ascending=True).tail(top_r).plot(kind='barh', figsize=(20, 10))
ax.set_title("Top terms in Biblio's 650$a field")
ax.set_xlabel("Number biblio records")
ax.set_ylabel("Term $a")

In [None]:
# # save the figure
# name_file = 'plot2_uniqueSubjectTerms650a--'
# # Save the figure as PNG
# fig2.savefig(f'{data_downloads_subjects}/{name_file}.png', format='png', dpi=300, bbox_inches='tight')

In [None]:
# # export all list of unique subject terms with the number of occurrences in Biblio
# biblio_subjectUniqueTerms_v1.rename(columns={'subfield_a': '650a', 'count_of_records': 'count_records_biblio', 'ids': 'biblio_record_ids'}, inplace=True)

# timestr = time.strftime("%Y%m%d-%H%M%S")
# name_file = 'unique_650a_with_counts_and_recordIds'

# biblio_subjectUniqueTerms_v1.to_csv(f'{data_downloads_subjects}/{name_file}_{timestr}.csv', index=False) # if too big, use compression='gzip'

In [None]:
biblio_subjectUniqueTerms_v4.info()

In [None]:
biblio_subjectUniqueTerms = biblio_subjectUniqueTerms_v4.reset_index(drop=True).copy()

In [None]:
list_unique_terms_biblio = biblio_subjectUniqueTerms['subfield_a'].unique().tolist()

In [None]:
len(list_unique_terms_biblio)

## Normalize lists for Venn diagram

### Normalize strings for Venn diagram

In [None]:
# normalize strings
biblio_subjectUniqueTerms['concept_string_normalized'] = (biblio_subjectUniqueTerms['subfield_a'].str.replace('-', ' ', regex=False).copy()   # Remove dashes
                                                                                                 .str.title().copy()                         # Convert to title case
                                                                                                 .str.replace(' ', '', regex=False).copy()  # Remove spaces
)


In [None]:
biblio_subjectUniqueTerms.head(10)

In [None]:
query_test_venn0 = biblio_subjectUniqueTerms[biblio_subjectUniqueTerms['subfield_a'].str.contains("market", case=False, regex=True)] #
query_test_venn0

In [None]:
# check_tcn = '717530'
# check_record = biblio_subjectUniqueTerms[biblio_subjectUniqueTerms['001'] == check_tcn]
# check_record

biblio_subjectUniqueTerms[biblio_subjectUniqueTerms['subfield_a'].str.contains("Typists", case=False, regex=True)]

In [None]:
# Fill in any record Id (TCN number) you want to explore
example_record_tcn2 = '1005655' #'1528765'
example_record2 = biblio_650_df[biblio_650_df['001'] == example_record_tcn2] 
example_record2

In [None]:
biblio_subjects_list = biblio_subjectUniqueTerms['concept_string_normalized'].tolist()
len(biblio_subjects_list)

### Normalize the indicator to get AuthorityId

In [None]:
biblio_subjectUniqueTerms_v30 = biblio_650_df[['001','indicator_0','subfield_a']]
biblio_subjectUniqueTerms_v30.info()

In [None]:
# dropping the row that contains the null values
biblio_subjectUniqueTerms_v50 = biblio_subjectUniqueTerms_v30[biblio_subjectUniqueTerms_v30['indicator_0'] != 'null']

In [None]:
# filter only for the local id (to exclude OCLC or other Ids)
biblio_subjectUniqueTerms_v55 = biblio_subjectUniqueTerms_v50[biblio_subjectUniqueTerms_v50['indicator_0'].str.contains("\(NL-AMISG\)", case=False, regex=True)]

In [None]:
biblio_subjectUniqueTerms_v56 = biblio_subjectUniqueTerms_v55.reset_index(drop=True)

In [None]:
biblio_subjectUniqueTerms_v56['indicator_0_normalized'] = (biblio_subjectUniqueTerms_v56['indicator_0'].str.replace('(NL-AMISG)', '', regex=False).copy())

In [None]:
biblio_subjectUniqueTerms_v56.head(10)

In [None]:
# create a df groupping per term showing the counts
# Group by 'term' and count unique '001' values for each term

## aggregates all values from 'subfield_a'
# biblio_subjectUniqueTerms_v52 = biblio_subjectUniqueTerms_v51.groupby('indicator_0_normalized', as_index=False).agg(
#     count_of_records=('001', 'nunique'),
#     strings=('subfield_a', lambda x: ','.join(map(str, x)))
# )

#keeps unique values from subfield_a
biblio_subjectUniqueTerms_v57 = biblio_subjectUniqueTerms_v56.groupby('indicator_0_normalized', as_index=False).agg(
    count_of_records=('001', 'nunique'),
    unique_strings=('subfield_a', lambda x: len(set(x.dropna()))),
    strings=('subfield_a', lambda x: ','.join(sorted(set(x.dropna()))))
)

# Sort by 'count_of_ids' in descending order (most frequent to least frequent)
biblio_subjectUniqueTerms_v58 = biblio_subjectUniqueTerms_v57.sort_values(by='count_of_records', ascending=False)

In [None]:
biblio_subjectUniqueTerms_v58.info()

In [None]:
biblio_subjectUniqueTerms_v59 = biblio_subjectUniqueTerms_v58.reset_index()

In [None]:
biblio_subjectUniqueTerms_v59.head()

In [None]:
biblio_subjectUniqueTerms_v59.shape

In [None]:
biblio_subjects_Ids_list = biblio_subjectUniqueTerms_v59['indicator_0_normalized'].tolist()
len(biblio_subjects_Ids_list)

In [None]:
biblio_subjects_Ids_wrong = biblio_subjectUniqueTerms_v59[biblio_subjectUniqueTerms_v59['unique_strings'] > 1]

In [None]:
biblio_subjects_Ids_wrong.shape

In [None]:
biblio_subjects_Ids_wrong.head()

In [None]:
total = biblio_subjects_Ids_wrong['count_of_records'].sum()
print(total)

## Problems with unique subject terms in Biblio

This section looks a bit more in detail to the overview given in Section 3.2 (Overview of Biblio's 650 field)

### Orphan subjects 
These are subject terms without an identifier

In [None]:
# unique value counts per column "subfield_a" 
subfield_a_unique = biblio_650_df['subfield_a'].nunique()
indicator_0_unique = biblio_650_df['indicator_0'].nunique()

In [None]:
# explanation
print(f"We see that the value counts per subfield_a ({subfield_a_unique}) and indicator_0 ({indicator_0_unique}) differ. Because indicator_0 is an identifier for the term in subfield_a, one would expect to have a one to one relation")

The following example record illustrates the issue:

In [None]:
# test record
test_record = biblio_650_df[biblio_650_df['001'].str.contains("1021749", case=False, regex=True)] 
test_record

In the previous record, every line corresponds to an instance of field 650$a. As we see, in row 4187 there is a subject term that has an identifier from OCLC, and on line 4192 an idenfitier from the IISH (which comes from the Authority database. But we also see several subject terms that have no "indicator_0". These terms are not connected to the Authority database or to any other thesaurus or subject list via an identifier.

In [None]:
# How many subjects have this problem?

subjects_without_ids = biblio_650_df[
    (biblio_650_df['indicator_0'].str.lower() == "null") &
    (biblio_650_df['subfield_a'].str.lower() != "null")
]

shape = subjects_without_ids.shape
unique_records = subjects_without_ids['001'].nunique()

In [None]:
print(f"This shows that {shape[0]} subjects don't have an identifier, which occurs in {unique_records} records")

In [None]:
# But some may have an identifier which is external, thus, count how many are those

subjects_with_ids = biblio_650_df[
    (biblio_650_df['indicator_0'].str.lower() != "null") &
    (biblio_650_df['subfield_a'].str.lower() != "null")
]

subjects_without_local_ids = subjects_with_ids[~subjects_with_ids['indicator_0'].str.contains("AMISG", case=False, regex=True)] 
subjects_without_local_ids.head(10)
# subjects_without_local_ids.shape

In [None]:
subjects_without_local_ids.shape

In [None]:
unique_records_no_local = subjects_without_local_ids['001'].nunique()
unique_records_no_local

### Unstructured subjects

In [None]:
# # Get the longest cell (to get the most problematic as example)
# # Convert all cells to string and get their lengths
# lengths = biblio_650_df['subfield_a'].astype(str).map(len)

# # Find position (row, col) of the max length
# max_row, max_col = lengths.stack().idxmax()

# # Get the value from the original DataFrame
# longest_cell = biblio_650_df.loc[max_row, max_col]

# print(f"Longest cell is in row {max_row}, column '{max_col}' with length {len(str(longest_cell))}")
# print("Value:", longest_cell)

test_terms = biblio_650_df[biblio_650_df['subfield_a'].str.contains(";+", case=False, regex=True)]
test_terms.tail(10)

In [None]:
# generate report of recordIds where this problem occurs
# test_term1['001'].unique().tolist()

### Subjects with same string and multiple Ids

In [None]:
# First I should exclude the Ids that are external, e.g., OCLC...

subjects_with_local_ids = subjects_with_ids[subjects_with_ids['indicator_0'].str.contains("AMISG", case=False, regex=True)] 
subjects_with_local_ids.head(10)

In [None]:
# subjects_with_local_ids['indicator_0'].value_counts() --> this shows some messy local identifiers mixed with external

In [None]:
subjects_with_local_ids['001'].nunique()

In [None]:
# # create a df groupping per term showing the counts
# # Group by 'term' and count unique '001' values for each term
# biblio_subjectUniqueTerms_v0 = biblio_650_df.groupby('subfield_a', as_index=False).agg(
#     count_of_records=('001', 'nunique'),
#     ids=('001', lambda x: ','.join(map(str, x)))
# )

# # Sort by 'count_of_ids' in descending order (most frequent to least frequent)
# biblio_subjectUniqueTerms_v1 = biblio_subjectUniqueTerms_v0.sort_values(by='count_of_records', ascending=False)

biblio_subjectUniqueTerms_v20 = biblio_650_df.groupby('subfield_a', as_index=False).agg(
    count_of_records=('001', 'nunique'),
    ids=('001', lambda x: ','.join(map(str, x))),
    unique_ids =('indicator_0', lambda x: len(set(x.dropna()))),
    indicator_0=('indicator_0', lambda x: ','.join(sorted(set(x.dropna()))))
)


# Sort by 'count_of_ids' in descending order (most frequent to least frequent)
biblio_subjectUniqueTerms_v21 = biblio_subjectUniqueTerms_v20.sort_values(by='count_of_records', ascending=False)

In [None]:
biblio_subjectUniqueTerms_v21.info()

In [None]:
# remove column with the record Ids
biblio_subjectUniqueTerms_v22 = biblio_subjectUniqueTerms_v21[['subfield_a', 'indicator_0', 'unique_ids', 'ids', 'count_of_records']]

In [None]:
biblio_subjectUniqueTerms_v22.head()

In [None]:
# dropping the row that contains the null values
biblio_subjectUniqueTerms_v23 = biblio_subjectUniqueTerms_v22[biblio_subjectUniqueTerms_v22['subfield_a'] != 'null']

In [None]:
biblio_subjectUniqueTerms_v23.info()

In [None]:
biblio_subjectUniqueTerms_v23.tail()

In [None]:
biblio_subjectUniqueTerms_v24 = biblio_subjectUniqueTerms_v23.reset_index(drop=True)

In [None]:
biblio_subjectUniqueTerms_v24.info()

In [None]:
# when is there more than 1 value?

biblio_subjects_Strings_wrong = biblio_subjectUniqueTerms_v24[biblio_subjectUniqueTerms_v24['unique_ids'] > 1]

In [None]:
biblio_subjects_Strings_wrong.tail()

In [None]:
len(biblio_subjects_Strings_wrong)

In [None]:
# how many records have the problem?
records_wrong_string = biblio_subjectUniqueTerms_v23['ids'].tolist()

In [None]:
records_wrong_string[0]

### Subjects with same Id and multiple strings (toDo)

#### Example record with both problems (no 1-1 relationship subject string/Id)
(0 and a should be 1 to 1)
1) why are the unique counts different? are the identifiers from 650-0 added only when a term is entered in Authorities? why are there terms without field 650-0?
2) In the test term below, why do the inconsistencies occur? I noticed that if one picks up a term from authorities, both the id and the string are loaded, but the string can be changed. -> shouldn't we try to lock the edit in Biblio?

In [None]:
# example to test why there is no one-to-one correspondence between the string and the id in the correspondent authority
test_term2 = biblio_650_df[biblio_650_df['indicator_0'].str.contains("318765", case=False, regex=True)] # nuclear weapons
# test_term2 = biblio_650_df[biblio_650_df['indicator_0'].str.contains("370805", case=False, regex=True)]

test_term2

In [None]:
test_term_strings = test_term2['indicator_0'].unique().tolist()
for term in test_term_strings:
    print(term)

In [None]:
test_term_strings = test_term2['subfield_a'].unique().tolist()
for term in test_term_strings:
    print(term)

In [None]:
# grouping by LOCAL id (subjects_with_local_ids only)
biblio_subjectUniqueTerms_v60 = subjects_with_local_ids.groupby('indicator_0', as_index=False).agg(
    count_of_records=('001', 'nunique'),
    ids=('001', lambda x: ','.join(map(str, x))),
    unique_strings =('subfield_a', lambda x: len(set(x.dropna()))),
    strings=('subfield_a', lambda x: ','.join(sorted(set(x.dropna()))))
)


# Sort by 'count_of_ids' in descending order (most frequent to least frequent)
biblio_subjectUniqueTerms_v61 = biblio_subjectUniqueTerms_v60.sort_values(by='count_of_records', ascending=False)

In [None]:
biblio_subjectUniqueTerms_v61.head()

In [None]:
biblio_subjectUniqueTerms_v64 = biblio_subjectUniqueTerms_v61[biblio_subjectUniqueTerms_v61['unique_strings'] > 1]

In [None]:
biblio_subjectUniqueTerms_v65 = biblio_subjectUniqueTerms_v64.reset_index(drop=True)

In [None]:
biblio_subjectUniqueTerms_v65['indicator_0'].nunique()

In [None]:
# Combine all 'ids' into one big string, split by commas, and count unique values
total_unique_ids = len(set(
    ','.join(biblio_subjectUniqueTerms_v65['ids'].dropna()).split(',')
))

In [None]:
total_unique_ids

In [None]:
# biblio_subjectUniqueTerms_v65['unique_ids_count'].nunique()

In [None]:
showdf = biblio_subjectUniqueTerms_v65[['indicator_0', 'count_of_records', 'strings', 'unique_strings']]
showdf.head(20)

# AUTHORITIES subject terms
- These terms are extracted from the IISH metadata using the public version of the OAI-PMH endpoint. For more information about what AUTHORITIES contains, see: https://confluence.socialhistoryservices.org/x/S4FeBw.
- The harvesting was done using the code from the "Metadata overviews" repository: https://github.com/lilimelgar/iisg-metadata-overviews
- The harvesting date was November 13th, 2024.
- Using another jupyter notebook (https://github.com/lilimelgar/iisg-metadata-overviews/blob/main/biblio/src/biblio_query.ipynb) I created a slice of the entire metadata selecting only the MARC fields that start with 6, because 600 corresponds to the group of subject terms in MARC (https://www.loc.gov/marc/bibliographic/bd6xx.html)

## Preparation

In [None]:
# read csv as dataframe
authorities_subjectTerms_df_v0 = pd.read_csv(f'{data_downloads_authority}/subjects_authority_150_subfields_20250929-075346.csv', sep=",", low_memory=False)
# low_memory=False was set after this warning message: "/var/folders/3y/xbjxw0b94jxg6x2bcbyjsmmcgvnf7q/T/ipykernel_987/2912965462.py:3: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False."

# history
authorities_subjectTerms_df_v0 = pd.read_csv(f'{data_downloads_authority}/subjects_authority_150_subfields_20250421-172541.csv', sep=",", low_memory=False)

In [None]:
# get an overview of the data
authorities_subjectTerms_df_v0.info(verbose = True, show_counts = True)

In [None]:
# convert datatypes and fill in empty values
df_columns = authorities_subjectTerms_df_v0.columns
for column in df_columns:
    dataType = authorities_subjectTerms_df_v0.dtypes[column]
    if dataType == np.float64:
        authorities_subjectTerms_df_v0[column] = authorities_subjectTerms_df_v0[column].fillna('null')
        authorities_subjectTerms_df_v0[column] = authorities_subjectTerms_df_v0[column].astype(str)
    if dataType == np.int_:
        authorities_subjectTerms_df_v0[column] = authorities_subjectTerms_df_v0[column].fillna('null')
        authorities_subjectTerms_df_v0[column] = authorities_subjectTerms_df_v0[column].astype(str)
    if dataType == object:
        authorities_subjectTerms_df_v0[column] = authorities_subjectTerms_df_v0[column].fillna('null')
        authorities_subjectTerms_df_v0[column] = authorities_subjectTerms_df_v0[column].astype(str)

In [None]:
# convert id to string
authorities_subjectTerms_df_v0['001'] = authorities_subjectTerms_df_v0['001'].astype(str)
authorities_subj_df_v1 = authorities_subjectTerms_df_v0.copy()

In [None]:
# get an overview of the data
authorities_subj_df_v1.shape

In [None]:
# get an overview of the data
authorities_subj_df_v1.tail(10)

In [None]:
# get an overview of the data
authorities_subj_df_v1.describe()

In [None]:
query_test12 = authorities_subj_df_v1[authorities_subj_df_v1['subfield_a'].str.contains("⑄", case=False, regex=True)] #
query_test12

In [None]:
query_test_venn = authorities_subj_df_v1[authorities_subj_df_v1['subfield_a'].str.contains("Recruitment of personnel", case=False, regex=True)] #
query_test_venn

In [None]:
# # TEMPORARILY DROP THE OUTLIER
# authorities_sub_df_v2 = authorities_subj_df_v1.drop(300).copy()

In [None]:
# authorities_sub_df_v2.shape

In [None]:
# query_test13 = authorities_sub_df_v2[authorities_sub_df_v2['150'].str.contains('"a":', case=False, regex=True)] #
# query_test13

In [None]:
# query_test13.shape

In [None]:
# authorities_sub_df_v2['150a'] = authorities_sub_df_v2['150'].map(lambda x: x.lstrip('"a":').rstrip(''))

In [None]:
# authorities_subj_df['150a'].unique()

In [None]:
authorities_subj_df = authorities_subj_df_v1.reset_index(drop=True)

In [None]:
authorities_subj_df.info()

In [None]:
authorities_subj_df.iloc[707] #Labour market

In [None]:
authorities_subj_df.head()

In [None]:
test_record = authorities_subj_df[authorities_subj_df['subfield_a'].str.contains('alca', case=False, regex=True)]
test_record

In [None]:
export_richard = authorities_subj_df[['001', 'subfield_a']]

In [None]:
export_richard.info()

In [None]:
# export
name_file = 'subject_terms_Authority_150a'


# ## or download to csv
export_richard.to_csv(f'{data_downloads_authority}/{name_file}.csv', index=False) # if too big, use compression='gzip'

## normalize strings for venn

In [None]:
# normalize strings
authorities_subj_df['concept_string_normalized'] = (authorities_subj_df['subfield_a'].str.replace('-', ' ', regex=False).copy()   # Remove dashes
                                                                                     .str.title().copy()                         # Convert to title case
                                                                                     .str.replace(' ', '', regex=False).copy()  # Remove spaces
)


In [None]:
query_test_venn2 = authorities_subj_df[authorities_subj_df['subfield_a'].str.contains("Typists", case=False, regex=True)] #
query_test_venn2

In [None]:
authority_list = authorities_subj_df['concept_string_normalized'].tolist()
# authority_list

## Get Ids for venn

In [None]:
authority_list_ids = authorities_subj_df_v1['001'].tolist()

In [None]:
len(authority_list_ids)

# POOLPARTY-THESAURUS
Used export functionality from Poolparty to export to RDF-NQ. Then I adapted some columns by using Open Refine. 

In [None]:
# read csv as dataframe
# poolparty_thes_df_v0 = pd.read_csv(f'{data_directory_subjects}/poolparty-thesaurus/pp-project-socialhistorytaxonomy-nq_STRINGS_ONLY_english.csv', sep=",", low_memory=False)
# low_memory=False was set after this warning message: "/var/folders/3y/xbjxw0b94jxg6x2bcbyjsmmcgvnf7q/T/ipykernel_987/2912965462.py:3: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False."

poolparty_thes_df_v0 = pd.read_csv(f'{data_directory_subjects}/poolparty-thesaurus/pp-project-socialhistorytaxonomy-nq_BasicColumns.csv', sep=",", low_memory=False)

In [None]:
poolparty_thes_df_v0.info()

In [None]:
# rename columns
poorparty_thes_df_v1 = poolparty_thes_df_v0.rename(columns={"subject": "subjectId", "http://www.w3.org/2004/02/skos/core#prefLabel": "prefLabel", "concept_string_normalized": "concept_string_normalized", "concept_string_language": "concept_string_language", "http://www.w3.org/2004/02/skos/core#notation": "authorityId"})

In [None]:
poorparty_thes_df_v1.head(1)

In [None]:
# convert datatypes and fill in empty values
pp_columns = poorparty_thes_df_v1.columns
for column in pp_columns:
    dataType = poorparty_thes_df_v1.dtypes[column]
    if dataType == np.float64:
        poorparty_thes_df_v1[column] = poorparty_thes_df_v1[column].fillna('null')
        poorparty_thes_df_v1[column] = poorparty_thes_df_v1[column].astype(str)
    if dataType == np.int_:
        poorparty_thes_df_v1[column] = poorparty_thes_df_v1[column].fillna('null')
        poorparty_thes_df_v1[column] = poorparty_thes_df_v1[column].astype(str)
    if dataType == object:
        poorparty_thes_df_v1[column] = poorparty_thes_df_v1[column].fillna('null')
        poorparty_thes_df_v1[column] = poorparty_thes_df_v1[column].astype(str)

In [None]:
poorparty_thes_df_v1.info()

In [None]:
poorparty_thes_df = poorparty_thes_df_v1.reset_index(drop=True)

## normalize strings and Ids for venn

In [None]:
poolparty_thes_list = poorparty_thes_df['concept_string_normalized'].tolist()
poolparty_thes_list

In [None]:
type(poolparty_thes_list)

In [None]:
poolparty_authority_ids_notnull = poorparty_thes_df[poorparty_thes_df['authorityId'] != 'null']

In [None]:
# poolparty Id list
poolparty_authority_ids = poolparty_authority_ids_notnull['authorityId'].tolist()
poolparty_authority_ids

In [None]:
len(poolparty_authority_ids)

# PAPER-THESAURUS

In [None]:
# read csv as dataframe
paper_thesaurus_df_v0 = pd.read_csv(f'{data_directory_subjects}/paper-thesaurus/iish-thesaurus-pdf_STRINGS_ONLY.csv', sep=",", low_memory=False)
# low_memory=False was set after this warning message: "/var/folders/3y/xbjxw0b94jxg6x2bcbyjsmmcgvnf7q/T/ipykernel_987/2912965462.py:3: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False."

In [None]:
paper_thesaurus_df_v0.info()

In [None]:
paper_thes_list = paper_thesaurus_df_v0['concept_string_normalized'].tolist()
paper_thes_list

# COMPARISONS (VENN)

## Comparing Poolparty-thesaurus to Paper-thesaurus

In [None]:
# pip install matplotlib-venn

In [None]:
# convert to sets
set1 = set(paper_thes_list)
set2 = set(poolparty_thes_list)

# venn2(subsets = (3, 2, 1))

# venn2((set(['A', 'B', 'C', 'D']), set(['D', 'E', 'F'])))
venn2([set1, set2], set_labels=('Paper-thesaurus', 'Poolparty-thesaurus'))
plt.title("Venn Diagram of Paper-thesaurus and Poolparty-thesaurus")
plt.show()

In [None]:
# this roughly indicates that from the 2080 terms in Poolparty, 1062 match with the paper thesaurus
# 

In [None]:
from matplotlib import pyplot as plt
from matplotlib_venn import venn2

# Your sets
venn = venn2([set1, set2], set_labels=('Paper-thesaurus', 'Poolparty-thesaurus'))

# Custom colors
venn.get_patch_by_id('10').set_color('#66c2a5')  # Paper-only
venn.get_patch_by_id('01').set_color('#fc8d62')  # Poolparty-only
venn.get_patch_by_id('11').set_color('#8da0cb')  # Shared

# Optional: transparency
for subset in ('10', '01', '11'):
    patch = venn.get_patch_by_id(subset)
    if patch:
        patch.set_alpha(0.7)

# Title
plt.title("Venn Diagram of Paper-thesaurus and Poolparty-thesaurus")
plt.show()


In [None]:
from matplotlib import pyplot as plt
from matplotlib_venn import venn2

# Your sets
venn = venn2([set1, set2], set_labels=('Paper-thesaurus', 'Poolparty-thesaurus'))

# Set custom colors from the Lucid palette
colors = {
    '10': '#85bdff',  # Paper only
    '01': '#FF6F91',  # Poolparty only
    '11': '#845EC2',  # Overlap
}

# Apply colors
for subset_id, color in colors.items():
    patch = venn.get_patch_by_id(subset_id)
    if patch:
        patch.set_color(color)
        patch.set_alpha(0.35)

# Optional: white text and clearer labels
for text in venn.set_labels:
    text.set_fontsize(12)
    text.set_color('black')

plt.title("Venn Diagram of Paper-thesaurus and Poolparty-thesaurus", fontsize=14)
plt.show()



In [None]:
# from matplotlib_venn import venn3

# set1 = set(['A', 'B', 'C', 'D'])
# set2 = set(['B', 'C', 'D', 'E'])
# set3 = set(['C', 'D',' E', 'F', 'G'])

# venn3([set1, set2, set3], ('Set1', 'Set2', 'Set3'))
# plt.show()

In [None]:
# Set difference: items in authority_list but not in poolparty_thes_list
non_matching = set1 - set2  # or: set1.difference(set2)

# Print all (optional)
print("Terms in Paper-thesaurus not in poolparty_thes_list:")
print(non_matching)

In [None]:
# Set difference: items in authority_list but not in poolparty_thes_list
non_matching = set2 - set1  # or: set1.difference(set2)

# Print all (optional)
print("Terms in Poolparty_thes_list not in Paper-thesaurus:")
print(non_matching)

## Comparing Authority subject terms & Poolparty-thesaurus

### Strings

In [None]:
# convert to sets
set1 = set(authority_list)
set2 = set(poolparty_thes_list)

# venn2(subsets = (3, 2, 1))

# # venn2((set(['A', 'B', 'C', 'D']), set(['D', 'E', 'F'])))
# venn2([set1, set2], set_labels=('Authority subject terms', 'Poolparty-thesaurus'))
# plt.title("Venn Diagram of Authority subject terms and Poolparty-thesaurus")
# plt.show()

# Your sets
venn = venn2([set1, set2], set_labels=('Authority subjects', 'Poolparty-thesaurus'))

# Set custom colors from the Lucid palette
colors = {
    '10': '#85bdff',  # Paper only
    '01': '#FF6F91',  # Poolparty only
    '11': '#845EC2',  # Overlap
}

# Apply colors
for subset_id, color in colors.items():
    patch = venn.get_patch_by_id(subset_id)
    if patch:
        patch.set_color(color)
        patch.set_alpha(0.35)

# Optional: white text and clearer labels
for text in venn.set_labels:
    text.set_fontsize(12)
    text.set_color('black')


plt.title("Venn Diagram of Authority subjects and Poolparty-thesaurus", fontsize=14)
plt.show()


In [None]:
# Set difference: items in authority_list but not in poolparty_thes_list
non_matching = set1 - set2  # or: set1.difference(set2)

# Print all (optional)
print("Terms in authority_list not in poolparty_thes_list:")
print(non_matching)

In [None]:
# Set difference: items in authority_list but not in poolparty_thes_list
non_matching = set2 - set1  # or: set1.difference(set2)

# Print all (optional)
print("Terms in Poolparty not in Authority:")
print(non_matching)

### Ids from Authorities

In [None]:
# convert to sets
set1 = set(authority_list_ids)
set2 = set(poolparty_authority_ids)

# venn2(subsets = (3, 2, 1))

# # venn2((set(['A', 'B', 'C', 'D']), set(['D', 'E', 'F'])))
# venn2([set1, set2], set_labels=('Authority subject terms', 'Poolparty-thesaurus'))
# plt.title("Venn Diagram of Authority subject terms and Poolparty-thesaurus")
# plt.show()

# Your sets
venn = venn2([set1, set2], set_labels=('Authority subject Ids', 'Poolparty-thesaurus Ids'))

# Set custom colors from the Lucid palette
colors = {
    '10': '#85bdff',  # AuthorityIds
    '01': '#FF6F91',  # Poolparty AuthorityIds
    '11': '#845EC2',  # Overlap
}

# Apply colors
for subset_id, color in colors.items():
    patch = venn.get_patch_by_id(subset_id)
    if patch:
        patch.set_color(color)
        patch.set_alpha(0.35)

# Optional: white text and clearer labels
for text in venn.set_labels:
    text.set_fontsize(12)
    text.set_color('black')


plt.title("Venn Diagram of Authority subject Ids and Poolparty-thesaurus Authority Ids", fontsize=14)
plt.show()

## Comparing biblio unique terms with authorities unique terms (strings)

In [None]:
# convert to sets
set1 = set(authority_list)
set2 = set(biblio_subjects_list)

# venn2(subsets = (3, 2, 1))

# # venn2((set(['A', 'B', 'C', 'D']), set(['D', 'E', 'F'])))
# venn2([set1, set2], set_labels=('Authority subject terms', 'Biblio subject terms'))
# plt.title("Venn Diagram of Authority subject terms and Biblio subject terms")
# plt.show()

# Your sets
venn = venn2([set1, set2], set_labels=('Authority subjects', 'Biblio subject terms'))

# Set custom colors from the Lucid palette
colors = {
    '10': '#85bdff',  # Paper only
    '01': '#FF6F91',  # Poolparty only
    '11': '#845EC2',  # Overlap
}

# Apply colors
for subset_id, color in colors.items():
    patch = venn.get_patch_by_id(subset_id)
    if patch:
        patch.set_color(color)
        patch.set_alpha(0.35)

# Optional: white text and clearer labels
for text in venn.set_labels:
    text.set_fontsize(12)
    text.set_color('black')


plt.title("Venn Diagram of Authority subject terms and Biblio subject terms", fontsize=14)
plt.show()


In [None]:
# Set difference: items in authority_list but not in poolparty_thes_list
non_matching = set1 - set2  # or: set1.difference(set2)

# Print all (optional)
print("Terms in authority_list not in biblio_list:")
print(non_matching)

In [None]:
biblio_650_df.info()

#### compare the two dataframes

In [None]:
# get string column from Biblio
biblio_subject_strings_df_v0 = biblio_650_df[['001','subfield_a']]
biblio_subject_strings_df_v0.head()

In [None]:
biblio_subject_strings_df_v0.shape

In [None]:
biblio_subject_strings_df_v1 = biblio_subject_strings_df_v0[biblio_subject_strings_df_v0['subfield_a'] != 'null']
biblio_subject_strings_df_v1.head()

In [None]:
biblio_subject_strings_df = biblio_subject_strings_df_v1.reset_index(drop=True)

In [None]:
biblio_subject_strings_df.shape

In [None]:
authorities_subj_df.info()

In [None]:
# get string column from Authority
authority_subject_strings_df_v0 = authorities_subj_df[['001','subfield_a']]
authority_subject_strings_df_v0.head()

In [None]:
authority_subject_strings_df = authority_subject_strings_df_v0.reset_index(drop=True)

In [None]:
# # compare strings
# biblio_subject_strings_df['stringBiblio_in_Authority'] = biblio_subject_strings_df['subfield_a'].isin(authority_subject_strings_df['subfield_a'])

In [None]:
# biblio_subject_strings_df.head()

In [None]:
# Rename A's column for clarity during merge
authority_subject_strings_df_renamed = authority_subject_strings_df[['subfield_a']].rename(columns={'subfield_a': 'string_match'})

# Merge B with A based on matching strings
comparison_df = biblio_subject_strings_df.merge(
    authority_subject_strings_df_renamed,
    how='left',
    left_on='subfield_a',
    right_on='string_match'
)

# Add a boolean 'Matches' column
comparison_df['Matches'] = comparison_df['string_match'].notna()


comparison_df.head(200)

In [None]:
comparison_df.shape

In [None]:
comparison_df_unique = comparison_df.drop_duplicates()
comparison_df_unique.head(200)

In [None]:
test20 = comparison_df_unique[comparison_df_unique['subfield_a'].str.contains("Refugees*", case=False, regex=True)]
test20

In [None]:
comparison_df['001'].nunique()

## Comparing biblio unique terms with authorities unique terms (Ids)

In [None]:
# convert to sets
set1 = set(authority_list_ids)
set2 = set(biblio_subjects_Ids_list)

# venn2(subsets = (3, 2, 1))

# # venn2((set(['A', 'B', 'C', 'D']), set(['D', 'E', 'F'])))
# venn2([set1, set2], set_labels=('Authority subject terms (ids)', 'Biblio subject terms (ids)'))
# plt.title("Venn Diagram of Authority subject terms (ids) and Biblio subject terms (ids)")
# plt.show()

# Your sets
venn = venn2([set1, set2], set_labels=('Authority subject terms Ids', 'Biblio subject terms Ids'))

# Set custom colors from the Lucid palette
colors = {
    '10': '#85bdff',  # Paper only
    '01': '#FF6F91',  # Poolparty only
    '11': '#845EC2',  # Overlap
}

# Apply colors
for subset_id, color in colors.items():
    patch = venn.get_patch_by_id(subset_id)
    if patch:
        patch.set_color(color)
        patch.set_alpha(0.35)

# Optional: white text and clearer labels
for text in venn.set_labels:
    text.set_fontsize(12)
    text.set_color('black')


plt.title("Venn Diagram of Authority subject terms (ids) and Biblio subject terms (ids)", fontsize=14)
plt.show()

In [None]:
# Set difference: items in authority_list but not in poolparty_thes_list
non_matching_1 = set1 - set2  # or: set1.difference(set2)

# Print all (optional)
print("Terms in authority_list NOT in biblio_list:")
print(non_matching_1)

In [None]:
len(non_matching_1)

In [None]:
# Set difference: items in authority_list but not in poolparty_thes_list
non_matching_2 = set2 - set1  # or: set1.difference(set2)

# Print all (optional)
print("Terms in biblio_list NOT in authority_list:")
print(non_matching_2)

## Comparing main three (for updating thesaurus)

In [None]:

# # set4 = set(biblio_subjects_list)

In [None]:

# Example sets (replace with yours)
set1 = set(paper_thes_list)
set2 = set(poolparty_thes_list)
set3 = set(authority_list)

# Create Venn diagram
venn = venn3([set1, set2, set3],
             set_labels=('Paper thes. strings', 'Poolparty strings', 'Authority strings'))

# Custom color mapping (you can tweak as needed)
colors = {
    '100': '#85bdff',   # Authority only
    '010': '#FF6F91',   # Biblio only
    '001': '#FFC75F',   # PoolParty only
    '110': '#A1D5FF',   # Authority ∩ Biblio
    '101': '#B39CD0',   # Authority ∩ PoolParty
    '011': '#F9F871',   # Biblio ∩ PoolParty
    '111': '#845EC2',   # All three
}

# Apply colors
for subset_id, color in colors.items():
    patch = venn.get_patch_by_id(subset_id)
    if patch:
        patch.set_color(color)
        patch.set_alpha(0.5)

# Optional label tweaks
for text in venn.set_labels:
    if text:
        text.set_fontsize(11)
        text.set_color('black')

plt.title("Venn Diagram: Authority, ", fontsize=14)
plt.show()


# (not used now) MAPPINGS authority - Poolparty

ListA = authorities_sub_df
ListB = poolparty_df

<!-- Authorities -->
<!-- #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   001     982 non-null    object
 1   150     982 non-null    object
 2   450     982 non-null    object
 3   550     982 non-null    object
 4   leader  982 non-null    object
 5   150a    982 non-null    object -->


<!-- Poolparty -->
<!-- #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Concept            2213 non-null   object
 1   PreferredLabels    2213 non-null   object
 2   AlternativeLabels  2213 non-null   object
 3   BroaderConcepts    2213 non-null   object
 4   NarrowerConcepts   2213 non-null   object -->

In [None]:
# This python script detects the string similarity between two lists of concepts/terms.

from tqdm import tqdm
from time import sleep
from fuzzywuzzy import fuzz


def compare_strings(dfA, dfB):
    '''Processes and maps candidate names
    Inputs are two dataframes of names
    Outputs a dataframe of candidates
    '''

    # # create an empty dataframe
    # mapped_candidates_df = pd.DataFrame()
    ##COLLECT ROWS
    rows = []

    ############################## CAPTURE VARIABLES FROM DFs #######################################
    # for indexB, rowB in dfB.iterrows():
    for indexB, rowB in tqdm(dfB.iterrows(), total=dfB.shape[0]):
        # Capture basic standard columns for the mapping dataset B (to be mapped) as variables
        idB = dfB.loc[indexB, 'Concept']
        stringB = dfB.loc[indexB, 'PreferredLabels']
        alternativeLabelsB = dfB.loc[indexB, 'AlternativeLabels']
        broaderConceptsB = dfB.loc[indexB, 'BroaderConcepts']
        narrowerConceptsB = dfB.loc[indexB, 'NarrowerConcepts']
        sleep(0.01)
        for indexA, rowA in dfA.iterrows():
            # Capture basic standard columns for the mapping dataset A (to be mapped to) as variables
            idA = dfA.loc[indexA, '001']
            stringA = dfA.loc[indexA, '150a']
            marc150 = dfA.loc[indexA, '150']
            marc450 = dfA.loc[indexA, '450']
            marc550 = dfA.loc[indexA, '550']        
            leader = dfA.loc[indexA, 'leader']

    ############################## SET STRING MATCHING SETTINGS #######################################

            # Algorithm to be used
            matchScore1 = fuzz.token_sort_ratio(stringA.lower(), stringB.lower())
            # matchScore2 = fuzz.token_set_ratio(stringA, stringB)
            # matchScore3 = fuzz.partial_ratio(nameStringA, nameStringB) # USE WITH casesNoisy (edit below) if names in both datasets are very similar. It compares parts of strings, low score is useful to avoid matches like this (('Carlieri Jacopo', 'Jacopo Battieri'))

            # String score ranges
            rangeScoreVeryLow = 80
            rangeScoreLow = 85
            rangeScoreMid = 90
            rangeScoreHigh = 100


    # ############################## RUN STRING MATCHING #######################################
            # this rule only applies to cases of type A when the dates are exactly the same (e.g., to match 'Olivarius Vredius' with 'Olivier de Wree')
            if rangeScoreVeryLow <= matchScore1 <= rangeScoreHigh:
                scoreString = dfA.loc[indexA,'scoreString'] = matchScore1
                scoreType = dfA.loc[indexA, 'scoreType'] = 'matchScore1'
                match_StringB = dfA.loc[indexA,'match-stringB'] = stringB
                match_idB = idB = dfA.loc[indexA,'match-idB'] = idB
                rows.append({
                    'scoreString': scoreString,
                    'scoreType': scoreType,
                    'idA':idA,                    
                    'match_idB': match_idB,
                    'stringA': stringA,                    
                    'match_stringB': match_StringB,
                    'marc150': marc150, 
                    'marc450': marc450, 
                    'marc550': marc550, 
                    'leader': leader,
                    'alternativeLabelsB': alternativeLabelsB, 
                    'broaderConceptsB': broaderConceptsB, 
                    'narrowerConceptsB': narrowerConceptsB
                })

        df_mapped = pd.DataFrame(rows)

    return df_mapped


In [None]:
# dfA = authorities_sub_df
# dfB = poolparty_df

mapped_candidates = compare_strings(authorities_sub_df, poolparty_df)

In [None]:
mapped_candidates.head()

In [None]:
mapped_candidates.info(verbose=True)

In [None]:
test12 = mapped_candidates[mapped_candidates['StringA'].str.contains("strike*|staking*", case=False, regex=True)]
test12

In [None]:
# NOT USED

In [None]:
# # how many terms are in Biblio that have a correspondent Authority record?
# biblio_subjectTerms_notnull = biblio_subjectTerms_df[biblio_subjectTerms_df['650'] == 'notnull']
# question2 = biblio_subjectTerms_notnull[biblio_subjectTerms_notnull['"0"'].str.contains("NL-AMISG)", case=True, regex=False)]
# question2_total = question2.shape[0]
# question2_total

In [None]:
# # how many terms are in Biblio that do not have a correspondent Authority record?
# biblio_subjectTerms_a_notnull = biblio_subjectTerms_df[biblio_subjectTerms_df['"a"'] is not ]
# question3 = biblio_subjectTerms_notnull[~biblio_subjectTerms_notnull['"0"'].str.contains("", case=True, regex=False)]
# question3_total = biblio_subjectTerms_null.shape[0]
# question3_total

In [None]:
# questionTest = biblio_subjectTerms_df['"0"'].value_counts()
# questionTest

In [None]:
# query_test = subject_terms_df[
#     subject_terms_df['"a"'].str.contains("collective", case=False, regex=True) & 
#     subject_terms_df["another_column"].str.contains("stringTest", case=False, regex=True)
# ]

In [None]:
# subject_terms_df['"x"'].unique()
# query_test = subject_terms_df[subject_terms_df['"a"'].str.contains("collective", case=False, regex=True)] #
# query_test
# query_test['"a"'].unique()