# Author Gender Tagging
Code to tag OhioLINK MARC records for books with their author's gender as recorded in the VIAF. 

The files "author-name-index.parquet" and "author-genders.parquet" were kindly shared with us by Michael Ekstrand for use in this project and are not stored in the OSF repository. The two files can be generated using the [PIRetT Book Data Tools](https://bookdata.piret.info/data/viaf.html) and should be stored in the folder *Library Bias Analysis/Data/Author Data*. The PIReT tools are described in
> Ekstrand, M. D., and Kluver, D. (2021). Exploring Author Gender in Book Rating and Recommendation. *User Modeling and User-Adapted Interaction, 31(3)*, 377–420. doi: https://doi.org/10.1007/s11257-020-09284-2.

It should be noted that the VIAF dataset is incomplete and in some cases incorrect as the VIAF only records an author's gender as 'male', 'female', or 'unknown' and does not record any non-binary genders. Further analyses using this data are thus limited. 

MARC records can link to multiple VIAF records. Following the linking processed used by Ekstrand & Kluver (2021), MARC records with a matching author in the VIAF dataset are tagged as 'male', 'female', 'unknown' or 'ambiguous'. 'Ambiguous' is only used if a MARC record matches to at least one male author record <u>and</u> at least one female author record. 


## Imports

In [1]:
import pickle as pk
import polars as pl
import pandas as pd
import regex as re

## Functions

In [2]:
def percent(count, total):
    return (count/total)*100

def renorm(num1, num2):
    return (num1 / (num1+num2))*100

'''
Link data from author-name-index.parquet and author-genders.parquet then
convert it to a pandas dataframe with a column for author names and a 
column for author genders. 
'''
def extract_auth_gens():
    auth_name_idx = pl.scan_parquet('Author Data/author-name-index.parquet')
    auth_gen = pl.scan_parquet('Author Data/author-genders.parquet')
    joined = auth_name_idx.join(auth_gen, left_on='rec_id', right_on='rec_id')
    name_w_gen = joined.select(['name', 'gender']).unique()
    return name_w_gen.collect().to_pandas()

'''
Preprocess names to match format in author-name-index.parquet. Method
intended to mirror name processing from Ekstrand & Kluver (2021).
'''
def process_name(name):
    #strip leading and trailing whitespace
    name = name.strip()
    # strip trailing punctuation
    name = name.rstrip('.,_-()')
    # remove remaning periods
    name = name.replace('.', '')
    # remove square brackets
    name = name.replace('[', '').replace(']', '')
    #remove dates
    name = re.sub(',?( d|,d)? ?\d{4} ?-? ?\d{,4}', '', name)
    #remove trailing numbers
    name = re.sub('\d*$| \d*$', '', name)
    #deal with brackets 
    name = re.sub(' ?\([^)]*\)?', '', name)   
    #final strip
    name = name.strip()
    return name

'''
Tag a MARC record with its author's gender based on the genders listed
in the VIAF records it is linked too. Tagging convention of male, female, 
ambiguous, or unknown follows the linking method used in Ekstrand & 
Kluver (2021).
'''
def get_gender_tag(genList):
    # if only one VIAF records links to a MARC record
    if len(genList) == 1:
        gen = genList[0]
    else:
        # if all linked VIAF records list the same gender
        if genList.count(genList[0]) == len(genList):
            gen = genList[0]
        elif 'unknown' in genList:
            sub_vals = [val for val in genList if val != 'unknown']
            # if all linked VIAF records are either the same known gender or unknown 
            if sub_vals.count(sub_vals[0]) == len(sub_vals):
                gen = sub_vals[0]
            else:
                gen = 'ambiguous'
        # if linked VIAF records list both male and female
        else:
            gen = 'ambiguous'        
    return gen

## Load Data

#### OhioLINK Data

In [3]:
with open('OhioLINK Data/marcData.pk', 'rb') as f:
    bookList = pk.load(f)
totalMARC = len(bookList)

#### VIAF Data

In [4]:
auth_gen_df = extract_auth_gens()
num_uk = auth_gen_df['gender'].value_counts()[0] # unknown
num_m = auth_gen_df['gender'].value_counts()[1] # male
num_f = auth_gen_df['gender'].value_counts()[2] # female
totalVIAF = num_uk + num_f + num_m

In [5]:
print(f'There are {totalVIAF:,} author records:\n\t{round(percent(num_f, totalVIAF), 2):.2f}% are female\
      \n\t{round(percent(num_m, totalVIAF), 2):.2f}% are male\n\t{round(percent(num_uk, totalVIAF), 2):.2f}% are unknown')
print(f'Renormalized:\n\t{round(renorm(num_f, num_m), 2):.2f}% are female\n\t{round(renorm(num_m, num_f), 2):.2f}% are male')

There are 44,923,746 author records:
	10.70% are female      
	26.79% are male
	62.50% are unknown
Renormalized:
	28.55% are female
	71.45% are male


## Link Data

In [6]:
# Process MARC record author names and collect OCLC numbers
names = [process_name(item['auth'][0]) for item in bookList if item['auth'] != []]
oclc = [item['oclc'] for item in bookList if item['auth'] != []]

In [7]:
# Join MARC records with VIAF records based on author names
book_df = pd.DataFrame({'oclc': oclc, 'name': names})
joined = book_df.set_index('name').join(auth_gen_df.set_index('name'))
joined = joined[joined['gender'].notnull()]

In [8]:
# Extract unique author name, author gender pairs unique oclc numbers
gen_dat = [(x, y) for x, y in zip(joined['oclc'], 
                                  joined['gender'])]
oclcs = [item[0] for item in gen_dat]
uniq_oclc = list(set(oclcs))

In [9]:
# Build author gender lookup table
gen_hash = {oclc: [] for oclc in uniq_oclc}
for oclc, gen in gen_dat:
    gen_hash[oclc].append(gen)
for key, value in gen_hash.items():
    gender = get_gender_tag(value)
    gen_hash[key] = gender

In [10]:
with open('Author Data/authorGender.pk', 'wb') as f:
    pk.dump(gen_hash, f)

## Data Breakdown
Note: These results are the tagging breakdown for the entire set of MARC records, not just records with both a valid LCC and DDC. Thus the percentages will differ from the results reported for the item gender bias analyses.

In [11]:
noVIAF, noMARC, mCount, fCount, uCount, aCount = 0, 0, 0, 0, 0, 0
for item in bookList:
    if item['auth'] == []:
        noMARC += 1
    elif item['oclc'] not in gen_hash.keys():
        noVIAF += 1
    elif gen_hash[item['oclc']] == 'male':
        mCount += 1
    elif gen_hash[item['oclc']] == 'female':
        fCount += 1
    elif gen_hash[item['oclc']] == 'ambiguous':
        aCount += 1
    elif gen_hash[item['oclc']] == 'unknown':
        uCount += 1

In [12]:
print(f'In the dataset of {totalMARC:,} books:\n\t{round(percent(fCount+mCount, totalMARC), 2)}% have a known gender\n\t\
{round(percent(uCount, totalMARC), 2)}% are unknown\n\t{round(percent(aCount, totalMARC), 2)}% are ambiguous\
    \n\t{round(percent(noMARC, totalMARC), 2)}% have no author in MARC\n\t{round(percent(noVIAF, totalMARC), 2)}% have no author in VIAF')
print(f'Renormalized:\n\t{round(renorm(fCount, mCount), 2)}% are female\n\t{round(renorm(mCount, fCount), 2)}% are male')

In the dataset of 6,779,969 books:
	51.31% have a known gender
	11.86% are unknown
	0.63% are ambiguous    
	25.69% have no author in MARC
	10.5% have no author in VIAF
Renormalized:
	15.39% are female
	84.61% are male
