<a href="https://colab.research.google.com/github/julianikulski/director-experience/blob/main/preprocessing/biography_matching.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Matching biographies to directors
In this notebook I will be matching the directors sitting on the boards of the S&P 500 companies between 2011 and 2015. These directors were researched on Refinitiv Eikon and then the relevant directors were identified in the notebook `director_company_data`. The biographies used in this notebook were taken from S&P Capital IQ. Because there are about 4600 director biographies missing from the S&P 500 Capital IQ biography dataset, I wrote the entire list of relevant directors (incl. directors with and without biographies) to an excel file. For all of the directors included in this file (~7500) I manually researched the missing biographies from DEF 14As. Simultaneously, I research the committee memberships and relevant CSR committees in these DEF 14As for each company.

In [55]:
# connecting to Google Drive to access files
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [56]:
import numpy as np
import pandas as pd

from glob import glob
import re
from functools import reduce


In [57]:
# set number of max rows
pd.set_option('display.max_rows', 13000)


## Reading in data

In [58]:
# read in the csv file containing all directors in my dataset
all_directors_df = pd.read_csv('/content/drive/My Drive/director-csr/all_directors.csv')
# drop the 'Unnamed: 0' column
all_directors_df.drop(columns='Unnamed: 0', inplace=True)

all_directors_df.head()


Unnamed: 0,name,age,last_position,director_start,director_end,executive_start,executive_end,comp_name,ticker,missing_start_date,2011,2012,2013,2014,2015,current_position,dir_exec,in_position,isin
0,ms. christina gold,72.0,independent director,0.0,0.0,1997.0,2020.0,itt inc,itt,0.0,1.0,1.0,1.0,1.0,1.0,,,,us45073v1089
1,mr. frank macinnis,72.0,independent chairman of the board,2011.0,2020.0,2001.0,2020.0,itt inc,itt,0.0,1.0,1.0,1.0,1.0,1.0,,,,us45073v1089
2,ms. denise ramos,63.0,"president, chief executive officer, director",2011.0,2019.0,2011.0,2019.0,itt inc,itt,0.0,1.0,1.0,1.0,1.0,1.0,,,,us45073v1089
3,ms. karen larue,40.0,"controller, executive director",0.0,2018.0,0.0,2018.0,itt inc,itt,1.0,0.0,0.0,0.0,0.0,0.0,,,,us45073v1089
4,mr. g. peter d'aloia,,independent director,0.0,0.0,0.0,2017.0,itt inc,itt,1.0,0.0,0.0,0.0,0.0,0.0,,,,us45073v1089


In [59]:
 # read in the excel files containing the biographies
all_files = glob('/content/drive/My Drive/director-csr/directors/*.xls')

list_df = []

for file in all_files:
    df_file = pd.read_excel(file, skiprows=7) # skipping the first 7 rows above the header
    list_df.append(df_file)
    
biographies_df = pd.concat(list_df, axis=0, ignore_index=True)

biographies_df.head()


Unnamed: 0,Person Name,Company Name [Any Professional Record] [Current Matching Results],Exchange:Ticker,Email Address,Professional Titles [Any Professional Record] [Current Matching Results],Colleges/Universities,Degrees,Graduation Year,Majors,Geographic Locations [Any Professional Record] [Current Matching Results],Primary Professional Record,Biographies,Person Locations [Any Professional Record] [Current Matching Results],Person Age,Person Name First,Person Name Last,Person Name Middle,Person Name Nickname,Person Name Prefix,Person Name Suffix,Person Notes,Specialties [Any Professional Record] [Current Matching Results],Year Born,CIK [Any Professional Record] [Current Matching Results],Company CUSIP [Any Professional Record] [Current Matching Results],Primary ISIN [Any Professional Record] [Current Matching Results],Security Tickers [Any Professional Record] [Current Matching Results],SIC Codes (Primary) [Any Professional Record] [Current Matching Results],Company Type [Any Professional Record] [Current Matching Results],Professional Job Functions [Any Professional Record] [Current Matching Results]
0,"Schwarzman, Stephen Allen (Prior Board)",PJT Partners Inc. (NYSE:PJT),NYSE:PJT,Schwarzman@blackstone.com,Former Chairman and Chief Executive Officer,Harvard Business School; Yale University; Quin...,Harvard Business School - MBA; Yale University...,Quinnipiac University (2012),-,United States and Canada (Primary),The Blackstone Group L.P. (NYSE:BX) (Board),"Mr. Stephen Allen Schwarzman, also known as St...",United States of America; Northeast; New York;...,68,Stephen,Schwarzman,Allen,Steve,Mr.,-,,-,1947,0001626115,69343T,US69343T1079,NYSE:PJT; BST:1PJ; DB:1PJ,6282 Investment advice,Public Company,Chief Executive Officer (Prior)
1,"Bovender, Jack O. (Prior Board)","HCA Holdings, Inc. (NYSE:HCA)",NYSE:HCA,-,Former Executive Chairman and Chairman of Exec...,Duke University,Duke University - Bachelor's Degree; Duke Univ...,Duke University (1967),Duke University - Psychology,United States and Canada (Primary),Duke University (Board),"Mr. Jack O. Bovender, Jr., served as the Chair...",United States of America; Southeast; Tennessee...,70,Jack,Bovender,O.,-,Mr.,Jr.,,-,1945,0000311314; 0000732872; 0000860730; 0001392778,40412C,US40412C1018,NYSE:HCA; BAYB:2BH,8062 General medical and surgical hospitals,Public Company,Chief Executive Officer (Prior)
2,"Mandaric, Milan (Prior Board)",Elexsys International,-,,Former Chairman of the Board and Chief Executi...,-,-,-,-,United States and Canada (Primary),"MM Holdings International, Inc. (Board)",Mr. Milan Mandaric serves as Chief Executive O...,United States of America; California; West Coa...,77,Milan,Mandaric,-,-,Mr.,-,,-,1938,0000727010,28626C,-,-,3672 Printed circuit boards,Public Company,Chief Executive Officer (Prior)
3,"Childs, John W. (Prior Board)",JWC Acquisition Corp.,-,jchilds@jwchilds.com,Chairman and Chief Executive Officer,Yale University; Columbia University,Yale University - BA; Columbia University - MBA,-,-,United States and Canada (Primary),"J.W. Childs Associates, L.P. (Board)",Mr. John W. Childs serves as the Chairman and ...,United States of America; Northeast; Massachus...,73,John,Childs,W.,-,Mr.,-,,-,1942,0001498157,46634Y,US46634Y1029,-,9995 Non-operating establishments,Public Company,Chief Executive Officer (Prior)
4,"Vota, John P. (Prior)","Insight Management Corporation, Prior to Rever...",-,-,Former Interim Chief Executive Officer and Int...,Columbia University; Fordham University; Schoo...,Columbia University - Bachelor's Degree; Fordh...,-,-,United States and Canada (Primary),Blackbird Capital Partners,Mr. John P. Vota serves as a Managing Partner ...,United States of America; Northeast; New York;...,76,John,Vota,P.,-,Mr.,-,,-,1939,-,45776Q,US45776Q3074,-,-,Public Company,Chief Executive Officer (Prior)


## Data Cleaning and Preprocessing

In [60]:
# rename some of the columns and drop others
new_columns = ['name', 'first_name', 'middle_name', 'nick_name', 'last_name', 'comp_name', 'ticker', 'education', 'prim_comp', 'biographies', 'age', 'isin', 'all_tickers']
old_columns = ['Person Name', 'Person Name First', 'Person Name Middle', 'Person Name Nickname', 'Person Name Last', 'Company Name [Any Professional Record] [Current Matching Results]', 
               'Exchange:Ticker', 'Colleges/Universities', 'Primary Professional Record', 'Biographies', 'Person Age', 
               'Primary ISIN [Any Professional Record] [Current Matching Results]', 'Security Tickers [Any Professional Record] [Current Matching Results]']
biographies_df = biographies_df[old_columns]
biographies_df.rename(columns=dict(zip(old_columns, new_columns)), inplace=True)
biographies_df


Unnamed: 0,name,first_name,middle_name,nick_name,last_name,comp_name,ticker,education,prim_comp,biographies,age,isin,all_tickers
0,"Schwarzman, Stephen Allen (Prior Board)",Stephen,Allen,Steve,Schwarzman,PJT Partners Inc. (NYSE:PJT),NYSE:PJT,Harvard Business School; Yale University; Quin...,The Blackstone Group L.P. (NYSE:BX) (Board),"Mr. Stephen Allen Schwarzman, also known as St...",68,US69343T1079,NYSE:PJT; BST:1PJ; DB:1PJ
1,"Bovender, Jack O. (Prior Board)",Jack,O.,-,Bovender,"HCA Holdings, Inc. (NYSE:HCA)",NYSE:HCA,Duke University,Duke University (Board),"Mr. Jack O. Bovender, Jr., served as the Chair...",70,US40412C1018,NYSE:HCA; BAYB:2BH
2,"Mandaric, Milan (Prior Board)",Milan,-,-,Mandaric,Elexsys International,-,-,"MM Holdings International, Inc. (Board)",Mr. Milan Mandaric serves as Chief Executive O...,77,-,-
3,"Childs, John W. (Prior Board)",John,W.,-,Childs,JWC Acquisition Corp.,-,Yale University; Columbia University,"J.W. Childs Associates, L.P. (Board)",Mr. John W. Childs serves as the Chairman and ...,73,US46634Y1029,-
4,"Vota, John P. (Prior)",John,P.,-,Vota,"Insight Management Corporation, Prior to Rever...",-,Columbia University; Fordham University; Schoo...,Blackbird Capital Partners,Mr. John P. Vota serves as a Managing Partner ...,76,US45776Q3074,-
...,...,...,...,...,...,...,...,...,...,...,...,...,...
38822,"Falcone, Philip A. (Prior)",Philip,A.,-,Falcone,Fidelity & Guaranty Life (NYSE:FGL),NYSE:FGL,-,Fidelity & Guaranty Life (NYSE:FGL) (Prior),Mr. Philip A. Falcone served as the Chief Exec...,-,US3157851052,NYSE:FGL; DB:FGY
38823,"Blum, Erik (Board)",Erik,-,-,Blum,Golden Global Corp. (OTCPK:GLDG),OTCPK:GLDG,-,Golden Global Corp. (OTCPK:GLDG) (Board),Mr. Erik Blum has been Chief Executive Officer...,48,US3810572079,OTCPK:GLDG
38824,"Kelly, Christopher S.",Christopher,S.,-,Kelly,"Carbon Sciences, Inc. (OTCPK:CABN)",OTCPK:CABN,University of Denver College of Law,"Carbon Sciences, Inc. (OTCPK:CABN)",Mr. Christopher S. Kelly has been the Chief Ex...,-,US14115L2051,OTCPK:CABN
38825,"Louth, Christy (Board)",Christy,-,-,Louth,Golden Secret Ventures Ltd. (TSXV:GGS),TSXV:GGS,-,Golden Secret Ventures Ltd. (TSXV:GGS) (Board),Christy Louth has been the Chief Executive Off...,-,CA38117P1045,TSXV:GGS


In this biographies_df dataframe, I am really only interested in the biographies. I will keep the other columns as well because it may be a way to distinguish people with similar names based on their education, age, etc. However, I will keep duplicate biographies and only drop duplicate rows. The biographies may be the same, but other ISINs or company names might be used for the same person which might be valuable for the matching later on.

In [61]:
# drop any rows that are duplicate
print(biographies_df.shape)
biographies_df.drop_duplicates(inplace=True)
print(biographies_df.shape)


(38827, 13)
(38827, 13)


In [62]:
# create a copy of the uncleaned biographies_df
biographies_old = biographies_df.copy()
biographies_old['name'] = biographies_old['name'].apply(lambda x: x.lower() if not pd.isna(x) else x)


In [63]:
def clean_names(df, bio=True):
    '''
    Function to clean up the director names so that they can be matched
    Args: df = dataframe; containing director names
          bio = bool; True if the biographies dataframe is added, False otherwise 
    Returns: df = dataframe
    '''

    df['org_name'] = df.iloc[:,0]

    # change the strings to lower case
    df.iloc[:,0] = df.iloc[:,0].apply(lambda x: x.lower())

    # check if the names contain anything in parentheses and if so remove them and their content
    df.iloc[:,0] = df.iloc[:,0].apply(lambda x: re.sub(r'\([^()]*\)', '', x))

    # check if the names contain a title like ms. and mr. and if so remove them
    df.iloc[:,0] = df.iloc[:,0].apply(lambda x: re.sub(r'^\w{2,3}\. ?', '', x))

    # do two different things with the commas for the different dataframes
    if bio:
        # move the last name in the front of the comma to the back of the string and remove the comma
        df.iloc[:,0] = df.iloc[:,0].apply(lambda x: ' '.join([x.split(',')[1], x.split(',')[0]]))
    else:
        # create a new column that contains all the words after a comma at the end
        df['qualification'] = df.iloc[:,0].apply(lambda x: x.split(',')[-1] if len(x.split(',')) > 1 else None)
        df.iloc[:,0] = df.iloc[:,0].apply(lambda x: x.split(',')[0])

    # remove any initials or titles because they might be distracting when matching names
    df.iloc[:,0] = df.iloc[:,0].apply(lambda x: ' '.join([name if '.' not in name else '' for name in x.split()]))

    # remove 'the' substring from names
    df.iloc[:,0] = df.iloc[:,0].apply(lambda x: re.sub(r'^the\s', '', x))

    # ensure that all white space is stripped
    df.iloc[:,0] = df.iloc[:,0].apply(lambda x: re.sub(' +', ' ', x).strip())

    return df


In [64]:
# clean both dataframes
all_directors_clean = clean_names(all_directors_df, bio=False)
biographies_old_clean = clean_names(biographies_df, bio=True)


In [65]:
# I need to manually re-add names that got lost in the cleaning function
index = all_directors_clean[all_directors_clean['name'] == ''].index
new_names = ['david owen', 'donald riegle', 'thomas niles']
for i in range(len(index)):
    all_directors_clean.at[index[i], 'name'] = new_names[i]
all_directors_clean[all_directors_clean['name'] == '']


Unnamed: 0,name,age,last_position,director_start,director_end,executive_start,executive_end,comp_name,ticker,missing_start_date,2011,2012,2013,2014,2015,current_position,dir_exec,in_position,isin,org_name,qualification


In [66]:
# add last name column to dataframe
all_directors_clean['last_name'] = all_directors_clean['name'].apply(lambda x: x.split(' ')[-1])
# sort the dataframe by name column
all_directors_clean.sort_values(by='name', ascending=True)
all_directors_clean.head()


Unnamed: 0,name,age,last_position,director_start,director_end,executive_start,executive_end,comp_name,ticker,missing_start_date,2011,2012,2013,2014,2015,current_position,dir_exec,in_position,isin,org_name,qualification,last_name
0,christina gold,72.0,independent director,0.0,0.0,1997.0,2020.0,itt inc,itt,0.0,1.0,1.0,1.0,1.0,1.0,,,,us45073v1089,ms. christina gold,,gold
1,frank macinnis,72.0,independent chairman of the board,2011.0,2020.0,2001.0,2020.0,itt inc,itt,0.0,1.0,1.0,1.0,1.0,1.0,,,,us45073v1089,mr. frank macinnis,,macinnis
2,denise ramos,63.0,"president, chief executive officer, director",2011.0,2019.0,2011.0,2019.0,itt inc,itt,0.0,1.0,1.0,1.0,1.0,1.0,,,,us45073v1089,ms. denise ramos,,ramos
3,karen larue,40.0,"controller, executive director",0.0,2018.0,0.0,2018.0,itt inc,itt,1.0,0.0,0.0,0.0,0.0,0.0,,,,us45073v1089,ms. karen larue,,larue
4,peter d'aloia,,independent director,0.0,0.0,0.0,2017.0,itt inc,itt,1.0,0.0,0.0,0.0,0.0,0.0,,,,us45073v1089,mr. g. peter d'aloia,,d'aloia


I will do some additional data cleaning to try to match more biographies to people

In [67]:
# additional data cleaning to merge more directors with biographies
biographies_clean = biographies_old_clean.copy()
biographies_clean = biographies_clean.applymap(lambda x: x.lower() if isinstance(x, str) else x)
biographies_clean['ticker'] = biographies_clean['ticker'].apply(lambda x: x.split(':')[-1] if not pd.isna(x) else x)
biographies_clean['isin'] = biographies_clean['isin'].apply(lambda x: x if not pd.isna(x) else x)
biographies_clean['last_name'] = biographies_clean['last_name'].apply(lambda x: x if not pd.isna(x) else x)
biographies_clean['all_tickers'] = biographies_clean['all_tickers'].apply(lambda x: [ticker.split(':')[-1].lower() if not pd.isna(x) else x for ticker in x.split(';')])
biographies_clean['comp_name'] = biographies_clean['comp_name'].apply(lambda x: re.sub(r'\([^()]*\)', '', x))
biographies_clean['prim_comp'] = biographies_clean['prim_comp'].apply(lambda x: re.sub(r'\([^()]*\)', '', x))
# now add the ticker values of to the all_tickers list
biographies_clean['all_tickers'] = biographies_clean.apply(lambda x: x['all_tickers'] + [x['ticker']] if x['all_tickers'][0] != '-' else x['all_tickers'], axis=1)

# drop any biographies that are not populated
biographies_clean = biographies_clean[biographies_clean['biographies'] != '-']

biographies_clean.head()


Unnamed: 0,name,first_name,middle_name,nick_name,last_name,comp_name,ticker,education,prim_comp,biographies,age,isin,all_tickers,org_name
0,stephen allen schwarzman,stephen,allen,steve,schwarzman,pjt partners inc.,pjt,harvard business school; yale university; quin...,the blackstone group l.p.,"mr. stephen allen schwarzman, also known as st...",68,us69343t1079,"[pjt, 1pj, 1pj, pjt]","schwarzman, stephen allen (prior board)"
1,jack bovender,jack,o.,-,bovender,"hca holdings, inc.",hca,duke university,duke university,"mr. jack o. bovender, jr., served as the chair...",70,us40412c1018,"[hca, 2bh, hca]","bovender, jack o. (prior board)"
2,milan mandaric,milan,-,-,mandaric,elexsys international,-,-,"mm holdings international, inc.",mr. milan mandaric serves as chief executive o...,77,-,[-],"mandaric, milan (prior board)"
3,john childs,john,w.,-,childs,jwc acquisition corp.,-,yale university; columbia university,"j.w. childs associates, l.p.",mr. john w. childs serves as the chairman and ...,73,us46634y1029,[-],"childs, john w. (prior board)"
4,john vota,john,p.,-,vota,"insight management corporation, prior to rever...",-,columbia university; fordham university; schoo...,blackbird capital partners,mr. john p. vota serves as a managing partner ...,76,us45776q3074,[-],"vota, john p. (prior)"


In order to be able to distinguish the directors and not create a duplicate effort because I research/review biographies for the same director twice because they are assigned to multiple companies, I will add a unique director ID based on the org_name (as displayed on the Reuters terminal) to the dataframe containing all directors. I will also add a unique id to the biographies_clean dataframe which considers the biographies to be unique.

In [68]:
# assign unique director id based on org_name
all_directors_clean['unique_dir_id'] = all_directors_clean.groupby(['org_name']).ngroup()
all_directors_clean.head()


Unnamed: 0,name,age,last_position,director_start,director_end,executive_start,executive_end,comp_name,ticker,missing_start_date,2011,2012,2013,2014,2015,current_position,dir_exec,in_position,isin,org_name,qualification,last_name,unique_dir_id
0,christina gold,72.0,independent director,0.0,0.0,1997.0,2020.0,itt inc,itt,0.0,1.0,1.0,1.0,1.0,1.0,,,,us45073v1089,ms. christina gold,,gold,7917
1,frank macinnis,72.0,independent chairman of the board,2011.0,2020.0,2001.0,2020.0,itt inc,itt,0.0,1.0,1.0,1.0,1.0,1.0,,,,us45073v1089,mr. frank macinnis,,macinnis,3325
2,denise ramos,63.0,"president, chief executive officer, director",2011.0,2019.0,2011.0,2019.0,itt inc,itt,0.0,1.0,1.0,1.0,1.0,1.0,,,,us45073v1089,ms. denise ramos,,ramos,7996
3,karen larue,40.0,"controller, executive director",0.0,2018.0,0.0,2018.0,itt inc,itt,1.0,0.0,0.0,0.0,0.0,0.0,,,,us45073v1089,ms. karen larue,,larue,8250
4,peter d'aloia,,independent director,0.0,0.0,0.0,2017.0,itt inc,itt,1.0,0.0,0.0,0.0,0.0,0.0,,,,us45073v1089,mr. g. peter d'aloia,,d'aloia,3393


In [69]:
# assign unique director id based on org_name
biographies_clean['unique_bio_id'] = biographies_clean.groupby(['biographies']).ngroup()
biographies_clean.head()


Unnamed: 0,name,first_name,middle_name,nick_name,last_name,comp_name,ticker,education,prim_comp,biographies,age,isin,all_tickers,org_name,unique_bio_id
0,stephen allen schwarzman,stephen,allen,steve,schwarzman,pjt partners inc.,pjt,harvard business school; yale university; quin...,the blackstone group l.p.,"mr. stephen allen schwarzman, also known as st...",68,us69343t1079,"[pjt, 1pj, 1pj, pjt]","schwarzman, stephen allen (prior board)",26212
1,jack bovender,jack,o.,-,bovender,"hca holdings, inc.",hca,duke university,duke university,"mr. jack o. bovender, jr., served as the chair...",70,us40412c1018,"[hca, 2bh, hca]","bovender, jack o. (prior board)",14284
2,milan mandaric,milan,-,-,mandaric,elexsys international,-,-,"mm holdings international, inc.",mr. milan mandaric serves as chief executive o...,77,-,[-],"mandaric, milan (prior board)",21079
3,john childs,john,w.,-,childs,jwc acquisition corp.,-,yale university; columbia university,"j.w. childs associates, l.p.",mr. john w. childs serves as the chairman and ...,73,us46634y1029,[-],"childs, john w. (prior board)",16996
4,john vota,john,p.,-,vota,"insight management corporation, prior to rever...",-,columbia university; fordham university; schoo...,blackbird capital partners,mr. john p. vota serves as a managing partner ...,76,us45776q3074,[-],"vota, john p. (prior)",16758


In [70]:
# drop any directors that do not have any of the years populated
all_directors_clean['all_years'] = all_directors_clean.apply(lambda x: 'yes' if (x[['2011', '2012', '2013', '2014', '2015']] == 0).all() else 'no', axis=1)
all_directors_rel = all_directors_clean[all_directors_clean['all_years'] == 'no'].copy()

print(all_directors_rel.shape)
print(all_directors_clean.shape)


(6888, 24)
(12466, 24)


In [71]:
# manual name replacements to connect the right entries
index = biographies_clean[biographies_clean['name'] == 'ahmet kent'].index
biographies_clean.at[index, 'name'] = 'muhtar kent'

index = all_directors_rel[all_directors_rel['name'] == 'ahmet kent'].index
all_directors_rel.at[index, 'name'] = 'muhtar kent'
all_directors_rel.at[index, 'org_name'] = 'mr. muhtar kent'


In [72]:
# write all relevant directors to a csv file
all_directors_rel.to_csv('/content/drive/My Drive/director-csr/all_directors_rel.csv')


In [73]:
# this many unique directors are contained in my dataset
print(all_directors_rel['unique_dir_id'].nunique())

# get only the unique names
names_directors = all_directors_rel[['unique_dir_id']].copy()


5500


In [74]:
# show any duplicate biography entries in the biographies_clean dataframe
dupe_bios_df = biographies_clean[biographies_clean.duplicated(subset='biographies', keep=False)].copy()
print(dupe_bios_df.shape)
dupe_bios_df.head()


(10900, 15)


Unnamed: 0,name,first_name,middle_name,nick_name,last_name,comp_name,ticker,education,prim_comp,biographies,age,isin,all_tickers,org_name,unique_bio_id
7,alan donenfeld,alan,p.,-,donenfeld,"paneltech international holdings, inc.",pnlt,the fuqua school of business; tufts university,"bristol investment group, inc.",mr. alan p. donenfeld serves as the managing m...,58,us69841h1005,"[pnlt, pnlt]","donenfeld, alan p. (prior board)",5911
8,alan donenfeld,alan,p.,-,donenfeld,timberjack sporting supplies inc.,-,the fuqua school of business; tufts university,"bristol investment group, inc.",mr. alan p. donenfeld serves as the managing m...,58,us88708t1060,[-],"donenfeld, alan p. (prior board)",5911
9,alan donenfeld,alan,p.,-,donenfeld,"charleston basics, inc.",-,the fuqua school of business; tufts university,"bristol investment group, inc.",mr. alan p. donenfeld serves as the managing m...,58,us1600561070,[-],"donenfeld, alan p. (prior board)",5911
10,mark meller,mark,-,-,meller,wien group inc.,-,baruch college; state university of new york a...,"silversun technologies, inc.",mr. mark meller has been president of silversu...,55,us9676361010,[-],"meller, mark (prior board)",19595
11,mark meller,mark,-,-,meller,"silversun technologies, inc., prior to reverse...",-,baruch college; state university of new york a...,"silversun technologies, inc.",mr. mark meller has been president of silversu...,55,us82846h1086,[-],"meller, mark (prior board)",19595


In [75]:
# check whether there are any instances where the bio is the same but the name is different
unique_names_df = dupe_bios_df.groupby('unique_bio_id')['name'].apply(lambda x: x.unique()).reset_index()
unique_names_df['multiple_names'] = unique_names_df['name'].apply(lambda x: 'yes' if len(x) > 1 else 'no')
issue_df = unique_names_df[unique_names_df['multiple_names'] == 'yes']
print(issue_df.shape)


(0, 3)


I will keep these duplicates and include them when matching the directors with the biographies. I can then later remove any duplicates.

## Biography matching
Unfortunately, the biography and director data does not match well. I can only use the names of the directors to merge the two datasets. But the way the names are shown differs greatly, something including middle names or nick names and sometimes not and differing between the two dataset for the same person. Therefore, I will use several steps to merge the datasets together. After each step, I will do a rough manual scan whether this method works and then move the correctly matched pairs to a separate dataframe. With the rest, I will do another merging step and then repeat this process multiple times.

I hope this method will save me some time that I would otherwise have to invest in gathering ~4000 bios manually from DEF 14As.

In [76]:
def review_and_match(df_dir, df_bio, merge_on, sanity='ticker'):
    '''
    Function to run some checks and then add potentially correct matches to
    a final dataframe
    Args: df_dir = df
          df_bio = df
          merge_on = list
    Returns: df_matched = df
             df_review = df
    '''

    # merge both dataframes on specified columns
    df_step = pd.merge(df_dir, df_bio, how='inner', on=merge_on, suffixes=['_dir', '_bio'])

    # assign column signifying how this director bio match was created
    df_step['_'.join(merge_on)+'_match'] = 1

    # check whether this list contains any duplicates
    df_review = df_step[df_step.duplicated(subset=merge_on, keep=False)].copy()

    # do the sanity check based on the entered value
    if sanity == 'ticker':
        df_step['ticker_match'] = df_step.apply(lambda x: int(x['ticker_dir'] in x['all_tickers']), axis=1)
        # add ticker issues to df_review
        df_review = df_step[df_step['ticker_match'] == 0].copy()

    elif sanity == 'age':
        df_step['age_dir'] = df_step['age_dir'].apply(lambda x: 0 if pd.isna(x) else x)
        df_step['age_bio'] = df_step['age_bio'].apply(lambda x: 0 if x == '-' else x)
        df_step['age_match'] = df_step.apply(lambda x: 1 if abs(x['age_dir'] - x['age_bio']) 
                                                            <= 5 else 0, axis=1)
        # add ticker issues to df_review
        df_review = df_step[df_step['age_match'] == 0].copy()

    # remove the duplicates from the merged dataframe
    indices = df_review['unique_dir_id']
    df_step = df_step[~df_step['unique_dir_id'].isin(indices)]

    # define the dataframe with all the probably correct matches
    df_matched = df_step.copy()

    return df_matched, df_review


### Merging on name and isin

I will check the results with the tickers and review any duplicates of last names and isins.


In [77]:
# match both datasets on name and isin
df_matched_name_isin, df_review_name_isin = review_and_match(all_directors_rel, biographies_clean, ['name', 'isin'], sanity='ticker')

# this many matches were generated
print(df_matched_name_isin.shape)
# this many samples need to be reviewed
print(df_review_name_isin.shape)


(501, 39)
(8, 39)


### Merging on last_name and isin

In [78]:
# match both datasets on last_name and isin
df_matched_last_name_isin, df_review_last_name_isin = review_and_match(all_directors_rel, biographies_clean, ['last_name', 'isin'], sanity='ticker')

# this many matches were generated
print(df_matched_last_name_isin.shape)
# this many samples need to be reviewed
print(df_review_last_name_isin.shape)


(582, 39)
(10, 39)


### Merging on last_name and ticker

In [79]:
# match both datasets on last_name and isin
df_matched_last_name_ticker, df_review_last_name_ticker = review_and_match(all_directors_rel, biographies_clean, ['last_name', 'ticker'], sanity=None)

# this many matches were generated
print(df_matched_last_name_ticker.shape)
# this many samples need to be reviewed
print(df_review_last_name_ticker.shape)


(544, 38)
(49, 38)


### Merging on name and do sanity check with ticker

In [80]:
# match both datasets on name and ticker
df_matched_name, df_review_name = review_and_match(all_directors_rel, biographies_clean, ['name'], sanity='ticker')

# this many matches were generated
print(df_matched_name.shape)
# this many samples need to be reviewed
print(df_review_name.shape)


(238, 40)
(2381, 40)


### Merging on name with age sanity check



In [81]:
# match both datasets on name and do age sanity check
df_matched_name_age, df_review_name_age = review_and_match(all_directors_rel, biographies_clean, ['name'], sanity='age')

# this many matches were generated
print(df_matched_name_age.shape)
# this many samples need to be reviewed
print(df_review_name_age.shape)


(1676, 40)
(910, 40)


### Merging only on name
I will not add the results of this merger to the other dataframes, but rather, I will check which ones are already included in the matched and review dataframes and the remaining ones I will add to the review dataframe.

In [82]:
# match both datasets on name and do age sanity check
df_matched_name_none, df_review_name_none = review_and_match(all_directors_rel, biographies_clean, ['name'], sanity=None)

# this many matches were generated
print(df_matched_name_none.shape)
# this many samples need to be reviewed
print(df_review_name_none.shape)

(772, 39)
(2122, 39)


### Putting the individual matching dataframes together

In [83]:
# drop the columns that are not in every dataframe
same_cols = list(set(df_matched_name_isin) & set(df_matched_last_name_isin) & set(df_matched_name) & set(df_matched_name_age) & set(df_matched_last_name_ticker))
df_matched_name_isin = df_matched_name_isin[same_cols].copy()
df_matched_last_name_isin = df_matched_last_name_isin[same_cols].copy()
df_matched_name = df_matched_name[same_cols].copy()
df_matched_name_age = df_matched_name_age[same_cols].copy()
df_matched_last_name_ticker = df_matched_last_name_ticker[same_cols].copy()
df_matched_name_none = df_matched_name_none[same_cols].copy()

same_cols = list(set(df_review_name_isin) & set(df_review_last_name_isin) & set(df_review_name) & set(df_review_name_age) & set(df_review_last_name_ticker))
df_review_name_isin = df_review_name_isin[same_cols].copy()
df_review_last_name_isin = df_review_last_name_isin[same_cols].copy()
df_review_name = df_review_name[same_cols].copy()
df_review_name_age = df_review_name_age[same_cols].copy()
df_review_last_name_ticker = df_review_last_name_ticker[same_cols].copy()
df_review_name_none = df_review_name_none[same_cols].copy()


In [84]:
 # append both matched data frames
df_matched_all = df_matched_name.append([df_matched_last_name_isin, df_matched_name_isin, df_matched_name_age, df_matched_last_name_ticker])
df_matched_all.drop_duplicates(subset=['unique_dir_id', 'unique_bio_id'], inplace=True)
print(df_matched_all.shape)


(1223, 31)


I will check which ones of the df_matched_name_none dataframe samples are included in the df_matched_all dataframe


In [85]:
# which of the samles matched only on name are already included in the all matched dataframe
indices = df_matched_all['unique_dir_id']
df_manual_review = df_matched_name_none[~df_matched_name_none['unique_dir_id'].isin(indices)]

# add the entries not in the df_matched_all dataframe to the review dataframe
df_review_name_none = df_review_name_none.append(df_manual_review)
df_review_name_none.drop_duplicates(subset=['unique_dir_id', 'unique_bio_id'], inplace=True)
df_review_name_none.shape


(1168, 31)

In [86]:
# append all review dataframes
df_review_all = df_review_name_isin.append([df_review_last_name_isin, df_review_name, df_review_name_age, df_review_last_name_ticker, df_review_name_none])
df_review_all.drop_duplicates(subset=['unique_dir_id', 'unique_bio_id'], inplace=True)


In [87]:
# remove any entries from the review data set that are already in the matched dataset
indices = df_matched_all['unique_dir_id']
df_review_all = df_review_all[~df_review_all['unique_dir_id'].isin(indices)]

df_review_all.shape


(615, 31)

### Manually reviewing the flagged matches in the review dataframe

In [88]:
# sort the dataframe by original director name
df_review_all.sort_values(by=['org_name_dir'], inplace=True)


In [89]:
# add ticker, isin, last_name, and name columns again
df_review_merged = pd.merge(all_directors_rel[['name', 'ticker', 'last_name', 'isin', 'unique_dir_id']], df_review_all, how='right', on='unique_dir_id', suffixes=['_dir', '_review'])
df_review_merged = pd.merge(biographies_clean[['name', 'ticker', 'last_name', 'isin', 'unique_bio_id']], df_review_merged, how='right', on='unique_bio_id', suffixes=['_bio', '_review'])


In [90]:
# rearrange the columns
new_order = ['name_review', 'last_name_review', 'org_name_dir', 'comp_name_dir', 'ticker_review', 
             'org_name_bio', 'comp_name_bio', 'prim_comp', 'age_bio', 'age_dir','biographies', 'unique_dir_id', 'unique_bio_id',
             'first_name', 'nick_name', 'middle_name', 'last_name_bio']
rest_cols = [x for x in df_review_merged.columns if x not in new_order]
all_cols = new_order + rest_cols
df_review_merged = df_review_merged[all_cols]


In [91]:
df_review_all.shape

(615, 31)

In [92]:
# write this dataframe to excel file for manual review
df_review_merged.to_excel('/content/drive/My Drive/director-csr/dir_bio_manual_review.xlsx',
                                sheet_name='review')


I started to review these 615 directors, however, most of them were not a match. Therefore, I will continue my manual research of the biographies.

## Writing the basic matching on name column to excel

In [93]:
# check the old uncleaned biographies for name
biographies_old[biographies_old['name'].str.contains('schwarzman')]


Unnamed: 0,name,first_name,middle_name,nick_name,last_name,comp_name,ticker,education,prim_comp,biographies,age,isin,all_tickers
0,"schwarzman, stephen allen (prior board)",Stephen,Allen,Steve,Schwarzman,PJT Partners Inc. (NYSE:PJT),NYSE:PJT,Harvard Business School; Yale University; Quin...,The Blackstone Group L.P. (NYSE:BX) (Board),"Mr. Stephen Allen Schwarzman, also known as St...",68,US69343T1079,NYSE:PJT; BST:1PJ; DB:1PJ


In [94]:
# check the cleaned biographies for name
biographies_old_clean[biographies_old_clean['name'] == 'wayne daley']


Unnamed: 0,name,first_name,middle_name,nick_name,last_name,comp_name,ticker,education,prim_comp,biographies,age,isin,all_tickers,org_name
16019,wayne daley,Wayne,-,-,Daley,Cascade Mountain Mining Co. Inc.,-,-,BioStem Inc. (Prior Board),Wayne B. Daley served as President and Chief E...,51,US1473301042,-,"Daley, Wayne (Prior Board)"
16020,wayne daley,Wayne,-,-,Daley,BioStem Inc.,-,-,BioStem Inc. (Prior Board),Wayne B. Daley served as President and Chief E...,51,US48122Q1040,-,"Daley, Wayne (Prior Board)"


In [95]:
# merge both dataframes on the name column
dir_bio_df = pd.merge(all_directors_rel, biographies_old_clean, how='left', on='name')
print(dir_bio_df['biographies'].isnull().sum(), 'are entries without biographies')
print(dir_bio_df.shape)


4867 are entries without biographies
(7761, 37)


It is important to note that these 4867 directors are not necessarily really unique. In the case of director 'thomas brown', there are three different people with this name and different biographies. However, because I cannot (easily) distinguish which one of these really belongs to a specific company, all three people will be added per company and I need to manually determine which ones of these is the correct biography. I will do this during the process of reviewing DEF 14As for missing biographies and csr committees.

In [96]:
# check how many directors (incl. duplicates) are sitting on the company boards
unique_bios = dir_bio_df.drop_duplicates(subset=['org_name_x', 'comp_name_x'])
unique_bios.shape


(6877, 37)

In [97]:
# number of unique biographies which are available
unique_bios = unique_bios.drop_duplicates(subset='biographies')
unique_bios.shape


(1405, 37)

This sample includes two companies which are not relevant for my investigation because they were only part of the S&P 500 in 2010. Therefore, I will remove them just to see how many biographies relate to all relevant companies. But I will keep them in the dir_bio_df dataframe in order not to alter the code that is based on this dataframe in other notebooks. The irrelevant companies are itt and mdp (those are their tickers).

In [98]:
# check how many biographies are included in the S&P Capital IQ dataset without
# the irrelevant companies included
unique_bios[~unique_bios['ticker_x'].isin(['itt', 'mdp'])].shape


(1395, 37)

In [99]:
# remove any duplicate entries where the name, biography and comp_name are the same
dir_bio_df = dir_bio_df[~dir_bio_df.duplicated(subset=['name', 'biographies', 'comp_name_x'])]
dir_bio_df.shape


(7409, 37)

In [100]:
dir_bio_df.head()

Unnamed: 0,name,age_x,last_position,director_start,director_end,executive_start,executive_end,comp_name_x,ticker_x,missing_start_date,2011,2012,2013,2014,2015,current_position,dir_exec,in_position,isin_x,org_name_x,qualification,last_name_x,unique_dir_id,all_years,first_name,middle_name,nick_name,last_name_y,comp_name_y,ticker_y,education,prim_comp,biographies,age_y,isin_y,all_tickers,org_name_y
0,christina gold,72.0,independent director,0.0,0.0,1997.0,2020.0,itt inc,itt,0.0,1.0,1.0,1.0,1.0,1.0,,,,us45073v1089,ms. christina gold,,gold,7917,no,Christina,A.,-,Gold,The Western Union Company (NYSE:WU),NYSE:WU,Ecole des Hautes Etudes Commerciales de Montre...,First Data Merchant Services Corporation,Mrs. Christina A. Gold has been the Chief Exec...,67.0,US9598021098,NYSE:WU; BMV:WU *; BOVESPA:WUNI34; DB:W3U,"Gold, Christina A. (Prior Board)"
1,frank macinnis,72.0,independent chairman of the board,2011.0,2020.0,2001.0,2020.0,itt inc,itt,0.0,1.0,1.0,1.0,1.0,1.0,,,,us45073v1089,mr. frank macinnis,,macinnis,3325,no,Frank,T.,-,MacInnis,EMCOR Group Inc. (NYSE:EME),NYSE:EME,University of Alberta; University Of Alberta L...,MES Holdings Corporation,Mr. Frank T. MacInnis serves as the President ...,68.0,US29084Q1004,NYSE:EME; DB:EM4,"MacInnis, Frank T. (Prior Board)"
2,denise ramos,63.0,"president, chief executive officer, director",2011.0,2019.0,2011.0,2019.0,itt inc,itt,0.0,1.0,1.0,1.0,1.0,1.0,,,,us45073v1089,ms. denise ramos,,ramos,7996,no,Denise,L.,-,Ramos,ITT Corporation (NYSE:ITT),NYSE:ITT,The University of Chicago; Purdue University,ITT Corporation (NYSE:ITT) (Board),Ms. Denise L. Ramos serves as the Chief Execut...,58.0,US4509112011,NYSE:ITT; DB:ITTA,"Ramos, Denise L. (Board)"
3,orlando ashford,51.0,,,,,,itt inc,itt,,0.0,1.0,1.0,1.0,1.0,independent director,2012.0,2012.0,us45073v1089,mr. orlando ashford,,ashford,5733,no,,,,,,,,,,,,,
4,donald defosset,72.0,,,,,,itt inc,itt,,0.0,1.0,1.0,1.0,1.0,independent director,2012.0,2012.0,us45073v1089,"mr. donald (don) defosset , jr.",jr.,defosset,2984,no,Donald,-,Don,DeFosset,"Walter Energy, Inc.",-,Harvard Business School; Purdue University,ATL Partners,"Mr. Donald DeFosset, Jr., also known as Don, B...",66.0,US93317Q1058,OTCPK:WLTG.Q,"DeFosset, Donald (Prior Board)"


In [101]:
# write the dir_bio_df dataframe to an excel file
dir_bio_df.to_excel('/content/drive/My Drive/director-csr/director_bios_all.xlsx',
                                sheet_name='bios')
