<a href="https://colab.research.google.com/github/julianikulski/director-experience/blob/main/preprocessing/prepare_director_comp_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Clean-up of the manually researched director bios
I have researched all missing biographies and I checked which companies have CSR related committees and which directors sat when on these committees. I reviewed DEF 14As for this information. I looked at DEF 14As between 2012 and 2016 for the committee data and at DEF 14As between 2008 (in rare cases) and 2016 for director biography information. Now that I have a long list of biographies, I need to check whether there are any directors which have the same name but are actually different people (with different biographies). And I also need to remove any duplicate biographies from directors which had a missing biography but sat on multiple boards and therefore, I researched multiple biographies as these differ per company (even if it is the same director, the layout, content and length of these biographies differs per company). In the latter case, I will keep the longest biography and remove any shorter ones for a director. This is a simple assumption that a longer biography will contain more information and will be more likely to include information on any CSR related previous career experience.

In [1]:
# connecting to Google Drive to access files
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
import numpy as np
import pandas as pd

from glob import glob
import re
from functools import reduce


In [3]:
# set number of max rows
pd.set_option('display.max_rows', 13000)


# Read in the manual review list

In [4]:
# read in the excel file that contains the manually researched biographies and committee memberships
all_bios_comms_df = pd.read_excel('/content/drive/My Drive/director-csr/dir_bio_comm_all_added.xlsx')
all_bios_comms_df.drop(columns='Unnamed: 0', inplace=True)
print(all_bios_comms_df.shape)
all_bios_comms_df.head()


(7620, 67)


Unnamed: 0,name,org_name_x,org_name_y,irrelevant,missing_reason,board_committee,committee,comm_type,comm_start,comm_end,list_years_if_non_consecutive,comp_name,isin,biography_date,Biographies,2010,2011,2012,2013,2014,2015,age,last_position,director_start,director_end,executive_start,executive_end,ticker,missing_start_date,2011.1,2012.1,2013.1,2014.1,2015.1,current_position,dir_exec,in_position,qualification,all_years,Company Name [Any Professional Record] [Current Matching Results],Exchange:Ticker,Email Address,Professional Titles [Any Professional Record] [Current Matching Results],Colleges/Universities,Degrees,Graduation Year,Majors,Geographic Locations [Any Professional Record] [Current Matching Results],Primary Professional Record,Person Locations [Any Professional Record] [Current Matching Results],Person Age,Person Name First,Person Name Last,Person Name Middle,Person Name Nickname,Person Name Prefix,Person Name Suffix,Person Notes,Specialties [Any Professional Record] [Current Matching Results],Year Born,CIK [Any Professional Record] [Current Matching Results],Company CUSIP [Any Professional Record] [Current Matching Results],Primary ISIN [Any Professional Record] [Current Matching Results],Security Tickers [Any Professional Record] [Current Matching Results],SIC Codes (Primary) [Any Professional Record] [Current Matching Results],Company Type [Any Professional Record] [Current Matching Results],Professional Job Functions [Any Professional Record] [Current Matching Results]
0,thomas brown,mr. thomas (tony) brown,,,,No,,,,,,3m co,us88579y1010,,"Thomas ""Tony"" K. Brown, 59, Retired Group Vice...",y,y,y,y,y,y,64.0,,,,,,mmm,,0.0,0.0,0.0,1.0,1.0,independent director,2014.0,2014.0,,no,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,thomas brown,mr. thomas (tony) brown,"Brown, Thomas R. (Prior Board)",Yes,,No,,,,,,3m co,us88579y1010,,"Thomas R. Brownm, also known as Tom, served as...",y,y,y,y,y,y,64.0,,,,,,mmm,,0.0,0.0,0.0,1.0,1.0,independent director,2014.0,2014.0,,no,"AGR Tools, Inc., Prior to Reverse Merger with ...",-,-,Former Director,-,-,-,-,United States and Canada (Primary),Marquis Ventures Inc. (TSXV:MQV.H) (Board),Canada; United States and Canada; British Colu...,52.0,Thomas,Brown,R.,Tom,Mr.,-,,-,1963.0,-,50545R,US0012361087,-,1000 Metal mining,Public Company,Chief Executive Officer (Prior)
2,linda alvarado,ms. linda alvarado,,,,No,,,,,,3m co,us88579y1010,2014.0,"Linda G. Alvarado, 63, President and Chief Exe...",y,y,y,y,y,y,66.0,independent director,0.0,0.0,2000.0,2016.0,mmm,0.0,1.0,1.0,1.0,1.0,1.0,,,,,no,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,vance coffman,"dr. vance coffman , ph.d.","Coffman, Vance D. (Prior Board)",,,No,,,,,,3m co,us88579y1010,,Dr. Vance D. Coffman served as the Chief Execu...,y,y,y,y,y,y,74.0,independent director,0.0,0.0,2002.0,2018.0,mmm,0.0,1.0,1.0,1.0,1.0,1.0,,,,ph.d.,no,Lockheed Martin Corporation (NYSE:LMT),NYSE:LMT,-,"Former Chairman, Chief Executive Officer and M...",Stanford University; Iowa State University; Em...,Stanford University - Doctorate; Stanford Univ...,-,Stanford University - Aeronautics; Stanford Un...,United States and Canada (Primary),Amgen Inc. (NasdaqGS:AMGN) (Board),United States of America; Maryland; Midatlanti...,71.0,Vance,Coffman,D.,-,Dr.,-,,-,1944.0,0000060026; 0000936468,539830,US5398301094,NYSE:LMT; XTRA:LOM; SWX:LMT; BASE:LMT; BMV:LMT...,3760 Guided missiles and space vehicles and parts,Public Company,Chief Executive Officer (Prior)
4,robert ulrich,mr. robert (bob) ulrich,"Ulrich, Robert J. (Prior Board)",,,No,,,,,,3m co,us88579y1010,,"Mr. Robert J. Ulrich, also known as Bob, is th...",y,y,y,y,y,y,74.0,independent director,0.0,0.0,2008.0,2017.0,mmm,0.0,1.0,1.0,1.0,1.0,1.0,,,,,no,Target Corp. (NYSE:TGT),NYSE:TGT,-,Former Chairman Emeritus,University of Minnesota,University of Minnesota - BA,-,-,United States and Canada (Primary),The Musical Instrument Museum (Board),United States of America; Great Lakes; Minneso...,71.0,Robert,Ulrich,J.,Bob,Mr.,-,,-,1944.0,0000027419,87612E,US87612E1064,NYSE:TGT; SNSE:TGT; BMV:TGT *; BOVESPA:TGTB34;...,5331 Variety stores,Public Company,Chief Executive Officer (Prior)


### Remove all entries marked as irrelevant and duplicates in org_name_y column

There is one director which has an incorrect name. He is shown as Ahmet Kent in the `dir_bio_comm_all` dataframe, however, his actual name is shown in the `all_directors_rel` dataframe which is Muhtar Kent. I will need to replace this in the `dir_bio_comm_all` dataframe.

In [5]:
index = all_bios_comms_df[all_bios_comms_df['name'] == 'muhtar kent'].index
all_bios_comms_df.loc[index, 'org_name_x'] = 'mr. muhtar kent'


In [6]:
# convert the 'irrelevant' column to lower case
all_bios_comms_df['irrelevant'] = all_bios_comms_df['irrelevant'].apply(lambda x: x.lower().strip() if not pd.isna(x) else x)
# remove all directors that have 'irrelevant' flagged as 'yes'
all_bios_comms_irr = all_bios_comms_df[all_bios_comms_df['irrelevant'].isnull()].copy()
all_bios_comms_irr.shape


(6870, 67)

In [7]:
# remove all duplicate entries based on the org_name_y but keep the NaN rows
less_dupes_df = all_bios_comms_irr[(~all_bios_comms_irr.duplicated(subset=['org_name_y', 'org_name_x'])) | (all_bios_comms_irr['org_name_y'].isnull())]
less_dupes_df.shape


(6343, 67)

###  Create manual review list to determine false positive duplicates

In [8]:
# filter all duplicate entries based on the org_name_x and org_name_y fields
dupes_df = less_dupes_df[less_dupes_df.duplicated(subset=['org_name_x', 'org_name_y'], keep=False)].copy()
# filter all duplicates from that list based on the org_name_y and keep the first entry
dupes_df.sort_values(by=['name'], inplace=True)
dupes_df.head()


Unnamed: 0,name,org_name_x,org_name_y,irrelevant,missing_reason,board_committee,committee,comm_type,comm_start,comm_end,list_years_if_non_consecutive,comp_name,isin,biography_date,Biographies,2010,2011,2012,2013,2014,2015,age,last_position,director_start,director_end,executive_start,executive_end,ticker,missing_start_date,2011.1,2012.1,2013.1,2014.1,2015.1,current_position,dir_exec,in_position,qualification,all_years,Company Name [Any Professional Record] [Current Matching Results],Exchange:Ticker,Email Address,Professional Titles [Any Professional Record] [Current Matching Results],Colleges/Universities,Degrees,Graduation Year,Majors,Geographic Locations [Any Professional Record] [Current Matching Results],Primary Professional Record,Person Locations [Any Professional Record] [Current Matching Results],Person Age,Person Name First,Person Name Last,Person Name Middle,Person Name Nickname,Person Name Prefix,Person Name Suffix,Person Notes,Specialties [Any Professional Record] [Current Matching Results],Year Born,CIK [Any Professional Record] [Current Matching Results],Company CUSIP [Any Professional Record] [Current Matching Results],Primary ISIN [Any Professional Record] [Current Matching Results],Security Tickers [Any Professional Record] [Current Matching Results],SIC Codes (Primary) [Any Professional Record] [Current Matching Results],Company Type [Any Professional Record] [Current Matching Results],Professional Job Functions [Any Professional Record] [Current Matching Results]
1701,alain belda,sr. alain belda,,,,Yes,"nomination, governance and public affairs comm...","social, environmental",2011.0,2011.0,,citigroup inc,us1729674242,2011,Mr. Belda is an experienced executive and has ...,y,y,y,y,y,y,75.0,independent director,0.0,0.0,1997.0,2012.0,c,0.0,1.0,1.0,0.0,0.0,0.0,,,,,no,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3838,alain belda,sr. alain belda,,,,Yes,directors and corporate governance committee,"social, environmental",2014.0,,,international business machines corp,us4592001014,2015,"Alain J.P. Belda, 71, is a managing director a...",y,y,y,y,y,y,75.0,independent director,0.0,0.0,2008.0,2016.0,ibm,0.0,1.0,1.0,1.0,1.0,1.0,,,,,no,,,,,,,,,,,,,,,,,,,,,,,,,,,,
5385,alan batkin,mr. alan batkin,,,,No,,,,,,omnicom group inc,us6819191064,2016,The selection of Mr. Batkin as a director nomi...,y,y,y,n,n,n,76.0,independent director,0.0,0.0,2008.0,2020.0,omc,0.0,1.0,1.0,1.0,1.0,1.0,,,,,no,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3489,alan batkin,mr. alan batkin,,,,Yes,"nominating, governance and social responsibili...","social, environmental",2015.0,,,hasbro inc,us4180561072,2016,Alan R. Batkin is Chairman and Chief Executive...,y,y,y,y,y,y,76.0,independent director,0.0,0.0,1992.0,2017.0,has,0.0,1.0,1.0,1.0,1.0,1.0,,,,,no,,,,,,,,,,,,,,,,,,,,,,,,,,,,
810,alan boeckmann,mr. alan boeckmann,,,,No,,,,,,archer daniels midland co,us0394831020,2016,"Prior to retiring in February, 2012, Mr. Boeck...",,,,,,,71.0,independent director,0.0,0.0,2004.0,2019.0,adm,0.0,1.0,1.0,1.0,1.0,1.0,,,,,no,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [9]:
# write all duplicates which need to be checked manually to an excel file
dupes_df.to_excel('/content/drive/My Drive/director-csr/dupes_directors.xlsx', 
                  sheet_name='dupes_directors')


### Determine longest bios for all unique directors
The manual review showed that only two directors have the same first and last name but are two different people. I will add these two separately after I have removed all other duplicates by keeping the longer biography of all duplicates.

In [10]:
# read in the manually checked data
checked_dupes_df = pd.read_excel('/content/drive/My Drive/director-csr/dupes_directors_checked.xlsx')
checked_dupes_df.rename(columns={'Unnamed: 0': 'old_index'}, inplace=True)
checked_dupes_df[checked_dupes_df['unique'].notnull()]


Unnamed: 0,old_index,name,org_name_x,org_name_y,unique,irrelevant,missing_reason,board_committee,committee,comm_type,comm_start,comm_end,list_years_if_non_consecutive,comp_name,isin,biography_date,Biographies,2010,2011,2012,2013,2014,2015,age,last_position,director_start,director_end,executive_start,executive_end,ticker,missing_start_date,2011.1,2012.1,2013.1,2014.1,2015.1,current_position,dir_exec,in_position,qualification,all_years,Company Name [Any Professional Record] [Current Matching Results],Exchange:Ticker,Email Address,Professional Titles [Any Professional Record] [Current Matching Results],Colleges/Universities,Degrees,Graduation Year,Majors,Geographic Locations [Any Professional Record] [Current Matching Results],Primary Professional Record,Person Locations [Any Professional Record] [Current Matching Results],Person Age,Person Name First,Person Name Last,Person Name Middle,Person Name Nickname,Person Name Prefix,Person Name Suffix,Person Notes,Specialties [Any Professional Record] [Current Matching Results],Year Born,CIK [Any Professional Record] [Current Matching Results],Company CUSIP [Any Professional Record] [Current Matching Results],Primary ISIN [Any Professional Record] [Current Matching Results],Security Tickers [Any Professional Record] [Current Matching Results],SIC Codes (Primary) [Any Professional Record] [Current Matching Results],Company Type [Any Professional Record] [Current Matching Results],Professional Job Functions [Any Professional Record] [Current Matching Results]
667,5283,john thompson,mr. john thompson,,yes_1,,,Yes,,,,,,nortonlifelock inc,us6687711084,2011.0,Mr. Thompson has served as our Group President...,y,y,y,y,y,y,71.0,chairman of the board,1999.0,2011.0,1999.0,2011.0,nlok,0.0,1,0,0,0,0,,,,,no,,,,,,,,,,,,,,,,,,,,,,,,,,,,
668,5225,john thompson,mr. john thompson,,no_1,,,No,,,,,,norfolk southern corp,us6558441084,2015.0,Areas of Expertise: CEO/Senior Officer; Financ...,y,y,y,y,y,y,68.0,,,,,,nsc,,0,0,1,1,1,independent director,2013.0,2013.0,,no,,,,,,,,,,,,,,,,,,,,,,,,,,,,
669,6968,john thompson,mr. john thompson,,yes_2,,,No,,,,,,united parcel service inc,us9113121068,2012.0,John has been Chief Executive Officer of Virtu...,y,y,y,y,y,y,70.0,independent director,0.0,0.0,2000.0,2013.0,ups,0.0,1,1,1,0,0,,,,,no,,,,,,,,,,,,,,,,,,,,,,,,,,,,
670,4814,john thompson,mr. john thompson,,no_1,,,Yes,regulatory and public policy committee,"social, environmental",2011.0,,,microsoft corp,us5949181045,2016.0,"Mr. Thompson, previously lead independent dire...",y,y,y,y,y,y,71.0,,,,,,msft,,0,1,1,1,1,independent non-executive chairman of the board,2012.0,2014.0,,no,,,,,,,,,,,,,,,,,,,,,,,,,,,,
671,6215,john thompson,mr. john thompson,,no_1,,,No,,,,,,seagate technology plc,ie00b58jvz52,2009.0,Mr. Thompson is Chairman of the Board of Direc...,y,y,y,y,y,y,71.0,independent director,0.0,0.0,2000.0,2011.0,stx,0.0,1,0,0,0,0,,,,,no,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [11]:
# add a column that shows the biography length
checked_dupes_df['bio_len'] = checked_dupes_df['Biographies'].apply(lambda x: len(x) if not pd.isna(x) else 0)
# add a blank column to identify which biography to keep
checked_dupes_df['bio_keep'] = None

# iterate through the dataframe to identify the longest biography
for item in checked_dupes_df.groupby('org_name_x'):
    #if item[1]['missing_reason']

    # get a list of the biography lengths for all items in the item group
    len_list = item[1]['bio_len'].values
    # get the index of the row with the longest bio length
    bio_index = np.argmax(len_list)
    # get the row index of the relevant item in the group
    bio_org_index = list(item[1].index)[bio_index]
    # assign 'yes' to the bio_keep column for the row that includes the longest biography
    checked_dupes_df.at[bio_org_index, 'bio_keep'] = 'yes'


In [12]:
# check which of the false positive duplicates was kept
checked_dupes_df[checked_dupes_df['unique'].notnull()][['name', 'org_name_x', 'bio_len', 'unique', 'bio_keep']]


Unnamed: 0,name,org_name_x,bio_len,unique,bio_keep
667,john thompson,mr. john thompson,706,yes_1,
668,john thompson,mr. john thompson,457,no_1,
669,john thompson,mr. john thompson,1400,yes_2,
670,john thompson,mr. john thompson,1934,no_1,yes
671,john thompson,mr. john thompson,342,no_1,


In [13]:
# add the missing false positive duplicate director by assigning bio_keep with 'yes'
checked_dupes_df.at[670, 'bio_keep'] = 'yes'
checked_dupes_df.at[671, 'bio_keep'] = 'yes'
checked_dupes_df.at[672, 'bio_keep'] = 'yes'
checked_dupes_df.at[673, 'bio_keep'] = 'yes'

# add the correct biographies manually
checked_dupes_df.at[670, 'Biographies'] = checked_dupes_df.loc[671]['Biographies']
checked_dupes_df.at[672, 'Biographies'] = checked_dupes_df.loc[671]['Biographies']
checked_dupes_df.at[673, 'Biographies'] = checked_dupes_df.loc[671]['Biographies']


In [14]:
# check how many unique biographies are now left
print(checked_dupes_df[checked_dupes_df['bio_keep'].notnull()].shape)
# save the unique bio rows in a new dataframe
unique_bios = checked_dupes_df[checked_dupes_df['bio_keep'].notnull()].copy()


(655, 71)


In [15]:
# assign the column 'old_index' to a list to prepare for removing the entries from checked_dupes_df
# from the less_dupes_df
remove_indices = checked_dupes_df['old_index'].to_list()


In [16]:
# drop all duplicates from the less_dupes_df dataframe and add the ones that should be kept
less_dupes_df.drop(remove_indices, inplace=True)
less_dupes_df.shape


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


(4864, 67)

In [17]:
# add all biographies that should be added as unique director entries
unique_dir_bios_df = less_dupes_df.append(unique_bios)
unique_dir_bios_df.shape


(5519, 71)

In [18]:
# remove all entries from this dataframe which have a biography but have data in
# the 'missing_reason' column
unique_dir_bios_df.drop(unique_dir_bios_df[(unique_dir_bios_df['missing_reason'].notnull()) & (unique_dir_bios_df['org_name_y'].notnull())].index, inplace=True)
unique_dir_bios_df.reset_index(drop=True, inplace=True)
unique_dir_bios_df.shape


(5500, 71)

### Deal with missing_reason column to determine which companies/directors to remove from sample

In [19]:
# review the dataframe that has all irrelevant directors removed but still has all duplicates
# first step is to get all company names
all_missing_comps = list(set(list(all_bios_comms_df[all_bios_comms_df['missing_reason'].notnull()]['comp_name'])))
all_missing_comps


['medtronic plc',
 'intercontinental exchange inc',
 'fossil group inc',
 'coty inc',
 'zoetis inc',
 'dte energy co',
 'cme group inc',
 'broadcom inc',
 'e i du pont de nemours and co',
 'graham holdings co',
 'kraft heinz co',
 'cummins inc',
 'metlife inc',
 'at&t inc',
 'duke energy corp',
 'perrigo company plc',
 'federated hermes inc',
 'lam research corp',
 'vulcan materials co',
 'automatic data processing inc',
 'dollar tree inc',
 'cognizant technology solutions corp',
 'amg advanced metallurgical group nv',
 'discover financial services',
 'prudential plc',
 'newell brands inc',
 'international paper co',
 'skyworks solutions inc',
 'u.s. bancorp',
 'corning inc',
 'colgate-palmolive co',
 'jefferies financial group inc',
 'eaton corporation plc',
 'schlumberger nv',
 'lockheed martin corp',
 'accenture plc',
 'agilent technologies inc',
 'darden restaurants inc',
 'centerpoint energy inc',
 'pultegroup inc',
 'pepsico inc',
 'general motors co',
 'exelon corp',
 'honeywell

In [20]:
# get a list of all the missing reasons
all_missing_reasons = list(set(all_bios_comms_df['missing_reason']))
all_missing_reasons 


[nan,
 'no def 14a statements matching these director names',
 'only 1 month in office without bio',
 'less than one year in office; no bio available',
 'not a director',
 'not a director before 2018',
 'no proxy statements before 2015',
 'not in any documents',
 'no proxy statements available',
 'no bio information in any document',
 'very incomplete data',
 'no def 14a for 2016 and therefore no committe info on 2015',
 'duplicate',
 'no mention in any documents',
 'no bio information available in any document; not a director',
 'no bio information available in any document',
 'duplicate director entry',
 'no bio information in any documents',
 'no bio information available in any document; committee information cannot be used because no DEF 14As available before 2015',
 'was only director from may 2012 until september 2012 because of an acquisition',
 'no def 14as available',
 'no proxy statements']

Based on the above list of reasons, I will create a dataframe of any irrelevant directors and in certain cases, of all directors of certain companies. This dataframe will then be used to remove those irrelevant directors and companies from the overall sample in the dataframe which will be read in later on: `all_directors_rel`. Here are some explanations on what was done with the missing_reason populated rows

* `duplicate`: these are not added to the removal dataframe because they are 
relevant directors.

* `no mention in any documents` and `not a director` and `not a director before 2018` and `not in any documents` and `no bio information available in any document; not a director` and `no def 14a statements matching these director names` and `duplicate director entry`: keep those and add column to mark them to be removed by flagging 'individual'.

* `less than one year in office; no bio available` and `only 1 month in office without bio` and `was only director from may 2012 until september 2012 because of an acquisition` and `no bio information in any document` and `no bio information in any documents` and `no proxy statements before 2015` and `no bio information available in any document`: the respective companies will only be considered in the sample for the years t+2 or t with t being the years these directors are on the board. This way, their potential influence will not be included in the sample.

* `no proxy statements available` and `no proxy statements` and `no def 14as available` and `very incomplete data` and `no bio information available in any document; committee information cannot be used because no DEF 14As available before 2015`: the entire company needs to be removed from the sample because no data is available for them.

* `no def 14a for 2016 and therefore no committe info on 2015`: the entire company needs to be removed for the year 2015 (and therefore, also 2016) because there is not committee information.





In [21]:
# define the types of missing_reasons that require the same action
comps_irrel_years = ['only 1 month in office without bio', 
                     'less than one year in office; no bio available',
                     'was only director from may 2012 until september 2012 because of an acquisition',
                     'no bio information in any document',
                     'no bio information in any documents',
                     'no proxy statements before 2015',
                     'no bio information available in any document']

indiv_directors = ['no mention in any documents', 
                   'not a director', 
                   'not a director before 2018',
                   'not in any documents',
                   'no bio information available in any document; not a director',
                   'no def 14a statements matching these director names',
                   'duplicate director entry']

entire_comp = ['no proxy statements available',
               'no proxy statements',
               'no def 14as available', 
               'very incomplete data',
               'no bio information available in any document; committee information cannot be used because no DEF 14As available before 2015']

comp_year_committee = ['no def 14a for 2016 and therefore no committe info on 2015']


In [22]:
# check whether all missing reasons have been addressed
all_reasons_included = comps_irrel_years
all_reasons_included.extend(indiv_directors)
all_reasons_included.extend(entire_comp)
all_reasons_included.extend(comp_year_committee)
print(len(all_reasons_included))

# compare the overall list with the list which addresses the reasons
remaining_reasons = list(set(all_missing_reasons) - set(all_reasons_included))
remaining_reasons.remove(np.nan)
remaining_reasons.remove('duplicate')

# show the unaddressed reasons
remaining_reasons


20


[]

In [23]:
# check the remaining reasons which need to be addressed
all_bios_comms_df[all_bios_comms_df['missing_reason'].isin(remaining_reasons)]


Unnamed: 0,name,org_name_x,org_name_y,irrelevant,missing_reason,board_committee,committee,comm_type,comm_start,comm_end,list_years_if_non_consecutive,comp_name,isin,biography_date,Biographies,2010,2011,2012,2013,2014,2015,age,last_position,director_start,director_end,executive_start,executive_end,ticker,missing_start_date,2011.1,2012.1,2013.1,2014.1,2015.1,current_position,dir_exec,in_position,qualification,all_years,Company Name [Any Professional Record] [Current Matching Results],Exchange:Ticker,Email Address,Professional Titles [Any Professional Record] [Current Matching Results],Colleges/Universities,Degrees,Graduation Year,Majors,Geographic Locations [Any Professional Record] [Current Matching Results],Primary Professional Record,Person Locations [Any Professional Record] [Current Matching Results],Person Age,Person Name First,Person Name Last,Person Name Middle,Person Name Nickname,Person Name Prefix,Person Name Suffix,Person Notes,Specialties [Any Professional Record] [Current Matching Results],Year Born,CIK [Any Professional Record] [Current Matching Results],Company CUSIP [Any Professional Record] [Current Matching Results],Primary ISIN [Any Professional Record] [Current Matching Results],Security Tickers [Any Professional Record] [Current Matching Results],SIC Codes (Primary) [Any Professional Record] [Current Matching Results],Company Type [Any Professional Record] [Current Matching Results],Professional Job Functions [Any Professional Record] [Current Matching Results]


In [24]:
# remove the rows that are marked as duplicate because these directors are relevant
removal_df = all_bios_comms_df[all_bios_comms_df['missing_reason'].notnull()]
removal_df = removal_df[removal_df['missing_reason'] != 'duplicates']

# add flag that specifies that the entire company year related to the director and the entire company should be removed from sample
removal_df['what_to_remove'] = removal_df.apply(lambda x: 'entire company in t+1' if x['missing_reason'] in comps_irrel_years else np.nan, axis=1)

# add column to signify that individual directors should be removed
removal_df['what_to_remove'] = removal_df.apply(lambda x: 'only this director' if x['missing_reason'] in indiv_directors else x['what_to_remove'], axis=1)

# add flag that specifies that the entire company should be removed for all years
removal_df['what_to_remove'] = removal_df.apply(lambda x: 'entire company for all years' if x['missing_reason'] in entire_comp else x['what_to_remove'], axis=1)

# add flag that shows that company needs to be remove for director years 2015 because of missing committee data
removal_df['what_to_remove'] = removal_df.apply(lambda x: 'entire company for director year 2015 because of committee' if x['missing_reason'] in comp_year_committee else x['what_to_remove'], axis=1)


In [25]:
removal_df['what_to_remove'].unique()

array(['entire company in t+1', 'only this director',
       'entire company for all years', nan,
       'entire company for director year 2015 because of committee'],
      dtype=object)

## Add biographies to all relevant directors in sample

As I mentioned at the end of the `csr_committees` notebook, I unfortunately started the manual biography review whose result is displayed in the `all_bios_comms_df` dataframe while I was still doing some code changes and experiments in the `biography_matching` notebook. This means that the excel file generated by the `csr_committees` notebook is different from the excel file I actually used for the review. There are most likely some duplicate director entries in the new file from the `csr_committees` notebook compared to the file I reviewed (because the reviewed file had about 300 less entries). This is not a problem though because I will use the previously created list of all relevant directors and match the researched biographies with it. If there are still some biographies missing afterwards (and there is no reason for this marked in the manually reviewed excel file), then I will quickly try to research those and add them.

In [26]:
# read in the file that is generated at the end of the csr_committees notebook
all_directors_rel = pd.read_csv('/content/drive/My Drive/director-csr/all_directors_rel.csv')
all_directors_rel.drop(columns=['Unnamed: 0'], inplace=True)
print(all_directors_rel.shape)
all_directors_rel.head()


(6888, 24)


Unnamed: 0,name,age,last_position,director_start,director_end,executive_start,executive_end,comp_name,ticker,missing_start_date,2011,2012,2013,2014,2015,current_position,dir_exec,in_position,isin,org_name,qualification,last_name,unique_dir_id,all_years
0,christina gold,72.0,independent director,0.0,0.0,1997.0,2020.0,itt inc,itt,0.0,1.0,1.0,1.0,1.0,1.0,,,,us45073v1089,ms. christina gold,,gold,7917,no
1,frank macinnis,72.0,independent chairman of the board,2011.0,2020.0,2001.0,2020.0,itt inc,itt,0.0,1.0,1.0,1.0,1.0,1.0,,,,us45073v1089,mr. frank macinnis,,macinnis,3325,no
2,denise ramos,63.0,"president, chief executive officer, director",2011.0,2019.0,2011.0,2019.0,itt inc,itt,0.0,1.0,1.0,1.0,1.0,1.0,,,,us45073v1089,ms. denise ramos,,ramos,7996,no
3,orlando ashford,51.0,,,,,,itt inc,itt,,0.0,1.0,1.0,1.0,1.0,independent director,2012.0,2012.0,us45073v1089,mr. orlando ashford,,ashford,5733,no
4,donald defosset,72.0,,,,,,itt inc,itt,,0.0,1.0,1.0,1.0,1.0,independent director,2012.0,2012.0,us45073v1089,"mr. donald (don) defosset , jr.",jr.,defosset,2984,no


In [27]:
# check whether there are any duplicates that can be removed right away
all_directors_rel[all_directors_rel.duplicated(subset=['name', 'comp_name'], keep=False)].head()


Unnamed: 0,name,age,last_position,director_start,director_end,executive_start,executive_end,comp_name,ticker,missing_start_date,2011,2012,2013,2014,2015,current_position,dir_exec,in_position,isin,org_name,qualification,last_name,unique_dir_id,all_years
55,william mcdermott,58.0,independent director,0.0,0.0,2005.0,2020.0,under armour inc,ua,0.0,1.0,1.0,1.0,1.0,1.0,,,,us9043112062,mr. william (bill) mcdermott,,mcdermott,7517,no
56,byron adams,61.0,director,2011.0,2013.0,2003.0,2017.0,under armour inc,ua,0.0,1.0,1.0,1.0,1.0,1.0,,,,us9043112062,"mr. byron (chip) adams , jr.",jr.,adams,2325,no
57,anthony deering,72.0,independent director,0.0,0.0,2008.0,2017.0,under armour inc,ua,0.0,1.0,1.0,1.0,1.0,1.0,,,,us9043112062,mr. anthony (tony) deering,,deering,2063,no
58,thomas sippel,68.0,director,0.0,0.0,2001.0,2015.0,under armour inc,ua,0.0,1.0,1.0,1.0,1.0,1.0,,,,us9043112062,mr. thomas sippel,,sippel,7230,no
59,brenda piper,48.0,independent director,0.0,0.0,2012.0,2013.0,under armour inc,ua,0.0,0.0,1.0,1.0,0.0,0.0,,,,us9043112062,brenda piper,,piper,176,no


In [28]:
# remove the second under armour isin
all_directors_rel = all_directors_rel[~(all_directors_rel['isin'] == 'us9043112062')]


Merging the biographies with the directors in the `all_directors_rel` dataframe will happen in two steps. This is necessary because the some directors have the same names but are different people. For these cases, I will merge the bios with the directors dataframe on the org_name and on the isin columns. The remaining bios will be merged only on the org_name column. Some directors sit on multiple boards. However, the `unique_dir_bios_df` includes each director only once. Therefore, merging on both org_name and isin would leave a lot of directors in the `all_directors_rel` dataframe without a biography.

In [29]:
# get the duplicate names for different directors from the unique_dir_bios_df dataframe
unique_dir_bios_df.rename(columns={'org_name_x': 'org_name'}, inplace=True)
duplicate_names = unique_dir_bios_df[unique_dir_bios_df.duplicated(subset=['org_name'], keep=False)][['isin', 'Biographies', 'org_name']]#, 
                                                                                                  #'board_committee', 'committee', 'comm_type',
                                                                                                  #'comm_start', 'comm_end', 'list_years_if_non_consecutive']]
# merge these rows with all_directors_rel on org_name and isin
dupe_names_bios_rel = pd.merge(all_directors_rel, duplicate_names, how='left', on=['org_name', 'isin'])


In [30]:
# compare the shapes of the original dataframes and the merged dataframe
print(all_directors_rel.shape)
print(duplicate_names.shape)
print(dupe_names_bios_rel.shape)


(6877, 24)
(64, 3)
(6877, 25)


In [31]:
# split the dataframe to prepare for second merging step
rel_dirs_bios = dupe_names_bios_rel[dupe_names_bios_rel['Biographies'].notnull()]
remain_rel_dirs = dupe_names_bios_rel[dupe_names_bios_rel['Biographies'].isnull()]
remain_rel_dirs.drop(columns=['Biographies'], inplace=True)

# also remove the all rows in the duplicate_names dataframe from the unique_dir_bios_df
# this ensures that these names are not matched with any incorrect directors with 
# the same name
unique_dir_bios_df.drop(duplicate_names.index, inplace=True)
unique_dir_bios_df = unique_dir_bios_df[['name', 'Biographies', 'org_name']]#, 
                                         #'board_committee', 'committee', 'comm_type',
                                         #'comm_start', 'comm_end', 'list_years_if_non_consecutive']]

print(rel_dirs_bios.shape)
print(unique_dir_bios_df.shape)
print(remain_rel_dirs.shape)


(64, 25)
(5436, 3)
(6813, 24)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [32]:
# merge remain_rel_dirs with unique_dir_bios_df
remain_bios_dirs = pd.merge(remain_rel_dirs, unique_dir_bios_df, how='left', on=['org_name', 'name'])
remain_bios_dirs.shape


(6813, 25)

In [33]:
# merge both dataframes from step 1 and step 2
all_bios_dirs = remain_bios_dirs.append(rel_dirs_bios)
all_bios_dirs.shape


(6877, 25)

### After merging the bios with the directors, the sample needs to be cleaned from the flagged directors in the removal_df

In [34]:
# how many biographies are blank
all_bios_dirs[all_bios_dirs['Biographies'].isnull()].shape


(172, 25)

In [35]:
# these are the actions required to clean the sample
removal_df['what_to_remove'].unique()


array(['entire company in t+1', 'only this director',
       'entire company for all years', nan,
       'entire company for director year 2015 because of committee'],
      dtype=object)

In [36]:
# removing entire companies from dataset for all years
removal_comps = list(removal_df[removal_df['what_to_remove'] == 'entire company for all years']['comp_name'].unique())

# shape of all_bios_dirs
print(all_bios_dirs.shape)
# remove the defined companies
all_bios_dirs = all_bios_dirs[~all_bios_dirs['comp_name'].isin(removal_comps)]
# check the shape of all_bios_dirs afterwards
print(all_bios_dirs.shape)
# check how many missing biographies are left
print(all_bios_dirs[all_bios_dirs['Biographies'].isnull()].shape)


(6877, 25)
(6786, 25)
(120, 25)


In [37]:
# removing individual director from dataset
removal_directors = removal_df[removal_df['what_to_remove'] == 'only this director'][['org_name_x', 'comp_name']]

# shape of all_bios_dirs
print(all_bios_dirs.shape)
# remove the defined directors
all_bios_dirs.drop(all_bios_dirs[(all_bios_dirs['org_name'].isin(removal_directors['org_name_x'])) & (all_bios_dirs['comp_name'].isin(removal_directors['comp_name']))].index, inplace=True)
# check the shape of all_bios_dirs afterwards
print(all_bios_dirs.shape)
# check how many missing biographies are left
print(all_bios_dirs[all_bios_dirs['Biographies'].isnull()].shape)


(6786, 25)
(6721, 25)
(60, 25)


In [38]:
# removing directors and adding column specifying which years this company should be considered for
removal_directors_years = removal_df[removal_df['what_to_remove'] == 'entire company in t+1'][['org_name_x', 'comp_name']]
# creating a helper column to identify the correct directors and companies to be removed
removal_directors_years['compare_col'] = removal_directors_years.apply(lambda x: ' '.join([x['org_name_x'], x['comp_name']]), axis=1)

# shape of all_bios_dirs
print(all_bios_dirs.shape)
# add helper column for those companies in removal_directors_years that need to be
# removed for certain years
all_bios_dirs['remove_comp_for_years'] = all_bios_dirs.apply(lambda x: 'yes' if ' '.join([x['org_name'],x['comp_name']]) in list(removal_directors_years['compare_col'])
                                                                        else 'no', axis=1)

# add column that specifies which years need to be removed
all_bios_dirs['removal_years'] = all_bios_dirs.apply(lambda x: [i+2011 for i in range(5)
                                                                if x[['2011', '2012', '2013', '2014', '2015']].values.tolist()[i] == 1]
                                                                if x['remove_comp_for_years'] == 'yes' else np.nan, axis=1)

# create a list for each company which shows which years for those directors need to be excluded
matching_dict = {}
for group_item in all_bios_dirs[all_bios_dirs['removal_years'].notnull()].groupby('comp_name'):
    matching_dict[group_item[0]] = list(set([subitem for item in group_item[1]['removal_years'].values for subitem in item]))

# re-assign the values for each year to identify for which years the directors and companies should be considered
all_bios_dirs['2011'] = all_bios_dirs.apply(lambda x: 0 if x['comp_name'] in matching_dict.keys()
                                                        and 2011 in matching_dict[x['comp_name']]
                                                        else x['2011'], axis=1)
all_bios_dirs['2012'] = all_bios_dirs.apply(lambda x: 0 if x['comp_name'] in matching_dict.keys()
                                                        and 2012 in matching_dict[x['comp_name']]
                                                        else x['2012'], axis=1)
all_bios_dirs['2013'] = all_bios_dirs.apply(lambda x: 0 if x['comp_name'] in matching_dict.keys()
                                                        and 2013 in matching_dict[x['comp_name']]
                                                        else x['2013'], axis=1)
all_bios_dirs['2014'] = all_bios_dirs.apply(lambda x: 0 if x['comp_name'] in matching_dict.keys()
                                                        and 2014 in matching_dict[x['comp_name']]
                                                        else x['2014'], axis=1)
all_bios_dirs['2015'] = all_bios_dirs.apply(lambda x: 0 if x['comp_name'] in matching_dict.keys()
                                                        and 2015 in matching_dict[x['comp_name']]
                                                        else x['2015'], axis=1)

# drop any directors that have only 0s in all year columns
year_cols = ['2011', '2012', '2013', '2014', '2015']
all_bios_dirs = all_bios_dirs[~(all_bios_dirs[year_cols].sum(1) == 0)]

# check the shape of all_bios_dirs afterwards
print(all_bios_dirs.shape)
# check how many missing biographies are left
print(all_bios_dirs[all_bios_dirs['Biographies'].isnull()].shape)


(6721, 25)
(6596, 27)
(22, 27)


In [39]:
# removing the 2015 flag for all directors whose company does not have any information
# on the CSR committee
removal_directors_committee = removal_df[removal_df['what_to_remove'] == 'entire company for director year 2015 because of committee'][['org_name_x', 'comp_name']]
rel_comps = list(removal_directors_committee['comp_name'].unique())

# shape of all_bios_dirs
print(all_bios_dirs.shape)

# replace the 2015 column values with 0 for all relevant companies
all_bios_dirs['2015'] = all_bios_dirs.apply(lambda x: 0 if x['comp_name'] in rel_comps
                                                        else x['2015'], axis=1)

# shape of all_bios_dirs
print(all_bios_dirs.shape)
# check how many missing biographies are left
print(all_bios_dirs[all_bios_dirs['Biographies'].isnull()].shape)


(6596, 27)
(6596, 27)
(22, 27)


## Handle the remaining missing biographies

In [40]:
# create a dataframe for a manual merge
manual_merge_df = all_bios_dirs[all_bios_dirs['Biographies'].isnull()]
# remove those entries from the sample dataframe
all_comp_bios_df = all_bios_dirs[~all_bios_dirs['Biographies'].isnull()].copy()
print(all_comp_bios_df.shape)

# remove all irrelevant entries from the original bio dataframe
rel_bios_df = all_bios_comms_df[all_bios_comms_df['irrelevant'].isnull()][['name', 'org_name_x', 
                                                                           'comp_name', 'Biographies']].copy()
rel_bios_df.rename(columns={'org_name_x': 'org_name'}, inplace=True)

# now merge the bios with the remaining directors
completed_bios_df = pd.merge(manual_merge_df, rel_bios_df, how='left', on=['name', 'comp_name'])
completed_bios_df.drop(columns=['Biographies_x', 'org_name_y'], inplace=True)
completed_bios_df.rename(columns={'Biographies_y': 'Biographies',
                                  'org_name_x': 'org_name'}, inplace=True)
completed_bios_df.shape


(6574, 27)


(22, 27)

In [41]:
# one person was not removed even though he is flagged as irrelevant so I will do this manually now
completed_bios_df = completed_bios_df[~(completed_bios_df['name'] == 'robert kotick')]
completed_bios_df.shape


(21, 27)

In [42]:
# this is the final sample with no missing biographies now
complete_sample_df = all_comp_bios_df.append(completed_bios_df)
complete_sample_df[complete_sample_df['Biographies'].isnull()]


Unnamed: 0,name,age,last_position,director_start,director_end,executive_start,executive_end,comp_name,ticker,missing_start_date,2011,2012,2013,2014,2015,current_position,dir_exec,in_position,isin,org_name,qualification,last_name,unique_dir_id,all_years,Biographies,remove_comp_for_years,removal_years


In [43]:
# final check if there are any duplicates per company
complete_sample_df[complete_sample_df.duplicated(subset=['name', 'comp_name'], keep=False)]


Unnamed: 0,name,age,last_position,director_start,director_end,executive_start,executive_end,comp_name,ticker,missing_start_date,2011,2012,2013,2014,2015,current_position,dir_exec,in_position,isin,org_name,qualification,last_name,unique_dir_id,all_years,Biographies,remove_comp_for_years,removal_years
4059,john marriott,52.0,vice chairman of the board,0.0,2014.0,2002.0,2014.0,marriott international inc,mar,0.0,1.0,1.0,1.0,1.0,0.0,,,,us5719032022,"john marriott , iii",iii,marriott,1567,no,John W. Marriott III (Vice Chairman of the Boa...,no,
4062,john marriott,88.0,,,,,,marriott international inc,mar,,0.0,1.0,1.0,1.0,1.0,executive chairman of the board,1964.0,2012.0,us5719032022,"mr. john (bill) marriott , jr.",jr.,marriott,4340,no,"J.W. Marriott, Jr. (Executive Chairman of the ...",no,
5199,wayne hughes,60.0,trustee,0.0,0.0,1998.0,2020.0,public storage,psa,0.0,1.0,1.0,1.0,1.0,1.0,,,,us74460d1090,"mr. b. wayne hughes , jr.",jr.,hughes,2155,no,"Mr. Hughes, Jr. became a member of the Board i...",no,
15,wayne hughes,85.0,trustee,1980.0,2011.0,1980.0,2012.0,public storage,psa,0.0,1.0,1.0,0.0,0.0,0.0,,,,us74460d1090,mr. b. wayne hughes,,hughes,2154,no,"Mr. Hughes, Jr. became a member of the Board i...",no,


In [44]:
# drop any unnecessary columns
complete_sample_df.drop(columns=['removal_years', 'remove_comp_for_years', 'qualification', 'last_name'], inplace=True)


In [45]:
# rename the 'Biogrpahies' columns
complete_sample_df.rename(columns={'Biographies': 'biographies'}, inplace=True)
# after all of this merging I will reset the index
complete_sample_df.reset_index(drop=True, inplace=True)

# shape of the final sample
print(complete_sample_df.shape)
# overview of missing data
complete_sample_df.isnull().sum()


(6595, 23)


name                     0
age                    405
last_position         2735
director_start        2735
director_end          2735
executive_start       2735
executive_end         2735
comp_name                0
ticker                   0
missing_start_date    2735
2011                     0
2012                     0
2013                     0
2014                     0
2015                     0
current_position      3860
dir_exec              3860
in_position           3860
isin                     0
org_name                 0
unique_dir_id            0
all_years                0
biographies              0
dtype: int64

The four remaining duplicates are correct because these are different directors (fathers and sons). Therefore, the final sample has been created.


## Clean biography data
I will do some cleaning of the biography data so that I can use it later on without having to repeat the cleaning process

In [46]:
# I need to do a tiny bit of cleaning before continuing
complete_sample_df['biographies'] = complete_sample_df['biographies'].apply(lambda x: x.replace('\n', ' '))
complete_sample_df['biographies'] = complete_sample_df['biographies'].apply(lambda x: x.replace('\t', ' '))
complete_sample_df['biographies'] = complete_sample_df['biographies'].apply(lambda x: re.sub('\s+', ' ', x).strip())

complete_sample_df.head()


Unnamed: 0,name,age,last_position,director_start,director_end,executive_start,executive_end,comp_name,ticker,missing_start_date,2011,2012,2013,2014,2015,current_position,dir_exec,in_position,isin,org_name,unique_dir_id,all_years,biographies
0,christina gold,72.0,independent director,0.0,0.0,1997.0,2020.0,itt inc,itt,0.0,1.0,1.0,1.0,1.0,1.0,,,,us45073v1089,ms. christina gold,7917,no,Mrs. Christina A. Gold has been the Chief Exec...
1,frank macinnis,72.0,independent chairman of the board,2011.0,2020.0,2001.0,2020.0,itt inc,itt,0.0,1.0,1.0,1.0,1.0,1.0,,,,us45073v1089,mr. frank macinnis,3325,no,Mr. Frank T. MacInnis serves as the President ...
2,denise ramos,63.0,"president, chief executive officer, director",2011.0,2019.0,2011.0,2019.0,itt inc,itt,0.0,1.0,1.0,1.0,1.0,1.0,,,,us45073v1089,ms. denise ramos,7996,no,Ms. Denise L. Ramos serves as the Chief Execut...
3,orlando ashford,51.0,,,,,,itt inc,itt,,0.0,1.0,1.0,1.0,1.0,independent director,2012.0,2012.0,us45073v1089,mr. orlando ashford,5733,no,"Orlando D. Ashford, 47, has served as the Pres..."
4,donald defosset,72.0,,,,,,itt inc,itt,,0.0,1.0,1.0,1.0,1.0,independent director,2012.0,2012.0,us45073v1089,"mr. donald (don) defosset , jr.",2984,no,"Mr. Donald DeFosset, Jr., also known as Don, B..."


## Re-assigning unique_dir_ids to indicate different directors
During the manual review, there were some directors which have the same org_name but are actually different people based on their biographies. Therefore, I will change the unique_dir_id values for these cases where the org_name is the same but the biographies are different, therefore indicating different directors.

In [47]:
# create a helper dataframe
bios_ids = complete_sample_df[['biographies', 'unique_dir_id']]
# check the shape of this dataframe
print(bios_ids.shape)
# drop all duplicate rows in this helper dataframe
bio_ids_unique = bios_ids.drop_duplicates()
# check the new shape
print(bio_ids_unique.shape)


(6595, 2)
(5336, 2)


In [48]:
# how many bios have multiple unique_dir_ids
multiple_dir_ids = bio_ids_unique[bio_ids_unique.duplicated(subset='biographies', keep=False)].sort_values(by='biographies')
multiple_dir_ids.head()


Unnamed: 0,biographies,unique_dir_id
4477,"Dr. Robert W. Lane, also known as Bob, served ...",920
2651,"Dr. Robert W. Lane, also known as Bob, served ...",6514
4251,John W. Adams served as the President and Chie...,4378
3467,John W. Adams served as the President and Chie...,680
1687,"Mr. Edward P. Campbell, also known as Ed, serv...",1079


In [49]:
# drop the duplicates from the multiple_dir_ids
multiple_dir_ids_no_dupes = multiple_dir_ids.drop_duplicates(subset='biographies')
multiple_dir_ids_no_dupes.head()


Unnamed: 0,biographies,unique_dir_id
4477,"Dr. Robert W. Lane, also known as Bob, served ...",920
4251,John W. Adams served as the President and Chie...,4378
1687,"Mr. Edward P. Campbell, also known as Ed, serv...",1079
333,"Mr. H. Lee Scott, Jr. serves as an Executive P...",3676
4956,"Mr. Hughes, Jr. became a member of the Board i...",2155


In [50]:
# re-assign the unique_dir_ids
for id in multiple_dir_ids_no_dupes.index:
    complete_sample_df['unique_dir_id'] = complete_sample_df.apply(lambda x: multiple_dir_ids_no_dupes.loc[id]['unique_dir_id']
                                                                             if x['biographies'] == multiple_dir_ids_no_dupes.loc[id]['biographies']
                                                                             else x['unique_dir_id'], axis=1)


In [51]:
# how many unique_dir_ids have multiple biographies
multiple_bio_ids = bio_ids_unique[bio_ids_unique.duplicated(subset='unique_dir_id', keep=False)].sort_values(by='unique_dir_id')
multiple_bio_ids.head()


Unnamed: 0,biographies,unique_dir_id
6547,William L. Cunningham served as President and ...,1038
6548,"William H. Cunningham, Ph.D. has been a profes...",1038
6544,"David A. Jones, Jr. was initially elected to t...",2808
6559,"Mr. David A. Jones, also known as Dave, III, h...",2808
6530,James J. Johnson was elected a Director of Cin...,3978


In [52]:
# what is the highest unique_dir_id
max_unique_id = max(list(complete_sample_df['unique_dir_id']))
print('Highest unique_dir_id value:', max_unique_id)

# how many unique bios are there
unique_bios = multiple_bio_ids['biographies']
print(len(unique_bios))

# create new unique_dir_ids
new_unique_dir_ids = list(range(max_unique_id+1, max_unique_id+1+len(unique_bios)))
print(new_unique_dir_ids)


Highest unique_dir_id value: 9461
66
[9462, 9463, 9464, 9465, 9466, 9467, 9468, 9469, 9470, 9471, 9472, 9473, 9474, 9475, 9476, 9477, 9478, 9479, 9480, 9481, 9482, 9483, 9484, 9485, 9486, 9487, 9488, 9489, 9490, 9491, 9492, 9493, 9494, 9495, 9496, 9497, 9498, 9499, 9500, 9501, 9502, 9503, 9504, 9505, 9506, 9507, 9508, 9509, 9510, 9511, 9512, 9513, 9514, 9515, 9516, 9517, 9518, 9519, 9520, 9521, 9522, 9523, 9524, 9525, 9526, 9527]


In [53]:
# create helper column with new unique_dir_ids
multiple_bio_ids['new_unique_dir_id'] = new_unique_dir_ids
multiple_bio_ids.head()


Unnamed: 0,biographies,unique_dir_id,new_unique_dir_id
6547,William L. Cunningham served as President and ...,1038,9462
6548,"William H. Cunningham, Ph.D. has been a profes...",1038,9463
6544,"David A. Jones, Jr. was initially elected to t...",2808,9464
6559,"Mr. David A. Jones, also known as Dave, III, h...",2808,9465
6530,James J. Johnson was elected a Director of Cin...,3978,9466


In [54]:
# re-assign the unique_dir_ids
for id in multiple_bio_ids.index:
    complete_sample_df['unique_dir_id'] = complete_sample_df.apply(lambda x: multiple_bio_ids.loc[id]['new_unique_dir_id']
                                                                             if x['biographies'] == multiple_bio_ids.loc[id]['biographies']
                                                                             else x['unique_dir_id'], axis=1)
    

In [55]:
# check whether this assignment worked
len(list(complete_sample_df['unique_dir_id'].unique()))


5321

In [56]:
# check whether this assignment worked
len(list(complete_sample_df['biographies'].unique()))


5321

## Take a look at the committee data
The committee data needs to be merged now, separately from the biography data because of the difficulties of directors having the same name but being different people with different biographies and on different company boards. Therefore, the committe information in the original manually reviewed biography dataset ( `all_bios_comms_df` ) will be merged with the `complete_sample_df`

In [57]:
# create dataframe that only includes the directors with committee memberships
comm_info_df = all_bios_comms_df[all_bios_comms_df['comm_start'].notnull()]
comm_info_df = comm_info_df[['name', 'isin', 'irrelevant',
                             'board_committee', 'committee', 'comm_type',
                             'comm_start', 'comm_end', 'list_years_if_non_consecutive']]
comm_info_df.shape


(1003, 9)

In [58]:
# remove any rows that have 'irrelevant' populated
comm_info_df = comm_info_df[comm_info_df['irrelevant'].isnull()]
print(comm_info_df.shape)

# drop the 'irrelevant' column
comm_info_df.drop(columns='irrelevant', inplace=True)


(999, 9)


In [59]:
# current shape of the complete sample df
print(complete_sample_df.shape)

# merge on name and isin
complete_sample_comm_df = pd.merge(complete_sample_df, comm_info_df, how='left', on=['name', 'isin'])
print(complete_sample_comm_df.shape)


(6595, 23)
(6595, 29)


In [60]:
complete_sample_comm_df.columns

Index(['name', 'age', 'last_position', 'director_start', 'director_end',
       'executive_start', 'executive_end', 'comp_name', 'ticker',
       'missing_start_date', '2011', '2012', '2013', '2014', '2015',
       'current_position', 'dir_exec', 'in_position', 'isin', 'org_name',
       'unique_dir_id', 'all_years', 'biographies', 'board_committee',
       'committee', 'comm_type', 'comm_start', 'comm_end',
       'list_years_if_non_consecutive'],
      dtype='object')

In [61]:
# turn everything to lower case in board_bommittee and committee columns
complete_sample_comm_df['board_committee'] = complete_sample_comm_df['board_committee'].apply(lambda x: x.lower().strip() if not pd.isna(x) else x)
complete_sample_comm_df['committee'] = complete_sample_comm_df['committee'].apply(lambda x: x.lower().strip() if not pd.isna(x) else x)


In [62]:
# how many companies have a CSR related board committee
len(list(complete_sample_comm_df[complete_sample_comm_df['comm_type'].notnull()]['comp_name'].unique()))


146

In [63]:
# create new cols for each committee year
year_list = ['2011', '2012', '2013', '2014', '2015']

# add year 2015 as last year if comm_start populated but comm_end empty
complete_sample_comm_df['comm_end'] = complete_sample_comm_df.apply(lambda x: 2015 if not pd.isna(x['comm_start']) 
                                                                            and pd.isna(x['comm_end']) 
                                                                            else x['comm_end'], axis=1)
# create a column for each year for the committee memberships
for year in year_list:
    complete_sample_comm_df[year+'_comm'] = complete_sample_comm_df.apply(lambda x: 1 if not pd.isna(x['comm_start']) 
                                                                                and int(x['comm_start']) <= int(year) 
                                                                                and int(x['comm_end']) >= int(year)
                                                                                else 0, axis=1)
    
    # in some cases, directors left the committe for a year or two and then join again
    # need to account for values in list_years_if_non_consecutive column
    complete_sample_comm_df[year+'_comm'] = complete_sample_comm_df.apply(lambda x: 0 if not pd.isna(x['list_years_if_non_consecutive'])
                                                                                    and year not in x['list_years_if_non_consecutive']
                                                                                    else x[year+'_comm'], axis=1)



In [64]:
# check whether the re-assignment of committee membership years depending on the 
# value in the 'list_years_if_non_consecutive' column worked
complete_sample_comm_df[complete_sample_comm_df['list_years_if_non_consecutive'].notnull()]


Unnamed: 0,name,age,last_position,director_start,director_end,executive_start,executive_end,comp_name,ticker,missing_start_date,2011,2012,2013,2014,2015,current_position,dir_exec,in_position,isin,org_name,unique_dir_id,all_years,biographies,board_committee,committee,comm_type,comm_start,comm_end,list_years_if_non_consecutive,2011_comm,2012_comm,2013_comm,2014_comm,2015_comm
1152,james lash,72.0,independent director,0.0,0.0,2002.0,2017.0,baker hughes co,bkr,0.0,1.0,1.0,1.0,1.0,1.0,,,,us05722g1004,mr. james lash,4119,no,Chairman of Manchester Principal LLC and its p...,yes,governance and hs&e committee,"social, environmental",2011.0,2015.0,"2011, 2012, 2014, 2015",1,1,0,1,1
1159,lynn elsenhans,63.0,,,,,,baker hughes co,bkr,,0.0,1.0,1.0,1.0,1.0,independent director,2012.0,2017.0,us05722g1004,ms. lynn elsenhans,8411,no,"Lynn L. Elsenhans, age 59, has served as a dir...",yes,governance and hs&e committee,"social, environmental",2012.0,2015.0,"2012, 2014, 2015",0,1,0,1,1
1299,michael o'neill,71.0,independent chairman of the board,2012.0,2019.0,2009.0,2019.0,citigroup inc,c,0.0,1.0,1.0,1.0,1.0,1.0,,,,us1729674242,mr. michael o'neill,5561,no,"Mr. Michael E. O'Neill, also known as Mike, se...",yes,"nomination, governance and public affairs comm...","social, environmental",2012.0,2015.0,"2012, 2014, 2015",0,1,0,1,1
1346,juan gallardo,72.0,,,,,,caterpillar inc,cat,,1.0,1.0,1.0,1.0,1.0,independent director,1998.0,1998.0,us1491231015,mr. juan gallardo,4820,no,Mr. Gallardo is the Chairman of Organización C...,yes,public policy committee,"social, environmental",2011.0,2015.0,"2011, 2013, 2014, 2015",1,0,1,1,1
1902,david dewalt,56.0,,,,,,delta air lines inc,dal,,0.0,1.0,1.0,1.0,1.0,independent director,2012.0,2012.0,us2473617023,mr. david (dave) dewalt,2704,no,"Mr. David G. DeWalt, also known as Dave, has b...",yes,"['audit', 'csrresponsibility']",social,2011.0,2015.0,"2011, 2012, 2015",1,1,0,0,1
2452,dustan mccoy,70.0,,,,,,freeport-mcmoran inc,fcx,,1.0,1.0,1.0,1.0,1.0,lead independent director,2007.0,2020.0,us35671d8570,"mr. dustan mccoy , j.d.",3083,no,Mr. Dustan E. McCoy served as the Chairman and...,yes,"['audit', 'csrresponsibility', 'compensation']","social, environmental",2011.0,2015.0,"2011, 2012, 2015",1,1,0,0,1
2740,kathryn marinello,64.0,independent director,0.0,0.0,2009.0,2016.0,general motors co,gm,0.0,0.0,1.0,1.0,1.0,1.0,,,,us37045v1008,ms. kathryn (kathy) marinello,8293,no,Ms. Marinello has served as Senior Advisor of ...,yes,public policy committee; governance and corpor...,"social, environmental",2011.0,2015.0,"2011, 2012, 2013, 2015",1,1,1,0,1
3462,ronald rogers,69.0,independent director,0.0,0.0,2008.0,2018.0,keurig dr pepper inc,kdp,0.0,1.0,1.0,1.0,1.0,1.0,,,,us49271v1008,mr. ronald rogers,6682,no,"Mr. Rogers, age 67, has served as one of our d...",yes,corporate governance and nominating committee,"social, environmental",2013.0,2015.0,"2013, 2015",0,0,1,0,1
3707,hansel tookes,71.0,independent director,0.0,0.0,2005.0,2019.0,l3harris technologies inc,lhx,0.0,1.0,1.0,1.0,1.0,1.0,,,,us5024311095,"mr. hansel tookes , ii",3696,no,"Mr. Tookes retired from Raytheon Company, a co...",yes,business conduct and corporate responsibility ...,"social, environmental",2011.0,2015.0,"2011, 2015",1,0,0,0,1
4120,carlos represas,74.0,independent director,0.0,0.0,2009.0,2018.0,merck & co inc,mrk,0.0,1.0,1.0,1.0,1.0,1.0,,,,us58933y1055,mr. carlos represas,2371,no,"In deciding to nominate Mr. Represas, the Boar...",yes,public policy and social responsibility commit...,"social, environmental",2011.0,2015.0,"2011, 2013",1,0,1,0,0


In [65]:
# final overview of the complete sample dataframe including committee data
complete_sample_comm_df.head()


Unnamed: 0,name,age,last_position,director_start,director_end,executive_start,executive_end,comp_name,ticker,missing_start_date,2011,2012,2013,2014,2015,current_position,dir_exec,in_position,isin,org_name,unique_dir_id,all_years,biographies,board_committee,committee,comm_type,comm_start,comm_end,list_years_if_non_consecutive,2011_comm,2012_comm,2013_comm,2014_comm,2015_comm
0,christina gold,72.0,independent director,0.0,0.0,1997.0,2020.0,itt inc,itt,0.0,1.0,1.0,1.0,1.0,1.0,,,,us45073v1089,ms. christina gold,7917,no,Mrs. Christina A. Gold has been the Chief Exec...,,,,,,,0,0,0,0,0
1,frank macinnis,72.0,independent chairman of the board,2011.0,2020.0,2001.0,2020.0,itt inc,itt,0.0,1.0,1.0,1.0,1.0,1.0,,,,us45073v1089,mr. frank macinnis,3325,no,Mr. Frank T. MacInnis serves as the President ...,,,,,,,0,0,0,0,0
2,denise ramos,63.0,"president, chief executive officer, director",2011.0,2019.0,2011.0,2019.0,itt inc,itt,0.0,1.0,1.0,1.0,1.0,1.0,,,,us45073v1089,ms. denise ramos,7996,no,Ms. Denise L. Ramos serves as the Chief Execut...,,,,,,,0,0,0,0,0
3,orlando ashford,51.0,,,,,,itt inc,itt,,0.0,1.0,1.0,1.0,1.0,independent director,2012.0,2012.0,us45073v1089,mr. orlando ashford,5733,no,"Orlando D. Ashford, 47, has served as the Pres...",,,,,,,0,0,0,0,0
4,donald defosset,72.0,,,,,,itt inc,itt,,0.0,1.0,1.0,1.0,1.0,independent director,2012.0,2012.0,us45073v1089,"mr. donald (don) defosset , jr.",2984,no,"Mr. Donald DeFosset, Jr., also known as Don, B...",,,,,,,0,0,0,0,0


In [66]:
# write the complete sample to a csv file
complete_sample_comm_df.to_excel('/content/drive/My Drive/director-csr/complete_sample.xlsx',
                                sheet_name='complete_sample')
