# Process Names: Get Author names from the dataset

In this notebook we create three functions: `load_data`, `get_author_names_list`, and `extract_names`.

## Loading the Data

In [None]:
#| default_exp process_names

In [3]:
#| export

import pandas as pd
import pprint
import re
import warnings

In [3]:
#| export

#Reading Excel file with pandas and choosing the sheet we want to work with
usecols = ['id', 'title', 'contact_email', 'contact_author_name', 'doi', 'author_names']

df = pd.read_excel(open('Catalogdatabase-till2018b.xlsx', 'rb'), sheet_name='publication', usecols=usecols)

# Practice functions with a small subset of the entire df
df_small = df.head()

#Displaying Relevant fields we'll work with

df_small

Unnamed: 0,id,title,contact_email,contact_author_name,doi,author_names
0,1.0,A river system modelling platform for Murray-D...,ang.yang@csiro.au,Ang Yang,10.2166/hydro.2012.153,"['Geoff Podger', 'Robert Power', 'Shane Seaton..."
1,2.0,Impact of Regulation and Network Topology on E...,lei@umd.edu,Lei Zhang,10.3141/2297-21,"['Dilya Yusufzyanova', 'Lei Zhang']"
2,3.0,Simulating Rural Environmentally and Socio-Eco...,msqalli@yahoo.com,Mehdi Saqalli,,"['Charles L. Bielders', 'Pierre Defourny', 'Br..."
3,4.0,A preliminary test of Hunt's General Theory of...,tay@udayton.edu,N.S.P Tay,10.1016/j.jbusres.2004.04.005,"['RF Lusch', 'NSP Tay']"
4,5.0,Human birthweight evolution across contrasting...,fthomas@mpl.ird.fr,FrÃ©dÃ©ric Thomas,10.1111/j.1420-9101.2004.00705.x,"['SP Brown', 'EV Budilova', 'JF Guegan', 'F Re..."


That looks really good, but can we turn that into a function so we don't have to call that every time we want to use the dataset in the notebook?

In [12]:
#| export
# TODO: write a function to load the data. Create a parameter called `small` that let's us chose whether to
# return the whole dataframe or just the head (DONE)

def load_data(small=False):
    #Reading Excel file with pandas and choosing the sheet we want to work with
    usecols = ['id', 'title', 'contact_email', 'contact_author_name', 'doi', 'author_names']
    
    df = pd.read_excel(open('Catalogdatabase-till2018b.xlsx', 'rb'), sheet_name='publication', usecols=usecols)   
    

    #Small or full data frame
    if small:
        return df.tail()
    else:
        return df

In [13]:
load_data(small = True)

Unnamed: 0,id,citations,mod,title,abstract,short_title,contact_email,email_sent_count,contact_author_name,is_primary,...,Sponsor (United States National Aeronautics and Space Administration (NASA)),Sponsor (United States National Institutes of Health (NIH)),Sponsor (United States National Oceanic and Atmospheric Administration (NOAA)),Sponsor (United States National Science Foundation (NSF)),Sponsor (United States Office of Naval Research (ONR)),Sponsor (Wellcome Trust),Sponsor Other,platformentioned,sponsormentioned,platform
7497,291375.0,0.0,57.0,Modelling the evolution of periodicity in the ...,Background: Periodical cicadas (Magicicada spp...,,jaakko.toivonen@alumni.helsinki.fi,0.0,Jaakko Toivonen,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,2.0
7498,291390.0,0.0,28.0,Rethinking Microfinance in a Dual Financial Sy...,Critics concerning the real impact of traditio...,,sarabourhime@gmail.com,0.0,Sara Bourhime,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
7499,291464.0,0.0,60.0,The Impact of Out-of-Stocks and Supply Chain D...,"In today's competitive environment, consumers ...",,whipple@broad.msu.edu,0.0,Judith M. Whipple,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7500,,,,,,,,,,,...,,,,,,,,,,
7501,,,,,,,,,,,...,,,,,,,,,,


In [None]:
# get the first row
single_publication = df_small.loc[0]
single_publication

In [None]:
# get the author list from the first row
author_names = single_publication['author_names']
author_names

In [None]:
# remove all the brackets and single quotes
_author_names = author_names.strip("[]").replace("'", "")
_author_names

In [None]:
# Split at ',' to get a list
_author_names = _author_names.split(', ')
_author_names

In [None]:
#| export
def get_author_names_list(author_names):
    author_names = author_names.strip("[]").replace("'", "")
    author_names_list = author_names.split(', ')
    return author_names_list

In [None]:
author_names = single_publication['author_names']
author_names_list = get_author_names_list(author_names)
author_names_list

In [None]:
# grab a single author name from the list of authors
single_author = author_names_list[3]
single_author

In [None]:
## Extract the first, middle, and last name
names = single_author.split(' ')
first_name = names[0]
last_name = names[-1]
middle_name = ' '.join(names[1:-1]) if len(names) > 2 else None

(last_name, first_name, middle_name)

That looks good! Let's put all that logic in a function we can reuse.

In [None]:
def extract_names(full_name):
    names = full_name.split(' ')
    first_name = names[0]
    last_name = names[-1]
    middle_name = ' '.join(names[1:-1]) if len(names) > 2 else None
    return (last_name, first_name, middle_name)

In [None]:
extract_names(single_author)

Now let's try that function on an author name that has a different format.

In [None]:
first_author_in_fourth_pub = get_author_names_list(df_small.loc[4]['author_names'])[0]
first_author_in_fourth_pub

In [None]:
extract_names(first_author_in_fourth_pub)

That didn't really do it. Let's try again.

In [None]:
def extract_names(full_name):
    names = full_name.split()
    first_name = names[0][0]
    middle_name = names[0][1:] if len(names[0]) > 1 else None
    last_name = names[-1]
    return (last_name, first_name, middle_name)

In [None]:
extract_names(first_author_in_fourth_pub)

Much better! What about Ang Yang? 

In [None]:
extract_names(single_author)

Not exactly. Try writing a function that works for both cases. Let's write some test cases to check against every time we iterate on the function we come up with. We will do this with `assert` statements, which are this simplest way to do unit testing. The code below will throw an error if the code after the `assert` keyword returns False. Let's try the 'SP Brown' example first since we know it's working. 

In [None]:
assert extract_names('SP Brown') == ('Brown', 'S', 'P'), "The extract_names function isn't working as expected. Run extract_names('SP Brown') in it's own cell to see what happened!"

Okay great it ran as expected! Now let's try an assert statement on something that doesn't run as expected. It will throw an error with a message that matches the string to the right of the assert statement (just after the last comma).

In [None]:
assert extract_names('Ang Yang') == ('Yang', 'Ang', None), "The extract_names function isn't working as expected. Run extract_names('Ang Yang') in it's own cell to see what happened!"

Okay! Now we know how to write assert stamements to check our code ever time we make a small change! That's awesome because it means we don't have to go through all those cells before to change our workflow a little and try another name. Instead, we can just make some up!

In [None]:
"""
# TODO: make sure this function works for the SP Brown and Ang Yang cases DONE 
def extract_names(full_name):
    # Check for any name with first two capital letters
    pattern_first_two_capital = re.compile(r'^([A-Z])([A-Z])\s+(.*)$')
    match_first_two_capital = pattern_first_two_capital.match(full_name)

    if match_first_two_capital:
        first_name = match_first_two_capital.group(1)
        middle_name = match_first_two_capital.group(2)
        last_name = match_first_two_capital.group(3)
    else:
        # Fallback to the original splitting
        names = full_name.split(' ')
        first_name = names[0]
        middle_name = ' '.join(names[1:-1]) if len(names) > 2 else None
        last_name = names[-1]

    return (last_name, first_name, middle_name) # note that I changed the order here"""

In [None]:
assert extract_names('SP Brown') == ('Brown', 'S', 'P')
assert extract_names('Ang Yang') == ('Yang', 'Ang', None)
assert extract_names('Ang F Yang') == ('Yang', 'Ang', 'F')

That looks great! But it still doesn't account for a lot of the funny edge cases we need to expect. Let's make sure that multiple middle names or initials are accounted for. Let's expand on the `extract_names` function you wrote above so that it accounts for more cases. Just below this function we will have some assert statements so you can check your work.

**Note**: for this version we will need to return more middle names. Let's assume that there are never more than three middle names. 

In [12]:
# TODO: copy-paste your function from above and modify it so that it accounts for new edge cases
    # Check for any name with first two capital letters

def extract_names(full_name):
    # Check for any name with first two capital letters
    pattern_first_two_capital = re.compile(r'^([A-Z])([A-Z])\s+(.*)$')
    match_first_two_capital = pattern_first_two_capital.match(full_name)

    if match_first_two_capital:
        last_name = match_first_two_capital.group(3)
        first_name = match_first_two_capital.group(1)
        middle_initials = match_first_two_capital.group(2)
    else:
        # Fallback to the original splitting
        names = full_name.split(' ')
        last_name = names[-1]
        first_name = names[0]
       
        ##NO middle_initials = ''.join(names[1:-1]) if len(names) > 2 else None

    # Extract individual initials from the middle initials
    if middle_initials:
        middle_initials_list = [initial.upper() for initial in middle_initials]
    else:
        middle_initials_list = []

    while len(middle_initials_list) < 2:
        middle_initials_list.append(None)

    return (last_name, first_name, *middle_initials_list)

# Test cases
print(extract_names('S LF Burnett'))


UnboundLocalError: local variable 'middle_initials' referenced before assignment

In [18]:
#| export

# THIS ONE!!!

def extract_names(full_name):

    full_name = full_name.replace('.','')
    names = re.sub( r"([A-Z])", r" \1", full_name).split()
    #print(names)
    
    last_name = names[-1]
    first_name = names[0]
    
    middle_name1 = None
    middle_name2 = None 
    middle_name3 = None

    if len(names) > 2:
        middle_name1 = names[1]
        if len(names) > 3:
            middle_name2 = names[2]
            if len(names) >4:
                middle_name3 = names[3]

    return (last_name, first_name, middle_name1, middle_name2, middle_name3)

    
#print(extract_names('Ang G.B. Burnett'))


### Downside of the re. package is that you can't check for edge cases such as MBtt and raise a warning

In [17]:
extract_names('Ang G.B. Burnett')

['Ang', 'G', 'B', 'Burnett']


('Burnett', 'Ang', 'G', 'B', None)

In [14]:
full_name = "Ang GB Burnett"

names = re.sub( r"([A-Z])", r" \1", full_name)
names[0]

' '

In [10]:
str("Ang G B. Burnett").replace('.','')

'Ang G B Burnett'

In [11]:
full_name = "Ang G B Yang"

names = full_name.split(' ')
print(names)
last_name = names[-1]

first_name = names[0]

middle_initials = ''.join(names[1:-1]) # if len(names) > 2 else None
print(middle_initials)

['Ang', 'G', 'B', 'Yang']
GB


In [None]:
assert extract_names('SP Brown') == ('Brown', 'S', 'P', None, None)
assert extract_names('Ang Yang') == ('Yang', 'Ang', None, None, None)
assert extract_names('RBBF Burnett') == ('Burnett', 'R', 'B', 'B', 'F')
assert extract_names('Rebecca BB Burnett') == ('Burnett', 'Rebecca', 'B', 'B', None)

In [None]:
# TODO: create some of your own assert statements to check funny edge cases
assert extract_names('Brown P. Brown') == ('Brown', 'Brown', 'P', 'P', None)
with warnings.catch_warnings(record = True) as w:
    extract_names('SMiddle Last')
    assert len(w) > 1

Nice! Now let's use that return value to create a new Author object.

## Review

Let's put everything we did all together.

In [None]:
single_publication = df.loc[0]
single_publication

In [None]:
author_names = single_publication['author_names']
author_names

In [None]:
author_names_list = get_author_names_list(author_names)
author_names_list

In [None]:
first_author = author_names_list[0]

In [None]:
last, first, middle, middle2, middle3 = extract_names(first_author)
(last, first, middle)

In [1]:
# TODO: when you are all finished...
# 1. Restart kernel and clear all outputs
# 2. save this notebook
# 3. Run this cell

from nbdev.export import nb_export
nb_export('process_names.ipynb', 'preprocessing')