# Parse Authors

This notebook contains a set of scripts that will convert my publication and conference presentation lists into a standardized format that can be brought back into my `CV.numbers` file. This will make it easier to maintain the spreadsheet going forward and also paves the way for a collaborators file that will make it possible to format my CV properly (as well as automatically generate an NSF COI file).

## Step 1.

Let's split the existing conference author lists out into columns so there is one author per column and the author names are standardized. We will keep any information we get about author names though and start to make a collaborator dataframe as well (which we will use later more extensively)

There are three types of author lists:

1. Author1FirstName Author1LastName, Author2FirstName Author2LastName
1. Author1LastName, Author1Initials and  Author1LastName, Author1Initials
1. Author1LastName, Author1Initials; Author2LastName, Author2Initials;
1. Author1LastName, Author1Initials., Author2LastName, Author2Initials.,


So separation is either by `;`, `.,`, or `,`. We should attempt each split and use the one that seems right

In [183]:
def split_authors(author_string, seq=None):
    """
    splits a string containing an author list into a list of indiviual authors. 
    Tries to do the right thing always.
    
    PARAMETERS
    ==========
        
        seq = ';', ',' (optional)
    """
    seq_order = [';', ' and', '.,']
    if seq:
        return author_string.split(seq)
    else:
        for seq in seq_order:
            author_list = author_string.split(seq)
            if len(author_list) > 1:
                author_tuples = [get_name(author) for author in author_list]
                return author_tuples

def get_name(name_string):
    """
    splits a string containing an author name into a tuple of (LastName, FirstName, Initials). 
    Tries to do the right thing always.
    
    PARAMETERS
    ==========
        
        seq = ','
    """
    from collections import namedtuple
    # Try to split on comma. If possible, this means we have either:
    # LastName, Initial
    # LastName, Initials
    # LastName, FirstName
    # LastName, FirstName Initial
    Name = namedtuple('Name', ['First','Last','First_Initial','Second_Initial'])
    last_name = None
    first_name = None
    first_initial = None
    second_initial = None
    if 'others' in name_string.lower():
        return None
    name_parts = [part.strip() for part in name_string.split(',')]
    if len(name_parts) > 1:
        last_name = name_parts[0]
        # Now let's see if the remainder is a name or initials:
        if name_parts[1].isupper():
            # We have initials. But how many?
            if len(name_parts[1]) > 1:
                # Let's first try to split if we can:
                initials = name_parts[1].split('.')
                if len(initials) > 1:
                    first_initial = initials[0]
                    second_initial = initials[1]
                else:
                    first_initial = name_parts[1][0]
                    second_initial = name_parts[1][1:]
            else:
                first_initial = name_parts[1]
                second_initial = None
        else:
            # It must be a full name. But maybe there is still an initial?
            # Let's try to split on spaces:
            first_names = name_parts[1].split(' ')
            if len(first_names) > 1:
                # We have a first name and an initial (or second name??)
                first_name = first_names[0]
                first_initial = first_name[0]
                second_initial = first_names[1][0]
            else:
                first_name = name_parts[1]
                first_initial = first_name[0]
                second_initial = None
    else: # We have a regular old name.
        name_parts = [part.strip() for part in name_string.split(' ')]
        first_name = name_parts[0]
        last_name = name_parts[-1]
        first_initial = first_name[0]
        if len(name_parts) > 2:
            second_initial = name_parts[1][0]

    name = Name(
            First=first_name,
            Last=last_name,
            First_Initial=first_initial,
            Second_Initial=second_initial)
    return name

def format_name(name, style='citation'):
    """ 
    Formats a name tuple into a standard name format.
    
    """
    name_string = None
    if name is not None:
        if style == 'citation':
            if not name.Second_Initial:
                name_string = name.Last + ', ' +  name.First_Initial + '.'
            else:
                name_string = name.Last + ', ' + name.First_Initial + '.' + name.Second_Initial + '.'
        elif style == 'search':
            if not name.First:
                name_string = name.First_Initial + '. ' + name.Last
            else:
                name_string = name.First
                if not name.Second_Initial:
                    name_string = name_string + ' ' + name.Last
                else:
                    name_string = name_string + ' ' + name.Second_Initial + '. ' + name.Last
    return name_string

In [120]:
import sys
sys.path.append('/Users/kellycaylor/Documents/dev/biobib')
from tables import Proceedings

In [121]:
table = Proceedings(csv_file='CV/Conference Abstracts-Table.csv', cumulative=True)

In [122]:
author_tuples = []
author_count = []
authors = table.df['Authors']
for author_list in authors:
    author_tuple = split_authors(author_list)
    author_num = len(author_tuple) if author_tuple else 0
    author_tuples.append(author_tuple)
    author_count.append(author_num)

In [123]:
table.df['author_tuples'] = author_tuples
table.df['author_count'] = author_count

## Assign Authors to Author Columns

This would seem to be the easiest way to assign authors, although I need to think about it a little more.


In [145]:
def assign_author(authors,n=0, format='tuple'):
    if authors:
        try:
            if format == 'tuple':
                return authors[n]
            else:
                return format_name(authors[n])
        except IndexError:
            return None
        else:
            return None
    return None

In [146]:
author_columns = ['A'+str(i+1) for i in range(max(table.df['author_count']))]

for i, author_column  in enumerate(author_columns):
    table.df[author_column] =  table.df['author_tuples'].apply(lambda x: assign_author(x,n=i))

### Search for an author using `scholarly`



In [184]:
from scholarly import scholarly
print(table.df['A2'][0])
name = format_name(table.df['A2'][0], style='search')
search_query = scholarly.search_author(name)

Name(First=None, Last='Caylor', First_Initial='K', Second_Initial='K')


In [185]:
this_author = next(search_query).fill()

In [187]:
print(name)

K. Caylor


In [157]:
print(author)

Name(First=None, Last='Caylor', First_Initial='K', Second_Initial='K')


In [158]:
print([pub.bib['title'] for pub in author.publications])

AttributeError: 'Name' object has no attribute 'publications'

In [200]:
import pandas as pd
pubs = pd.DataFrame([pub.bib for pub in this_author.publications])

In [211]:
this_year = pubs[pubs['year'] == '2020']