# Create all Author and Publication Objects

In this module, we will define a Publication object which is associated with each row of the Excel sheet. Each publication is also associated with several authors. We create an Author object for each author, including the contact author, which has an associated email. If we encounter an author with the same name or initials as an author we've already created, we assume they are the same person and combine their data. Analysis in another notebook will show that this oversimplified approach may be "good enough" for our purposes. Finally, we pickle all of these objects so they can be used by another repository for network analysis.

In [4]:
#| default_exp objects

In [5]:
#| export
from preprocessing.process_names import load_data, get_author_names_list, extract_names
from preprocessing.author import Author
import pandas as pd

## Load Excel Data

The first step is to load our excel data. This database has 7501 publications associated with it and around 25,000 (undisambiguated) authors. It takes several minutes to process the whole list. If you want to run an example on the first five entries, use the `small` variable below.


In [6]:
#| export
df = load_data(small=False) # small=True if you only want the first 5 entries
df.head()

Unnamed: 0,id,title,contact_email,contact_author_name,doi,author_names
0,1.0,A river system modelling platform for Murray-D...,ang.yang@csiro.au,Ang Yang,10.2166/hydro.2012.153,"['Geoff Podger', 'Robert Power', 'Shane Seaton..."
1,2.0,Impact of Regulation and Network Topology on E...,lei@umd.edu,Lei Zhang,10.3141/2297-21,"['Dilya Yusufzyanova', 'Lei Zhang']"
2,3.0,Simulating Rural Environmentally and Socio-Eco...,msqalli@yahoo.com,Mehdi Saqalli,,"['Charles L. Bielders', 'Pierre Defourny', 'Br..."
3,4.0,A preliminary test of Hunt's General Theory of...,tay@udayton.edu,N.S.P Tay,10.1016/j.jbusres.2004.04.005,"['RF Lusch', 'NSP Tay']"
4,5.0,Human birthweight evolution across contrasting...,fthomas@mpl.ird.fr,FrÃ©dÃ©ric Thomas,10.1111/j.1420-9101.2004.00705.x,"['SP Brown', 'EV Budilova', 'JF Guegan', 'F Re..."


In a previous module, we made the Author class, but we also want to have a Publication object that we can associate with multiple authors. Our end goal is to make a graph with multiple representations, so we will make a list of authors that contain associated publications and a list of publications that point to associated authors. 

In [7]:
#| export
class Publication:
    def __init__(self, id, title, doi):
        self.id = id
        self.title = title
        self.doi = doi

## Creating Publications and Authors

In the main loop, we want to go through the all the rows in the dataframe and create Publication objects using the `title` and `DOI` columns and create Author objects for each author in the `author_names` column. This will require us to loop through the rows in the dataframe and
1. create a publication object
2. parse and split the `author_names` string into a list of author names
3. add the author to the list

Adding the author to the list is complicated by the fact that the Author could already be in the list, in which case we only need to *update* the author already in the list instead of appending a new one. We have extracted this part into its own function below.

In [8]:
#| export
# if author with same name is in list, combine their info
# else if author is not in list, append it to the list

def add_author_in_list(author_list, new_author):
    
    for existing_author in author_list:
        if new_author == existing_author:
            pass
            # combine info from each
            existing_author.merge_names(new_author)
            # combine emails
            existing_author.add_contact_author_info(new_author)
            # publications
            for publication in new_author.publications:
                existing_author.publications.append(publication)
            return
    # add new_author to list
    author_list.append(new_author)

We can check that this function works on its own. We will create two authors, Jane and Mary.

In [None]:
mary = Author('Lou', 'Mary')
jane = Author('Doe', 'Jane')
toy_list = [jane, mary]
toy_list

Then we will add another Author object who has the same name as Mary. We expect that when we try to add `mary2` to the `toy_list` that the list will not get longer because the two Mary objects will get combined. This is probably made most obvious by the fact that the name will get updated to combine information from both representations of the name to lose the least amount of information possible.

In [None]:
mary2 = Author('Lou', 'M' ,'R', emails=['ml@asu.edu'])

assert mary.full_name() == 'Mary Lou'
assert mary2.full_name() == 'M R Lou'

In [None]:
add_author_in_list(toy_list, mary2)

assert mary.full_name() == 'Mary R Lou' # we can see both versions of mary were combined

In [None]:
mary.emails # and that mary is now associated with mary2's email

Okay, time to put this all to use. Next, let's initialize the lists that all the publications and authors get loaded into.

In [9]:
#| export
publication_list = []
author_list = []

We're also going to count some things to make sure we create the number of publications and authors we expect to. And so we can learn a little more about the data.

In [10]:
#| export 
num_no_authors = 0
num_no_publication = 0

Let's get into the weeds now. If you aren't using the "small" dataframe, this will take a few minutes since we are searching for matches with brute force. 

In [11]:
#| export 
for index, row in df.iterrows():
    
    # If title or contact_email exists
    if (row['title'] or row['doi']):
        author_row_list = [] #List of authors in each publication; author Object
        # create a new publication object
        publication = Publication(id=row['id'], title=row['title'], doi=row['doi'])
        # add the publication to the list
        publication_list.append(publication)

     
        author_names = row['author_names']
        
        if pd.isna(author_names) or (len(author_names) == 0) or (author_names).strip('[\'] ') == '':
            author_exists = False
        else:
            author_exists = True
            author_names_list = get_author_names_list(author_names)
            for author_name in author_names_list:
                last_name, first_name, middle_name1, middle_name2, middle_name3 = extract_names(author_name)
                # Create an Author object
                author = Author(last_name, first_name, middle_name1)
                # Add the publication to the Author's list of publications
                author.publications.append(publication) ##TO DO: Check if contact_author ends up having a pub
                author_row_list.append(author)

        # Create contact author
        contact_name = row["contact_author_name"]
        
        if pd.isna(contact_name) or (len(contact_name) == 0) or (contact_name.strip() == ''):
            contact_exists = False
        else: #Contact exists = no contact name
            contact_exists = True
            contact_last, contact_first, contact_middle, _, _ = extract_names(contact_name)
            contact_author = Author(contact_last, contact_first, contact_middle, emails=[row["contact_email"]]) 
            
        
        # If there is no value in author names and contact name, then add to no_authors count
        if not author_exists and not contact_exists:
            num_no_authors = num_no_authors + 1
        elif not author_exists: #No author exists
            add_author_in_list(author_list, contact_author)            
        elif not contact_exists: #No contact exists
            for author in author_row_list:
                # Add the Author to the list of Authors
                add_author_in_list(author_list, author)
        elif author_exists and contact_exists: #Both author and contact exist
             # If that author is also the contact author, add an email
            for author in author_row_list:
                if (author == contact_author):
                   # print("True", author)
                    author.add_contact_author_info(contact_author)
                add_author_in_list(author_list, author)

    else:
        # If there is no title or contact_email, skip this entry (do not add to lists)
        num_no_publication = num_no_publication + 1

In [12]:
if small: 
    display(author_list)
else:
    display(len(author_list))

NameError: name 'small' is not defined

In [None]:
from nbdev.export import nb_export
nb_export('create_objects.ipynb', 'preprocessing')