# Author: comparing and sorting names with and without initials 

> Outputs: `Name` class and `Author` class

## Big picture: name disambiguation

We need to check if the author exists before creating a new one. This will require looking back in the dataset. To make sure this proceeds quickly, we will want to ensure that the dataset is ordered alphabetically (sort and compare dunders). Keeping this in mind, we need to be able to compare two authors that have different forms of the same name.

In [None]:
#| default_exp author

## Comparing Two Different Forms of the Same Name

I want to determine how many authors have ambiguous names. A potential match is one that shares an identical last name and first initial. Any additional information present in the author names, like a full first name or middle initials, do not conflict with one another. Full last names and a first initial a guarenteed. Many authors have multiple middle names.

To start off, we need to be able to compare full names to initials, and to be able to differentiate between the two. 

In [None]:
#| export 

import pandas as pd
import pprint
import bisect

In [None]:
# TODO: import  get_author_names_list and extract_names functions from process_names.ipynb
# Import the functions from process_names.py
from preprocessing.process_names import get_author_names_list, extract_names

In [None]:
#| export

class Name(str):
    
    def matches(self, other):
        return str(self) == str(other)
        
    def __eq__(self, other):
        if len(self) == 0 or len(other) == 0:
            return True
        elif len(self) == 1:
            if len(other) > 1:
                return str(other[0]).__eq__(self)
            else:
                return str(other).__eq__(self)
        else:
            if len(other) == 1:
                return str(self[0]).__eq__(other[0])
            else:
                return str(self).__eq__(other)
            
    def longest(self, other):
        if not self == other:
            raise Exception('cannot merge names that aren\'t equivalent')
        if len(other) > len(self):
            return other
        else:
            return self

In [None]:
a = Name('a')
b = Name('b')
ab = Name('ab')
bc = Name('bc')
empty = Name('')

In [None]:
assert a == a
assert ab == a
assert a == ab
assert a == empty
assert empty == a
assert ab == empty
assert empty == ab
assert empty == empty
assert not a == b
assert not a == bc
assert not ab == b
assert not ab == b
assert not ab == bc

In [None]:
assert (a, b, ab) == (ab, bc, ab)

In [None]:
assert a.matches('a')
assert not ab.matches('a')
assert a.matches(a)
assert not ab.matches(a)

In [None]:
assert a.longest(ab).matches('ab')
assert ab.longest(a).matches('ab')

In [None]:
#| export

class Author:
    
    def __init__(self, last, first, middle='', middle2='', middle3='', emails=[]):
        self.first = Name(first)
        self.middle = Name(middle)
        self.middle2 = Name(middle2)
        self.middle3 = Name(middle2)
        self.last = Name(last)
        self.emails = emails
        self.publications = []

    def full_name(self):
        strings = [getattr(self, attr) for attr in ('first', 'middle', 'middle2', 'middle3', 'last') if getattr(self, attr)]
        strings = [x for x in strings if x is not None]
        strings = ' '.join(strings)
        return strings
        
    def __repr__(self):
        return self.full_name()
    
    def matches(self, other):
        return (self.first.matches(other.first)
            and self.middle.matches(other.middle)
            and self.middle2.matches(other.middle2)
            and self.middle3.matches(other.middle3)
            and self.last.matches(other.last))
    
    def add_contact_author_info(self, contact_author):
        # use the __eq__ function to make sure the author and contact_author are the same before merging them
        assert self == contact_author, 'author and contact_author do not have the same name'
        self.emails = self.emails + contact_author.emails
        self.merge_names(contact_author)
        
    def merge_names(self, other):
        self.first = self.first.longest(other.first)
        self.middle = self.middle.longest(other.middle)
        self.middle2 = self.middle2.longest(other.middle2)
        self.middle3 = self.middle3.longest(other.middle3)
        self.last = self.last.longest(other.last)
        
    def __eq__(self, other):
        return (self.last, self.first, self.middle, self.middle2, self.middle3) == (other.last, other.first, other.middle, other.middle2, other.middle3)
    
    def __lt__(self, other):
        return (self.last, self.first, self.middle, self.middle2, self.middle3) < (other.last, other.first, other.middle, other.middle2, other.middle3)

    def __le__(self, other):
        return (self.last, self.first, self.middle, self.middle2, self.middle3) <= (other.last, other.first, other.middle, other.middle2, other.middle3)

    def __gt__(self, other):
        return (self.last, self.first, self.middle, self.middle2, self.middle3) > (other.last, other.first, other.middle, other.middle2, other.middle3)

    def __ge__(self, other):
        return (self.last, self.first, self.middle, self.middle2, self.middle3) >= (other.last, other.first, other.middle, other.middle2, other.middle3)

In [None]:
# Example usage:
author0 = Author('J', 'Smith', 'S', emails=['j.smith@gmail.com'])
author1 = Author("John", "Doe", "A")
author2 = Author("Jane", "Smith")
author3 = Author("Alice", "Johnson", "B")

assert not author1 > author2
assert not author1 >= author2
assert author1 < author2
assert author1 <= author2

print('Combine:')
display(author0, author2)

assert author2 == author0
author2.add_contact_author_info(author0)
assert author2.emails == ['j.smith@gmail.com']
assert author2.first == 'Jane'
assert author2.middle == 'S'
assert author2.last == 'Smith'

author2

In [None]:
author4 = Author("J", "Rowling", 'K')
author5 = Author("Mark", "Twain")
author6 = Author("H", "Wells", "G")
author7 = Author("Agatha", "Christie", "")
author8 = Author("J", "Tolkien", "R", "R")
author9 = Author("Joanne", "Rowling")
author10 = Author("J", "Tolkien")

print(author4)
print(author5)
print(author6)
print(author7)
print(author8)
print(author9)
print(author10)

In [None]:
(author4, author9)

In [None]:
# TODO: use nbdev to export this notebook to the preprocessing module - DONE

from nbdev.export import nb_export
nb_export('author.ipynb', 'preprocessing')