## AIM:

This notebook is important. I want to do two things here:

- Check whether the new ieee_author_df I scraped using bs4 is correct. 
- Check whether fuzzy matching changed the order and content of author name. 

I will use two datasets. `'../../data/processed/merged_author_df.csv'` is the output of the new `get_merged_author_df.py` which processes the new `ieee_author_df.csv` scraped using bs4 and which employes fuzzy matching. 

The other file is `merged_author_df_3.csv`. This file is produced based on the `ieee_author_df.csv` scraped using Selenium and without fuzzy matching on author names (I just assumed author position). 

In [1]:
import pandas as pd
pd.options.display.max_colwidth = None
pd.set_option('display.max_rows', 500)
import csv

In [2]:
def read_txt(INPUT):
	"""read txt files and return a list
	"""
	raw = open(INPUT, "r")
	reader = csv.reader(raw)
	allRows = [row for row in reader]
	data = [i[0] for i in allRows]
	return data

In [3]:
papers = read_txt('../../data/processed/papers_to_study.txt')

In [4]:
df = pd.read_csv('../../data/processed/merged_author_df.csv')

In [5]:
DOIS = list(set(df.DOI.tolist()))

In [6]:
len(DOIS)

3233

In [7]:
# First, I want to see in merged_author_df, whether there are duplicate authors in each paper. 
# If yes, then there must be something wrong with fuzzy matching. 
for doi in DOIS:
    df1 = df[df.DOI == doi]
    unique_author_num = len(list(set(df1['IEEE Author Name'].tolist())))
    if unique_author_num != df1.shape[0]:
        print(doi)

In [8]:
# these are where fuzzy matching works
df[df['IEEE Author Position'] != df['Author Position']][[
    'IEEE Author Position', 'IEEE Number of Authors', 
    'Author Position', 'Number of Authors', 'IEEE Author Name', 'Author Name']].shape

(87, 6)

## Compare with _3

In [9]:
# df3 is where I didn't use fuzzy matching. i simply assumed the author position in each paper. 
# if the author name vector in df3 is exactly the same as merged_author_df,
# Then i can tell that there is nothing wrong with the fuzzy matching. At least it does not change the author name 
# and the order of authors. 
df3 = pd.read_csv('merged_author_df_3.csv')
df3

Unnamed: 0,Year,DOI,Title,IEEE Number of Authors,IEEE Author Position,IEEE Author Name,IEEE Author ID,IEEE Author Affiliation Updated,IEEE One Affiliation,Number of Authors,...,Author ORCID,Number of Affiliations,First Institution Name Updated,Raw Affiliation String Updated,First Institution ID,First Institution ROR,First Institution Type Updated,First Institution Country Code Updated,Binary Institution Type,IEEE Author Affiliation Filled
0,2011,10.1109/TVCG.2011.185,D³ Data-Driven Documents,3,1.0,Michael Bostock,https://ieeexplore.ieee.org/author/37591067400,"Computer Science Department, Stanford University, Stanford, CA, USA",True,3,...,,1.0,Stanford University,"Computer Science Department, Stanford University, Stanford, CA, USA#TAB#",https://openalex.org/I97018004,https://ror.org/00f54p054,education,US,education,"Computer Science Department, Stanford University, Stanford, CA, USA"
1,2011,10.1109/TVCG.2011.185,D³ Data-Driven Documents,3,2.0,Vadim Ogievetsky,https://ieeexplore.ieee.org/author/38016292400,"Computer Science Department, Stanford University, Stanford, CA, USA",True,3,...,,1.0,Stanford University,"Computer Science Department, Stanford University, Stanford, CA, USA#TAB#",https://openalex.org/I97018004,https://ror.org/00f54p054,education,US,education,"Computer Science Department, Stanford University, Stanford, CA, USA"
2,2011,10.1109/TVCG.2011.185,D³ Data-Driven Documents,3,3.0,Jeffrey Heer,https://ieeexplore.ieee.org/author/37550791300,"Computer Science Department, Stanford University, Stanford, CA, USA",True,3,...,https://orcid.org/0000-0002-6175-1655,1.0,Stanford University,"Computer Science Department, Stanford University, Stanford, CA, USA#TAB#",https://openalex.org/I97018004,https://ror.org/00f54p054,education,US,education,"Computer Science Department, Stanford University, Stanford, CA, USA"
3,1991,10.1109/VISUAL.1991.175815,Tree-maps: a space-filling approach to the visualization of hierarchical information structures,2,1.0,B. Johnson,https://ieeexplore.ieee.org/author/37381975300,"Department of Computer Science & Human-Computer Interaction Laboratory, University of Maryland, College Park, MD, USA",True,2,...,,1.0,"University of Maryland, College Park","Dept. of Comput Sci., Maryland Univ., College Park, MD, USA",https://openalex.org/I66946132,https://ror.org/047s2c258,education,US,education,"Department of Computer Science & Human-Computer Interaction Laboratory, University of Maryland, College Park, MD, USA"
4,1991,10.1109/VISUAL.1991.175815,Tree-maps: a space-filling approach to the visualization of hierarchical information structures,2,2.0,B. Shneiderman,https://ieeexplore.ieee.org/author/37283016400,"Department of Computer Science & Human-Computer Interaction Laboratory, University of Maryland, College Park, MD, USA",True,2,...,https://orcid.org/0000-0002-8298-1097,1.0,"University of Maryland, College Park","Dept. of Comput Sci., Maryland Univ., College Park, MD, USA",https://openalex.org/I66946132,https://ror.org/047s2c258,education,US,education,"Department of Computer Science & Human-Computer Interaction Laboratory, University of Maryland, College Park, MD, USA"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12408,2021,10.1109/TVCG.2020.3032984,Understanding Missing Links in Bipartite Networks with MissBiN,4,1.0,Jian Zhao,,"School of Computer Science, University of Waterloo, 8430 Waterloo, Ontario, Canada, (e-mail: jianzhao@uwaterloo.ca)",True,4,...,https://orcid.org/0000-0003-4134-2739,1.0,University of Waterloo,"[School of Computer Science, University of Waterloo, 8430 Waterloo, Ontario, Canada, (e-mail: jianzhao@uwaterloo.ca)]",https://openalex.org/I151746483,https://ror.org/01aff2v68,education,CA,education,"School of Computer Science, University of Waterloo, 8430 Waterloo, Ontario, Canada, (e-mail: jianzhao@uwaterloo.ca)"
12409,2021,10.1109/TVCG.2020.3032984,Understanding Missing Links in Bipartite Networks with MissBiN,4,2.0,Maoyuan Sun,,"Department of Computer Science, Northern Illinois University, 2848 DeKalb, Illinois, United States, (e-mail: smaoyuan@niu.edu)",True,4,...,https://orcid.org/0000-0002-0990-2620,1.0,Northern Illinois University,"Department of Computer Science, Northern Illinois University, 2848 DeKalb, Illinois, United States, (e-mail: smaoyuan@niu.edu)",https://openalex.org/I102502594,https://ror.org/012wxa772,education,US,education,"Department of Computer Science, Northern Illinois University, 2848 DeKalb, Illinois, United States, (e-mail: smaoyuan@niu.edu)"
12410,2021,10.1109/TVCG.2020.3032984,Understanding Missing Links in Bipartite Networks with MissBiN,4,3.0,Francine Chen,,"Research, FXPAL, Palo Alto, California, United States, (e-mail: chen@fxpal.com)",True,4,...,,1.0,"[Research, FXPAL, Palo Alto, California, United States, (e-mail: chen@fxpal.com)]","[Research, FXPAL, Palo Alto, California, United States, (e-mail: chen@fxpal.com)]",,,,,,"Research, FXPAL, Palo Alto, California, United States, (e-mail: chen@fxpal.com)"
12411,2021,10.1109/TVCG.2020.3032984,Understanding Missing Links in Bipartite Networks with MissBiN,4,4.0,Patrick Chui,,"Research, FXPAL, Palo Alto, California, United States, (e-mail: chiu@fxpal.com)",True,4,...,,1.0,"[Research, FXPAL, Palo Alto, California, United States, (e-mail: chiu@fxpal.com)]","[Research, FXPAL, Palo Alto, California, United States, (e-mail: chiu@fxpal.com)]",,,,,,"Research, FXPAL, Palo Alto, California, United States, (e-mail: chiu@fxpal.com)"


In [10]:
# What this True means is that fuzzy matching in get_merged_author_df.py did not change the order of all author names
df['IEEE Author Name'].tolist() == df3['IEEE Author Name'].tolist()

True

## Result

Since the ieee author name vectors are exactly the same. It means that the new ieee_author_df scraped using bs4 is correct, and fuzzy matching does not change the author name order or content. 

In [11]:
# for doi in df.DOI.tolist():
#     dff = df[df.DOI == doi]
#     df33 = df3[df3.DOI == doi]
#     if dff['IEEE Author Name'].tolist() != df33['IEEE Author Name'].tolist():
#         print(f'{doi}')

In [12]:
# df[df.DOI == '10.1109/TVCG.2021.3114876']['IEEE Author Name'].tolist()

In [13]:
# df3[df3.DOI == '10.1109/TVCG.2021.3114876']['IEEE Author Name'].tolist()