# General

## Importing Libraries and Modules

In [1]:
import pandas as pd
import matplotlib as plt

import pickle

from sqlalchemy import create_engine

import etl.extract.tb as tb_proc 
import etl.extract.wos as wos_parser
import test_units.test_extract as test_ext
import test_units.test_transform as test_tf

## CRISP-DM

## ETL

### (E)xtraction


#### Web of Science/Zoological Records
WoS do not provide an API, as far as I know, and has strict access policies across the board of all of their products. Thus, I downloaded the search results using their GUI, manually, and coded a parser from scratch that loads the data from the 150+ files into a SQLite database.

The specifics on how I queried this system is in the report, Methods section.

In [175]:
# Parse data, creates df
# I commented out this command when I was working on the notebook, to accelerate the process of reloading this dataset
#wos = wos_parser.create_df()

# This is commented out because it generates a over 200Mb file that I couldn't push to GitHub, but I left it commented off here to show that I've worked on this
#wos_parser.to_pickle(wos, 'data/wos.p')

# Use the non-provided pickle file to reload this, if necessary
wos = pd.read_pickle('data/wos.p')

In [176]:
# Checking if it loaded
wos.head()

Unnamed: 0,Publication Type,Zoological Record Accession Number,Document Type,Title,Foreign Title,Authors,Source,Volume,Page Span,Publication Date,...,Journal URL,Item URL,Notes,NaN,Book Authors,Publisher,Publisher Address,International Standard Book Number (ISBN),Editors,Group Authors
0,J,ZOOR14707048981,Article,Orthopterans of the Reserva Biologica Alberto ...,Ortopteros de la Reserva Biologica Alberto Man...,"Barranco, Pablo (pbvega@ual.es)",Boletin de la SEA,47,21-32,31 Dec 2010,...,,,,,,,,,,
1,J,ZOOR14707048982,Article,Further considerations on the genus Ananteris ...,,"Lourenco, Wilson R. (arache@mnhn.fr), Duhem, B...",Boletin de la SEA,47,33-38,31 Dec 2010,...,,,,,,,,,,
2,J,ZOOR14707048983,Article,Description of a new species of the genus Mela...,Descripcion de una nueva especie del genero Me...,"Ferrer, Julio, Tovar, Alejandro Castro",Boletin de la SEA,47,39-44,31 Dec 2010,...,,,,,,,,,,
3,J,ZOOR14707048985,Article,New arachnids from Puerto Rico (Arachnida: Amb...,Nuevos aracnidos de Puerto Rico (Arachnida: Am...,"de Armas, Luis F. (zoologia.ies@ama.cu)",Boletin de la SEA,47,55-64,31 Dec 2010,...,,,,,,,,,,
4,J,ZOOR14707048986,Article,Description of new hypogean taxa of Laparoceru...,Descripcion de nuevos Laparocerus hipogeos de ...,"Machado, Antonio (a.machado@telefonica.net), G...",Boletin de la SEA,47,65-69,31 Dec 2010,...,,,,,,,,,,


#### TreatmentBank (Plazi)
TreatmentBank has a powerful API with GUI. I used it to built the target query, and copied/pasted their URL into my code. From there I processed the results, adding them into a SQLite database. 

In [4]:
# Process the hard-coded API call and creates a database from it
# I'm commenting out because the SQLite db was provided and it contains a snapshot of the data I analyzed. As this includes an API call, running this now might get new data, changing my results

#tb_proc.create_db(tb_proc.get_data())

# Alternatively, I can simply load the .db
db_connect = create_engine('sqlite:///data/tb.db')

tb = pd.read_sql("docs", con=db_connect)

In [5]:
# Checking if it loaded
tb.head()

Unnamed: 0,id,authors,year,title,journal,volume,issue,start_page,end_page,num_treats
0,1,"Fernandez-Triana, Jose L & Whitfield, James B ...",2014,First record of the genus Venanus (Hymenoptera...,Biodiversity Data Journal,2,0,4167,4167,4
1,2,"Barrales-Alcala, Diego A. & Francke, Oscar F.",2018,A new Sky Island species of Vaejovis C. L. Koc...,ZooKeys,760,0,37,53,2
2,3,"Guevara-Guerrero, Gonzalo & Bonito, Gregory & ...",2018,"Tuberaztecorum sp. nov., a truffle species fro...",MycoKeys,30,0,61,72,2
3,4,"Balke, Michael & Ruthensteiner, Bernhard & War...",2015,Two new species of Limbodessus diving beetles ...,Biodiversity Data Journal,3,0,7096,7096,4
4,5,"Costa, Wilson J. E. M. & Amorim, Pedro F.",2018,Cryptic species diversity in the Hypsolebiasma...,ZooKeys,777,0,141,158,4


#### Zoobank
As explained, I performed a web scraping to collect data from Zoobank. For this, I used a Scrapy project, with custom Item, ItemLoader, in- and output processors and ItemPipeline. These can't quite be run from here, as far as I know, I can only create a CrawlerProcess that won't have these functionalities. This was then saved into a SQLite database, which is alternatively loaded here.

In [216]:
# Load the .db created from the web scraping
db_connect = create_engine('sqlite:///data/zb.db')

zb = pd.read_sql("docs", con=db_connect)

In [217]:
# Checking if it loaded
zb.head()

Unnamed: 0,id,authors,year,title,journal,bibref_details,volume,issue,start_page,end_page,from_year
0,1,"Achatz, Johannes G., Matthew D. Hooge, A. Wall...",2010,Systematic revision of acoels with 9+0 sperm u...,Journal of Zoological Systematics and Evolutio...,48(1): 9-32,48.0,1.0,9.0,32.0,2010
1,2,"Affilastro, Andrés A. O. & Ignacio Garcia-Mauro.",2010,"A new Bothriurus (Scorpiones, Bothriuridae) fr...",Zootaxa,2488: 52-64,2488.0,,52.0,64.0,2010
2,3,"Ahyong, Shane T.",2010,The marine fauna of New Zealand: king crabs of...,No,,,,,,2010
3,4,"Ahyong, Shane T., Keiji Baba, Enrique Macphers...",2010,A new classification of the Galatheoidea (Crus...,Zootaxa,2676: 57-68,2676.0,,57.0,68.0,2010
4,5,"Aibek, Ulykpan & Seiki Yamane.",2010,Discovery of the Subgenera Austrolasius and De...,Japanese Journal of systematic entomology,16(2): 197-202,16.0,2.0,197.0,202.0,2010


#### Exploratory Analyses: Extraction Tests
Here I will make a series of tests to check if I'm getting all the data that I was supposed to be getting. For that, I'm going to use Python's assert, and some manual observations.


##### Web of Science/Zoological Records
These were the observed results from the WoS GUI, per year:

- 2010: 7207 results
- 2011: 7480 results
- 2012: 7604 results
- 2013: 7655 results
- 2014: 7456 results
- 2015: 7983 results
- 2016: 7859 results
- 2017: 8023 results
- 2018: 7973 results
- 2019: 7664 results
- 2020: 3548 results


In [8]:
# Create a dict with year: official results, to be used on a data consistency test
wos_results = {
    '2010': 7207,
    '2011': 7480,
    '2012': 7604,
    '2013': 7655,
    '2014': 7456,
    '2015': 7983,
    '2016': 7859,
    '2017': 8023,
    '2018': 7973,
    '2019': 7664,
    '2020': 3548,
}

In [9]:
# Checkout column names to get the right one: 'Year Published'
wos.columns

Index([                                     'Publication Type',
                          'Zoological Record Accession Number',
                                               'Document Type',
                                                       'Title',
                                               'Foreign Title',
                                                     'Authors',
                                                      'Source',
                                                      'Volume',
                                                   'Page Span',
                                            'Publication Date',
                                              'Year Published',
                                                    'Language',
                                 'Usage Count (Last 180 Days)',
                                    'Usage Count (Since 2013)',
                                                    'Abstract',
                                        

In [10]:
# Create a pandas.Series with the ordered counts of the 'Year Published' column
wos_years_count = wos['Year Published'].value_counts().sort_index()

In [11]:
# Calling the test with WoS/Zoological Record data
test_ext.check_consistency("WoS/Zoological Records", wos_years_count, wos_results)

--------------------------------------------------------------------------
Testing for... WOS/ZOOLOGICAL RECORDS
--------------------------------------------------------------------------
SUCCESS! Total count was equal between the processed and official results!
--------------------------------------------------------------------------
SUCCESS! the year 2010 seems to be fine!
SUCCESS! the year 2011 seems to be fine!
SUCCESS! the year 2012 seems to be fine!
SUCCESS! the year 2013 seems to be fine!
SUCCESS! the year 2014 seems to be fine!
SUCCESS! the year 2015 seems to be fine!
SUCCESS! the year 2016 seems to be fine!
SUCCESS! the year 2017 seems to be fine!
SUCCESS! the year 2018 seems to be fine!
SUCCESS! the year 2019 seems to be fine!
SUCCESS! the year 2020 seems to be fine!
--------------------------------------------------------------------------


##### TreatmentBank
I can't make new API calls. The reason why is that as the data is live I might risk to get more data than I have got when I first run this, which could cause different results. 

That's why I saved a .pickle file with the entire response of the API on the routine call, besides the .db resulting of the processing of this data. As such, I can compare the total number of records in both (and per year!) to check if my processing missed something or not, using the same function applied to WoS data.

In [12]:
# Creating a dict with the counts of records per year, based on the total response saved from the API call to TreatmentBank

# Open .pickle
with open('data/tb_response.p', 'rb') as p_file:
        tb_response = pickle.load(p_file)

tb_response_years_count = dict()

# Iterate over the full response, and create the counting
for result in tb_response['data']:
    try:
        tb_response_years_count[result['BibYear']] += 1
    except:
        tb_response_years_count[result['BibYear']] = 1

In [13]:
# Create a pandas.Series with the ordered counts of the 'year' column
tb_years_count = tb['year'].value_counts().sort_index()

In [15]:
# Calling the test with TreatmentBank data
test_ext.check_consistency("TreatmentBank", tb_years_count, tb_response_years_count)

--------------------------------------------------------------------------
Testing for... TREATMENTBANK
--------------------------------------------------------------------------
SUCCESS! Total count was equal between the processed and official results!
--------------------------------------------------------------------------
SUCCESS! the year 2014 seems to be fine!
SUCCESS! the year 2018 seems to be fine!
SUCCESS! the year 2015 seems to be fine!
SUCCESS! the year 2016 seems to be fine!
SUCCESS! the year 2013 seems to be fine!
SUCCESS! the year 2017 seems to be fine!
SUCCESS! the year 2011 seems to be fine!
SUCCESS! the year 2012 seems to be fine!
SUCCESS! the year 2020 seems to be fine!
SUCCESS! the year 2010 seems to be fine!
SUCCESS! the year 2019 seems to be fine!
--------------------------------------------------------------------------


##### Zoobank
These were the observed results from the Zoobank, per year:

- 2010: 1128 results
- 2011: 1430 results
- 2012: 2482 results
- 2013: 4293 results
- 2014: 4829 results
- 2015: 5552 results
- 2016: 5732 results
- 2017: 6098 results
- 2018: 6401 results
- 2019: 6386 results
- 2020: 4554 results

In [16]:
# Create a dict with year: official results, to be used on a data consistency test
zb_results = {
    '2010': 1128,
    '2011': 1430,
    '2012': 2482,
    '2013': 4293,
    '2014': 4829,
    '2015': 5552,
    '2016': 5732,
    '2017': 6098,
    '2018': 6401,
    '2019': 6386,
    '2020': 4554,
}

In [17]:
# Check the existing columns in this dataframe
list(zb.columns)

['id',
 'authors',
 'year',
 'title',
 'journal',
 'bibref_details',
 'volume',
 'issue',
 'start_page',
 'end_page',
 'from_year']

Here there is one interesting caveat. Zoobank allow users to register papers from the past, and when queried by year, it shows not the publications published on that year, but the publications **registred** on that year. 

As such, we might have all sort of years in the column 'year'.

In [18]:
# Create a pandas.Series with the ordered counts of the 'year' column
zb_years_count = zb['year'].value_counts().sort_index()

# Check results
zb_years_count

1847       2
1877       2
1878       1
1880       2
1881       1
1883       1
1887       2
1888       2
1889       3
1894       1
1896       2
1898       2
1902       1
1905       1
1917       1
1919       3
1923       1
1927       3
1932       1
1933       1
1936       1
1937       5
1951       4
1952       3
1960       1
1962       1
1967       1
1969       1
1972       2
1986       1
1989       1
1996       2
1999       4
2001       1
2002       1
2003       9
2004       1
2008       1
2009       4
2010    1023
2011    1318
2012    2387
2013    4253
2014    4774
2015    5550
2016    5779
2017    6119
2018    6434
2019    6497
2020    4674
Name: year, dtype: int64

To address this situation, I included a column on zb.db called 'from_year', which extracts the year being queried on the response.url, inside the parser() of the spider used to scrape Zoobank, and add to this column. 

This means that we can get the total (included) per year, to compare with the official results.

In [19]:
# Create a pandas.Series with the ordered counts of the 'Year Published' column
zb_years_count = zb['from_year'].value_counts().sort_index()

# Check results
zb_years_count

2010    1128
2011    1430
2012    2482
2013    4293
2014    4829
2015    5552
2016    5732
2017    6098
2018    6401
2019    6386
2020    4554
Name: from_year, dtype: int64

In [20]:
# Now that I have both dictionary with official results and the pandas.Series with the processed results, I can call the unit test function
test_ext.check_consistency("Zoobank", zb_years_count, zb_results) 

--------------------------------------------------------------------------
Testing for... ZOOBANK
--------------------------------------------------------------------------
SUCCESS! Total count was equal between the processed and official results!
--------------------------------------------------------------------------
SUCCESS! the year 2010 seems to be fine!
SUCCESS! the year 2011 seems to be fine!
SUCCESS! the year 2012 seems to be fine!
SUCCESS! the year 2013 seems to be fine!
SUCCESS! the year 2014 seems to be fine!
SUCCESS! the year 2015 seems to be fine!
SUCCESS! the year 2016 seems to be fine!
SUCCESS! the year 2017 seems to be fine!
SUCCESS! the year 2018 seems to be fine!
SUCCESS! the year 2019 seems to be fine!
SUCCESS! the year 2020 seems to be fine!
--------------------------------------------------------------------------


It seems that the data I processed into databases are consistent with the data observed on the sources and that the processing were completed flawlessly. 

It's time to move on.

### (T)ransform

I'm aiming one single pandas.DataFrame with the counts of papers for each one of the three data sources, grouped per journal, with years on the columns. Thus, it's a MultiIndex dataframe (journal x Source/Year). But to achieve this, I want first check the following aspects of these data for all three sources:

1. Check for duplicates
2. Check the amount of NaN on the to-be-used 'year' columns
3. Group by journal and check the data quality of the journals' names


#### Exploratory Analyses: Before Transform


##### Web of Science/Zoological Records

In [177]:
# First, check the available columns again
wos.columns

Index([                                     'Publication Type',
                          'Zoological Record Accession Number',
                                               'Document Type',
                                                       'Title',
                                               'Foreign Title',
                                                     'Authors',
                                                      'Source',
                                                      'Volume',
                                                   'Page Span',
                                            'Publication Date',
                                              'Year Published',
                                                    'Language',
                                 'Usage Count (Last 180 Days)',
                                    'Usage Count (Since 2013)',
                                                    'Abstract',
                                        

###### Duplicates

With this I can see that the column 'Digital Object Identifier (DOI)' might contain a persistent, unique identifier, but not all papers will have it. So I'll perform two different checks for duplicates for this source:

1. Based on 'Digital Object Identifier (DOI)'
2. Based on the columns: 'Source', 'Year Published', 'Volume', 'Issue', 'Page Span'

In [266]:
# Method 01 - DOI

# First, select the ones that has DOIs
wos_duplicates = wos.dropna(subset=['Digital Object Identifier (DOI)'], how='all', inplace=False)

# Then, search for duplicates. 'True' will indicate the existence of them.
wos_duplicates.duplicated(subset=['Digital Object Identifier (DOI)'], keep=False).value_counts()

False    7964
True        4
dtype: int64

In [267]:
# Checking these four duplicates...
wos_duplicates[wos_duplicates.duplicated(subset=['Digital Object Identifier (DOI)'], keep=False)]

Unnamed: 0,Publication Type,Zoological Record Accession Number,Document Type,Title,Foreign Title,Authors,Source,Volume,Page Span,Publication Date,...,Journal URL,Item URL,Notes,NaN,Book Authors,Publisher,Publisher Address,International Standard Book Number (ISBN),Editors,Group Authors
77,J,ZOOR14704030516,Article,A New Genus of the Spider Family Caponiidae (A...,,"Sanchez-Ruiz, Alexander (alex@bioeco.ciges.inf...",American Museum Novitates,3705,1-44,Dec 29 2010,...,,,,,,,,,,
10549,J,ZOOR14707054161,Article,Apseudes talpa revisited (Crustacea; Tanaidace...,,"Larsen, Kim (tanaids@hotmail.com), Bertocci, I...",Zootaxa,2886,19-30,23 May 2011,...,http://www.mapress.com/zootaxa/list/2011/2886....,http://www.mapress.com/zootaxa/2011/2/zt02886p...,,21594100.0,,,,,,
11930,J,ZOOR14705036582,Article,"The Goblin Spider Genus Aprusia Simon, 1893 (A...",,"Grismado, Cristian J. (grismado@macn.gov.ar), ...",American Museum Novitates,3706,1-21,Feb 3 2011,...,,,,,,,,,,
12433,J,ZOOR14705041437,Article,Tanaidaceans (Crustacea) from the Central Paci...,,"Larsen, Kim (tanaids@hotmail.com)",ZooKeys,87,19-41,2011,...,http://www.pensoft.net/journals/zookeys/http:/...,http://www.pensoft.net/journal_home_page.php?j...,,21594100.0,,,,,,


In [268]:
# They doesn't seem to be duplicates. Check their DOIs
wos_duplicates[wos_duplicates.duplicated(subset=['Digital Object Identifier (DOI)'], keep=False)]['Digital Object Identifier (DOI)']

77               10.1206/3705.2
10549    10.3897/zookeys.87.784
11930            10.1206/3705.2
12433    10.3897/zookeys.87.784
Name: Digital Object Identifier (DOI), dtype: object

They have the same DOI, but they are clearly different papers (if you look into title, authors, and everything else!) So, **false-positives**, despite the same DOIs. Moving on...

In [269]:
# Method 02

# Stores records that is not NaN on all selected fields
wos_duplicates = wos.dropna(subset=['Source', 'Year Published', 'Volume', 'Issue', 'Page Span'], how='all', inplace=False)

In [270]:
# Compare the lenght of the original df with this recently created one. Zero means no record that met the criteria was found
wos_duplicates.shape[0] - wos.shape[0]

0

In [283]:
# If there are no all-NaNs, then, check for duplicates
wos_duplicates.duplicated(subset=['Source', 'Year Published', 'Volume', 'Issue', 'Page Span'], keep=False).value_counts()

False    80194
True       258
dtype: int64

In [272]:
# A lot! Checking these 258...
wos_duplicates[wos_duplicates.duplicated(subset=['Source', 'Year Published', 'Volume', 'Issue', 'Page Span'], keep=False)]

Unnamed: 0,Publication Type,Zoological Record Accession Number,Document Type,Title,Foreign Title,Authors,Source,Volume,Page Span,Publication Date,...,Journal URL,Item URL,Notes,NaN,Book Authors,Publisher,Publisher Address,International Standard Book Number (ISBN),Editors,Group Authors
1296,J,ZOOR15410067947,Article,Bryozoans from the Jurginskaya Formation (Fame...,,"Tolokonnikova, Zoya (zalatoi@yandex.ru)",Geologos,16,139-152,Oct 2010,...,,,,,,,,,,
2504,J,ZOOR14801000011,Article,THE GENUS PSORTHASPIS (HYMENOPTERA: POMPILIDAE...,,"Rodriquez, Juanita (juanitarodrigueza@gmail.co...",Caldasia,32,435-441,July-Dec 2010,...,,,,,,,,,,
2505,J,ZOOR14801000012,Article,Three new species of Hyphessobrycon group hete...,Tres nuevas especies de Hyphessobrycon grupo h...,"Garcia-Alzate, Carlos A. (cagarcia@uniquindio....",Caldasia,32,443-461,July-Dec 2010,...,,,,,,,,,,
5670,J,ZOOR15009042744,Article,THE GENUS PSORTHASPIS (HYMENOPTERA: POMPILIDAE...,,"Rodriguez, Juanita (juanitarodrigueza@gmail.co...",Caldasia,32,435-441,2010,...,,,,,,,,,,
5671,J,ZOOR15009042745,Article,Three new species of Hyphessobrycon group hete...,TRES NUEVAS ESPECIES DE Hyphessobrycon GRUPO h...,"Garcia-Alzate, Carlos A. (cagarcia@uniquindio....",Caldasia,32,443-461,2010,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80071,J,ZOOR15604029227,Article,Description of a new species of the genus Acro...,,"Okamoto, Makoto (epigonidae@gmail.com), Willia...",Ichthyological Research,67,39-49,Jan 2020,...,,,,,,,,,,
80119,J,ZOOR15607047778,Article,"Cercyon Leach, 1817 (Coleoptera: Hydrophilidae...",,"Clarkson, Bruno, Mise, Kleber M., Almeida, Luc...",Revista Brasileira de Entomologia,64,1-8,2020,...,,,,,,,,,,
80121,J,ZOOR15607048143,Article,Veliidae (Insecta: Heteroptera: Gerromorpha) f...,,"Morales, Irina (irina.morales@uptc.edu.co), Mo...",Revista Brasileira de Entomologia,64,1-6,2020,...,,,,,,,,,,
80122,J,ZOOR15607047772,Article,"Three new species of Bruggmanniella Tavares, 1...",,"Rodrigues, Alene Ramos (alenerodrigues@yahoo.c...",Revista Brasileira de Entomologia,64,1-8,2020,...,,,,,,,,,,


In [273]:
# Checking of there are real titles on these 258 'duplicates'
wos_duplicates[wos_duplicates.duplicated(subset=['Source', 'Year Published', 'Volume', 'Issue', 'Page Span'], keep=False)]['Title'].value_counts()

A new species of the ant genus Leptogenys Roger, 1861 (Hymenoptera:, Formicidae) from India.                                                                                                                       3
Australoheros sanguineus sp n. - a new cichlid species from the rio, Cubatao basin, southern Brazil (Cichlidae: Heroini).                                                                                          2
A new genus and species of pinworm (Nematoda, Oxyuridae) from the gray, mouse opossum, Tlacuatzin canescens.                                                                                                       2
New species of Neotropical Acanthocinini (Coleoptera, Cerambycidae,, Lamiinae).                                                                                                                                    2
Redefinition of the genus Scarabacariphis Masan and new morphological, data for S. geotrupes comb. nov (Ishikawa) (Acari: Mesostigmata:, Eviphididae

Some indeed seems to be duplicates... with more than one occurrence of the title. But a lot seems to be metadata issues. Better investigate.

In [281]:
# Checking how many of these have no repeated titles, meaning, could be metadata issues alone
wos_duplicates_title_counts = wos_duplicates[wos_duplicates.duplicated(subset=['Source', 'Year Published', 'Volume', 'Issue', 'Page Span'], keep=False)]['Title'].value_counts()

wos_duplicates_single_titles = wos_duplicates_title_counts[wos_duplicates_title_counts.where(wos_duplicates_title_counts < 2).notna()]

wos_duplicates_single_titles

The tetrapod fauna of the upper Permian Naobaogou Formation of China: 5., Caodeyao liuyufengi gen. et sp. nov., a new peculiar therocephalian.                                                                             1
New nomenclatural and taxonomic acts, and Comments (2015).                                                                                                                                                                 1
Atypical Wing Venation in Dialictus and Hemihalictus and Its, Implications for Subgeneric Classification of Lasioglossum.                                                                                                  1
A new nematode species, Chromadorina tangaroa sp. nov. (Chromadorida:, Chromadoridae) from the hull of a research vessel, New Zealand.                                                                                     1
Calicotyle hydrolagi n. sp. (Monogenea: Monocotylidae) infecting the, deep-sea Eastern Pacific black ghost shark Hyd

In [287]:
# How many of these 'duplicates' have repeated titles?
wos_duplicates.duplicated(subset=['Source', 'Year Published', 'Volume', 'Issue', 'Page Span'], keep=False).value_counts()[1] - wos_duplicates_single_titles.shape[0]

129

In [290]:
# Getting info on one of the duplicates with non-repeated titles 
wos_duplicates[wos_duplicates['Title'] == 'New nomenclatural and taxonomic acts, and Comments (2015).'][['Title', 'Source', 'Year Published', 'Volume', 'Issue', 'Page Span']]

Unnamed: 0,Title,Source,Year Published,Volume,Issue,Page Span
37746,"New nomenclatural and taxonomic acts, and Comm...",Snudebiller,2015,16,,1-8


In [293]:
# Checking if this duplicate with non-repeated titles have instead some metadata problems
wos_duplicates[wos_duplicates['Source'] == 'Snudebiller'][wos_duplicates['Year Published'] == '2015'][wos_duplicates['Volume'] == '16'][wos_duplicates['Page Span'] == '1-8']

Unnamed: 0,Publication Type,Zoological Record Accession Number,Document Type,Title,Foreign Title,Authors,Source,Volume,Page Span,Publication Date,...,Journal URL,Item URL,Notes,NaN,Book Authors,Publisher,Publisher Address,International Standard Book Number (ISBN),Editors,Group Authors
37742,J,ZOOR15210068284,Article,Auletobius (Canarauletes) garajonay sp.n. from...,,"Stueben, Peter E. (P.Stueben@curci.de)",Snudebiller,16,1-8,1 December 2015,...,,,,,,,,,,
37744,J,ZOOR15210068264,Article,Torneuma korwitzi sp.n. from Madeira (Coleopte...,,"Stueben, Peter E. (P.Stueben@curci.de), Schuet...",Snudebiller,16,1-8,1 December 2015,...,,,,,,,,,,
37746,J,ZOOR15210068286,Article,"New nomenclatural and taxonomic acts, and Comm...",,"Stueben, Peter E. (P.Stueben@curci.de), Bayer,...",Snudebiller,16,1-8,1 December 2015,...,,,,,,,,,,


Though call! They have the same author and metadata, but not the same title.

In [294]:
# Checking for DOIs...
wos_duplicates[wos_duplicates['Source'] == 'Snudebiller'][wos_duplicates['Year Published'] == '2015'][wos_duplicates['Volume'] == '16'][wos_duplicates['Page Span'] == '1-8']['Digital Object Identifier (DOI)']

37742    NaN
37744    NaN
37746    NaN
Name: Digital Object Identifier (DOI), dtype: object

No DOIs, and, this situation where everything else but title is the same. I'm making a tough call here, but I'll decide to exclude these potential duplicates too.

###### 'year' NaNs

This is a more straight forward check... simply count NaNs on a specific column, 'Year Published'.

In [186]:
# Counting NaNs on 'Year Published'
wos['Year Published'].isna().value_counts()

False    80452
Name: Year Published, dtype: int64

Wow! Seems that there is not a single row without the year of publication. That's quit amazing.

![](https://i.imgur.com/uM1TOIt.png)

###### journals' names

I just want to have a quick look into the (possible) mess that the journals' names in this source might be. I'm not specting that much of mess, to be honest.


In [187]:
# Group the wos dataframe by group, getting only 'Year Published' as return
wos_journals = wos.groupby('Source').count()['Year Published'].sort_values(ascending=False)

# Check the number of journals
wos_journals.shape[0]

2650

This is a LOT OF JOURNALS! I'm surprised. So, decided to check a bit further by:

1. Check the ones with one single occurrence, to see if there are any non-journals included in the dataset
2. Check if there are typos involving one of the most, if not the most popular journal: Zootaxa

In [188]:
# Take a look at the journals with less than 2 occurrences as we might see some non-journals here
list(wos_journals[~wos_journals.where(wos_journals < 2).isna()].sort_values().index)

Biological Research',
 'The carrion beetles of China (Coleoptera: Silphidae).',
 'The blackflies (Diptera: Simuliidae) of Brazil. [Aquatic Biodiversity in, Latin America. Volume 6.]',
 'Arnoldia Zimbabwe',
 'Anadolu Universitesi Bilim ve Teknoloji Dergisi C Yasam Bilimleri ve, Biyoteknoloji',
 'Ardea',
 'Vestnik Moskovskogo Universiteta Seriya XVI Biologiya',
 'Arctiid moths of India. Volume 1.',
 'Arctic Antarctic and Alpine Research',
 'Veroeffentlichungen des Museums fuer Naturkunde Chemnitz',
 'The Howard and Moore complete checklist of the birds of the world. 4th, edtion. Volume two: passerines.',
 'The Witt Catalogue. A taxonomic atlas of the Eurasian and North African, Noctuoidea: volume 4: Plusiinae II.',
 'Rhyzobius (Coleoptera: Coccinellidae) a revision of the world species., [Fauna Mundi Volume 2.]',
 '[The Turonian of the Uchaux and Ceze massifs (S.E. France): global, migration of ammonites and implications for international zonation,, rudists and massif correlations.]. [Me

In [189]:
# Checking for possible typos involving Zootaxa's name
wos_journals[['Zoot'.lower() in s.lower() for s in wos_journals.index]]

Source
Zootaxa                                                   16906
Acta Zootaxonomica Sinica                                   441
Arquivo Brasileiro de Medicina Veterinaria e Zootecnia       12
Veterinaria e Zootecnia                                       4
Acta Veterinaria et Zootechnica Sinica                        1
Name: Year Published, dtype: int64

We can see that there are a LOT of different journals in this dataset, but not all of them are actual journals, there are mistakes, random enough to be extremely hard -- or impossible -- to come up with an automate way of cleaning at this point.

On the bright side, there aren't many typos with the most common journal on the dataset, which is a good indication that we are not missing a lot of data due to this issue.

Summary of transformations needed in this dataset (Web of Science/Zoological Records):

1. Remove duplicates based on 'Source', 'Year Published', 'Volume', 'Issue', 'Page Span' columns
2. Select only 'Source' and 'Year Published' columns
3. Rename these two columns to 'Journal' and 'Year'
4. Add a column named 'Source', with all rows = 'WoS/Zoological Records'

##### TratmentBank (Plazi)

In [190]:
# First, check the available columns again
tb.columns

Index(['id', 'authors', 'year', 'title', 'journal', 'volume', 'issue',
       'start_page', 'end_page', 'num_treats'],
      dtype='object')

###### Duplicates

There aren't many things that we can easily compare other than adopting the second strategy applied for WoS dataset. Thus:

1. Based on the columns: 'journal', 'year', 'volume', 'issue', 'start_page'

It's worth to mention that I'm deliberately avoiding using 'title' for this because I strongly believe that the more complex, variable, a field is, the more likely to have typos and other variance. As such, 'title' isn't a good fit to check for duplicates, for instance.

In [258]:
# Check for duplicates using the defined strategy

# First, remove possible rowns with NaNs in all of these fields:
tb_duplicates = tb.dropna(subset=['journal', 'year', 'volume', 'issue', 'start_page'], how='all', inplace=False)

# Then, applied strategy
tb_duplicates.duplicated(subset=['journal', 'year', 'volume', 'issue', 'start_page'], keep=False).value_counts()

False    18763
True       192
dtype: int64

In [259]:
# Take a carefull look into some of these so-called duplicates
tb_duplicates[tb_duplicates.duplicated(subset=['journal', 'year', 'volume', 'issue', 'start_page'], keep=False)]

Unnamed: 0,id,authors,year,title,journal,volume,issue,start_page,end_page,num_treats
329,330,"Min, Gao & Zhang, Yalin",2013,Review of the leafhopper genus Oniella Matsumu...,Zootaxa,3693,1,0,0,1
357,358,"Miranda, Vinicius Rocha & Rodrigues, Andrielle...",2020,A new species of Ophryotrocha (Annelida: Dorvi...,Pap. Avulsos Zool.,60,0,1,8,1
452,453,"Molineri, Carlos & Granados-Martinez, Cristian E.",2019,Two new species of Campsurus Eaton (Ephemeropt...,Zootaxa,4543,1,0,0,2
504,505,"Monteiro, Nilton Juvencio Santiago & Esposito,...",2017,New species and new records of Calycomyza Hend...,Zootaxa,4338,3,0,0,2
642,643,Mostafa R. Sharaf & Joe Monks & Andrew Polaszek,2016,A remarkable new species of the genus Lepisiot...,Journal of Natural History,50,29,1,13,1
...,...,...,...,...,...,...,...,...,...,...
18571,18572,"Martynov, Alexander V. & Godunko, Roman J.",2017,Mayflies of the Caucasus Mountains. IV. New sp...,Zootaxa,4231,1,0,0,1
18603,18604,"Maruyama, Munetoshi & Kamezawa, Hiromu",2013,"Adinopsis nippon, a new species of marsh-dwell...",Zootaxa,3669,1,0,0,1
18868,18869,"Abolafia, Joaquín & Shokoohi, Ebrahim",2017,Description of Stegelletina lingulata sp. n. (...,Zootaxa,4358,3,0,0,1
18916,18917,"Adeldoost, Yaser & Heydari, Ramin & Pedram, Majid",2015,Morphological and molecular characterization o...,Zootaxa,4040,1,0,0,1


In [260]:
# Looking the first record on that list, to see if we can find its duplicate
tb_duplicates[tb_duplicates['journal'] == 'Zootaxa'][tb_duplicates['volume'] == '3693'][tb_duplicates['issue'] == '1'][tb_duplicates['start_page'] == '0']

Unnamed: 0,id,authors,year,title,journal,volume,issue,start_page,end_page,num_treats
329,330,"Min, Gao & Zhang, Yalin",2013,Review of the leafhopper genus Oniella Matsumu...,Zootaxa,3693,1,0,0,1
14918,14919,"Huerta, Heron & Dzul, Felipe",2013,First record of the genus Abrhexosa Freeman fr...,Zootaxa,3693,1,0,0,3


Ok, this is clearly NOT a duplication, but a metadata issue. Let's check another one.

In [261]:
# Looking the first record on that list, to see if we can find its duplicate
tb_duplicates[tb_duplicates['journal'] == 'Journal of Natural History'][tb_duplicates['volume'] == '50'][tb_duplicates['issue'] == '29'][tb_duplicates['start_page'] == '1']

Unnamed: 0,id,authors,year,title,journal,volume,issue,start_page,end_page,num_treats
642,643,Mostafa R. Sharaf & Joe Monks & Andrew Polaszek,2016,A remarkable new species of the genus Lepisiot...,Journal of Natural History,50,29,1,13,1
14903,14904,Huei-Ping Shen & Chih-Han Chang & Wen-Jay Chih,2016,Amynthas (Megascolecidae: Oligochaeta) from so...,Journal of Natural History,50,29,1,22,1


Again, very likely another metadata issue, but two different entities. I think we should keep the so-called 'duplicates' for this dataset.

###### 'year' NaNs

This is a more straight forward check... simply count NaNs on a specific column, 'year'.

In [205]:
# Check the amount of NaN on column 'year'
tb['year'].isna().value_counts()

False    18955
Name: year, dtype: int64

Again! Not a problematic field here!

###### journals' names

I just want to have a quick look into the (possible) mess that the journals' names in this source might be. I think this should be the most accurated source.

In [206]:
# Group the wos dataframe by group, getting only 'Year Published' as return
tb_journals = tb.groupby('journal').count()['year'].sort_values(ascending=False)

In [207]:
# Check the number of journals
tb_journals.shape[0]

196

BIG difference here. From over 2500 to 196. TreatmentBank seems to be not only a much smaller database but less diverse too.

In [208]:
# Take a look at the journals with less than 2 occurrences as we might see some non-journals here
list(tb_journals[~tb_journals.where(tb_journals < 2).isna()].sort_values().index)

['Communications Biology',
 'Journal Of The Kansas Entomological Society',
 'Journal of Bryology',
 'Israel Journal Of Entomology',
 'Florida Entomologist',
 'New Journal of Botany',
 'Occasional Papers, Museum of Texas Tech University',
 'Zootax',
 'Zoologischer Anzeiger',
 'Zoological Research',
 'Zoological Journal of the Linnean Society',
 'Zookeys',
 'J. Parasitol.',
 'ZooTaxa',
 'UPLB Museum Publications in Natural History',
 'Tropical Zoology',
 'The Veliger',
 'The Journal of Research on the Lepidoptera',
 'Taxon',
 'Systematics and Biodiversity',
 'Systematic Parasitology',
 'Stuttgarter Beiträge zur Naturkunde A, Neue Serie',
 'Soil Organisms',
 'Serket',
 'Orgamisms, Diversity & Evolution',
 'Vestnik Zoologii',
 'Oriental Insects',
 'J. Entomol. Res. Soc.',
 'J Zool Syst Evol Res',
 'Nature Communications',
 "Mémoires du Muséum national d'Histoire naturelle",
 'Mycol Progress',
 'Mitteilungen der Münchner Entomologischen Gesellschaft',
 'Memoirs of the Queensland Museu Natur

There are fewer entries that are not real journals (for example, 'Handbook of the Birds of the Wold Special volume: new species and global index') than on wos dataset. It also doesn't seem to be a big problem in this dataset.


In [212]:
# Checking for possible typos involving Zootaxa's name
tb_journals[['Zoot'.lower() in s.lower() for s in tb_journals.index]]

journal
Zootaxa    13134
ZooTaxa        1
Zootax         1
Name: year, dtype: int64

In [210]:
# Definitely non-significant, but I decided to check 'ZooKeys' too
tb_journals[['Zook'.lower() in s.lower() for s in tb_journals.index]]

journal
ZooKeys    3183
Zookeys       1
Name: year, dtype: int64

In [211]:
# And European Journal of Taxonomy
tb_journals[['Taxon'.lower() in s.lower() for s in tb_journals.index]]

journal
European Journal of Taxonomy    564
European Jornal of Taxonomy       1
Taxon                             1
Name: year, dtype: int64

It's clear to me that journals' names is not a problem in this dataset.

Summary of transformations needed in this dataset (TreatmentBank):

1. Select only 'journal' and 'year' columns
2. Rename these two columns to 'Journal' and 'Year', with capital initial letters
3. Add a column named 'Source', with all rows = 'TreatmentBank'

##### Zoobank


In [213]:
# First, let's check columns' names
zb.columns

Index(['id', 'authors', 'year', 'title', 'journal', 'bibref_details', 'volume',
       'issue', 'start_page', 'end_page', 'from_year'],
      dtype='object')

###### Duplicates

This dataset is very similar, in terms of structure, than the previous one. There aren't many things that we can easily compare other than adopting the second strategy applied for WoS dataset. Thus:

1. Based on the columns: 'journal', 'bibref_details'

The column 'bibref_details' carries the info for colume, issue, start and end_pages combined. Thus, it's a good place to check for duplicates.

In [262]:
# Check for duplicates using the defined strategy

# First, remove possible rowns with NaNs in all of these fields:
zb_duplicates = zb[~zb[['journal', 'bibref_details']].isna()]
zb_duplicates = zb.dropna(subset=['journal', 'bibref_details'], how='all', inplace=False)

# Then, applied strategy
zb_duplicates.duplicated(subset=['journal', 'bibref_details'], keep=False).value_counts()

False    39360
True      9525
dtype: int64

Wow. That's a LOT of potential duplicates. Better check this more carefully.

![](https://i.imgur.com/dG0WSvp.png)

In [263]:
# Take a carefull look into some of these so-called duplicates
zb_duplicates[zb_duplicates.duplicated(subset=['journal', 'bibref_details'], keep=False)]

Unnamed: 0,id,authors,year,title,journal,bibref_details,volume,issue,start_page,end_page,from_year
2,3,"Ahyong, Shane T.",2010,The marine fauna of New Zealand: king crabs of...,No,,,,,,2010
6,7,"Aldawood, Abdulrahman S., Mostafa R. Sharaf & ...",2010,Monomorium moathi sp,n,,,,,,2010
14,15,"Altuna, Álvaro.",2017,Deep-water scleractinian corals (Cnidaria: Ant...,Zootaxa,4353(2): 257–293,4353,2,257,293,2010
15,16,"Álvarez, Fernando, Thomas M. Iliffe & José L. ...",2014,A new species of the alpheid shrimp genus Tria...,Zootaxa,3768(1): 88–94,3768,1,88,94,2010
16,17,"Ambrus, R. & W. Grosser.",2012,Additions and corrections to the new Catalogue...,Löbl and A,,,,,,2010
...,...,...,...,...,...,...,...,...,...,...,...
48878,48879,"Żyła, Dagmara, Shȗhei Yamamoto & Josh Jenkins ...",2019,Total-evidence approach reveals an extinct lin...,Palaeontology,,,,,,2019
48879,48880,"Антонов, А. Л. & И. В. Костомарова.",2019,МИКИЖА PARASALMO MYKISS (SALMONIDAE) У ГРАНИЦ ...,,,,,,,2019
48880,48881,"Важенина, Н. В.",2019,Динамика населения герпетобионтных жесткокрылы...,Amurian Zoological Journal,11(4): 314-326,11,4,314,326,2019
48883,48884,"Озерский, П. В.",2019,К распространению некоторых видов кузнечиков (...,Amurian Zoological Journal,11(3): 240-246,11,3,240,246,2019


I can already see a real duplication on the IDs 48884 and 48885. As I have the 'from_year' column, I actually went back to [Zoobank#2019](http://zoobank.org/Search?search_term=2019) and found these two guys duplicated... 

![](https://i.imgur.com/HHyM254.png)

So we have duplicates. Let's check one more.

In [298]:
# Randomnly checking for another case of duplication...
zb_duplicates[zb_duplicates['bibref_details'] == '3768(1): 88–94']

Unnamed: 0,id,authors,year,title,journal,bibref_details,volume,issue,start_page,end_page,from_year
15,16,"Álvarez, Fernando, Thomas M. Iliffe & José L. ...",2014,A new species of the alpheid shrimp genus Tria...,Zootaxa,3768(1): 88–94,3768,1,88,94,2010
1294,1295,"Álvarez, Fernando, Thomas M. Iliffe & José L. ...",2014,A new species of the alpheid shrimp genus Tria...,Zootaxa,3768(1): 88–94,3768,1,88,94,2014


This one is even more interesting as their 'from_year' are different! Let me see if I can find them on Zoobank...

[Zoobank#2010](http://zoobank.org/Search?search_term=2010)

![](https://i.imgur.com/qEV24g4.png)

[Zoobank#2014](http://zoobank.org/Search?search_term=2014)

![](https://i.imgur.com/EFY6Cxn.png)

Yep, duplicate! And a weird one! Because how come a paper that was only published in 2014 appeared registred in 2010? **Better remove them all!**

###### 'year' NaNs

This is a more straight forward check... simply count NaNs on a specific column, 'year'.

In [299]:
# Check the amount of NaN on column 'year'
zb['year'].isna().value_counts()

False    48885
Name: year, dtype: int64

For a dataset built on a massive web scraping effort on a not reliably structured website, that just made my day.

![](https://media.giphy.com/media/OK27wINdQS5YQ/giphy-downsized.gif)


###### journals' names

I just want to have a quick look into the (possible) mess that the journals' names in this source might be. I think this must be the least accurate as the data wasn't structured enough for a safe extraction of the journals' names.

In [300]:
# Group the wos dataframe by group, getting only 'Year Published' as return
zb_journals = zb.groupby('journal').count()['year'].sort_values(ascending=False)

In [301]:
# Check the number of journals
zb_journals.shape[0]

3981

Wayy too many. Probably there are a LOT of errors. As this was a result of web scraping on a badly structured HTML, I probably sill have highly ranked mistakes too.

In [302]:
# Take a look at the journals with less than 2 occurrences as we might see some non-journals here
list(zb_journals[~zb_journals.where(zb_journals < 2).isna()].sort_values().index)

it and E',
 'pseudogaster (Dahl,',
 'pseudolobata Kondo & Gullan',
 'procerus Kuiter,',
 'profundus (Crustacea: Decapoda: Paguridae)',
 'prolixa sp',
 'prophanaeus (C',
 'putnamae, with a revised key to Indo-Pacific species of Sphyraena lacking gill rakers (Sphyraenidae)',
 'pusilla species-group from Queensland, Australia (Coleoptera: Buprestidae)',
 'pusilla (Mertens,',
 'punctatonuchalis and C',
 'pseudogranulosus, a new cryptic species of Long-toed Water Beetles (Coleoptera: Dryopidae) from Indochinese peninsula, and proposal of a new synonym for Praehelichus sericatus (Waterhouse,',
 'punctata from northern Western Ghats of India',
 'pulverata Christoph,',
 'pulchra (Araneae: Salticidae: Plexippina) in Brazil',
 'pulcherrimus, two new riffle beetle species from Borneo, and discussion about elmid plastron structures (Coleoptera: Elmidae)',
 'pteridis from the Ryukyu Islands',
 'pseudouniformis Brame,',
 'pseudostylifer sp',
 'pseudospicatus (Diptera: Dolichopodidae): morphological 

As expected, most of these are not journals, which is a different situation found on the two precedent datasets.

In [312]:
# Take a look at the top 100 journals in this dataset
zb_journals.head(40)

journal
Zootaxa                                           16093
ZooKeys                                            4047
nov                                                1661
Journal of Threatened Taxa                          921
n                                                   797
sp                                                  581
Systematic & applied acarology                      551
European Journal of Taxonomy(                       484
Public Library of Science, ONE                      430
Far Eastern Entomologist(                           365
Raffles Bulletin of Zoology                         352
                                                    347
Acta Entomologica Musei Nationalis Pragae           316
Journal of Hymenoptera Research                     310
Japanese Journal of Systematic Entomology           272
Journal of Systematic Palaeontology                 256
Insecta mundi(                                      251
Biodiversity Data Journal               

Well, the fact that the journal 'blank' got 347 hits is not a good sign. As expected, many issues here too.

![](https://media.giphy.com/media/3oEjI80DSa1grNPTDq/giphy.gif)

In [306]:
# Checking for possible typos involving Zootaxa's name. Also expecting many...
zb_journals[['Zoot'.lower() in s.lower() for s in zb_journals.index]]

journal
Zootaxa                                                                               16093
Zootaxa,                                                                                119
Zootaxa(                                                                                 66
Acta Zootaxonomica Sinica                                                                 8
curiosa group (Odonata: Platystictidae)  Zootaxa,                                         2
Zootaxa:                                                                                  2
defectus Ferris  (Hemiptera: Pseudococcidae) distinct species? Zootaxa                    1
Part III: North America and Greenland by Kaczmarek, Michalczyk & McInnes  (Zootaxa        1
Name: year, dtype: int64

In [307]:
# Checking for possible typos involving ZooKeys's name.
zb_journals[['Zook'.lower() in s.lower() for s in zb_journals.index]]

journal
ZooKeys     4047
ZooKeys(      11
ZooKeys:       4
Zookeys        2
Name: year, dtype: int64

In [308]:
# Checking for possible typos involving European Journal of Taxonomy's name.
zb_journals[['Taxon'.lower() in s.lower() for s in zb_journals.index]]

journal
European Journal of Taxonomy(                                                                                                                              484
European journal of taxonomy(                                                                                                                               54
European Journal of Taxonomy                                                                                                                                24
Taxon                                                                                                                                                        8
Acta Zootaxonomica Sinica                                                                                                                                    8
Studies and reports, Taxonomical series                                                                                                                      5
Entomotaxonomia                       

As expected, we do have more problems with typos in this database - but I guess not enough to be a concern if we are looking at the top 10, 20 journals only. So, moving on!

Summary of transformations needed in this dataset (Zoobank):

1. Remove duplicates based on 'journal' and 'bibref_details' columns
2. Remove rows that contains higher ranked non-jounals, such as "nov", "n", etc.
3. Select only 'journal' and 'year' columns
4. Rename these two columns to 'Journal' and 'Year', with capital initial letters
5. Add a column named 'Source', with all rows = 'Zoobank'

#### Transform



#### SHERPA/RoMEO
At this point, I'm not sure if I'll include it.


#### Exploratory Analyses: After Transform


### (L)oad


## Analysis


### Exploratory Analyses: Final


### Question #1


### Question #2


### Question #3

### Question #4

### Question #5

## Conclusion

## References
I'm not sure if i'll maintain this section or only add these to the README.md