# General

## Importing Libraries and Modules

In [2]:
import pandas as pd
import matplotlib as plt 

from sqlalchemy import create_engine

import etl.extract.tb as tb_proc 
import etl.extract.wos as wos_parser

## CRISP-DM

## ETL

### (E)xtraction


#### Web of Science/Zoological Records
WoS do not provide an API, as far as I know, and has strict access policies across the board of all of their products. Thus, I downloaded the search results using their GUI, manually, and coded a parser from scratch that loads the data from the 150+ files into a SQLite database.

The specifics on how I queried this system is in the report, Methods section.

In [3]:
# Parse data, creates df
wos = wos_parser.create_df()

# This is commented out because it generates a over 200Mb file that I couldn't push to GitHub, but I left it commented off here to show that I've worked on this

#wos_parser.to_pickle(wos, 'data/wos.p')

# As an last minute alternative, I'm using this command here to save the pandas.DataFrame into a SQLite database

#TODO I've to work on this.

In [10]:
# Checking if it loaded
wos.head()

Unnamed: 0,Publication Type,Zoological Record Accession Number,Document Type,Title,Foreign Title,Authors,Source,Volume,Page Span,Publication Date,...,Journal URL,Item URL,Notes,NaN,Book Authors,Publisher,Publisher Address,International Standard Book Number (ISBN),Editors,Group Authors
0,J,ZOOR14707048981,Article,Orthopterans of the Reserva Biologica Alberto ...,Ortopteros de la Reserva Biologica Alberto Man...,"Barranco, Pablo (pbvega@ual.es)",Boletin de la SEA,47,21-32,31 Dec 2010,...,,,,,,,,,,
1,J,ZOOR14707048982,Article,Further considerations on the genus Ananteris ...,,"Lourenco, Wilson R. (arache@mnhn.fr), Duhem, B...",Boletin de la SEA,47,33-38,31 Dec 2010,...,,,,,,,,,,
2,J,ZOOR14707048983,Article,Description of a new species of the genus Mela...,Descripcion de una nueva especie del genero Me...,"Ferrer, Julio, Tovar, Alejandro Castro",Boletin de la SEA,47,39-44,31 Dec 2010,...,,,,,,,,,,
3,J,ZOOR14707048985,Article,New arachnids from Puerto Rico (Arachnida: Amb...,Nuevos aracnidos de Puerto Rico (Arachnida: Am...,"de Armas, Luis F. (zoologia.ies@ama.cu)",Boletin de la SEA,47,55-64,31 Dec 2010,...,,,,,,,,,,
4,J,ZOOR14707048986,Article,Description of new hypogean taxa of Laparoceru...,Descripcion de nuevos Laparocerus hipogeos de ...,"Machado, Antonio (a.machado@telefonica.net), G...",Boletin de la SEA,47,65-69,31 Dec 2010,...,,,,,,,,,,


#### TreatmentBank (Plazi)
TreatmentBank has a powerful API with GUI. I used it to built the target query, and copied/pasted their URL into my code. From there I processed the results, adding them into a SQLite database. 

In [6]:
# Process the hard-coded API call and creates a database from it
# I'm commenting out because the SQLite db was provided and it contains a snapshot of the data I analyzed. As this includes an API call, running this now might get new data, changing my results

#tb_proc.create_db(tb_proc.get_data())

# Alternatively, I can simply load the .db
db_connect = create_engine('sqlite:///data/tb.db')

tb = pd.read_sql("docs", con=db_connect)

In [11]:
# Checking if it loaded
tb.head()

Unnamed: 0,id,authors,year,title,journal,volume,issue,start_page,end_page,num_treats
0,1,"Fernandez-Triana, Jose L & Whitfield, James B ...",2014,First record of the genus Venanus (Hymenoptera...,Biodiversity Data Journal,2,0,4167,4167,4
1,2,"Barrales-Alcala, Diego A. & Francke, Oscar F.",2018,A new Sky Island species of Vaejovis C. L. Koc...,ZooKeys,760,0,37,53,2
2,3,"Guevara-Guerrero, Gonzalo & Bonito, Gregory & ...",2018,"Tuberaztecorum sp. nov., a truffle species fro...",MycoKeys,30,0,61,72,2
3,4,"Balke, Michael & Ruthensteiner, Bernhard & War...",2015,Two new species of Limbodessus diving beetles ...,Biodiversity Data Journal,3,0,7096,7096,4
4,5,"Costa, Wilson J. E. M. & Amorim, Pedro F.",2018,Cryptic species diversity in the Hypsolebiasma...,ZooKeys,777,0,141,158,4


#### Zoobank
As explained, I performed a web scraping to collect data from Zoobank. For this, I used a Scrapy project, with custom Item, ItemLoader, in- and output processors and ItemPipeline. These can't quite be run from here, as far as I know, I can only create a CrawlerProcess that won't have these functionalities. This was then saved into a SQLite database, which is alternatively loaded here.

In [12]:
# Load the .db created from the web scraping
db_connect = create_engine('sqlite:///data/zb.db')

zb = pd.read_sql("docs", con=db_connect)

In [14]:
# Checking if it loaded
zb.head()

Unnamed: 0,id,authors,year,title,journal,bibref_details,volume,issue,start_page,end_page,from_year
0,1,"Achatz, Johannes G., Matthew D. Hooge, A. Wall...",2010,Systematic revision of acoels with 9+0 sperm u...,Journal of Zoological Systematics and Evolutio...,48(1): 9-32,48.0,1.0,9.0,32.0,2010
1,2,"Affilastro, Andrés A. O. & Ignacio Garcia-Mauro.",2010,"A new Bothriurus (Scorpiones, Bothriuridae) fr...",Zootaxa,2488: 52-64,2488.0,,52.0,64.0,2010
2,3,"Ahyong, Shane T.",2010,The marine fauna of New Zealand: king crabs of...,No,,,,,,2010
3,4,"Ahyong, Shane T., Keiji Baba, Enrique Macphers...",2010,A new classification of the Galatheoidea (Crus...,Zootaxa,2676: 57-68,2676.0,,57.0,68.0,2010
4,5,"Aibek, Ulykpan & Seiki Yamane.",2010,Discovery of the Subgenera Austrolasius and De...,Japanese Journal of systematic entomology,16(2): 197-202,16.0,2.0,197.0,202.0,2010


#### Exploratory Analyses: Extraction


### (T)ransform


#### Web of Science/Zoological Records


#### TreatmentBank (Plazi)


#### Zoobank


#### SHERPA/RoMEO
At this point, I'm not sure if I'll include it.


#### Exploratory Analyses: Transform


### (L)oad


## Analysis


### Exploratory Analyses: Final


### Question #1


### Question #2


### Question #3

### Question #4

### Question #5

## Conclusion

## References
I'm not sure if i'll maintain this section or only add these to the README.md