## Notebook 1: Exploring *Aedes aegypti* occurrence data
####  Student: Pedro Correia de Siracusa
####  Academic advisors: Artur Ziviani & Fábio Porto
--------

On this first notebook I'll perform some basic exploratory analysis on a publicly available scientific dataset on *Aedes aegypti* occurrence data, hosted by DRYAD Digital Repository (http://datadryad.org/resource/doi:10.5061/dryad.47v3c). This dataset was originally published by
>Kraemer MUG, Sinka ME, Duda KA, Mylne A, Shearer FM, Brady OJ, Messina JP, Barker CM, Moore CG, Carvalho RG, Coelho GE, Van Bortel W, Hendrickx G, Schaffner F, Wint GRW, Elyazar IRF, Teng H, Hay SI (2015) The global compendium of Aedes aegypti and Ae. albopictus occurrence. Scientific Data 2(7): 150035. http://dx.doi.org/10.1038/sdata.2015.35

Besides having a general overview of the available data, my goal with this exercise is to get more familiarized with some basic tools for performing data analysis with Python. Let's start by importing our main dataset into the environment. 

In [1]:
import urllib
import pandas as pd

In [2]:
aegypti_data_url='http://datadryad.org/bitstream/handle/10255/dryad.88853/aegypti.csv?sequence=1'

try: 
    aegypti_dataset
    
except NameError:
    aegypti_dataset = pd.read_csv(aegypti_data_url)

Let's take a quick look at this dataset:

In [3]:
print(aegypti_dataset[:3])

          VECTOR  OCCURRENCE_ID  SOURCE_TYPE LOCATION_TYPE POLYGON_ADMIN  \
0  Aedes aegypti              1  unpublished       polygon             2   
1  Aedes aegypti              2  unpublished       polygon             2   
2  Aedes aegypti              3  unpublished       polygon             2   

       Y      X  YEAR                   COUNTRY COUNTRY_ID  GAUL_AD0  STATUS  
0  25.49 -80.99  1960  United States of America        USA       259     NaN  
1  26.12 -81.33  1960  United States of America        USA       259     NaN  
2  26.13 -97.55  1960  United States of America        USA       259     NaN  


Next I list the top 5 countries with most occurrence records.  

In [4]:
records_by_country = aegypti_dataset.groupby('COUNTRY').size()
records_by_country.sort_values(ascending=False)[:5]

COUNTRY
Taiwan                      9501
Brazil                      5057
Indonesia                    606
Thailand                     500
United States of America     444
dtype: int64

As I'm mostly interested on the vector geographic range in Brazil, let's work with a subset.

In [5]:
aegypti_dataset_br = aegypti_dataset[aegypti_dataset.COUNTRY=='Brazil']