<p><font size="6"><b> CASE - Biodiversity data - analysis</b></font></p>


> *DS Data manipulation, analysis and visualisation in Python*  
> *December, 2019*

> *© 2016, Joris Van den Bossche and Stijn Van Hoey  (<mailto:jorisvandenbossche@gmail.com>, <mailto:stijnvanhoey@gmail.com>). Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)*

---

In [None]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('seaborn-whitegrid')

## Reading in the enriched survey data set

<div class="alert alert-success">
    <b>EXERCISE</b>:

<ul>
  <li>Read in the 'survey_data_completed.csv' file and save the resulting DataFrame as variable <code>survey_data_processed</code> (if you did not complete the previous notebook, a version of the csv file is available in the `../data` folder).</li>
  <li>Interpret the 'eventDate' column directly as python datetime object and make sure the 'occurrenceID' column is used as the index of the resulting DataFrame (both can be done at once when reading the csv file using parameters of the `read_csv` function)</li>
  <li>Inspect the resulting frame (remember `.head()` and `.info()`) and check that the 'eventDate' indeed has a datetime data type.</li>
</ul> 
    
</div>

In [None]:
# %load _solutions/case2_biodiversity_analysis1.py

In [None]:
# %load _solutions/case2_biodiversity_analysis2.py

In [None]:
# %load _solutions/case2_biodiversity_analysis3.py

## Tackle missing values (NaN) and duplicate values

<div class="alert alert-success">
    <b>EXERCISE</b>: How many records are in the data set without information on the 'species' name?
</div>

In [None]:
# %load _solutions/case2_biodiversity_analysis4.py

<div class="alert alert-success">
    <b>EXERCISE</b>: How many duplicate records are present in the dataset?

_Tip_: Pandas has a function to find `duplicated` values... 
</div>

In [None]:
# %load _solutions/case2_biodiversity_analysis5.py

<div class="alert alert-success">
    <b>EXERCISE</b>: Extract a list of all duplicates, sort on the columns `eventDate` and `verbatimLocality` and show the first 10 records
    
_Tip_: Check documentation of `duplicated`
</div>

In [None]:
# %load _solutions/case2_biodiversity_analysis6.py

<div class="alert alert-success">
    
<b>EXERCISE</b>: Exclude the duplicate values from the survey data set and save the result as <code>survey_data_unique</code>
    
__Tip__: Next to finding `duplicated` values, Pandas has a function to `drop duplicates`...
    
</div>

In [None]:
# %load _solutions/case2_biodiversity_analysis7.py

In [None]:
len(survey_data_unique)

<div class="alert alert-success">

<b>EXERCISE</b>: For how many records (rows) we have all the information available (i.e. no NaN values in any of the columns)?

__Tip__: Just counting the nan (null) values won't work, maybe `dropna` can help you?
    
</div>

In [None]:
# %load _solutions/case2_biodiversity_analysis8.py

<div class="alert alert-success">
    
<b>EXERCISE</b>: Select the subset of records without a species name, while having information on the sex and store the result as variable <code>not_identified</code>
    
__Tip__: next to `isnull`, also `notnull` exists...

</div>

In [None]:
# %load _solutions/case2_biodiversity_analysis9.py

In [None]:
not_identified.head()

<div class="alert alert-success">
    <b>EXERCISE</b>: Select only those records that do have species information and save them as the variable <code>survey_data</code>. Make sure <code>survey_data</code> is a copy of the original DataFrame. This is the DataFrame we will use in the further analyses.
</div>

In [None]:
# %load _solutions/case2_biodiversity_analysis10.py

<div class="alert alert-danger">
    <b>NOTE</b>: For biodiversity studies, absence values (knowing that someting is not present) are useful as well to normalize the observations, but this is out of scope for these exercises.
</div>

## Observations over time

<div class="alert alert-success">
    
<b>EXERCISE</b>: Make a plot visualizing the evolution of the number of observations for each of the individual years (i.e. annual counts).

__Tip__: In the `pandas_04_time_series_data.ipynb` notebook, a powerful command to resample a time series
    
</div>

In [None]:
# %load _solutions/case2_biodiversity_analysis11.py

To evaluate the intensity or number of occurrences during different time spans, a heatmap is an interesting representation. We can actually use the plotnine library as well to make heatmaps, as it provides the [`geom_tile`](http://plotnine.readthedocs.io/en/stable/generated/plotnine.geoms.geom_tile.html) geometry. Loading the library:

In [None]:
import plotnine as pn

<div class="alert alert-success">
    
<b>EXERCISE</b>: Create a table, called <code>heatmap_prep_plotnine</code>, based on the <code>survey_data</code> DataFrame with a column for the years, a column for the months a column with the counts (called `count`).

__Tip__: You have to count for each year/month combination. Also `reset_index` could be useful.
    
</div>

In [None]:
# %load _solutions/case2_biodiversity_analysis12.py

In [None]:
# %load _solutions/case2_biodiversity_analysis13.py

<div class="alert alert-success">
    
<b>EXERCISE</b>: Based on <code>heatmap_prep_plotnine</code>, make a heatmap using the plotnine package. 


__Tip__: When in trouble, check [this section of the documentation](http://plotnine.readthedocs.io/en/stable/generated/plotnine.geoms.geom_tile.html#Annotated-Heatmap)
    
</div>

In [None]:
# %load _solutions/case2_biodiversity_analysis14.py

Remark that we started from a `tidy` data format (also called *long* format). 

The heatmap functionality is also provided by the plotting library [seaborn](http://seaborn.pydata.org/generated/seaborn.heatmap.html) (check the docs!). Based on the documentation, seaborn uses the *short* format with in the row index the years, in the column the months and the counts for each of these year/month combinations as values.

Let's reformat the `heatmap_prep_plotnine` data to be useable for the seaborn heatmap function:

<div class="alert alert-success">
    
<b>EXERCISE</b>: Create a table, called <code>heatmap_prep_sns</code>, based on the <code>heatmap_prep_plotnine</code> DataFrame with in the row index the years, in the column the months and as values of the table, the counts for each of these year/month combinations.
    
__Tip__: The `pandas_07_reshaping_data.ipynb` notebook provides all you need to know
    
</div>

In [None]:
# %load _solutions/case2_biodiversity_analysis15.py

<div class="alert alert-success">
    <b>EXERCISE</b>: Using the seaborn <a href="http://seaborn.pydata.org/generated/seaborn.heatmap.html">documentation</a> make a heatmap starting from the <code>heatmap_prep_sns</code> variable.
</div>

In [None]:
# %load _solutions/case2_biodiversity_analysis16.py

<div class="alert alert-success">
    
<b>EXERCISE</b>: Based on the <code>heatmap_prep_sns</code> DataFrame, return to the <i>long</i> format of the table with the columns `year`, `month` and `count` and call the resulting variable <code>heatmap_tidy</code>.
    
__Tip__: The `pandas_07_reshaping_data.ipynb` notebook provides all you need to know, but a `reset_index` could be useful as well
    
</div>

In [None]:
# %load _solutions/case2_biodiversity_analysis17.py

## Species abundance for each of the plots

The name of the observed species consists of two parts: the 'genus' and 'species' columns. For the further analyses, we want the combined name. This is already available as the 'name' column if you completed the previous notebook, otherwise you can add this again in the following exercise.

<div class="alert alert-success">
    
<b>EXERCISE</b>: Make a new column 'name' that combines the 'Genus' and 'species' columns (with a space in between).
    
__Tip__: You are aware you can count with strings in Python 'a' + 'b' = 'ab'?   
    
</div>

In [None]:
# %load _solutions/case2_biodiversity_analysis18.py

<div class="alert alert-success">

<b>EXERCISE</b>: Which 8 species have been observed most of all?
    
__Tip__: Pandas provide a function to combine sorting and showing the first n records, see [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.nlargest.html)...
    
</div>

In [None]:
# %load _solutions/case2_biodiversity_analysis19.py

In [None]:
# %load _solutions/case2_biodiversity_analysis20.py

<div class="alert alert-success">
    <b>EXERCISE</b>: How many records are available of each of the species in each of the plots (called `verbatimLocality`)? How would you visualize this information with seaborn?
</div>

In [None]:
# %load _solutions/case2_biodiversity_analysis21.py

In [None]:
# %load _solutions/case2_biodiversity_analysis22.py

<div class="alert alert-success">
    
<b>EXERCISE</b>: What is the number of different species in each of the plots? Make a bar chart, using Pandas `plot` function, providing for each plot the diversity of species, by defining a matplotlib figure and ax to make the plot. Change the y-label to 'plot number'

__Tip__: next to `unique`, Pandas also provides a function `nunique`...
    
</div>

In [None]:
# %load _solutions/case2_biodiversity_analysis23.py

<div class="alert alert-success">

<b>EXERCISE</b>: What is the number of plots each species have been observed? Make an horizontal bar chart using Pandas `plot` function providing for each species the spread amongst the plots for which the species names are sorted to the number of plots

</div>

In [None]:
# %load _solutions/case2_biodiversity_analysis24.py

<div class="alert alert-success">

<b>EXERCISE</b>: First, exclude the NaN-values from the `sex` column and save the result as a new variable called `subselection_sex`. Based on this variable `subselection_sex`, calculate the amount of males and females present in each of the plots. Save the result (with the verbatimLocality as index and sex as column names) as a variable <code>n_plot_sex</code>.
    
__Tip__: Release the power of `unstack`...  
    
</div>

In [None]:
# %load _solutions/case2_biodiversity_analysis25.py

In [None]:
# %load _solutions/case2_biodiversity_analysis26.py

As such, we can use the variable `n_plot_sex` to plot the result:

In [None]:
n_plot_sex.plot(kind='bar', figsize=(12, 6), rot=0)

<div class="alert alert-success">

<b>EXERCISE</b>: Create the previous plot with the plotnine library, directly from the variable <code>subselection_sex</code>. 
    
__Tip__: When in trouble, check these [docs](http://plotnine.readthedocs.io/en/stable/generated/plotnine.geoms.geom_col.html#Two-Variable-Bar-Plot).

</div>

In [None]:
# %load _solutions/case2_biodiversity_analysis27.py

## Select subsets according to taxa of species

In [None]:
survey_data["taxa"].unique()

In [None]:
survey_data['taxa'].value_counts()
#survey_data.groupby('taxa').size()

<div class="alert alert-success">

<b>EXERCISE</b>: Select the records for which the `taxa` is equal to 'Rabbit', 'Bird' or 'Reptile'. Call the resulting variable `non_rodent_species`.
    
__Tip__: You do not have to combine three different conditions, as Pandas has a function to check if something is in a certain list of values    
    
</div>

In [None]:
# %load _solutions/case2_biodiversity_analysis28.py

In [None]:
len(non_rodent_species)

<div class="alert alert-success">

<b>EXERCISE</b>: Select the records for which the `taxa` starts with an 'ro' (make sure it does not matter if a capital character is used in the 'taxa' name). Call the resulting variable <code>r_species</code>.

__Tip__: Remember the `.str.` construction to provide all kind of string functionalities?

</div>

In [None]:
# %load _solutions/case2_biodiversity_analysis29.py

In [None]:
len(r_species)

<div class="alert alert-success">
    <b>EXERCISE</b>: Select the records that are not Birds. Call the resulting variable <code>non_bird_species</code>.
</div>

In [None]:
# %load _solutions/case2_biodiversity_analysis30.py

In [None]:
len(non_bird_species)

## (OPTIONAL SECTION) Evolution of species during monitoring period

*In this section, all plots can be made with the embedded Pandas plot function, unless specificly asked*

<div class="alert alert-success">
    <b>EXERCISE</b>: Plot using Pandas `plot` function the number of records for `Dipodomys merriami` on yearly basis during time
</div>

In [None]:
# %load _solutions/case2_biodiversity_analysis31.py

In [None]:
# %load _solutions/case2_biodiversity_analysis32.py

<div class="alert alert-danger">
    <b>NOTE</b>: Check the difference between the following two graphs? What is different? Which one would you use?
</div>

In [None]:
merriami = survey_data[survey_data["species"] == "merriami"]
fig, ax = plt.subplots(2, 1, figsize=(14, 8))
merriami.groupby(merriami['eventDate']).size().plot(ax=ax[0], style="-") # top graph
merriami.resample("D", on="eventDate").size().plot(ax=ax[1], style="-") # lower graph

<div class="alert alert-success">

<b>EXERCISE</b>: Plot, for the species 'Dipodomys merriami', 'Dipodomys ordii', 'Reithrodontomys megalotis' and 'Chaetodipus baileyi', the monthly number of records as a function of time for the whole monitoring period. Plot each of the individual species in a separate subplot and provide them all with the same y-axis scale
    
__Tip__: have a look at the documentation of the pandas plot function.
    
</div>

In [None]:
# %load _solutions/case2_biodiversity_analysis33.py

In [None]:
# %load _solutions/case2_biodiversity_analysis34.py

In [None]:
# %load _solutions/case2_biodiversity_analysis35.py

<div class="alert alert-success">
    <b>EXERCISE</b>: Reproduce the previous plot using the plotnine package.
</div>

In [None]:
# %load _solutions/case2_biodiversity_analysis36.py

In [None]:
# %load _solutions/case2_biodiversity_analysis37.py

<div class="alert alert-success">
    <b>EXERCISE</b>: Evaluate the yearly amount of occurrences for each of the 'taxa' as a function of time.
</div>

In [None]:
# %load _solutions/case2_biodiversity_analysis38.py

<div class="alert alert-success">
    <b>EXERCISE</b>: Calculate the number of occurrences for each weekday, grouped by each year of the monitoring campaign, without using the `pivot` functionality. Call the variable <code>count_weekday_years</code>
</div>

In [None]:
# %load _solutions/case2_biodiversity_analysis39.py

In [None]:
# %load _solutions/case2_biodiversity_analysis40.py

In [None]:
count_weekday_years.head()

In [None]:
count_weekday_years.plot()

<div class="alert alert-success">
    <b>EXERCISE</b>: Based on the variable `count_weekday_years`, calculate for each weekday the median amount of records based on the yearly count values. Modify the labels of the plot to indicate the actual days of the week (instead of numbers)
</div>

In [None]:
# %load _solutions/case2_biodiversity_analysis41.py

Nice work!