# Data generator: Pseudo-Presences list with environmental covariates.
Given a Taxa (e.g.Bats) selects an associated taxa, based on frequencies.
Use this is you want to produce a CSV list of presences of certain taxa asociated in a list of taxonomic trees.


In [1]:
%matplotlib inline
import sys
sys.path.append('/apps')
import django
django.setup()
from drivers.tree_builder import TreeNeo
from drivers.graph_models import TreeNode, Order, Family, graph,Kingdom,Occurrence
from drivers.graph_models import Cell,Mex4km, countObjectsOf
from drivers.graph_models import pickNode
import matplotlib.pyplot as plt
import pandas as pd
import itertools as it
import numpy as np

## Use the ggplot style
plt.style.use('ggplot')

The following example selects the "bats" node in the tree of life and get the associated taxonomic trees with corresponding environmental covariates. 

In [2]:
## Let's pick the bats node
bats = pickNode(Order,name='Chiroptera')

In [3]:
ids4bats = bats.getCellsById()

## Random selection of cells.

> Note: Data Arquitecture. For storage reasons I couldn't load the complete world bioclimatic layers. Therefore I needed to put a regional subset that comprises only the Mexican Territory. 
For this reason, it is necessary that any approach for selecting subsamples needs to be constrained (filter) by this geometry. 
We can do that with this:

Obtain list of cells within the Mexican Territory.
> The attribute: `mexican_cells.values` is a generator of the Type: QuerySet. We need to cast it to list for loading all the data in memory.


In [4]:
# Get all cell ids
#selected_cells = mexican_cells
selected_cells = ids4bats
#ids = list(selected_cells.values('pk'))

The UniformRandomCellSample is a method for sampling cells in the example below we give as arguments. 

# Select uniformly random 
We will load the submodule `sampling` from the module `traversals`. This submodule has some sampling methods for selecting objects from the *Knowledge Graph*

Usage: 

> ` sampling.UniformRandomCellSample(list_of_cell_ids, CellNodeClass, sample_size=100, with_replacement=False, random_seed='') `

Returns:
> <\py2neo.ogm.Mex4kmSelection\>

In [5]:
from traversals import sampling as sm
cells_with_bats = bats.is_in.related_class
N = 200
sample_cells = sm.UniformRandomCellSample(ids4bats,cells_with_bats,sample_size=N)

INFO Compiling Query and asking the Graph Database


## Extract richness and Environmental covariates from cells at a given taxonomic level

Inside the module  `traversals` there exists a submodule `strategies` which is composed of different traversal schemes for exploring the *Knowledge Graph*. Here we will use the function: `getEnvironmentalCovariatesFromListOfCells(sample_cells)` which will return a dataframe of the summary statistics for the environmental covariates given a list of Cell objects (`sample_cells`). 



In [6]:
from traversals import strategies as st

In [7]:
%time data = st.getEnvironmentalCovariatesFromListOfCells(sample_cells)

CPU times: user 16.4 s, sys: 232 ms, total: 16.7 s
Wall time: 1min 53s


In [8]:
%time coords = st.getCentroidsFromListofCells(sample_cells)

CPU times: user 6.47 s, sys: 128 ms, total: 6.6 s
Wall time: 9.93 s


### Build the trees from these cells
Using the function for building cell to tree cells to trees

In [9]:
from drivers.tree_builder import buildTreeNeo

In [10]:
%time cells = list(sample_cells)

CPU times: user 6.52 s, sys: 52 ms, total: 6.57 s
Wall time: 9.94 s


In [11]:
%time trees = map(lambda c : buildTreeNeo(c),cells)

CPU times: user 6min 6s, sys: 10.3 s, total: 6min 16s
Wall time: 7min 48s


## Calculating Node frequencies
For achiving this we must first obtain the global tree within this region.
It is only with a context in which we can obtain frequencies.
In this case the context is the totallity of the trees. Given that each tree
is a subtree of the tree of life and all share at least a common ancestor (Root)
it is possible to make the union of this trees. As explained before [reference] the
set of taxonomic trees allows a monadic structure (the addition). The union of two trees
will then be called `addition` and will be denoted with the $\times$ symbol.


> ### Implementation note:
It is appealing to `integrate` a list of trees with a reduce function (fold).
* `big_tree = reduce(lambda a,b : a + b ,trees)`
/n Doing so will require a much higher processing time. 

The most efficient way of calculating the union (merge) of a list of trees is by first extracting the occurrences and, by making use of the Taxonomic relationships obtaining the resulting tree.



In [12]:
%time ocs = reduce(lambda a,b : a + b ,map(lambda t : t.occurrences, trees))

CPU times: user 236 ms, sys: 0 ns, total: 236 ms
Wall time: 233 ms


In [13]:
## Ohh! super fast (compared to the other method that takes more than 20 minutes for this sample size)
%time bigtree = TreeNeo(ocs,cell_objects=cells)

CPU times: user 3.13 s, sys: 20 ms, total: 3.15 s
Wall time: 3.15 s


## Let´s rank first the most common nodes in the selected trees.
i.e. The nodes that: given `bats` are quite common
    

In [14]:
bigtree.countNodesFrequenciesOnList(trees)

INFO Going deep 0.005
INFO Going deep 0.005
INFO Going deep 0.01
INFO Going deep 0.01
INFO Going deep 0.01
INFO Going deep 0.005
INFO Going deep 0.005
INFO Going deep 0.01
INFO Going deep 0.01
INFO Going deep 0.005
INFO Going deep 0.005
INFO Going deep 0.005
INFO Going deep 0.02
INFO Going deep 0.005
INFO Going deep 0.005
INFO Going deep 0.005
INFO Going deep 0.005
INFO Going deep 0.02
INFO Going deep 0.005
INFO Going deep 0.005
INFO Going deep 0.005
INFO Going deep 0.005
INFO Going deep 0.005
INFO Going deep 0.005
INFO Going deep 0.005
INFO Going deep 0.005
INFO Going deep 0.005
INFO Going deep 0.01
INFO Going deep 0.005
INFO Going deep 0.005
INFO Going deep 0.01
INFO Going deep 0.01
INFO Going deep 0.01
INFO Going deep 0.01
INFO Going deep 0.01
INFO Going deep 0.01
INFO Going deep 0.035
INFO Going deep 0.035
INFO Going deep 0.01
INFO Going deep 0.01
INFO Going deep 0.015
INFO Going deep 0.025
INFO Going deep 0.04
INFO Going deep 0.165
INFO Going deep 0.075
INFO Going deep 0.07
INFO G

0.05

## Obtaining presences of other families
Having ranked by frequency we can select the 10 most abundant families asociated with ´Chiroptera´


In [15]:
freqs = pd.DataFrame(map(lambda t : t.n_presences_in_list, bigtree.families))

In [16]:
bigtree.genera.sort(key=lambda k:k.n_presences_in_list,reverse=True)

In [17]:
## Let´s filter by plants
plants = bigtree.to_Plantae.plantTreeNode()

In [18]:
plants.countNodesFrequenciesOnList(list_of_trees=trees)

INFO Going deep 0.005
INFO Going deep 0.005
INFO Going deep 0.005
INFO Going deep 0.005
INFO Going deep 0.005
INFO Going deep 0.005
INFO Going deep 0.01
INFO Going deep 0.01
INFO Going deep 0.01
INFO Going deep 0.005
INFO Going deep 0.01
INFO Going deep 0.01
INFO Going deep 0.01
INFO Going deep 0.02
INFO Going deep 0.045
INFO Going deep 0.045
INFO Going deep 0.045
INFO Going deep 0.01
INFO Going deep 0.005
INFO Going deep 0.015
INFO Going deep 0.015
INFO Going deep 0.06
INFO Going deep 0.06
INFO Going deep 0.01
INFO Going deep 0.01
INFO Going deep 0.01
INFO Going deep 0.01
INFO Going deep 0.005
INFO Going deep 0.005
INFO Going deep 0.005
INFO Going deep 0.005
INFO Going deep 0.015
INFO Going deep 0.015
INFO Going deep 0.005
INFO Going deep 0.015
INFO Going deep 0.005
INFO Going deep 0.005
INFO Going deep 0.02
INFO Going deep 0.005
INFO Going deep 0.005
INFO Going deep 0.02
INFO Going deep 0.015
INFO Going deep 0.04
INFO Going deep 0.015
INFO Going deep 0.015
INFO Going deep 0.005
INFO 

0.05

Ordenar taxa por decoles, escoger penultimo decil


In [19]:
plants.families.sort(key=lambda k:k.n_presences_in_list,reverse=True)

In [20]:
famids = map(lambda fam : fam.id , plants.families[:10])

In [22]:
plants.families

[<LocalTree | Family: Fabaceae - n.count : 1273- | AF: 0.54 >,
 <LocalTree | Family: Asteraceae - n.count : 810- | AF: 0.48 >,
 <LocalTree | Family: Poaceae - n.count : 808- | AF: 0.39 >,
 <LocalTree | Family: Apocynaceae - n.count : 346- | AF: 0.34 >,
 <LocalTree | Family: Euphorbiaceae - n.count : 318- | AF: 0.33 >,
 <LocalTree | Family: Solanaceae - n.count : 269- | AF: 0.31 >,
 <LocalTree | Family: Rubiaceae - n.count : 276- | AF: 0.3 >,
 <LocalTree | Family: Malvaceae - n.count : 219- | AF: 0.27 >,
 <LocalTree | Family: Acanthaceae - n.count : 143- | AF: 0.23 >,
 <LocalTree | Family: Verbenaceae - n.count : 93- | AF: 0.22 >,
 <LocalTree | Family: Orchidaceae - n.count : 263- | AF: 0.21 >,
 <LocalTree | Family: Cyperaceae - n.count : 191- | AF: 0.205 >,
 <LocalTree | Family: Boraginaceae - n.count : 137- | AF: 0.2 >,
 <LocalTree | Family: Bignoniaceae - n.count : 134- | AF: 0.195 >,
 <LocalTree | Family: Lamiaceae - n.count : 109- | AF: 0.19 >,
 <LocalTree | Family: Convolvulaceae 

In [21]:
famnames = map(lambda fam : fam.name, plants.families[:10])

In [32]:

#bigtree.to_Animalia.to_Chordata.to_Mammalia.to_Chiroptera.level
pres = map(lambda tree : tree.pseudoPresenceAbsence(famids,5),trees)

In [33]:
presi = pd.concat(pres,axis=1).transpose()

In [37]:
# Prepare to export as CSV
# Change ids to names
# Reset the index
presi.columns = famnames
presi.reset_index(drop=True,inplace=True)

In [38]:
presi[:10]

Unnamed: 0,Fabaceae,Asteraceae,Poaceae,Apocynaceae,Solanaceae,Malvaceae,Euphorbiaceae,Rubiaceae,Lamiaceae,Acanthaceae
0,1,0,0,0,0,0,0,1,0,0
1,0,0,0,1,0,0,0,0,0,0
2,0,0,0,0,0,0,1,0,0,0
3,1,1,0,1,1,1,1,1,1,1
4,0,0,0,0,0,0,0,0,0,0
5,1,0,0,0,0,0,0,0,0,0
6,1,1,0,0,0,0,0,0,0,1
7,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0
9,1,1,1,1,1,1,1,1,1,1


In [39]:
## Remmeber that to take environmental variables with:
fulldata = pd.concat([data,coords],axis=1)

In [41]:
datatot = pd.concat([presi,fulldata],axis=1)

In [42]:
## Save it for later
#datatot.to_csv("/outputs/sample_fams_bats.csv")