# Spatial Regression of Abundance Data
Here I show how to extract different taxonomic information at cell level.
Although there exists a method for building the taxonomic tree within a single cell, the process can be computationally intensive because it depends on extracting the total amount of occurrences in each cell. From there, it traverses fromtop to bottom the tree looking for the corresponding nodes.

The approach is usefull when one needs a small number of trees but it'll become increasingly slow if the amount of cells or occurrences increases. 

## Extracting specific taxonomic levels en each cells

The method studied here makes use of the relationship type `IS_IN` stored in the knowledge graph.

Note: *There was a problem with the design of the OGM implementation (py2neo.ogm). The retrieval of linked nodes based on a specific relation does not distinguish different labels. In other words it returns the totality of the data that has the  specific relationship given a node.*

The solution was to include extra methods for the class Cell `has_[taxas]`. This method/attribute returns a graph selector that points to the corresponding nodes.

Let's get started.
As usual we need to load the necessary modules


In [1]:
%matplotlib inline
import sys
sys.path.append('/apps')
import django
django.setup()
from drivers.tree_builder import TreeNeo
from drivers.graph_models import TreeNode, Order, Family, graph,Kingdom,Occurrence
from drivers.graph_models import Cell,Mex4km, countObjectsOf
import matplotlib.pyplot as plt
## Use the ggplot style
plt.style.use('ggplot')

## Random selection of cells.

*Note* : There was a big problem in the data arquitecture. For storage reasons I couldn't load the complete world bioclimatic layers. Therefore I needed to put a regional subset that comprises only the Mexican Territory. 
For this reason, it is necessary that any approach for selecting subsamples needs to be constrained (filter) by this geometry. 
We can do that with this:

In [2]:
from sketches.models import Country
Mexico = Country.objects.filter(name__contains="exico").get()
import pandas as pd

In [3]:
from mesh.models import MexMesh
mexican_cells = MexMesh.objects.filter(cell__intersects=Mexico.geom)

In [4]:
ids = list(mexican_cells.values('pk'))

In [5]:
ids = pd.DataFrame(ids)


In [6]:
ids.shape[0]

74200

The selection should be as follow:
    * Convert to pandas
    * GEnerate random numbers uniform on that range
    * use iloc to get the id values
    * Use the normal methodology.
    

In [7]:
ncells = countObjectsOf(Mex4km)
ncells = ids.shape[0]
import numpy as np
np.random.seed(12345)
sample_size = 200
choices = np.random.choice(range(1,ncells),sample_size,replace='False')

In [8]:
choices = list(ids.loc[choices].pk)

In [9]:
## This will stringify the id list to get the selected cells.
selection_of_cells = Mex4km.select(graph).where("_.id IN  %s "%str(list(choices)))

### Using iterators (imap + graphselector_iterator)

In [10]:
import itertools as it

## We will select the different Families here

In [11]:
%time families = it.imap(lambda c : c.has_families,selection_of_cells)

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 11.9 µs


In [12]:
from traversals import strategies as st

## Extract richness and Environmental covariates from cells at a given taxonomic level
Options are: Family, Order, Spicies, etc

In [15]:
%time data = st.getEnvironmentAndRichnessFromListOfCells(list_of_cells=selection_of_cells,taxonomic_level_name='Family')

CPU times: user 10.1 s, sys: 388 ms, total: 10.5 s
Wall time: 1min 17s


It takes time because it need to calculate on the fly the summary statistic of each cell. It is using the postgis backend

In [16]:
data

Unnamed: 0,n.Family,Longitude,Latitude,Elevation_mean,MaxTemperature_mean,MeanTemperature_mean,MinTemperature_mean,Precipitation_mean,SolarRadiation_mean,Vapor_mean,WindSpeed_mean
0,4,-113.380563,27.54339,63.222222,30.225231,20.766204,30.317130,7.560185,18867.312500,1.750000,2.263889
1,0,-97.406563,25.68139,3.888889,27.693981,23.069444,27.807870,58.791667,16915.250000,2.212963,4.416667
2,2,-103.629563,22.83939,2353.000000,21.844722,13.783333,21.880556,47.480556,18051.152778,1.000000,3.113889
3,9,-107.255563,25.97539,1305.777778,27.772222,20.793981,27.831019,85.865741,18112.976852,1.300926,2.185185
4,1,-98.974563,24.35839,283.444444,30.418518,23.655093,30.504630,57.622685,17391.502315,1.974537,2.233796
5,0,-97.994563,24.16239,59.555556,29.541945,23.255556,29.461111,55.769444,17422.047222,2.333333,2.908333
6,0,-106.667563,29.06239,2240.333333,22.721111,13.372222,22.769444,43.850000,18295.166667,0.513889,3.533333
7,0,-100.493563,22.25139,1603.111111,25.656250,18.354167,25.703704,45.997685,18172.048611,1.365741,2.250000
8,0,-95.299563,16.61639,271.250000,31.701620,26.263889,31.724537,44.273148,17577.493056,2.388889,3.037037
9,0,-106.814563,31.41439,1211.666667,26.597222,17.104167,26.604167,19.215278,19579.583333,1.083333,3.250000


## Here development for getting environmental covariates for each cell
Let's bring the environmental value per cell

In [21]:
cc = selection_of_cells.first()

In [23]:
Sacar las celdas superiores hasta sacar una marca lloraras
luego extraer la informacion raster en matriz
usar la transformacion affin para generar coordenadas
aplicar al modelo. con pymc3
primero lineal y luego con gam o algo asi,


<py2neo.ogm.RelatedObjects at 0x7fc2bb575c50>

In [None]:
c = c_iter.next()

In [None]:
env_data = c.getAssociatedRasterAreaData('MeanTemperature')

In [None]:
c.polygon.wkt

In [None]:
env_data.getRaster()

In [None]:
rast = c.getAssociatedRasterAreaData('Elevation')

In [None]:
rast.display_field()

## Let's see if we can get data from the upper scales


In [None]:
c.upperCell.next()

In [None]:
big_cell = c.upperCell.next().upperCell.next().upperCell.next()

In [None]:
rr = big_cell.getAssociatedRasterAreaData('Elevation')

In [None]:
rr.display_field()

In [None]:
cells = list(selection_of_cells)

In [None]:
%time tii = st.getEnvironmentalCovariatesFromListOfCells(cells)

In [None]:
tii

### Benchmarking time for retrieval using explicit lists vs lazy-evaluation


In [None]:
%time samples = list(sel)

In [None]:
%time ocs2 = map(lambda c : list(c.has_occurrences),samples)

In [None]:
ocs2_l = filter(lambda k : k != [] ,ocs2)

In [None]:
len(ocs2_l)

In [None]:
lll = reduce(lambda a,b : a+b,ocs2_l)

In [None]:
lll

In [None]:
lll == ccc

In [None]:
sel = Mex4km.select(graph).where("_.id IN  %s "%str(c))

In [None]:
def _try_levelnames_extraction(relationship):
    """
    Extracts the end node relationship name.
    for use with map functions.
    """
    try:
        a = relationship.start_node()['levelname']
        return a
    except:
        return None
    
types = map(lambda r : map(lambda t : _try_levelnames_extraction(t),r),available_rels)

In [None]:
types

In [None]:
tt = tb.buildTreeNeo(samples[26])

In [None]:
#For now not run
#big_tree = reduce(lambda a,b : a+b , trees)
import seaborn as sns

In [None]:
t = trees[2]

In [None]:
ll = map(lambda t : t.richness , trees)

In [None]:
sns.distplot(ll)

In [None]:
tl.plotTree(tt)

In [None]:
import traversals.strategies as strg

In [None]:
type(root)

In [None]:
root = t.node

In [None]:
a = strg.getPresencesForNode(root,trees)

In [None]:
data_t = strg.getPresencesForListOfNodes([root],trees)

In [None]:
data_t

# The model

In [17]:
import pymc3 as pm

In [19]:
from pymc3 import find_MAP
map_estimate = find_MAP(model=model)
map_estimate

NameError: name 'model' is not defined

In [None]:
import pandas as pd

In [None]:
mapxy = pd.concat([data_t[['Longitude','Latitude']],pd.DataFrame({'map': map_estimate['latent_field']})],axis=1)

In [None]:
gmapxy = tools.toGeoDataFrame(mapxy,xcoord_name='Longitude',ycoord_name='Latitude')

In [None]:
fig, ax = plt.subplots(figsize=(14, 9));
gmapxy.plot(ax=ax,column='map')

## Prediction
The conditional method creates the conditional, or predictive, distribution over the latent function at arbitrary x∗x∗ input points, f(x∗)f(x∗). To construct the conditional distribution we write:

In [None]:
minx = min(data_t.Longitude)
maxx = max(data_t.Longitude)
miny = min(data_t.Latitude)
maxy = max(data_t.Latitude)

In [None]:
from external_plugins.spystats.spystats import tools

In [None]:
grid = tools.createGrid(grid_sizex=10,grid_sizey=10,minx=minx,miny=miny,maxx=maxx,maxy=maxy)

In [None]:
gp.predict(grid[['Lon','Lat']])

In [None]:
%time f_star = gp.conditional("f_star", X=grid[['Lon','Lat']])

In [None]:
getdata = lambda tree : tree.associatedData.getEnvironmentalVariablesCells()

In [None]:
ts[1].associatedData.getEnvironmentalVariablesCells()

In [None]:
list(choices)

In [None]:
n