This notebook will use the `records` package and it's functions to retrieve a set of occurrences from the GBIF package and run a machine learning algorithm (RandomForests) to ask if the state from which a `Bombus` species was recorded within the US can be predicted, given two features ie. elevation and name of species

### Import the `records` package

In [35]:
import records
import pandas as pd
import numpy as np
import toyplot

### Import occurrence records of `Bombus` from the biodiversity portal GBIF

#### We find all records of `Bombus` from 1935 up to 1950 from USA

In [13]:
rec = records.Records("Bombus", (1935, 1950), **{"country": "US"})
rec.df[["species", "country", "year","stateProvince"]].head(10)


Unnamed: 0,species,country,year,stateProvince
0,Bombus bifarius,United States,1935,Oregon
1,Bombus frigidus,United States,1937,Michigan
2,Bombus perplexus,United States,1936,Michigan
3,Bombus rufocinctus,United States,1937,Michigan
4,Bombus impatiens,United States,1940,Illinois
5,Bombus pensylvanicus,United States,1938,Illinois
6,Bombus impatiens,United States,1946,Illinois
7,Bombus variabilis,United States,1940,Illinois
8,Bombus ashtoni,United States,1936,Michigan
9,Bombus fervidus,United States,1935,Oregon


In [15]:
rec.df.shape # Note that there are almost 11322 records from 1935 to 1950

(11322, 123)

#### Remove all NA values from the column `species`, `stateProvince` and `elevation`

In [54]:
rec_filter = rec.df[rec.df.stateProvince.notna() & rec.df.species.notna() & rec.df.elevation.notna()]
rec_filter.shape # There are a total of 10720 rows and 123 columns for the dataframe

(721, 123)

#### Selecting only two columns necessary for running a predictive algorithm using `scikit-learn`

In [59]:
rec_filter = rec_filter[["species","elevation","stateProvince"]]
rec_filter.head(10) 
rec_filter.stateProvince.unique()

# Replacing the names of the states with integers for random forests classification

rec_filter.stateProvince.replace(['Alaska','Arizona','California','Colorado',
                                 'Idaho','New Hampshire','New Mexico','North Carolina',
                                  'Oregon','Tennessee','Utah','Virginia',
                                 'Washington','Wyoming'],[1,2,3,4,5,6,7,8,9,10,11,12,13,14],inplace=True)


In [60]:
rec_filter.head(10)

Unnamed: 0,species,elevation,stateProvince
1467,Bombus rufocinctus,780.0,5
1822,Bombus rufocinctus,780.0,5
1823,Bombus rufocinctus,780.0,5
2042,Bombus occidentalis,91.0,1
2224,Bombus bifarius,780.0,5
2838,Bombus occidentalis,135.0,1
3092,Bombus rufocinctus,780.0,5
3232,Bombus rufocinctus,780.0,5
3710,Bombus occidentalis,135.0,1
3735,Bombus rufocinctus,780.0,5


In [95]:
from sklearn.cross_validation import train_test_split

Data_X = pd.DataFrame(rec_filter, columns = ["species","elevation"])
Data_Y = pd.DataFrame(rec_filter, columns = ["stateProvince"])

# convert to a 1d array
y = Data_Y.values

# show
print(y.shape)
print(y[:5])


(721, 1)
[[5]
 [5]
 [5]
 [1]
 [5]]


In [96]:
Data_X_transform = pd.get_dummies(Data_X, columns=["species"])
Data_X_transform.shape

(721, 31)

In [97]:
x_train,x_test,y_train,y_test = train_test_split(Data_X_transform,y,test_size=0.3,random_state=1)

In [98]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

In [99]:
model = RandomForestClassifier(max_depth=2, random_state=0)

In [100]:
model.fit(x_train, y_train)

  """Entry point for launching an IPython kernel.


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=2, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

In [101]:
yfit = model.predict(x_test)
print(yfit)

[4 4 4 5 4 4 4 4 4 4 4 3 4 4 4 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
 4 5 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4]


The above output predicts the state that `Bombus` species are recorded from and here, Colorado is predicted for majority of the test data, with poor accuracy as shown below.

In [102]:
print ('RF accuracy: TRAINING', model.score(x_train,y_train))
print ('RF accuracy: TESTING', model.score(x_test,y_test))

RF accuracy: TRAINING 0.4880952380952381
RF accuracy: TESTING 0.4700460829493088


Looks like the accuracy of the model was well-below accepted accuracy rates. The misclassification rate is high since the number of features is very small and the decision tree splits at almost every feature.