## Machine Learning to Predict *Bombus* Species Names Given Their Coordinates
The program used `records` library to download occurrence data of three Bombus species: *occidentalis* *pensylvanicus* and *terricola*, which contained the longitude and latitude of the samples found. Here with these information, the program adopted the `KNeighborsClassifier` model from the `scikit-learn` library so that it could predict the species names if the longitude & latitude information is provided.

### 1. Download the data
Use the `records` library to download a series of occurrence records for *Bombus Occidentalis* (taxon key: 1340429), *Bombus pensylvanicus* (taxon key: 1340416), and *Bombus terricola* (taxon key: 1340493).

In [1]:
import records

In [2]:
# Download data for occidentalis
kwargs = {"taxonKey": "1340429"}
occidentalis_data = records.Records("Bombus", interval=(1980, 2000), **kwargs)

In [7]:
occidentalis_data.df.shape

(2315, 112)

In [51]:
# Download data for pensylvanicus
kwargs = {"taxonKey": "1340416"}
pensylvanicus_data = records.Records("Bombus", interval=(1980, 2000), **kwargs)

In [52]:
pensylvanicus_data.df.shape

(2099, 116)

In [77]:
# Download data for terricola
kwargs = {"taxonKey": "1340493"}
terricola_data = records.Records("Bombus", interval=(1970, 2000), **kwargs)

In [78]:
terricola_data.df.shape

(2038, 108)

### 2. Prepare the data
Here we fetched three keys for each sample: `species`, `decimal longitude` and `decimal latitude`. We concatenated the three data sets together and shuffled them to form the overall data set for training and setting. Then we specified `decimal longitude` and `decimal latitude` as features, and `species` as the label. Finally, we separated test data and training data.

In [10]:
import numpy as np
import pandas as pd
import toyplot
toyplot.config.autoformat = "html"

In [53]:
o_data = occidentalis_data.df[["species", "decimalLongitude", "decimalLatitude"]]
o_data.head()

Unnamed: 0,species,decimalLongitude,decimalLatitude
0,Bombus occidentalis,-111.83,41.73
1,Bombus occidentalis,-103.82278,42.75806
2,Bombus occidentalis,-117.5347,47.7952
3,Bombus occidentalis,-103.82278,42.75806
4,Bombus occidentalis,-103.82278,42.75806


In [54]:
i_data = pensylvanicus_data.df[["species", "decimalLongitude", "decimalLatitude"]]
i_data.head()

Unnamed: 0,species,decimalLongitude,decimalLatitude
0,Bombus pensylvanicus,-88.1214,37.9749
1,Bombus pensylvanicus,-87.78013,40.08299
2,Bombus pensylvanicus,-88.29,42.17
3,Bombus pensylvanicus,-90.9362,40.9849
4,Bombus pensylvanicus,-105.551,38.9979


In [79]:
t_data = terricola_data.df[["species", "decimalLongitude", "decimalLatitude"]]
t_data.head()

Unnamed: 0,species,decimalLongitude,decimalLatitude
0,Bombus terricola,-84.74758,45.55102
1,Bombus terricola,-84.79261,45.62301
2,Bombus terricola,-84.79261,45.62301
3,Bombus terricola,-84.79261,45.62301
4,Bombus terricola,-84.79261,45.62301


In [80]:
# Concatenate and shuffle the data
data = pd.concat([o_data, i_data, t_data])
data = data.sample(frac=1).reset_index(drop=True)
data.head(10)

Unnamed: 0,species,decimalLongitude,decimalLatitude
0,Bombus pensylvanicus,-96.4718,42.2037
1,Bombus pensylvanicus,-89.87,40.41
2,Bombus occidentalis,-148.3015,64.704
3,Bombus terricola,-76.53855,42.55368
4,Bombus terricola,-84.79261,45.62301
5,Bombus occidentalis,-110.589749,43.934762
6,Bombus occidentalis,-123.555497,42.365499
7,Bombus occidentalis,-122.52801,41.59474
8,Bombus occidentalis,-122.137222,47.037604
9,Bombus occidentalis,-111.6189,41.775


In [83]:
# plot the total data to see how it looks like
cmap = {"Bombus pensylvanicus": "pink", "Bombus occidentalis": "grey", "Bombus terricola": "skyblue"}
toyplot.scatterplot(
    data.decimalLongitude, 
    data.decimalLatitude, 
    opacity=0.5,
    color=[cmap[i] for i in data.species], 
    width=300, 
    height=250);

In [84]:
# prepare the data features x and label y
x = data[["decimalLongitude", "decimalLatitude"]].values
y = data.species.values
data[["decimalLongitude", "decimalLatitude"]].shape

(6452, 2)

In [85]:
# for overall 6452 samples, we chose the first 452 data points as test, 
# and all the rest data points as training.
ntest = 452
xtest = x[:ntest]
xtrain = x[ntest:]
ytest = y[:ntest]
ytrain = y[ntest:]

### 3. Initialize a model instance
Select `kNeighborsClassifier` from `scikit-learn` library and create an instance.

In [86]:
from sklearn.neighbors import KNeighborsClassifier

In [87]:
knn = KNeighborsClassifier(n_neighbors=1)

### 4. Fit the model
Train the model on the training data set.

In [88]:
knn.fit(xtrain, ytrain)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')

### 5. Get predictions
Apply the model to the test data set, predicting the label from features.

In [89]:
yfit = knn.predict(xtest)

### 6. Assess the goodness of fit 
Measure the accuracy of the model by comparing the predicted values to the actual labels in the test. And we used cross validation to optimize the model.

In [90]:
np.mean(yfit == ytest)

0.9867256637168141

In [91]:
cmap = {"Bombus pensylvanicus": "pink", "Bombus occidentalis": "grey", "Bombus terricola": "skyblue"}
c, a, m = toyplot.scatterplot(
    data.decimalLongitude,
    data.decimalLatitude,
    opacity=0.5,
    color=[cmap[i] for i in data.species],
    width=300,
    height=250
);
a.scatterplot(
    xtest[:, 0], xtest[:, 1], size=8, 
    mstyle={"stroke": 'black'},
    color=[cmap[i] for i in ytest],
);

In [92]:
# Cross validation to optimize the model 
cvs = np.zeros(200)
for k in range(1, 200):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(xtrain, ytrain)
    yfit = knn.predict(xtest)
    cvs[k] = np.mean(yfit == ytest)

In [93]:
toyplot.plot(cvs, height=300, width=350, ymin=0.75);

In [94]:
cvs.argmax()

6

In [95]:
knn = KNeighborsClassifier(n_neighbors=cvs.argmax())
knn.fit(xtrain, ytrain)
yfit = knn.predict(xtest)
np.mean(yfit == ytest) # Give the score of the test.

0.9889380530973452

In [96]:
cmap = {"Bombus pensylvanicus": "pink", "Bombus occidentalis": "grey", "Bombus terricola": "skyblue"}
c, a, m = toyplot.scatterplot(
    data.decimalLongitude,
    data.decimalLatitude,
    opacity=0.5,
    color=[cmap[i] for i in data.species],
    width=300,
    height=250
);
a.scatterplot(
    xtest[:, 0], xtest[:, 1], size=8, 
    mstyle={"stroke": 'black'},
    color=[cmap[i] for i in ytest])

<toyplot.mark.Scatterplot at 0x1a2267eba8>

### 7. Summary
I focused on the occurrence data of three Bombus species: *occidentalis* *pensylvanicus* and *terricola*. I used their species names as labels, and their longitude and latitude as features. I developed the knn model to predict the species identity based on their coordinate information. After cross validation (let `n_neighbors` be 6), we got the test score 0.9889380530973452, which indicated this is a good model for prediction.