## VanAcker Assignment 9

This notebook will apply machine learning using scikit-learn to data obtained through querying species occurrence data. The species occurrence data will describe _Odocoileus virginianus_, the white-tailed deer between the years of 2016 and 2018 throughout the United States. 

### 1.) Use records library to query GBIF 

I searched for white-tailed deer occurrence data in the United States between 2016 and 2018.

In [163]:
import numpy as np
import toyplot
import requests
import pandas as pd
from sklearn.linear_model import LogisticRegression

In [164]:
import records 
rec = records.Records(q = "Odocoileus virginianus", interval = (2016, 2018),)

In [165]:
rec.df.shape

(1169, 120)

In [166]:
rec.sdf.head()

Unnamed: 0,species,year,country,stateProvince
0,Odocoileus virginianus,2016,United States,Oklahoma
1,Odocoileus virginianus,2017,United States,New Mexico
2,Odocoileus virginianus,2016,United States,Ohio
3,Odocoileus hemionus,2016,United States,New Mexico
4,Odocoileus hemionus,2017,United States,Colorado


In [131]:
list(rec.df.columns)

['accessRights',
 'associatedTaxa',
 'basisOfRecord',
 'bibliographicCitation',
 'catalogNumber',
 'class',
 'classKey',
 'collectionCode',
 'collectionID',
 'continent',
 'coordinateUncertaintyInMeters',
 'country',
 'countryCode',
 'county',
 'crawlId',
 'datasetID',
 'datasetKey',
 'datasetName',
 'dateIdentified',
 'day',
 'decimalLatitude',
 'decimalLongitude',
 'disposition',
 'dynamicProperties',
 'elevation',
 'elevationAccuracy',
 'endDayOfYear',
 'eventDate',
 'eventRemarks',
 'extensions',
 'facts',
 'family',
 'familyKey',
 'fieldNotes',
 'fieldNumber',
 'gbifID',
 'genericName',
 'genus',
 'genusKey',
 'geodeticDatum',
 'georeferenceProtocol',
 'georeferenceRemarks',
 'georeferenceSources',
 'georeferenceVerificationStatus',
 'georeferencedBy',
 'georeferencedDate',
 'habitat',
 'higherClassification',
 'higherGeography',
 'identificationID',
 'identificationQualifier',
 'identificationRemarks',
 'identificationVerificationStatus',
 'identifiedBy',
 'identifier',
 'identif

In [132]:
# Take a look at the data columns 

rec.df.describe()

Unnamed: 0,classKey,coordinateUncertaintyInMeters,crawlId,day,decimalLatitude,decimalLongitude,elevation,elevationAccuracy,familyKey,genusKey,individualCount,key,kingdomKey,month,orderKey,phylumKey,speciesKey,taxonKey,year
count,1157.0,136.0,1169.0,833.0,1169.0,1169.0,144.0,96.0,1158.0,1161.0,117.0,1169.0,1169.0,833.0,1145.0,1163.0,1072.0,1169.0,1169.0
mean,25176.74,730.407132,68.449102,19.338535,35.184143,-88.291784,1536.253472,9.390625,40453.95,3515617.0,1.0,1748903000.0,4.783576,9.992797,171176.6,954431.1,3980294.0,4064282.0,2016.347305
std,424330.5,3787.474119,98.859131,7.345895,2.800796,7.623307,635.358706,49.523689,398875.3,1883231.0,0.0,122958400.0,1.22483,3.194789,1095070.0,2539784.0,1721433.0,1806764.0,0.478112
min,178.0,3.0,14.0,1.0,25.5489,-122.975278,9.0,0.0,1945.0,2440964.0,1.0,1262256000.0,1.0,1.0,392.0,34.0,2439923.0,4342.0,2016.0
25%,180.0,19.0,56.0,19.0,33.3292,-87.50922,1476.25,0.5,4828.0,2599923.0,1.0,1796027000.0,5.0,8.0,1048.0,95.0,2607031.0,2607206.0,2016.0
50%,180.0,69.0,56.0,19.0,34.38944,-86.0639,1531.0,1.5,8305.0,2605982.0,1.0,1796606000.0,5.0,12.0,1048.0,95.0,3396209.0,3397574.0,2016.0
75%,194.0,235.25,56.0,22.0,34.97797,-85.67525,1674.75,3.5,8367.0,2988638.0,1.0,1797082000.0,5.0,12.0,1273.0,95.0,5332289.0,5474489.0,2017.0
max,7228684.0,41000.0,1164.0,31.0,46.133957,-67.109931,3015.0,444.5,9439202.0,9633346.0,1.0,1836911000.0,6.0,12.0,7381366.0,7707728.0,9502855.0,9502855.0,2018.0


In [167]:
# organize the dataframe with the columns of interest.
data = pd.DataFrame(rec.df, columns = ["year", "species", "month", "decimalLatitude",])
data.head()
data.shape

(1169, 4)

In [168]:
# Drop rows with NA values
data = data.dropna()
data.shape

(756, 4)

Create objects with the month and latitude data. 

In [172]:
month = data.loc[:, 'month']
month.head()

0    11.0
1     5.0
2     3.0
3    10.0
4     6.0
Name: month, dtype: float64

In [173]:
lat = data.loc[:, "decimalLatitude"]
lat.head()

0    35.215330
1    35.862467
2    39.807600
3    34.872922
4    39.089007
Name: decimalLatitude, dtype: float64

#### Examine the data through toyplot.

In [175]:
toyplot.scatterplot(month, lat, height = 250, width = 300, size = 3);

### 2.) Prepare the Data 
Select appropriate columns and format the data so that you have an column of labels (y) and one or more columns of features (X). Then split it into a training and test data set.

In [176]:
# Create a dataframe with predictors of interest.

data = pd.DataFrame({
    "x": month,
    "y": lat,
})
data.head()

Unnamed: 0,x,y
0,11.0,35.21533
1,5.0,35.862467
2,3.0,39.8076
3,10.0,34.872922
4,6.0,39.089007


In [177]:
print(data.shape)

(756, 2)


### 3.) Applying Machine Learning

#### Split the data set into a training set and a test set.

In [178]:
# Hold back about half of the dataset for testing
tsize = 320

In [179]:
# convert to a 2d array
x = data.x.values[:, None]
x.shape

(756, 1)

In [180]:
# separate test from training
x_test = x[:tsize]
x_train = x[tsize:]

print(x.shape)
print(x[:5])

(756, 1)
[[ 11.]
 [  5.]
 [  3.]
 [ 10.]
 [  6.]]


In [181]:
# convert to a 1d array
y = data.y.values
y.shape

(756,)

In [182]:
# separate test from training
y_test = y[:tsize]
y_train = y[tsize:]

print(y.shape)
print(y[:5])

(756,)
[ 35.21533   35.862467  39.8076    34.872922  39.089007]


#### Initialize a Model Instance

In [187]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()


#### Fit the model 

In [188]:
model.fit(x_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

#### Predict Y for new data X
get predicted 'y' values for held back data X_test

In [189]:
yfit = model.predict(x_test)

#### Assessing the goodness of fit for the model 

In [192]:
# Comparing predicted y to actual y
from sklearn.metrics import r2_score, mean_squared_error

results ={
    "R2": r2_score(yfit, y_test),
    "MSE": mean_squared_error(yfit, y_test),
}
print(results)


{'R2': -11.2599664386897, 'MSE': 12.787825315001587}


This is an extremely poor model fit with a very low R2 value. 

In [194]:
# build canvas
c = toyplot.Canvas(height=300, width=350)
a = c.cartesian()

# add training and test data points
a.scatterplot(x_train[:, 0], y_train, size=4, opacity=0.5);
a.scatterplot(x_test[:, 0], y_test, size=4, opacity=0.5);

# fitted line
a.plot(x_test[:, 0], yfit, color='black', style={"stroke-width": 2.5});


### Summary of Results:

I tried to predict the occurrence latitude of white tailed deer by the month of observation with a linear regression model. The model was an extremely poor fit as seen with the R2 values. This could be due to the fact that deer may be clustered around specific latitudes and the populations may not actively migrate. In the plots the majority of occurrence points are around 40 degrees latitude, this could be due to the natural history of the species or because of observation bias. 

The model may work better with more data. I could expand the year intervals for the GBIF query. Also, a truely continuous predictor sucha s elevation may be a better choice for the linear regression as month is still categorical. 