# Predicting _Coccinella_ species by coordinates and year

**Assignment 9 by izrubin**

This is a simple demonstration of supervised machine learning to attempt to predict the occurrence of ladybug species within the _Coccinella_ genus between 1900 and 2000 in the US.

### Download occurrence records.

First, I use the `records` library to download a series of occurrence records for _Coccinella_, a genus of several ladybug species.

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import records

In [2]:
# create an Epochs instance to sample 5-year intervals of Coccinella ladybugs in the U.S.
cep = records.Epochs("Coccinella", 1900, 2000, 5,  **{"country": "US"})
cep.df.shape

(1539, 120)

In [3]:
list(cep.df.columns)

['accessRights',
 'associatedTaxa',
 'basisOfRecord',
 'bibliographicCitation',
 'catalogNumber',
 'class',
 'classKey',
 'collectionCode',
 'collectionID',
 'continent',
 'coordinateUncertaintyInMeters',
 'country',
 'countryCode',
 'county',
 'crawlId',
 'datasetKey',
 'datasetName',
 'dateIdentified',
 'day',
 'decimalLatitude',
 'decimalLongitude',
 'depth',
 'disposition',
 'dynamicProperties',
 'elevation',
 'elevationAccuracy',
 'endDayOfYear',
 'epoch',
 'eventDate',
 'eventRemarks',
 'extensions',
 'facts',
 'family',
 'familyKey',
 'fieldNotes',
 'fieldNumber',
 'gbifID',
 'genericName',
 'genus',
 'genusKey',
 'geodeticDatum',
 'georeferenceProtocol',
 'georeferenceRemarks',
 'georeferenceSources',
 'georeferenceVerificationStatus',
 'georeferencedBy',
 'georeferencedDate',
 'habitat',
 'higherClassification',
 'higherGeography',
 'http://unknown.org/recordEnteredBy',
 'http://unknown.org/recordId',
 'identificationQualifier',
 'identificationRemarks',
 'identificationVerifi

Include records for only the genus _Coccinella_.

In [4]:
cep = cep.df[cep.df["genus"] == "Coccinella"]
cep.shape

(1533, 120)

In [5]:
cep.species.unique()

array(['Coccinella transversoguttata', 'Coccinella novemnotata',
       'Coccinella trifasciata', 'Coccinella californica',
       'Coccinella undecimpunctata', 'Coccinella monticola', nan,
       'Coccinella oculata', 'Coccinella prolongata',
       'Coccinella hieroglyphica', 'Coccinella venusta',
       'Coccinella johnsoni', 'Coccinella alta', 'Coccinella difficilis',
       'Coccinella repanda', 'Coccinella septempunctata'], dtype=object)

### Select appropriate columns and format the data so that you have a column of labels (y) and one or more columns of features (X).

In [6]:
cep = cep[['species', 'decimalLatitude', 'decimalLongitude', 'year']]
cep.head(10)

Unnamed: 0,species,decimalLatitude,decimalLongitude,year
0,Coccinella transversoguttata,39.739154,-104.984703,1902
1,Coccinella novemnotata,39.683723,-75.749657,1903
2,Coccinella novemnotata,42.441,-76.497,1904
3,Coccinella novemnotata,42.441,-76.497,1904
4,Coccinella novemnotata,42.441,-76.497,1904
5,Coccinella novemnotata,42.441,-76.497,1904
6,Coccinella novemnotata,46.344,-122.528,1907
7,Coccinella transversoguttata,39.9872,-107.6161,1909
8,Coccinella trifasciata,47.61264,-122.32568,1909
9,Coccinella trifasciata,47.61264,-122.32568,1909


Drop rows with NA values.

In [7]:
cep = cep.dropna()
cep.shape

(1236, 4)

**Feature matrix (X)**

In [8]:
X_cep = cep.drop('species', axis=1)
X_cep.shape

(1236, 3)

In [9]:
X_cep.head()

Unnamed: 0,decimalLatitude,decimalLongitude,year
0,39.739154,-104.984703,1902
1,39.683723,-75.749657,1903
2,42.441,-76.497,1904
3,42.441,-76.497,1904
4,42.441,-76.497,1904


**Target vector (y)**

In [10]:
y_cep = cep['species']
y_cep.shape

(1236,)

In [11]:
y_cep.head()

0    Coccinella transversoguttata
1          Coccinella novemnotata
2          Coccinella novemnotata
3          Coccinella novemnotata
4          Coccinella novemnotata
Name: species, dtype: object

### Split data into training and test data sets.

In [12]:
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X_cep, y_cep, random_state=1)

### Select a machine learning class from scikit-learn.

I selected the Gaussian naive Bayes generative model from scikit-learn, in order to predict species classification from the variables of the feature matrix.

In [13]:
from sklearn.naive_bayes import GaussianNB

### Create an instance of that class.

There are no required hyperparameters for this model.

In [14]:
model = GaussianNB()

### Train the model on the training data set.

In [15]:
model.fit(Xtrain, ytrain)

GaussianNB(priors=None)

### Get predictions by applying your model to the test data set.

In [16]:
y_model = model.predict(Xtest)

### Measure the accuracy of your model by comparing the predicted values to the actual labels in your test data.

In [17]:
from sklearn.metrics import accuracy_score
accuracy_score(ytest, y_model)

0.44660194174757284

### Describe the model that you tried to apply and the question that you tried to answer. How well do you think the model worked?

I tried predicting the species of _Coccinella_ (a genus of ladybugs) based on latitude, longitude, and year via the scikit-learn Gaussian naive Bayes generative model. The model was only 44.7% accurate, but that is not unexpected. This could be improved with the inclusion of additional features/explanatory variables and a more complex model. A few other factors that could improve the accuracy could include the concentration of certain pesticide chemicals in nearby air/water/soil, occurrence of aphids or other prey species, competitor species (of other ladybug genera), whether the species was introduced or native, habitat (the habitat values in these record were not uniformaly coded).

The Data Science Handbook page about [Naive Bayes Classification](https://jakevdp.github.io/PythonDataScienceHandbook/05.05-naive-bayes.html) was a useful resource when completing this assignment.