# Machine Learning with the `records` Library

## Julia Zeh
#### 4/2/18

### 1. Download the data
Use the `records` library to download a series of occurrence records for a taxon of your choice over a period of time, or use *Bombus* as we have been using in class.

In [48]:
import records

In [113]:
kwargs = {"country":"US", "stateProvince":"New York", "basisOfRecord":"HUMAN_OBSERVATION", "hasCoordinates":"TRUE", "hasSpecies":"TRUE", "hasYear":"TRUE", "hasMonth":"TRUE",}
rec = records.Records("Cetacea", (1900, 2018), **kwargs)
rec.sdf.shape

(253, 4)

In [129]:
rec.sdf.head()

Unnamed: 0,species,year,country,stateProvince
0,,1991,United States,New York
1,,1977,United States,New York
2,,1980,United States,New York
3,,1982,United States,New York
4,,1979,United States,New York


### 2. Preparing the data
Select appropriate columns and format the data so that you have an column of labels (y) and one or more columns of features (X). Then split it into a training and test data set.

In [120]:
cet = rec.df.astype(str)[["species", "year", "month", ]]

In [121]:
cet.head()

Unnamed: 0,species,year,month
0,,1991,6
1,,1977,11
2,,1980,8
3,,1982,1
4,,1979,4


In [122]:
X_cet = cet.drop('species', axis=1)
X_cet.shape

(253, 2)

In [123]:
y_cet = cet['species']
y_cet.shape

(253,)

In [124]:
from sklearn.cross_validation import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X_cet, y_cet,
                                                random_state=1)

### 3. Initialize a model instance
Select a machine learning class from scikit-learn. For this you can choose from many available options. Look to your reading for examples, or to the scikit learn documentation. The best way is to find examples of the model being applied and to substitute your data in for the example data. Create an instance of the class.

In [125]:
from sklearn.naive_bayes import GaussianNB # 1. choose model class
model = GaussianNB()                       # 2. instantiate model

### 4. Fit the model
Train your model on your training data set (call `.fit()` with your model).

In [126]:
model.fit(Xtrain, ytrain)                  # 3. fit model to data

GaussianNB(priors=None)

### 5. Get predictions
Get predictions by applying your model to the test data set (call `.predict()` with your model).

In [127]:
y_model = model.predict(Xtest)             # 4. predict on new data

### 6. Assess the goodness of fit (score)
Measure the accuracy of your model by comparing the predicted values to the actual labels in your test data.

In [128]:
from sklearn.metrics import accuracy_score
accuracy_score(ytest, y_model)

0.21875

### 7. Final summary
Describe the model that you tried to apply and the question that you tried to answer (e.g., I tried to use these features of the data to predict this). How well do you think the model worked?

I tried to use the month of the observational occurrence as a measure of seasonal occurrence in order to predict the species. I used species as a label and trained the Gaussian NB model on some of the labeled data to predict the labels of the rest of the data. This did not work very well (it had an accuracy score of about 22%). I'm not sure if this is because of the type of model I chose to use or if the accuracy could have been improved with better preparation of the data (using more data points or selecting only certain species).