### Machine Learning assignment:
#### Predicting Oncorhynchous species occurance based on states  in the US

Oncorhynchus is a genus belonging to family Salmonidae which contains the Pacific salmon and Pacific trout. In this assignment, the `record` library was used to download over 8000 Oncorhynchus occurrence data from [GBIF](https://www.gbif.org/). The state names were used to predict Oncorhynchus species occurrence by applying Neighbors-based classification method.

__1. Used the records library to obtain a series of occurrence records of *Oncorhynchus* between 1800 and current year in 5 year interval__

In [1]:
import records
import numpy as np
import pandas as pd

In [2]:
ep = records.Epochs("Oncorhynchus", 1800, 2018, 5, **{"country":"US"})

In [3]:
# view the shape of the dataframe
ep.sdf.shape

(8390, 5)

In [4]:
# view the first 10 rows 
ep.sdf.head()

Unnamed: 0,species,year,epoch,country,stateProvince
0,Oncorhynchus clarkii,1868.0,1865,United States,Utah
1,Oncorhynchus clarkii,1871.0,1870,United States,Montana
2,Oncorhynchus clarkii,1871.0,1870,United States,Montana
3,Oncorhynchus clarkii,1871.0,1870,United States,Montana
4,Oncorhynchus clarkii,1871.0,1870,United States,Montana


In [5]:
# view all of the unique species 
ep.sdf.species.unique()

array(['Oncorhynchus clarkii', 'Oncorhynchus gorbuscha',
       'Ascorhynchus armatus', 'Oncorhynchus nerka', 'Oncorhynchus mykiss',
       'Oncorhynchus kisutch', 'Oncorhynchus aguabonita', nan,
       'Oncorhynchus tshawytscha', 'Oncorhynchus keta',
       'Oncorhynchus apache', 'Ascorhynchus latipes',
       'Otiorhynchus rugosostriatus', 'Otiorhynchus sulcatus',
       'Hypsiglena torquata', 'Otiorhynchus ovatus',
       'Brachyrhinus sulcatus', 'Brachyrhinus rugifrons',
       'Brachyrhinus ovatus', 'Lycodes palearis', 'Rhacochilus vacca',
       'Cymatogaster aggregata', 'Oncorhynchus gilae',
       'Ascorhynchus castellioides', 'Otiorhynchus rugifrons',
       'Otiorhynchus meridionalis', 'Ascorhynchus pyrginospinus',
       'Ascorhynchus japonicus', 'Thymallus arcticus',
       'Ascorhynchus horologum', 'Ascorhynchus athernus',
       'Ascorhynchus ovicoxa', 'Rocinela belliceps',
       'Ascorhynchus crenatus', 'Gonorynchus moseleyi',
       'Hexagrammos decagrammus', 'Notorync

In [6]:
#drop NaN in all of the columns
rainbow = ep.sdf.dropna(subset = list(ep.sdf.columns))

In [7]:
rainbow.shape

(7918, 5)

__2. Apply a machine learning method from the scikit-learn library to the data in the dataframe of my records.Epochs object__

__i. Select an column of labels (y) and one column of features (X), and format my dataset into a training and test data set.__

In [9]:
y_rainbow = rainbow['species']

In [10]:
X_rainbow = rainbow['stateProvince']

In [11]:
# Convert the categorical feature 'stateProvince' into integers
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
LabelEnc = LabelEncoder()
X_int = LabelEnc.fit_transform(X_rainbow)
OnehotEnc = OneHotEncoder(sparse = False)
X_int = X_int.reshape(len(X_int ), 1)
newX_rainbow = OnehotEnc.fit_transform(X_int)

In [12]:
from sklearn.cross_validation import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(newX_rainbow, y_rainbow, random_state=1)



__ii. Select a machine learning class from scikit-learn__

In [13]:
from sklearn.neighbors import KNeighborsClassifier

__iii. Create an instance of that class__

In [14]:
knn = KNeighborsClassifier(n_neighbors=1)

__iv. Train the model on your training data set__

In [15]:
knn.fit(Xtrain, ytrain)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')

__v. Get predictions by applying your model to the test data set__

In [16]:
yfit = knn.predict(Xtest)

__vi. Measure the accuracy of your model by comparing the predicted values to the actual labels in your test data.__

In [17]:
np.mean(yfit == ytest)

0.42878787878787877

__vii. Summary__

I tried to use locality features (e.g.stateProvince) to predict name of species of Oncorhynchus. The accuracy of my model is based on the cross validation between the predicted values to the actual labels in my test data was 43%. The use of keyword 'Oncorhynchous' search ended up retrieving other species that are not genus Oncorhynchous, which may attribute to the fact that Salmonidae taxonomic classification is inheritably complex. Also the state names were not a good predictor for species occurrence since some are widely distributed in North America. Nonetheless other methods such as the k-means and Gaussian Mixture Models could have applied to this dataset.

###### Machine leaning map
http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html