## Neha Savant's Machine Learning Assignment
### April 1, 2018

**Assignment Instructions:** Fork the `9-python-analysis1` repository and create a new jupyter-notebook 
in the assignment directory. In your notebook use Markdown text to create a title and explain the content
of your notebook to document your code. Use the `records` library to download a series of occurrence records for
a taxon of your choice over a period of time, or use *Bombus* as we have been
using in class. Try to apply a machine learning method from the `scikit-learn` library to 
the data in the dataframe of your `records.Epochs` object. This will involve
the following steps. To submit your assignment, commit and push your notebook to the assignments directory and make a pull request to the class repository. 

### My Question: Can you use the month that Eurycea salamanders are found to predict the state they were found in? 

In [302]:
import records
import pandas as pd
import numpy as np
import requests

# create an Epochs instance to sample 7 3-year intervals of Eurycea salamanders from the U.S.
eur = records.Epochs("Eurycea", 1996, 2017, 3,  **{"country": "US"})
eur.df.shape

(3421, 125)

In [303]:
# define Deren's function to count # of unique species
def count_unique_values(series):
    "counts the number of unique species"
    return np.unique(series).size

# call the function to count number of unique species in each state.
(eur.df[eur.df.species.notna()]
 .groupby("stateProvince")
 .species
 .apply(count_unique_values)
)

stateProvince
Alabama            6
Arkansas          11
Connecticut        2
Florida            7
Georgia            7
Illinois           1
Kansas             3
Kentucky           5
Louisiana          1
Maine              1
Maryland           2
Massachusetts      1
Mississippi        2
Missouri           1
New Hampshire      1
New Jersey         2
New York           2
North Carolina     6
Ohio               2
Oklahoma           5
Pennsylvania       2
Rhode Island       1
South Carolina     6
Tennessee          6
Texas              6
Vermont            1
Virginia           4
West Virginia      1
Name: species, dtype: int64

In [304]:
#Also must remove non-Eurycea species
eurycea=["Eurycea bislineata", "Eurycea chamberlaini", "Eurycea cirrigera", 
         "Eurycea guttolineata", "Eurycea hillisi", "Eurycea junaluska", "Eurycea longicauda", 
         "Eurycea lucifuga", "Eurycea multiplicata", "Eurycea nana", "Eurycea neotenes", 
         "Eurycea pterophila", "Eurycea quadridigitata", "Eurycea rathbuni", "Eurycea spelaea", 
         "Eurycea sphagnicola", "Eurycea subfluvicola", "Eurycea troglodytes", "Eurycea tynerensis", 
         "Eurycea wallacei", "Eurycea wilderae"]

eur1 = eur.df[eur.df.species != "Bothriocephalus rarus"]
eur2 = eur1[eur1.species != "Bothriocephalus typhlotritonis"]
eur3 = eur2[eur2.species != "Cepedietta michiganensis"]
eur4 = eur3[eur3.species != "Clinostomum complanatum"]
eur5 = eur4[eur4.species != "Desmognathinema nantahalaensis"]
eur6 = eur5[eur5.species != "Fessisentis vancleavei"]
eur7 = eur6[eur6.species != "Lithobates clamitans"]
eur8 = eur7[eur7.species != "Desmognathus fuscus"]
eur9 = eur8[eur8.species != "Omeia papillocauda"]
eur10 = eur9[eur9.species != "Urspelerpes brucei"]
eur11 = eur10[eur10.species != "NaN"]

In [305]:
#Assigning state names integer values
eur11.stateProvince.replace(["Alabama", "Arkansas", "Connecticut", "Florida", "Georgia", 
                             "Illinois", "Indiana", "Kansas", "Kentucky", "Louisiana", 
                             "Maine", "Maryland", "Massachusetts", "Mississippi", 
                             "Missouri", "New Hampshire", "New Jersey", "New York", 
                             "North Carolina", "Ohio", "Oklahoma", "Pennsylvania",
                             "Rhode Island", "South Carolina", "Tennessee", "Texas", 
                             "Vermont", "Virginia", "West Virginia"],
                            [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29], inplace=True)

eur11.species.replace(["Eurycea bislineata", "Eurycea chamberlaini", "Eurycea cirrigera", 
                        "Eurycea guttolineata", "Eurycea hillisi", "Eurycea junaluska", "Eurycea longicauda", 
                        "Eurycea lucifuga", "Eurycea multiplicata", "Eurycea nana", "Eurycea neotenes", 
                        "Eurycea pterophila", "Eurycea quadridigitata", "Eurycea rathbuni", "Eurycea spelaea", 
                        "Eurycea sphagnicola", "Eurycea subfluvicola", "Eurycea troglodytes", "Eurycea tynerensis", 
                        "Eurycea wallacei", "Eurycea wilderae"],
                       [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21], inplace=True)

### Select appropriate columns and format the data so that you have an column of labels (y) and one or more columns of features (X). Then split it into a training and test data set. 

In [306]:
#The column of labels (y) will be the species names, while the columns of features will be the state. 
data = pd.DataFrame({
    "X": eur11.species,
    "y1": eur11.stateProvince, 
    "y2": eur11.month
})

In [307]:
data2 = data.dropna() #drop rows with NA values

In [308]:
data2

Unnamed: 0,X,y1,y2
0,7.0,21.0,9.0
1,1.0,12.0,6.0
2,7.0,21.0,9.0
3,3.0,4.0,3.0
4,1.0,12.0,6.0
6,8.0,21.0,9.0
7,8.0,21.0,9.0
8,7.0,21.0,9.0
9,7.0,21.0,9.0
12,13.0,4.0,4.0


In [341]:
#Splitting testing and training data

# training size
tsize = 600

# convert to a 2d array
X = data2.X.values[:, None]
y1 = data2.y1.values[:, None]

# separate test from training
X_test = X[:tsize]
y1_test = y1[:tsize]
X_train = X[tsize:]
y1_train = y1[tsize:]

#from sklearn.cross_validation import train_test_split
#X_train, X_test, y1_train, y1_test = train_test_split(data2.X, data2.y1, random_state=1)

In [342]:
X

array([[7.],
       [1.],
       [7.],
       ...,
       [3.],
       [3.],
       [3.]])

### Select a machine learning class from scikit-learn. For this you can choose from many available options. Look to your reading for examples, or to the scikit learn documentation. The best way is to find examples of the model being applied and to substitute your data in for the example data. 

In [343]:
from sklearn.naive_bayes import GaussianNB

### Create an instance of that class. 

In [344]:
model = GaussianNB()

### Train your model on your training data set (call `.fit()` with your model).

In [345]:
model.fit(X_train, y1_train)

  y = column_or_1d(y, warn=True)


GaussianNB(priors=None)

### Get predictions by applying your model to the test data set (call `.predict()` with your model). 

In [349]:
yfit = model.predict(X_test)

### Measure the accuracy of your model by comparing the predicted values to the actual labels in your test data. 

In [350]:
from sklearn.metrics import accuracy_score
accuracy_score(y1_test, yfit)

0.315

### Describe the model that you tried to apply and the question that you tried to answer (e.g., I tried to use these features of the data to predict this). How well do you think the model worked?

I wanted to use a Gaussian naive Bayes model to see if I could predict the state (y1) the Eurycea species was observed based on the  month it was observed (y2) (to explore how observation time may be affected by latitute due to seasonality). I needed to remove non-Eurycea species and also transform the species and state names into integer values. But I also had trouble transforming the data into the proper array dimensions to run properly into the Gaussian model. The accuracy of the model is 31.5%, which means that with this model, the month found may not be the best predictor of the state in which the Eurycea individual was found. I would like to use another model to explore a follow-up question - can you predict species (X) using both the state (y1) and month (y2)? 