# Insight Project --Birding Big Year--

In this project I intend to determine a way to win the Big Year competition by the American Birding Association (ABA), following their rules. As part of their rules they give the list of eligible birds (1116).  All the birds have to be seen with in 12:00 AM, January 1st to 11:59 PM, December 31st of the same year. 



In [None]:
import numpy as np
from datetime import datetime
import geopandas as gpd
from shapely.geometry import Point
import os
import struct
import pickle

import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import matplotlib
from mpl_toolkits.axes_grid1 import make_axes_locatable

from sklearn.cluster import KMeans

import pandas as pd
from pandas.io.json import json_normalize, read_json

def save_fig(name):
    fig.savefig(name,dpi=80,bbox_inches='tight', pad_inches=0.02, format = 'jpg')

%matplotlib inline

# The ebird Data

I will start with a singe state the state of WY. Since the ebird API limits the type of request I can make, I have a downloaded the cvs file.  I'm using the last two full years of data but in reality the alorithm should be train with more data and just tested on the last year.

In [None]:
dfAll = pd.read_csv('./ebd_US-WY_201801_201912_relApr-2020/ebd_US-WY_201801_201912_relApr-2020.txt'
                ,delimiter="\t")

# dfAll = pd.read_csv('./ebd_US-WI_201001_201812_relApr-2020/ebd_US-WI_201001_201812_relApr-2020.txt'
#                 ,delimiter="\t")

I add sertain condition to satify completnes fo the data, public locations and only bird species (i.e. no hybirds). `dfReduce` will contian all the information I will be using.

In [None]:
dfAll = dfAll[(dfAll['CATEGORY'] == 'species') & (dfAll['LOCALITY TYPE'] == 'H')
              & (dfAll['ALL SPECIES REPORTED'] == 1)  & (dfAll['APPROVED'] == 1)]

In [None]:
dfReduce = dfAll.filter(['SAMPLING EVENT IDENTIFIER', 'COMMON NAME', 'LOCALITY', 'TIME OBSERVATIONS STARTED',
              'LATITUDE', 'LONGITUDE', 'OBSERVATION DATE', 'ALL SPECIES REPORTED']) 
dfReduce['OBSERVATION DATE'] = pd.to_datetime(dfReduce['OBSERVATION DATE'])
dfReduce['YEAR WEEK'] = dfReduce['OBSERVATION DATE'].dt.strftime('%W')
dfReduce['YEAR DAY'] = dfReduce['OBSERVATION DATE'].dt.strftime('%j')
dfReduce['YEAR'] = dfReduce['OBSERVATION DATE'].dt.strftime('%Y')
dfReduce['YEAR WEEK'] = pd.to_numeric(dfReduce['YEAR WEEK'])

In [None]:
dfReduce.head(5)

dfReduce contains both my train set and my validation set.  In this case I will use the last year as my validation set (2019) and all the previous information as my train set.

In [None]:
dfValidation = dfReduce[dfReduce['YEAR']==2019]

In [None]:
dfTrain = dfReduce[dfReduce['YEAR']!=2019]

# Let do the k-mean clustering

From `dfTrain` data using a k-mean clustering I will select the clusters that will be use on the path finder. This clusters are fixed in space.

In [None]:
kmeans = KMeans(init='k-means++', n_clusters=11, n_init=10,random_state = 2345)
dfKMeans = dfTrain.filter(['LATITUDE', 'LONGITUDE', 'LOCALITY']).drop_duplicates()

In [None]:
kmeans.fit(dfKMeans.filter(['LATITUDE', 'LONGITUDE']))

In [None]:
centroids = kmeans.cluster_centers_

In [None]:
plotter = dfTrain.filter(['LATITUDE', 'LONGITUDE']).drop_duplicates()

# Step size of the mesh. Decrease to increase the quality of the VQ.
h = .02     # point in the mesh [x_min, x_max]x[y_min, y_max].

# Plot the decision boundary. For that, we will assign a color to each
x_min, x_max = np.min(plotter['LATITUDE']),  np.max(plotter['LATITUDE'])
y_min, y_max = np.min(plotter['LONGITUDE']), np.max(plotter['LONGITUDE'])
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Obtain labels for each point in mesh. Use last trained model.
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)


In [None]:
plt.imshow(Z, interpolation='nearest',
           extent=(xx.min(), xx.max(), yy.min(), yy.max()),
           cmap=plt.cm.Paired,
           aspect='auto', origin='lower', alpha = 0.5)
plt.scatter(plotter['LATITUDE'],plotter['LONGITUDE'], marker = '+')
plt.scatter(centroids[:,0], centroids[:,1])
plt.show()


In [None]:
kmeans.predict([[42.024710,-110.589578]])

In [None]:
dfKMeans['K-cluster'] = np.array(kmeans.labels_)

In [None]:
dfKMeans.head(5)

#### Now the bird probability.

`dfKMeans` has the information of where each of the hotspots lay, in terms of their cluster.  Now in order to constuct a path is important to mask the probabilites of the of seeing a particular bird with T or F on a weekly basis.  This is critical in order to construc the sets.

In [None]:
dfProb = dfTrain.merge(dfKMeans.filter(['LOCALITY','K-cluster']),
                            left_on='LOCALITY', right_on='LOCALITY', how = 'left').filter(['COMMON NAME','ALL SPECIES REPORTED','YEAR WEEK', 'K-cluster'])

In [None]:
dfProb.head(5)

In [None]:
nTime = 54
nLoc = dfKMeans['K-cluster'].unique().shape[0]
setMat = np.zeros((nTime,nLoc), dtype=object)

In [None]:
for week in range(0,nTime):
    dfProbA = dfProb[dfProb['YEAR WEEK']== week]
    dfProb1 = dfProbA.groupby(['COMMON NAME','K-cluster']).sum().filter(['ALL SPECIES REPORTED']).reset_index()
    dfProb1.rename(columns = {'ALL SPECIES REPORTED':'POS OBS'}, inplace=True)
    dfProb2 = dfProbA.groupby(['K-cluster']).sum().filter(['ALL SPECIES REPORTED']).reset_index()
    dfProb2.rename(columns = {'ALL SPECIES REPORTED':'TOT OBS'}, inplace=True)
    dfProb3 = dfProb1.merge(dfProb2, left_on='K-cluster', right_on='K-cluster', how = 'left')
    dfProb3['POS PROB'] = dfProb3['POS OBS']/dfProb3['TOT OBS']
    for loc in range(0,nLoc):
        aa = dfProb3[dfProb3['K-cluster'] == loc]
        aa['TF aa'] = list(map(lambda x: 0 if x < 0.02 else 1, aa['POS PROB']))
        setMat[week,loc] = set(aa[aa['TF aa'] == 1]['COMMON NAME'].values)
        

In [None]:
setMat

In [None]:
ToMakeUniverse = list(setMat.flatten())
Universe = set(e for s in ToMakeUniverse for e in s)

In [None]:
list(Universe)

# Here we go!!!!!

First user inputs some coordinates.
Then the coordinates get translated to a k-cluster.
That give us the first set (first week)
Then we obtain the resto fo the sets. The key here is to back track a set to an actual 'x,t' entry so we can have a route.
Display in some way that list of locations!  (Probabily using the centroid maps or coordinates).

In [None]:
userInputLat,userInputLon = 44, -110
userInput = [userInputLat,userInputLon]
print(userInput)

On the first week I most see:

In [None]:
initialLocSet = setMat[0,kmeans.predict([userInput])[0]]
print(list(initialLocSet))

The hole list of bird that we are planing to see are:

In [None]:
# print(list(Universe))

In [None]:
print('With a total of', len(list(Universe)), 'birds')

In [None]:
def set_cover_mine(elements, subsets, initset):
    '''
    There is a greedy algorithm for polynomial time approximation of set covering that chooses sets according to one rule: at each stage, choose the set that contains the largest number of uncovered elements.

    
    '''
    covered = initset.copy()  
    cover = []
    listCover = []
    # Greedily add the subsets with the most uncovered points
    while covered != elements:
        subset = max(subsets, key=lambda s: len(s - covered))
        cover.append(subset)
        listCover.append(subsets.index(subset))
        covered |= subset
 
    return cover, listCover





In [None]:
# aa, bb = set_cover_mine(Universe, ToMakeUniverse, initialLocSet)
print(bb)

In [None]:
bbb = np.sort(bb)

In [None]:
locMat = np.linspace(1,nTime*nLoc,nTime*nLoc).reshape(nTime,nLoc)

In [None]:
for element in bbb:
    a,b = np.where(locMat == element)
    print('On week:',a[0],'You need to be at location:',b[0])
    