### Tree Seed Dispersal modelling using ML

data: [HF120](https://harvardforest1.fas.harvard.edu/exist/apps/datasets/showData.html?id=HF120)

Use regression algorithms to predict distance that each seed travelled from the point at which it was dropped.
 - height = height in meters at which we dropped each seed (meter) 
 - wind.velocity = wind velocity in meters/second taken with a digital anemometer as we dropped each seed (metersPerSecond)  
  
 - data classification = distance in meters that each seed travelled from the    point at which we dropped it (meter)
  
  
  
   
  
###### Tree Seed Dispersal modelling using Machine Learning 
Forests are critically important for biodiversity and provide important health and economic benefits. Understanding forests' response to direct mortality resulting from infestation followed by defoliation and indirect mortality in the form of pre-emptive logging is however not very well understood. The efficacy of regeneration of vegetation following hemlock decline depends upon advance regeneration of seedlings and saplings, seed dispersal, and recruitment. In this study, we investigated whether the basic parameters of height of release and wind velocity can be used to model seed dispersal distance in areas both with and without canopies. For modelling, we trained three SVM based machine learning models that allow linear or nonlinear (polynomial and rbf) dependencies. Predicted values of dispersal distance generated by all three models did not provide a good fit to observed dispersal data. Poor fits of the model to the data are likely due to the very small size of the training dataset. Future research should compare model results across open areas  and those with canopies, since it is known that latter diminished the effects of wind and height. More complex models and larger datasets are  necessary to model highly non-linear seed dispersal patterns.


In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
import sys, os, pathlib, shutil, platform
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier


from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    accuracy_score,
)
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit
from sklearn.preprocessing import OrdinalEncoder

from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler

In [None]:
%matplotlib inline

In [None]:
species = ['maple', 'oak', 'birch']
dataFileRootName=['hf120-01-', 'hf120-02-', 'hf120-03-']
dataFileName = [i + j + '.csv' for i, j in zip(dataFileRootName, species)]
# myData = [pd.read_csv(str(pathlib.Path('./../data/hrvardf/HF120') / dataFileName)) for f in dataFileName]

myData = (pd.concat([(pd.read_csv(str(pathlib.Path('./../data/harvardf/HF120') / f))).assign(spp=spc)
                 for f, spc in zip(dataFileName, species)]
               ,ignore_index=True))

In [None]:
myData.shape
myData.head(2)
myData.tail(2)

In [None]:
myData.info()
myData.isnull().sum()

In [None]:
myCols = ['height', 'spp', 'distance']
myData[myCols[0]].value_counts(dropna=False) 
myData[myCols[1]].value_counts(dropna=False)
myData[myCols[2]].value_counts(dropna=False)
myData.pivot_table(index = [myCols[0]]
                   , columns = myCols[1]
                   , values =  myCols[2]
                   , aggfunc=np.sum, fill_value=0)



In [None]:
myCols = ['height', 'spp', 'wind.velocity']
myData[myCols[0]].value_counts(dropna=False) 
myData[myCols[1]].value_counts(dropna=False)
myData[myCols[2]].value_counts(dropna=False)
myData.pivot_table(index = [myCols[0]]
                   , columns = myCols[1]
                   , values =  myCols[2]
                   , aggfunc=np.sum, fill_value=0)

In [None]:
filteredDataML = myData[myData['spp'].isin(['maple','oak'])]
filteredDataML.shape
filteredDataML.head()


In [None]:
plt.figure(figsize=(8,5))
X_data, y_data = (filteredDataML["distance"].values, filteredDataML["height"].values)
plt.plot(X_data, y_data, 'ro')
plt.suptitle('Graph', y=1.02)
plt.ylabel('distance')
plt.xlabel('height')
plt.show()

In [None]:
X_data, y_data = (filteredDataML[['height','wind.velocity']].values, filteredDataML['distance'].values)


In [None]:
import seaborn as sns
sns.pairplot(filteredDataML[['height','wind.velocity', 'distance', 'spp']])

In [None]:
(filteredDataML[['height','wind.velocity', 'distance', 'spp']]).isnull().sum()

# filteredDataML.dropna(inplace=True)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn import metrics

X_train, X_test, Y_train, Y_test = train_test_split(X_data, y_data, test_size=0.2, random_state = 1)

In [None]:

model = linear_model.LinearRegression()
model.fit(X_train, Y_train)

In [None]:
Y_pred_train = model.predict(X_train)


In [None]:
print('Coefficients:', model.coef_)
print('Intercept:', model.intercept_)
print('Mean squared error (MSE): %.2f'
      % metrics.mean_squared_error(Y_train, Y_pred_train))
print('Coefficient of determination (R^2): %.2f'
      % metrics.r2_score(Y_train, Y_pred_train))

In [None]:
Y_pred_test = model.predict(X_test)


In [None]:
print('Coefficients:', model.coef_)
print('Intercept:', model.intercept_)
print('Coefficient of determination (R^2): %.2f'
      % metrics.r2_score(Y_test, Y_pred_test))

print('Mean Absolute Error:', metrics.mean_absolute_error(Y_test, Y_pred_test))
print('Mean Squared Error:', metrics.mean_squared_error(Y_test, Y_pred_test))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(Y_test, Y_pred_test)))

In [None]:
import matplotlib.pyplot as plt
import numpy as np

plt.figure(figsize=(5,11))

# 2 row, 1 column, plot 1
plt.subplot(2, 1, 1)
plt.scatter(x=Y_train, y=Y_pred_train, c="#7CAE00", alpha=0.3)

# Add trendline
# https://stackoverflow.com/questions/26447191/how-to-add-trendline-in-python-matplotlib-dot-scatter-graphs
z = np.polyfit(Y_train, Y_pred_train, 1)
p = np.poly1d(z)
plt.plot(Y_test,p(Y_test),"#F8766D")

plt.ylabel('Predicted LogS')


# 2 row, 1 column, plot 2
plt.subplot(2, 1, 2)
plt.scatter(x=Y_test, y=Y_pred_test, c="#619CFF", alpha=0.3)

z = np.polyfit(Y_test, Y_pred_test, 1)
p = np.poly1d(z)
plt.plot(Y_test,p(Y_test),"#F8766D")

plt.ylabel('Predicted LogS')
plt.xlabel('Experimental LogS')

plt.savefig('plot_vertical_logS.png')
plt.savefig('plot_vertical_logS.pdf')
plt.show()

In [None]:
# https://scikit-learn.org/stable/auto_examples/svm/plot_svm_regression.html
from sklearn.svm import SVR

In [None]:
# Fit regression model
svr_rbf = SVR(kernel='rbf', C=100, gamma=0.1, epsilon=.1)
svr_lin = SVR(kernel='linear', C=100, gamma='auto')
svr_poly = SVR(kernel='poly', C=100, gamma='auto', degree=3, epsilon=.1,
               coef0=1)

In [None]:
svrs = [svr_rbf, svr_lin, svr_poly]
kernel_label = ['RBF', 'Linear', 'Polynomial']

model=list()
for ix, svr in enumerate(svrs):
    model.append(svr.fit(X_train, Y_train))

In [None]:
for ix, svr in enumerate(svrs):
#     print(model[ix].support_)
    pass


In [None]:
# plotted_col = 'height'
# X_train[:,0]

In [None]:
# Look at the results
lw = 2
plotted_col = 'height'
# 'height','wind.velocity'
# model.fit(X_train, Y_train)

# svrs = [svr_rbf, svr_lin, svr_poly]
# kernel_label = ['RBF', 'Linear', 'Polynomial']
model_color = ['m', 'c', 'g']

fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(15, 10), sharey=True)
for ix, svr in enumerate(svrs):
#     horiz_data = X_train[:,0]
    horiz_data = Y_train
    axes[ix].scatter(horiz_data, model[ix].predict(X_train), color=model_color[ix], lw=lw,
                  label='{} model'.format(kernel_label[ix]))
    axes[ix].scatter((horiz_data)[model[ix].support_], Y_train[model[ix].support_], facecolor="none",
                     edgecolor=model_color[ix], s=50,
                     label='{} support vectors'.format(kernel_label[ix]))
#     axes[ix].scatter(X[np.setdiff1d(np.arange(len(X)), svr.support_)],
#                      y[np.setdiff1d(np.arange(len(X)), svr.support_)],
#                      facecolor="none", edgecolor="k", s=50,
#                      label='other training data')
    axes[ix].legend(loc='upper center', bbox_to_anchor=(0.5, 1.1),
                    ncol=1, fancybox=True, shadow=True)

fig.text(0.5, 0.04, 'data', ha='center', va='center')
fig.text(0.06, 0.5, 'target', ha='center', va='center', rotation='vertical')
fig.suptitle("Support Vector Regression", fontsize=14)
plt.show()

In [None]:
for ix, svr in enumerate(svrs):
    print(str(svr)+':')
    Y_pred_test = model[ix].predict(X_test)
    print('Coefficient of determination (R^2): %.2f'
          % metrics.r2_score(Y_test, Y_pred_test))

    print('Mean Absolute Error:', metrics.mean_absolute_error(Y_test, Y_pred_test))
    print('Mean Squared Error:', metrics.mean_squared_error(Y_test, Y_pred_test))
    print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(Y_test, Y_pred_test)))