**README** <br>
In order to run the following notebook, please clone the repository located at https://github.com/jdevenegas/plb_midterm. Navigate to this folder on your machine in order to run the following. <br>
***Important note:*** This script was written for Mac OSX, and may not be compatible with Windows. Before publication we would work to ensure that it works on both Mac and Windows, but lacked the necessary time for troubleshooting before the midterm was due. 

**1. Formatting csv to test pipeline on easily differentiable species** <br>
All other datasets used in this noebook are formatted and exported in dataWrangle.ipynb and are included, as well as the following, in the plb_midterm repository. Code is provided to facilitate clarity in our method, but need not be run. Original code was run in notebook called dataWrangleSVM.ipynb (also included in repository), but included here for clarity.

In [None]:
import pandas as pd

In [None]:
# make dataset with just rupestris and riparia
species = pd.read_csv('species_noOutgroupOrHybrids_prediction_dataset.csv')
species.head()

In [None]:
# drop everything but rupestris and riparia
for index,row in species.iterrows():
    if row['Class'] != 'Vitis_rupestris' and row['Class'] != 'Vitis_riparia':
        species.drop(labels=index,axis=0,inplace=True)
species.head()

In [None]:
# write out the file 
species.to_csv('species_RupRip_prediction.csv',index=False)

**2. Using SVM to predict species based on x-y coordinates** <br>
Shiu Lab's ML-pipeline is used in this script. In order to be able to run the following, please navigate to https://github.com/ShiuLab/ML-Pipeline and follow the instructions in the readme in order to set up the correct environment to run the pipeline. Then clone the pipeline to your machine. <br>
The pipeline entails two steps as we ran it. First, we created a testing subset of 10% of the data, using the script test_set.py. We then ran ML_correlation.py in order to train our model. <br>

*Note:* the file structure for the following is: one folder called GitHub, which contains both ML-Pipeline and plb_midterm. Please be sure to check that the file paths are correct for your machine and for wherever you cloned the repositories. 

**2A. Predicting Species** <br>
Species were predicted in the following way: <br>
(i) Just *V. rupestris* and *V. riparia* were predicted, in order to test whether x-y coordinates would be good features. *rupestris* and *riparia* are the most morphologically distinct species, and we hypothesized that there would be good separation <br>
(ii) Just *Vitis* core species, no hybrids or *Ampelopsis* outgroup <br>
(iii) Just *Vitis* core and *Ampelopsis* outgroup, still no hybrids <br>
(iv) All species

In [3]:
# (i) rupestris and riparia
# create test set 
!python ~/GitHub/ML-Pipeline/test_set.py -df species_RupRip_prediction.csv -type c -p 0.1 -save RupRip_test_instances.csv -sep ','

Holding out 10.0 percent
Pulling test set from classes: ['Vitis_riparia' 'Vitis_rupestris']
257 instances in test set
finished!


In [4]:
# run model
!python  /Users/SLotreck/GitHub/ML-Pipeline/ML_classification.py -df species_RupRip_prediction.csv -test RupRip_test_instances.csv -cl_train Vitis_riparia,Vitis_rupestris -alg SVM -sep ','

Removing test instances to apply model on later...
Snapshot of data being used:
                  Class        x1        y1        x2        y2
Instance                                                       
2         Vitis_riparia  0.306808  0.799709  0.397394  0.774340
5         Vitis_riparia  0.366655  0.452072  0.448257  0.443414
6         Vitis_riparia  0.378910  0.649807  0.443035  0.646330
7         Vitis_riparia  0.302437  0.653963  0.377223  0.683326
8         Vitis_riparia  0.374232  0.677398  0.452144  0.662857


CLASSES: ['Vitis_riparia' 'Vitis_rupestris']
POS: Vitis_riparia type:  <class 'str'>
NEG: Vitis_rupestris type:  <class 'str'>

Balanced dataset will have 542 instances of each class


===>  Grid search started  <===
Round 1 of 10
Round 2 of 10
Round 3 of 10
Round 4 of 10
Round 5 of 10
Round 6 of 10
Round 7 of 10
Round 8 of 10
Round 9 of 10
Round 10 of 10
Parameter sweep time: 67.771247 seconds
Parameters selected: Kernel=Linear, C=10.0
Grid search done. Time: 68.50

In [5]:
# (ii) Vitis core 
# create test set 
!python ~/GitHub/ML-Pipeline/test_set.py -df species_noOutgroupOrHybrids_prediction_dataset.csv -type c -p 0.1 -save Vitis_core_test_instances.csv -sep ','

Holding out 10.0 percent
Pulling test set from classes: ['Vitis_acerifolia' 'Vitis_aestivalis' 'Vitis_riparia' 'Vitis_vulpina'
 'Vitis_coignetiae' 'Vitis_labrusca' 'Vitis_cinerea' 'Vitis_rupestris'
 'Vitis_sp' 'Vitis_amurensis' 'Vitis_palmata']
559 instances in test set
finished!


In [None]:
# run model
!python  /Users/SLotreck/GitHub/ML-Pipeline/ML_classification.py -df species_noOutgroupOrHybrids_prediction_dataset.csv -test Vitis_core_test_instances.csv -alg SVM -sep ','

Removing test instances to apply model on later...
Snapshot of data being used:
                     Class        x1        y1        x2        y2
Instance                                                          
0         Vitis_acerifolia  0.395183  0.464763  0.462163  0.483590
1         Vitis_aestivalis  0.461953  0.676650  0.535271  0.714615
2            Vitis_riparia  0.404128  0.770275  0.471012  0.774340
3         Vitis_acerifolia  0.405337  0.789025  0.469997  0.798167
4         Vitis_aestivalis  0.319783  0.584354  0.333134  0.580537


CLASSES: ['Vitis_acerifolia' 'Vitis_aestivalis' 'Vitis_amurensis' 'Vitis_cinerea'
 'Vitis_coignetiae' 'Vitis_labrusca' 'Vitis_palmata' 'Vitis_riparia'
 'Vitis_rupestris' 'Vitis_sp' 'Vitis_vulpina']
POS: 1 type:  <class 'int'>
NEG: multiclass_no_NEG type:  <class 'str'>

Balanced dataset will have 53 instances of each class


===>  Grid search started  <===
Round 1 of 10
Round 2 of 10
Round 3 of 10
Round 4 of 10
Round 5 of 10
Round 6 of 10
Round 

In [None]:
# (iii) Vitis core and Ampelopsis
# create test set 
!python ~/GitHub/ML-Pipeline/test_set.py -df species_noHybrids_prediction_dataset.csv -type c -p 0.1 -save Vitis_core_and_ampelopsis_test_instances.csv -sep ','

In [None]:
# run model
!python  ~/GitHub/ML-Pipeline/ML_classification.py -df species_noHybrids_prediction_dataset.csv -test Vitis_core_and_ampelopsis_test_instances.csv -alg SVM -sep ','

In [None]:
# (iv) All species
# create test set 
!python ~/GitHub/ML-Pipeline/test_set.py -df species_prediction_dataset.csv -type c -p 0.1 -save All_species_test_instances.csv -sep ','

In [None]:
# run model
!python ~/GitHub/ML-Pipeline/ML_classification.py -df species_prediction_dataset.csv -test All_species_test_instances.csv -alg SVM -sep ','

**2B. Predicting year**

In [None]:
# create test set 
!python ~GitHub/ML-Pipeline/test_set.py -df year_prediction_dataset.csv -type C -p 0.1 -save year_test_instances.csv -sep ','

In [None]:
# run model 
!python ~/GitHub/ML-Pipeline/test_set.py -df year_prediction_dataset.csv -test year_test_instances.csv -alg SVM -sep ','

**2C. Predicting node counted from base**

In [None]:
# create test set 
!python ~GitHub/ML-Pipeline/test_set.py -df node_from_base_prediction_dataset.csv -type C -p 0.1 -save node_from_base_test_instances.csv -sep ','

In [None]:
# run model 
!python ~/GitHub/ML-Pipeline/test_set.py -df node_from_base_prediction_dataset.csv -test node_from_base_test_instances.csv -alg SVM -sep ','

**2D. Predicting node counted from tip**

In [None]:
# create test set 
!python ~GitHub/ML-Pipeline/test_set.py -df node_from_tip_prediction_dataset.csv -type C -p 0.1 -save node_from_tip_test_instances.csv -sep ','

In [None]:
# run model 
!python ~/GitHub/ML-Pipeline/test_set.py -df node_from_tip_prediction_dataset.csv -test node_from_tip_test_instances.csv -alg SVM -sep ','