**README** <br>
In order to run the following notebook, please clone the repository located at https://github.com/jdevenegas/plb_midterm. Navigate to this folder on your machine in order to run the following. <br>
***Important note:*** This script was written for Mac OSX, and may not be compatible with Windows. Before publication we would work to ensure that it works on both Mac and Windows, but lacked the necessary time for troubleshooting before the midterm was due. 

**1. Formatting csv to test pipeline on easily differentiable species** <br>
All other datasets used in this noebook are formatted and exported in dataWrangle.ipynb and are included, as well as the following, in the plb_midterm repository. Code is provided to facilitate clarity in our method, but need not be run. Original code was run in notebook called dataWrangleSVM.ipynb (also included in repository), but included here for clarity.

In [2]:
import pandas as pd

In [None]:
# make dataset with just rupestris and riparia
species = pd.read_csv('species_noOutgroupOrHybrids_prediction_dataset.csv')
species.head()

In [None]:
# drop everything but rupestris and riparia
for index,row in species.iterrows():
    if row['Class'] != 'Vitis_rupestris' and row['Class'] != 'Vitis_riparia':
        species.drop(labels=index,axis=0,inplace=True)
species.head()

In [None]:
# write out the file 
species.to_csv('species_RupRip_prediction.csv',index=False)

**2. Using SVM to predict species based on x-y coordinates** <br>
Shiu Lab's ML-pipeline is used in this script. In order to be able to run the following, please navigate to https://github.com/ShiuLab/ML-Pipeline and follow the instructions in the readme in order to set up the correct environment to run the pipeline. Then clone the pipeline to your machine. <br>
The pipeline entails two steps as we ran it. First, we created a testing subset of 10% of the data, using the script test_set.py. We then ran ML_correlation.py in order to train our model. <br>

*Note:* the file structure for the following is: one folder called GitHub, which contains both ML-Pipeline and plb_midterm. Please be sure to check that the file paths are correct for your machine and for wherever you cloned the repositories. 

**2A. Predicting Species** <br>
Species were predicted in the following way: <br>
(i) Just *V. rupestris* and *V. riparia* were predicted, in order to test whether x-y coordinates would be good features. *rupestris* and *riparia* are the most morphologically distinct species, and we hypothesized that there would be good separation <br>
(ii) Just *Vitis* core species, no hybrids or *Ampelopsis* outgroup <br>
(iii) Just *Vitis* core and *Ampelopsis* outgroup, still no hybrids <br>
(iv) All species

In [3]:
# (i) rupestris and riparia
# create test set 
!python ~/GitHub/ML-Pipeline/test_set.py -df species_RupRip_prediction.csv -type c -p 0.1 -save RupRip_test_instances.csv -sep ','

Holding out 10.0 percent
Pulling test set from classes: ['Vitis_riparia' 'Vitis_rupestris']
257 instances in test set
finished!


In [4]:
# run model
!python  /Users/SLotreck/GitHub/ML-Pipeline/ML_classification.py -df species_RupRip_prediction.csv -test RupRip_test_instances.csv -cl_train Vitis_riparia,Vitis_rupestris -alg SVM -sep ','

Removing test instances to apply model on later...
Snapshot of data being used:
                  Class        x1        y1        x2        y2
Instance                                                       
2         Vitis_riparia  0.306808  0.799709  0.397394  0.774340
5         Vitis_riparia  0.366655  0.452072  0.448257  0.443414
6         Vitis_riparia  0.378910  0.649807  0.443035  0.646330
7         Vitis_riparia  0.302437  0.653963  0.377223  0.683326
8         Vitis_riparia  0.374232  0.677398  0.452144  0.662857


CLASSES: ['Vitis_riparia' 'Vitis_rupestris']
POS: Vitis_riparia type:  <class 'str'>
NEG: Vitis_rupestris type:  <class 'str'>

Balanced dataset will have 542 instances of each class


===>  Grid search started  <===
Round 1 of 10
Round 2 of 10
Round 3 of 10
Round 4 of 10
Round 5 of 10
Round 6 of 10
Round 7 of 10
Round 8 of 10
Round 9 of 10
Round 10 of 10
Parameter sweep time: 67.771247 seconds
Parameters selected: Kernel=Linear, C=10.0
Grid search done. Time: 68.50

In [5]:
# (ii) Vitis core 
# create test set 
!python ~/GitHub/ML-Pipeline/test_set.py -df species_noOutgroupOrHybrids_prediction_dataset.csv -type c -p 0.1 -save Vitis_core_test_instances.csv -sep ','

Holding out 10.0 percent
Pulling test set from classes: ['Vitis_acerifolia' 'Vitis_aestivalis' 'Vitis_riparia' 'Vitis_vulpina'
 'Vitis_coignetiae' 'Vitis_labrusca' 'Vitis_cinerea' 'Vitis_rupestris'
 'Vitis_sp' 'Vitis_amurensis' 'Vitis_palmata']
559 instances in test set
finished!


In [6]:
# run model
!python  /Users/SLotreck/GitHub/ML-Pipeline/ML_classification.py -df species_noOutgroupOrHybrids_prediction_dataset.csv -test Vitis_core_test_instances.csv -alg SVM -sep ','

Removing test instances to apply model on later...
Snapshot of data being used:
                     Class        x1        y1        x2        y2
Instance                                                          
0         Vitis_acerifolia  0.395183  0.464763  0.462163  0.483590
1         Vitis_aestivalis  0.461953  0.676650  0.535271  0.714615
2            Vitis_riparia  0.404128  0.770275  0.471012  0.774340
3         Vitis_acerifolia  0.405337  0.789025  0.469997  0.798167
4         Vitis_aestivalis  0.319783  0.584354  0.333134  0.580537


CLASSES: ['Vitis_acerifolia' 'Vitis_aestivalis' 'Vitis_amurensis' 'Vitis_cinerea'
 'Vitis_coignetiae' 'Vitis_labrusca' 'Vitis_palmata' 'Vitis_riparia'
 'Vitis_rupestris' 'Vitis_sp' 'Vitis_vulpina']
POS: 1 type:  <class 'int'>
NEG: multiclass_no_NEG type:  <class 'str'>

Balanced dataset will have 53 instances of each class


===>  Grid search started  <===
Round 1 of 10
Round 2 of 10
Round 3 of 10
Round 4 of 10
Round 5 of 10
Round 6 of 10
Round 

In [7]:
# (iii) Vitis core and Ampelopsis
# create test set 
!python ~/GitHub/ML-Pipeline/test_set.py -df species_noHybrids_prediction_dataset.csv -type c -p 0.1 -save Vitis_core_and_ampelopsis_test_instances.csv -sep ','

Holding out 10.0 percent
Pulling test set from classes: ['Vitis_acerifolia' 'Vitis_aestivalis' 'Vitis_riparia' 'Vitis_vulpina'
 'Vitis_coignetiae' 'Vitis_labrusca' 'Vitis_cinerea' 'Vitis_rupestris'
 'Vitis_sp' 'Vitis_amurensis' 'Vitis_palmata'
 'Ampelopsis_brevipedunculata']
561 instances in test set
finished!


In [8]:
# run model
!python  ~/GitHub/ML-Pipeline/ML_classification.py -df species_noHybrids_prediction_dataset.csv -test Vitis_core_and_ampelopsis_test_instances.csv -alg SVM -sep ','

Removing test instances to apply model on later...
Snapshot of data being used:
                     Class        x1        y1        x2        y2
Instance                                                          
0         Vitis_acerifolia  0.395183  0.464763  0.462163  0.483590
1         Vitis_aestivalis  0.461953  0.676650  0.535271  0.714615
2            Vitis_riparia  0.404128  0.770275  0.471012  0.774340
4         Vitis_aestivalis  0.319783  0.584354  0.333134  0.580537
5            Vitis_riparia  0.455573  0.435433  0.515661  0.443414


CLASSES: ['Ampelopsis_brevipedunculata' 'Vitis_acerifolia' 'Vitis_aestivalis'
 'Vitis_amurensis' 'Vitis_cinerea' 'Vitis_coignetiae' 'Vitis_labrusca'
 'Vitis_palmata' 'Vitis_riparia' 'Vitis_rupestris' 'Vitis_sp'
 'Vitis_vulpina']
POS: 1 type:  <class 'int'>
NEG: multiclass_no_NEG type:  <class 'str'>

Balanced dataset will have 18 instances of each class


===>  Grid search started  <===
Round 1 of 10
Round 2 of 10
Round 3 of 10
Round 4 of 10
Rou

In [9]:
# (iv) All species
# create test set 
!python ~/GitHub/ML-Pipeline/test_set.py -df species_prediction_dataset.csv -type c -p 0.1 -save All_species_test_instances.csv -sep ','

Holding out 10.0 percent
Pulling test set from classes: ['Vitis_acerifolia' 'Vitis_aestivalis' 'Vitis_riparia' 'Vitis_vulpina'
 'Vitis_coignetiae' 'Vitis_labrusca' 'Vitis_cinerea' 'Vitis_x_doaniana'
 'Vitis_rupestris' 'Vitis_sp' 'Vitis_amurensis' 'Vitis_palmata'
 'Vitis_x_novae_angliae' 'Vitis_x_champinii' 'Vitis_x_andersonii'
 'Ampelopsis_brevipedunculata']
570 instances in test set
finished!


In [10]:
# run model
!python ~/GitHub/ML-Pipeline/ML_classification.py -df species_prediction_dataset.csv -test All_species_test_instances.csv -alg SVM -sep ','

Removing test instances to apply model on later...
Snapshot of data being used:
                     Class        x1        y1        x2        y2
Instance                                                          
0         Vitis_acerifolia  0.395183  0.464763  0.462163  0.477514
1         Vitis_aestivalis  0.461953  0.676650  0.535271  0.705636
2            Vitis_riparia  0.404128  0.770275  0.471012  0.764611
3         Vitis_acerifolia  0.405337  0.789025  0.469997  0.788139
5            Vitis_riparia  0.455573  0.435433  0.515661  0.437843


CLASSES: ['Ampelopsis_brevipedunculata' 'Vitis_acerifolia' 'Vitis_aestivalis'
 'Vitis_amurensis' 'Vitis_cinerea' 'Vitis_coignetiae' 'Vitis_labrusca'
 'Vitis_palmata' 'Vitis_riparia' 'Vitis_rupestris' 'Vitis_sp'
 'Vitis_vulpina' 'Vitis_x_andersonii' 'Vitis_x_champinii'
 'Vitis_x_doaniana' 'Vitis_x_novae_angliae']
POS: 1 type:  <class 'int'>
NEG: multiclass_no_NEG type:  <class 'str'>

Balanced dataset will have 15 instances of each class


===>  

**2B. Predicting year**

In [74]:
# create test set 
!python ~/GitHub/ML-Pipeline/test_set.py -df year_strings.csv -type C -p 0.1 -save year_strings_test_instances.csv -sep ','

Holding out 10.0 percent
Pulling test set from classes: ['a' 'b' 'c']
570 instances in test set
finished!


In [81]:
num_year_classes = {}
with open('year_strings.csv') as inputf:
    reader = csv.DictReader(inputf)
    
    for row in reader: 
        if row['Class'] not in num_year_classes:
            num_year_classes[row['Class']]=1
        else: num_year_classes[row['Class']] +=1
            
print(num_year_classes)

{'a': 1900, 'b': 1900, 'c': 1900}


In [4]:
year = pd.read_csv('year_prediction_dataset.csv')
year.replace(to_replace=[2013,2015,2016],value=['year2013','year2015','year2016'],inplace=True)
year.head()

Unnamed: 0,Instance,Class,x1,y1,x2,y2,x3,y3,x4,y4,...,x17,y17,x18,y18,x19,y19,x20,y20,x21,y21
0,0,year2013,579.014742,49.669286,564.11701,34.562869,586.781135,-8.11712,606.032488,-52.232172,...,-773.417167,530.360659,-1338.595312,-296.648911,-788.502868,-421.58033,-878.484789,-733.599371,-608.863391,-1612.438872
1,1,year2013,622.384266,158.865807,605.47828,137.443677,614.666475,111.062108,642.965277,89.000383,...,-725.211055,343.277655,-1028.649169,-398.067197,-729.232075,-398.355997,-911.901514,-677.70989,-1157.56763,-1652.086177
2,2,year2013,584.824913,207.115491,569.123583,164.040784,616.67527,113.001928,677.797976,66.534166,...,-589.139358,331.309077,-1035.774029,-448.443256,-715.676977,-300.604021,-865.491969,-611.708338,-1379.906517,-1614.081403
3,3,year2013,585.609852,216.778212,568.549214,174.651328,581.563425,114.867125,639.639037,60.326687,...,-717.318142,274.452196,-1178.749422,-518.663464,-794.905709,-317.480363,-897.419201,-587.381767,-1143.68586,-1598.494304
4,4,year2013,530.039098,111.300539,491.118785,77.735749,486.645484,20.775763,554.143172,-18.328012,...,-665.639109,478.490939,-1140.663125,-310.663543,-593.900401,-346.297023,-873.354059,-561.58817,-930.520913,-1909.580628


In [None]:
year.to_csv('year_strings_prediction_dataset.csv')

In [87]:
# run model 
!python ~/GitHub/ML-Pipeline/ML_classification.py -df year_strings.csv -alg SVM -sep ',' -min_size 1710 

Snapshot of data being used:
         Class        x1        y1        x2        y2
Instance                                              
0            a  0.395183  0.464763  0.462163  0.477514
1            a  0.461953  0.676650  0.535271  0.705636
2            a  0.404128  0.770275  0.471012  0.764611
3            a  0.405337  0.789025  0.469997  0.788139
4            a  0.319783  0.584354  0.333134  0.573243


CLASSES: ['a' 'b' 'c']
POS: 1 type:  <class 'int'>
NEG: multiclass_no_NEG type:  <class 'str'>

Balanced dataset will have 1710 instances of each class


===>  Grid search started  <===
Round 1 of 10
Done with round 1
Round 2 of 10
Done with round 2
Round 3 of 10
Done with round 3
Round 4 of 10
Done with round 4
Round 5 of 10
Done with round 5
Round 6 of 10
Done with round 6
Round 7 of 10
Done with round 7
Round 8 of 10
Done with round 8
Round 9 of 10
Done with round 9
Round 10 of 10
Done with round 10
Parameter sweep time: 972.300372 seconds
Parameters selected: Kernel=Linear,

**2C. Predicting node counted from base**

In [88]:
# create test set 
!python ~/GitHub/ML-Pipeline/test_set.py -df nodes_base_strings.csv -type C -p 0.1 -save node_from_base_strings_test_instances.csv -sep ','

Traceback (most recent call last):
  File "/Users/SLotreck/GitHub/ML-Pipeline/test_set.py", line 43, in <module>
    df = pd.read_csv(a.df, sep=a.sep, index_col = 0)
  File "/anaconda3/envs/ML_pipeline/lib/python3.7/site-packages/pandas/io/parsers.py", line 685, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/anaconda3/envs/ML_pipeline/lib/python3.7/site-packages/pandas/io/parsers.py", line 457, in _read
    parser = TextFileReader(fp_or_buf, **kwds)
  File "/anaconda3/envs/ML_pipeline/lib/python3.7/site-packages/pandas/io/parsers.py", line 895, in __init__
    self._make_engine(self.engine)
  File "/anaconda3/envs/ML_pipeline/lib/python3.7/site-packages/pandas/io/parsers.py", line 1135, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/anaconda3/envs/ML_pipeline/lib/python3.7/site-packages/pandas/io/parsers.py", line 1917, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 382, in pand

In [31]:
import csv 
num_classes = {}
with open('nodes_from_base_prediction_dataset.csv') as inputf:
    reader = csv.DictReader(inputf)
    
    for row in reader: 
        if row['Class'] not in num_classes:
            num_classes[row['Class']]=1
        else: num_classes[row['Class']] +=1
            
print(num_classes)

{'10': 246, '11': 129, '12': 51, '13': 15, '14': 3, '1': 639, '2': 597, '3': 606, '4': 621, '5': 615, '6': 603, '7': 606, '8': 552, '9': 417}


In [89]:
# run model 
!python ~/GitHub/ML-Pipeline/ML_classification.py -df nodes_base_strings.csv -test node_from_base_strings_test_instances.csv  -alg SVM -sep ',' -min_size 3 -cv 13

Traceback (most recent call last):
  File "/Users/SLotreck/GitHub/ML-Pipeline/ML_classification.py", line 783, in <module>
    main()
  File "/Users/SLotreck/GitHub/ML-Pipeline/ML_classification.py", line 142, in main
    df = pd.read_csv(args.df, sep=args.sep, index_col=0)
  File "/anaconda3/envs/ML_pipeline/lib/python3.7/site-packages/pandas/io/parsers.py", line 685, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/anaconda3/envs/ML_pipeline/lib/python3.7/site-packages/pandas/io/parsers.py", line 457, in _read
    parser = TextFileReader(fp_or_buf, **kwds)
  File "/anaconda3/envs/ML_pipeline/lib/python3.7/site-packages/pandas/io/parsers.py", line 895, in __init__
    self._make_engine(self.engine)
  File "/anaconda3/envs/ML_pipeline/lib/python3.7/site-packages/pandas/io/parsers.py", line 1135, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/anaconda3/envs/ML_pipeline/lib/python3.7/site-packages/pandas/io/parsers.py", line 1917, in __i

In [44]:
import pandas
pandas.DataFrame.sample??

[0;31mSignature:[0m
[0mpandas[0m[0;34m.[0m[0mDataFrame[0m[0;34m.[0m[0msample[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mself[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfrac[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mreplace[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mweights[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mrandom_state[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0maxis[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mSource:[0m   
    [0;32mdef[0m [0msample[0m[0;34m([0m[0;34m[0m
[0;34m[0m        [0mself[0m[0;34m,[0m[0;34m[0m
[0;34m[0m        [0mn[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m        [0mfrac[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[

In [57]:
pandas.DataFrame._stat_axis_number??

[0;31mType:[0m        int
[0;31mString form:[0m 0
[0;31mDocstring:[0m  
int([x]) -> integer
int(x, base=10) -> integer

Convert a number or string to an integer, or return 0 if no arguments
are given.  If x is a number, return x.__int__().  For floating point
numbers, this truncates towards zero.

If x is not a number or if base is given, then x must be a string,
bytes, or bytearray instance representing an integer literal in the
given base.  The literal can be preceded by '+' or '-' and be surrounded
by whitespace.  The base defaults to 10.  Valid bases are 0 and 2-36.
Base 0 means to interpret the base from the string as an integer literal.
>>> int('0b100', base=0)
4


In [58]:
pandas.DataFrame._get_axis_number??

[0;31mSignature:[0m [0mpandas[0m[0;34m.[0m[0mDataFrame[0m[0;34m.[0m[0m_get_axis_number[0m[0;34m([0m[0maxis[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m <no docstring>
[0;31mSource:[0m   
    [0;34m@[0m[0mclassmethod[0m[0;34m[0m
[0;34m[0m    [0;32mdef[0m [0m_get_axis_number[0m[0;34m([0m[0mcls[0m[0;34m,[0m [0maxis[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m        [0maxis[0m [0;34m=[0m [0mcls[0m[0;34m.[0m[0m_AXIS_ALIASES[0m[0;34m.[0m[0mget[0m[0;34m([0m[0maxis[0m[0;34m,[0m [0maxis[0m[0;34m)[0m[0;34m[0m
[0;34m[0m        [0;32mif[0m [0mis_integer[0m[0;34m([0m[0maxis[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m            [0;32mif[0m [0maxis[0m [0;32min[0m [0mcls[0m[0;34m.[0m[0m_AXIS_NAMES[0m[0;34m:[0m[0;34m[0m
[0;34m[0m                [0;32mreturn[0m [0maxis[0m[0;34m[0m
[0;34m[0m        [0;32melse[0m[0;34m:[0m[0;34m[0m
[0;34m[0m            [0;32mtry[0m[

**2D. Predicting node counted from tip**

In [41]:
# create test set 
!python ~/GitHub/ML-Pipeline/test_set.py -df nodes_from_tip_prediction_dataset.csv -type C -p 0.1 -save node_from_tip_test_instances.csv -sep ','

Traceback (most recent call last):
  File "/Users/SLotreck/GitHub/ML-Pipeline/test_set.py", line 43, in <module>
    df = pd.read_csv(a.df, sep=a.sep, index_col = 0)
  File "/anaconda3/envs/ML_pipeline/lib/python3.7/site-packages/pandas/io/parsers.py", line 685, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/anaconda3/envs/ML_pipeline/lib/python3.7/site-packages/pandas/io/parsers.py", line 457, in _read
    parser = TextFileReader(fp_or_buf, **kwds)
  File "/anaconda3/envs/ML_pipeline/lib/python3.7/site-packages/pandas/io/parsers.py", line 895, in __init__
    self._make_engine(self.engine)
  File "/anaconda3/envs/ML_pipeline/lib/python3.7/site-packages/pandas/io/parsers.py", line 1135, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/anaconda3/envs/ML_pipeline/lib/python3.7/site-packages/pandas/io/parsers.py", line 1917, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 382, in pand

In [16]:
# run model 
!python /Users/SLotreck/GitHub/ML-Pipeline/test_set.py -df nodes_from_tip_prediction_dataset.csv -test node_from_tip_test_instances.csv -alg SVM -sep ','

python: can't open file '/Users/SLotreck/GitHub/Hub/ML-Pipeline/test_set.py': [Errno 2] No such file or directory
