### Preparacion del dataset

Del estudio realizado en el [Data-analisis](data-analisis.ipynb)  se obtiene que los parametros importantes para modelar son: 'Species', 'Light_ISF', 'Soil', 'Sterile', 'Conspecific', 'AMF', 'EMF', 'Phenolics', 'Lignin', 'NSC' y 'Event', siendo este ultimo el target.

In [1]:
import pandas as pd

dataset = pd.read_csv("../dataset/Tree_Data.csv")

In [2]:
dataset_modif = dataset[['Species', 'Light_ISF', 'Soil', 'Sterile', 
                         'Conspecific', 'AMF', 'EMF', 'Phenolics', 'Lignin', 'NSC', 'Event']]

dataset_modif.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2783 entries, 0 to 2782
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Species      2783 non-null   object 
 1   Light_ISF    2783 non-null   float64
 2   Soil         2783 non-null   object 
 3   Sterile      2783 non-null   object 
 4   Conspecific  2783 non-null   object 
 5   AMF          2783 non-null   float64
 6   EMF          1283 non-null   float64
 7   Phenolics    2783 non-null   float64
 8   Lignin       2783 non-null   float64
 9   NSC          2783 non-null   float64
 10  Event        2782 non-null   float64
dtypes: float64(7), object(4)
memory usage: 239.3+ KB


Un solo Event es null, lo elimino y los nan del EMF los paso a 0

In [3]:
dataset_modif = dataset_modif.dropna(subset=["Event"])
dataset_modif.fillna(value={"EMF": 0}, inplace=True)

dataset_modif

Unnamed: 0,Species,Light_ISF,Soil,Sterile,Conspecific,AMF,EMF,Phenolics,Lignin,NSC,Event
0,Acer saccharum,0.106,Prunus serotina,Non-Sterile,Heterospecific,22.00,0.00,-0.56,13.86,12.15,1.0
1,Quercus alba,0.106,Quercus rubra,Non-Sterile,Heterospecific,15.82,31.07,5.19,20.52,19.29,0.0
2,Quercus rubra,0.106,Prunus serotina,Non-Sterile,Heterospecific,24.45,28.19,3.36,24.74,15.01,1.0
3,Acer saccharum,0.080,Prunus serotina,Non-Sterile,Heterospecific,22.23,0.00,-0.71,14.29,12.36,1.0
4,Acer saccharum,0.060,Prunus serotina,Non-Sterile,Heterospecific,21.15,0.00,-0.58,10.85,11.20,1.0
...,...,...,...,...,...,...,...,...,...,...,...
2777,Quercus alba,0.122,Quercus rubra,Non-Sterile,Heterospecific,10.89,39.00,5.53,21.44,18.99,1.0
2778,Prunus serotina,0.111,Populus grandidentata,Non-Sterile,Heterospecific,40.89,0.00,0.83,9.15,11.88,1.0
2779,Quercus alba,0.118,Acer rubrum,Non-Sterile,Heterospecific,15.47,32.82,4.88,19.01,23.50,1.0
2780,Quercus alba,0.118,Quercus rubra,Non-Sterile,Heterospecific,11.96,37.67,5.51,21.13,19.10,1.0


In [5]:
# La guardamos en un csv

dataset_modif.to_csv("../dataset/data_Modif.csv", index=False)

In [8]:
# probamos que se haya guardado correctamente
prueba_dataset_modif = pd.read_csv("../dataset/data_Modif.csv")

prueba_dataset_modif.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2782 entries, 0 to 2781
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Species      2782 non-null   object 
 1   Light_ISF    2782 non-null   float64
 2   Soil         2782 non-null   object 
 3   Sterile      2782 non-null   object 
 4   Conspecific  2782 non-null   object 
 5   AMF          2782 non-null   float64
 6   EMF          2782 non-null   float64
 7   Phenolics    2782 non-null   float64
 8   Lignin       2782 non-null   float64
 9   NSC          2782 non-null   float64
 10  Event        2782 non-null   float64
dtypes: float64(7), object(4)
memory usage: 239.2+ KB


### Modelando

In [15]:
import pandas as pd

dataset = pd.read_csv("../dataset/data_Modif.csv")

# dropeo cualquier observacion con algun valor nulo
dataset =  dataset.dropna()

X = dataset.drop(['Event'], axis = 1)
y = dataset['Event']

print("X: " + str(X.shape))
print("y: " + str(y.shape))

X: (2782, 10)
y: (2782,)


In [31]:
# utilizo ColumnTransformer para manejar las columnas numericas y categoricas
# luego un pipeline para acoplar model selectot 
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

# defino las columnas numericas y categoricas
num_cols = ['Light_ISF', 'AMF', 'EMF', 'Phenolics', 'Lignin', 'NSC']
cat_cols = ['Species', 'Soil', 'Sterile', 'Conspecific']
 
# armo el columnTransformer
col_trans = ColumnTransformer([
    ('scalador_col_num', StandardScaler(), num_cols),
    ('one-hot_cat_num', OneHotEncoder(), cat_cols)
    ],
    remainder='drop') # descarto el resto de columnas

# armo el pipeline
estimador = Pipeline([
    ('manejo de columnas', col_trans),
    ('core_model', LogisticRegression(random_state= 42))
])

display(estimador)

In [33]:
# hago el entrenamiento por validacion cruzada
from sklearn.model_selection import cross_validate
import numpy as np

results = cross_validate(estimador, X, y, cv=10, return_train_score=True)

train_score = np.mean(results['train_score'])
test_score = np.mean(results['test_score'])

print(f'Train Score: {train_score}')
print(f'Test Score: {test_score}')

Train Score: 0.8026199550954191
Test Score: 0.8026546504731698


Se obtiene un buen score, tanto en el train como en el test (indicaria que no hay overfitting)

### Modelo sin AMF, EMF, Phenolics, Lignin y NSC

Como se indico en el data-analisis, estos valores no se puede obtener a priori, por lo que vamos a realizar un modelo que no los tenga encuenta y analisar su rendimiento

In [36]:
num_cols_2 = ['Light_ISF']
cat_cols_2 = ['Species', 'Soil', 'Sterile', 'Conspecific']
 
col_trans_2 = ColumnTransformer([
    ('scalador_col_num', StandardScaler(), num_cols_2),
    ('one-hot_cat_num', OneHotEncoder(), cat_cols_2)
    ],
    remainder='drop') # descarto el resto de columnas

estimador_2 = Pipeline([
    ('manejo de columnas', col_trans_2),
    ('core_model', LogisticRegression(random_state= 42))
])

display(estimador_2)

In [40]:
X_2 = dataset.drop(['Event', 'AMF', 'EMF', 'Phenolics', 'Lignin', 'NSC'], axis = 1)
y_2 = dataset['Event']

results_2 = cross_validate(estimador_2, X_2, y_2, cv=10, return_train_score=True)

train_score_2 = np.mean(results_2['train_score'])
test_score_2 = np.mean(results_2['test_score'])

print('Modelo con todos los parametros:')
print(f'    Train Score: {train_score}')
print(f'    Test Score: {test_score}')
print('------------------------------------')
print('Modelo con menos parametros:')
print(f'    Train Score: {train_score_2}')
print(f'    Test Score: {test_score_2}')

Modelo con todos los parametros:
    Train Score: 0.8026199550954191
    Test Score: 0.8026546504731698
------------------------------------
Modelo con menos parametros:
    Train Score: 0.8026598911976555
    Test Score: 0.8026546504731698


La diferencia es minima podria obviar los parametros de AMF, EMF, Phenolics, Lignin y NSC.
