# **Demo**

¿Cuántas especies hay en la tierra? Como resumió Robert May en un artículo publicado en Science. Si alguna versión extraterrestre de Starship Enterprise visitara la Tierra, ¿cuál podría ser la primera pregunta de los visitantes? Creo que sería: "¿Cuántas formas de vida distintas, especies, tiene su planeta?" Vergonzosamente, nuestra mejor respuesta aproximada estaría en el rango de 5 a 10 millones de eucariotas (sin importar los virus y las bacterias), pero podríamos defender números superiores a 100 millones o tan bajos como 3 millones. Tan solo responder a esta pregunta sigue siendo un desafío en la biología, pero por fortuna la generación de secuencias de ADN está ayudando en esta tarea.  En la misma dirección, como es de esperarse los algoritmos de inteligencia artificial pueden brindar apoyo en la automatización de esta difícil tarea. El desafío que te planteamos aquí es: enseñarles cómo. El problema de clasificación de especies trata sobre asignar un espécimen desconocido a una especie conocida mediante el análisis de su código de barras. Con base a esto se formula el siguiente problema:  Dado un conjunto de secuencias de ADN ¿Qué método de aprendizaje automático supervisado muestra mayor eficacia para clasificar especies a partir de las secuencias de código de barras de ADN?

# **Comite**

**Pastor Enmanuel Pérez Estigarribia** (Facultad Politécnica de la Universidad
Nacional de Asunción - FPUNA, Paraguay): es Licenciado en Biología de la
Universidad Nacional de Asunción (UNA, Paraguay), Mágister con mención en
Zoología de la Universidad de Concepción (Chile) y Doctor en Ciencias de la
Computación de la FPUNA. Es docente investigador de la FPUNA y sus intereses
de investigación son: Bioinformática, Biología Matemática, Bioestadística, Zoología y Epidemiología Matemática.

**Diego Pedro Pinto Roa** (FPUNA): es Ingeniero en Electrónica de la Universidad
Católica de Asunción (Paraguay) y Doctor en Ciencias de la Computación de la FPUNA. Es docente investigador y director de investigación de la FPUNA; sus
áreas de interés son: Investigación de Operaciones, Optimización Combinatoria y
Multiobjetivo, Algoritmos Metaheurísticos, Aprendizaje de Máquina, Redes de
Telecomunicaciones.

**José Domingo Colbes Sanabria** (FPUNA): es Ingeniero en Electrónica de la
UNA, Maestro en Ciencias de la Computación del Centro de Investigación
Científica y de Educación Superior de Ensenada (CICESE, México) y Doctor en
Ciencias de la Computación del CICESE. Es docente investigador de la FPUNA y
sus intereses de investigación son: optimización combinatoria, predicción de
estructuras de proteínas, diseño de proteínas y aprendizaje de máquina.

**Instruccines para seguir este tutorial:**

Descarge la carpeta del siguiente enlace y pongalo en su cuenta de google drive vinculada con colab

https://drive.google.com/drive/folders/1-M1T7ahcmJ56Pd0yiHEm7joJiW7ktArv?usp=sharing

o almacena los datos de entrenamiento (Drosophila_train) y prueba (Drosophila_test) en **tu** Google Sheets.

# **Nota**

Para entender mejor la teoría del problema que intentamos automatizar puedes consultar:

https://doi.org/10.1186/1756-0381-7-4

# **Detalles sobre el desafio**

https://drive.google.com/file/d/1YsnvINJfRxKOxFYoGnrZig9nCOKJCROi/view?usp=sharing

# **Otros dataset similares para experimentar**

http://dmb.iasi.cnr.it/supbarcodes.php

# 1. Conectar Colab a Hojas de cálculo de Google

Lo primero que debe hacer es conectar Colaboratory a Google Sheets; o, más correctamente, a su unidad; para hacerlo puedes usar el siguiente código:
(ver https://medium.com/mlearning-ai/how-to-access-google-sheets-on-google-colaboratory-8766b3a0996f)

In [1]:
from google.colab import auth
import gspread
from google.auth import default
#autenticating to google
auth.authenticate_user()
creds, _ = default()
gc = gspread.authorize(creds)

# 2. Abrir el archivo que contiene los datos

Vamos a abrir primero el conjunto de prueba (Drosophila_test)

In [2]:
import pandas as pd
#defining my worksheet
Drosophila_test = gc.open('Drosophila_test').sheet1
#get_all_values gives a list of rows
rows_test = Drosophila_test.get_all_values()
#Convert to a DataFrame
df_test = pd.DataFrame(rows_test)
df_test.columns = df_test.iloc[0]
df_test = df_test.iloc[1:]
print(df_test)



0                    ID S1 S2 S3 S4 S5 S6 S7 S8 S9  ... S654 S655 S656 S657  \
1    DQ471543_110189412  A  A  A  T  T  G  G  A  A  ...    C    A    A    C   
2    DQ471554_110189434  A  T  T  G  G  A  A  C  T  ...    A    C    A    T   
3     DQ383671_87475141  N  N  N  N  N  N  A  C  T  ...    N    N    N    N   
4     DQ383677_87475153  N  N  N  N  N  N  A  C  T  ...    N    N    N    N   
5     DQ383678_87475155  N  N  N  N  N  N  A  C  T  ...    N    N    N    N   
..                  ... .. .. .. .. .. .. .. .. ..  ...  ...  ...  ...  ...   
117            Auxiliar  A  A  A  A  A  A  A  A  A  ...    A    A    A    A   
118            Auxiliar  C  C  C  C  C  C  C  C  C  ...    C    C    C    C   
119            Auxiliar  G  G  G  G  G  G  G  G  G  ...    G    G    G    G   
120            Auxiliar  T  T  T  T  T  T  T  T  T  ...    T    T    T    T   
121            Auxiliar  N  N  N  N  N  N  N  N  N  ...    N    N    N    N   

0   S658 S659 S660 S661 S662 S663  
1      A    T  

Vamos a abrir primero el conjunto de entrenamiento (Drosophila_train)

In [3]:
import pandas as pd
#defining my worksheet
Drosophila_train = gc.open('Drosophila_train').sheet1
#get_all_values gives a list of rows
rows_train = Drosophila_train.get_all_values()
#Convert to a DataFrame
df_train = pd.DataFrame(rows_train)
df_train.columns = df_train.iloc[0]
df_train = df_train.iloc[1:]
print(df_train)

0                    ID            target S1 S2 S3 S4 S5 S6 S7 S8  ... S654  \
1    DQ471538_110189402  Drosophila_angor  A  T  T  G  G  A  A  C  ...    A   
2    DQ471539_110189404  Drosophila_angor  A  T  C  G  G  A  A  C  ...    A   
3    DQ471540_110189406  Drosophila_angor  A  T  T  G  G  A  A  C  ...    A   
4    DQ471541_110189408  Drosophila_angor  A  T  C  G  G  T  A  C  ...    A   
5    DQ471542_110189410  Drosophila_angor  A  T  T  G  G  A  A  C  ...    A   
..                  ...               ... .. .. .. .. .. .. .. ..  ...  ...   
500            Auxiliar          Auxiliar  A  A  A  A  A  A  A  A  ...    A   
501            Auxiliar          Auxiliar  C  C  C  C  C  C  C  C  ...    C   
502            Auxiliar          Auxiliar  G  G  G  G  G  G  G  G  ...    G   
503            Auxiliar          Auxiliar  T  T  T  T  T  T  T  T  ...    T   
504            Auxiliar          Auxiliar  N  N  N  N  N  N  N  N  ...    N   

0   S655 S656 S657 S658 S659 S660 S661 S662 S663  


# 3. Implementar un modelo de calsificación supervizada

Aqui puedes encontrar un ejemplo simple de un proyecto de ML: https://towardsdatascience.com/build-your-first-machine-learning-model-with-zero-configuration-exploring-google-colab-5cc7263cfe28

## 3.1 Importación de dependencias

 Utilizaremos principalmente las siguientes bibliotecas:

 - _scikit-learn_ : una biblioteca de ML que consta de una variedad de funciones de procesamiento de datos y algoritmos de ML (p. ej., regresión, clasificación y agrupación). Esta biblioteca también se conoce como sklearn, y usaremos sklearn para fines de referencia.
 - _pandas_ : una biblioteca de ciencia de datos que se especializa principalmente en el preprocesamiento de datos similares a hojas de cálculo antes de construir modelos ML.

In [4]:
from sklearn import datasets, model_selection, metrics, ensemble

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, ConfusionMatrixDisplay
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
from sklearn.preprocessing import LabelEncoder
from sklearn import preprocessing
from sklearn.metrics import cohen_kappa_score

## 3.2 Exploremos el conjunto de datos de trabajo

In [5]:
Y_train = df_train["target"]
Y_train

1      Drosophila_angor
2      Drosophila_angor
3      Drosophila_angor
4      Drosophila_angor
5      Drosophila_angor
             ...       
500            Auxiliar
501            Auxiliar
502            Auxiliar
503            Auxiliar
504            Auxiliar
Name: target, Length: 504, dtype: object

In [6]:
X_train = df_train.loc[:, 'S1':'S663']
X_train.head

<bound method NDFrame.head of 0   S1 S2 S3 S4 S5 S6 S7 S8 S9 S10  ... S654 S655 S656 S657 S658 S659 S660  \
1    A  T  T  G  G  A  A  C  T   T  ...    A    C    A    T    T    T    A   
2    A  T  C  G  G  A  A  C  T   T  ...    A    C    A    T    T    T    A   
3    A  T  T  G  G  A  A  C  T   T  ...    A    C    A    T    T    T    A   
4    A  T  C  G  G  T  A  C  A   C  ...    A    C    A    T    T    T    A   
5    A  T  T  G  G  A  A  C  T   T  ...    A    C    A    T    T    T    A   
..  .. .. .. .. .. .. .. .. ..  ..  ...  ...  ...  ...  ...  ...  ...  ...   
500  A  A  A  A  A  A  A  A  A   A  ...    A    A    A    A    A    A    A   
501  C  C  C  C  C  C  C  C  C   C  ...    C    C    C    C    C    C    C   
502  G  G  G  G  G  G  G  G  G   G  ...    G    G    G    G    G    G    G   
503  T  T  T  T  T  T  T  T  T   T  ...    T    T    T    T    T    T    T   
504  N  N  N  N  N  N  N  N  N   N  ...    N    N    N    N    N    N    N   

0   S661 S662 S663  
1      G    

## 3.3 Preprocesamiento de los datos (one_hot_encoded)

In [7]:
# list(data) or
df_train.columns

Index(['ID', 'target', 'S1', 'S2', 'S3', 'S4', 'S5', 'S6', 'S7', 'S8',
       ...
       'S654', 'S655', 'S656', 'S657', 'S658', 'S659', 'S660', 'S661', 'S662',
       'S663'],
      dtype='object', name=0, length=665)

In [8]:
df = df_train.loc[:, 'target':'S663']
df.columns

Index(['target', 'S1', 'S2', 'S3', 'S4', 'S5', 'S6', 'S7', 'S8', 'S9',
       ...
       'S654', 'S655', 'S656', 'S657', 'S658', 'S659', 'S660', 'S661', 'S662',
       'S663'],
      dtype='object', name=0, length=664)

In [9]:
df_X = df_train.loc[:, 'S1':'S663']
df_X.columns

Index(['S1', 'S2', 'S3', 'S4', 'S5', 'S6', 'S7', 'S8', 'S9', 'S10',
       ...
       'S654', 'S655', 'S656', 'S657', 'S658', 'S659', 'S660', 'S661', 'S662',
       'S663'],
      dtype='object', name=0, length=663)

In [10]:
one_hot_encoded_data = pd.get_dummies(df, columns = df_X.columns)
print(one_hot_encoded_data)

               target  S1_A  S1_C  S1_G  S1_N  S1_T  S2_A  S2_C  S2_G  S2_N  \
1    Drosophila_angor     1     0     0     0     0     0     0     0     0   
2    Drosophila_angor     1     0     0     0     0     0     0     0     0   
3    Drosophila_angor     1     0     0     0     0     0     0     0     0   
4    Drosophila_angor     1     0     0     0     0     0     0     0     0   
5    Drosophila_angor     1     0     0     0     0     0     0     0     0   
..                ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   
500          Auxiliar     1     0     0     0     0     1     0     0     0   
501          Auxiliar     0     1     0     0     0     0     1     0     0   
502          Auxiliar     0     0     1     0     0     0     0     1     0   
503          Auxiliar     0     0     0     0     1     0     0     0     0   
504          Auxiliar     0     0     0     1     0     0     0     0     1   

     ...  S662_A  S662_C  S662_G  S662_N  S662_T  S

Antes de seguir vamos a eliminar las filas adicionales que fueron agregadas para abarcar todo el dominio de valores posibles en la recodificación one-hot

In [11]:
one_hot_encoded_data = one_hot_encoded_data.iloc[:-1, :]
one_hot_encoded_data = one_hot_encoded_data.iloc[:-1, :]
one_hot_encoded_data = one_hot_encoded_data.iloc[:-1, :]
one_hot_encoded_data = one_hot_encoded_data.iloc[:-1, :]
one_hot_encoded_data = one_hot_encoded_data.iloc[:-1, :]
print(one_hot_encoded_data)

                 target  S1_A  S1_C  S1_G  S1_N  S1_T  S2_A  S2_C  S2_G  S2_N  \
1      Drosophila_angor     1     0     0     0     0     0     0     0     0   
2      Drosophila_angor     1     0     0     0     0     0     0     0     0   
3      Drosophila_angor     1     0     0     0     0     0     0     0     0   
4      Drosophila_angor     1     0     0     0     0     0     0     0     0   
5      Drosophila_angor     1     0     0     0     0     0     0     0     0   
..                  ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   
495  Drosophila_virilis     0     0     0     1     0     0     0     0     1   
496  Drosophila_virilis     0     0     0     1     0     0     0     0     1   
497  Drosophila_virilis     0     0     0     1     0     0     0     0     1   
498  Drosophila_virilis     1     0     0     0     0     0     0     0     0   
499  Drosophila_virilis     1     0     0     0     0     0     0     0     0   

     ...  S662_A  S662_C  S

Recodifiquemos las clases (especies), remplazaremos las cadenas de nombres por etiquetas numericas (lebals)

In [12]:
le = preprocessing.LabelEncoder()
le.fit(one_hot_encoded_data.target)
list(le.classes_)
one_hot_encoded_data['target'] = le.transform(one_hot_encoded_data.target)
print(one_hot_encoded_data)

     target  S1_A  S1_C  S1_G  S1_N  S1_T  S2_A  S2_C  S2_G  S2_N  ...  \
1         0     1     0     0     0     0     0     0     0     0  ...   
2         0     1     0     0     0     0     0     0     0     0  ...   
3         0     1     0     0     0     0     0     0     0     0  ...   
4         0     1     0     0     0     0     0     0     0     0  ...   
5         0     1     0     0     0     0     0     0     0     0  ...   
..      ...   ...   ...   ...   ...   ...   ...   ...   ...   ...  ...   
495      18     0     0     0     1     0     0     0     0     1  ...   
496      18     0     0     0     1     0     0     0     0     1  ...   
497      18     0     0     0     1     0     0     0     0     1  ...   
498      18     1     0     0     0     0     0     0     0     0  ...   
499      18     1     0     0     0     0     0     0     0     0  ...   

     S662_A  S662_C  S662_G  S662_N  S662_T  S663_A  S663_C  S663_G  S663_N  \
1         0       0       0     

Para invertir la transformación podemos usar:

In [13]:
le.inverse_transform(one_hot_encoded_data['target'])

array(['Drosophila_angor', 'Drosophila_angor', 'Drosophila_angor',
       'Drosophila_angor', 'Drosophila_angor', 'Drosophila_angor',
       'Drosophila_angor', 'Drosophila_angor', 'Drosophila_angor',
       'Drosophila_angor', 'Drosophila_arizonae', 'Drosophila_arizonae',
       'Drosophila_arizonae', 'Drosophila_arizonae',
       'Drosophila_arizonae', 'Drosophila_arizonae',
       'Drosophila_arizonae', 'Drosophila_arizonae',
       'Drosophila_arizonae', 'Drosophila_arizonae',
       'Drosophila_arizonae', 'Drosophila_arizonae',
       'Drosophila_arizonae', 'Drosophila_arizonae',
       'Drosophila_barutani', 'Drosophila_barutani',
       'Drosophila_barutani', 'Drosophila_barutani',
       'Drosophila_barutani', 'Drosophila_beppui', 'Drosophila_beppui',
       'Drosophila_beppui', 'Drosophila_daruma', 'Drosophila_daruma',
       'Drosophila_daruma', 'Drosophila_daruma', 'Drosophila_falleni',
       'Drosophila_falleni', 'Drosophila_falleni', 'Drosophila_falleni',
       'Drosophi

In [14]:
target = one_hot_encoded_data['target']
one_hot_encoded_data_x = one_hot_encoded_data.loc[:, 'S1_A':'S663_T']
x_train, x_test, y_train, y_test = model_selection.train_test_split(one_hot_encoded_data_x, target, test_size = 0.3, random_state =0)
print("Training Dataset:", x_train.shape)
print("Test Dataset:", x_test.shape)



Training Dataset: (349, 3315)
Test Dataset: (150, 3315)


#3.4 Entrenamiento del modelo

In [15]:
classifier = ensemble.RandomForestClassifier()
model = classifier.fit(x_train,y_train)
print(model)

RandomForestClassifier()


In [16]:
print(x_test)

     S1_A  S1_C  S1_G  S1_N  S1_T  S2_A  S2_C  S2_G  S2_N  S2_T  ...  S662_A  \
91      0     0     0     1     0     0     0     0     1     0  ...       0   
255     1     0     0     0     0     0     0     0     0     1  ...       0   
284     0     0     0     1     0     0     0     0     1     0  ...       0   
445     0     0     0     1     0     0     0     0     1     0  ...       0   
475     0     0     0     1     0     0     0     0     1     0  ...       0   
..    ...   ...   ...   ...   ...   ...   ...   ...   ...   ...  ...     ...   
5       1     0     0     0     0     0     0     0     0     1  ...       0   
319     0     0     0     1     0     0     0     0     1     0  ...       0   
331     0     0     0     1     0     0     0     0     1     0  ...       0   
246     0     0     0     1     0     0     0     0     1     0  ...       0   
6       1     0     0     0     0     0     0     0     0     1  ...       0   

     S662_C  S662_G  S662_N  S662_T  S6

In [17]:
model.score(x_test,y_test)

0.9733333333333334

In [18]:
y_pred = model.predict(x_test)
report = metrics.classification_report(y_test,y_pred)
print(report)
print(y_pred)

              precision    recall  f1-score   support

           0       1.00      0.83      0.91         6
           1       1.00      1.00      1.00         4
           2       1.00      1.00      1.00         1
           4       0.00      0.00      0.00         0
           5       1.00      1.00      1.00         3
           6       1.00      1.00      1.00         9
           7       1.00      0.50      0.67         2
           8       1.00      1.00      1.00         2
           9       1.00      1.00      1.00         4
          10       1.00      1.00      1.00         8
          11       1.00      1.00      1.00        10
          12       1.00      1.00      1.00         1
          13       1.00      1.00      1.00         2
          14       1.00      1.00      1.00        22
          15       0.94      1.00      0.97        32
          16       0.88      1.00      0.93         7
          17       1.00      0.94      0.97        36
          18       1.00    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [19]:
 cohen_kappa_score(y_test,y_pred)

0.9689424918474041

In [20]:
le.inverse_transform(y_pred)

array(['Drosophila_mettleri', 'Drosophila_recens', 'Drosophila_recens',
       'Drosophila_subquinaria', 'Drosophila_subquinaria',
       'Drosophila_arizonae', 'Drosophila_recens',
       'Drosophila_subquinaria', 'Drosophila_montana',
       'Drosophila_montana', 'Drosophila_pachea', 'Drosophila_recens',
       'Drosophila_subquinaria', 'Drosophila_recens',
       'Drosophila_montana', 'Drosophila_recens',
       'Drosophila_subquinaria', 'Drosophila_recens',
       'Drosophila_montana', 'Drosophila_falleni', 'Drosophila_pachea',
       'Drosophila_simulans', 'Drosophila_pachea',
       'Drosophila_mojavensis', 'Drosophila_falleni',
       'Drosophila_subquinaria', 'Drosophila_arizonae',
       'Drosophila_simulans', 'Drosophila_mettleri', 'Drosophila_pachea',
       'Drosophila_subquinaria', 'Drosophila_mojavensis',
       'Drosophila_mojavensis', 'Drosophila_pachea',
       'Drosophila_subquinaria', 'Drosophila_subquinaria',
       'Drosophila_recens', 'Drosophila_subquinaria', 'Dr

In [21]:
le.inverse_transform(y_test)

array(['Drosophila_mettleri', 'Drosophila_recens', 'Drosophila_recens',
       'Drosophila_subquinaria', 'Drosophila_subquinaria',
       'Drosophila_arizonae', 'Drosophila_recens',
       'Drosophila_subquinaria', 'Drosophila_montana',
       'Drosophila_montana', 'Drosophila_pachea', 'Drosophila_recens',
       'Drosophila_subquinaria', 'Drosophila_recens',
       'Drosophila_montana', 'Drosophila_recens',
       'Drosophila_subquinaria', 'Drosophila_recens',
       'Drosophila_montana', 'Drosophila_falleni', 'Drosophila_pachea',
       'Drosophila_simulans', 'Drosophila_pachea',
       'Drosophila_mojavensis', 'Drosophila_falleni',
       'Drosophila_subquinaria', 'Drosophila_arizonae',
       'Drosophila_simulans', 'Drosophila_mettleri', 'Drosophila_pachea',
       'Drosophila_subquinaria', 'Drosophila_mojavensis',
       'Drosophila_mojavensis', 'Drosophila_pachea',
       'Drosophila_subquinaria', 'Drosophila_subquinaria',
       'Drosophila_recens', 'Drosophila_subquinaria', 'Dr

In [22]:
pred_df = pd.DataFrame()
pred_df['Esp'] = le.inverse_transform(y_pred)
pred_df['Obs'] = le.inverse_transform(y_test)
print(pred_df)

                        Esp                     Obs
0       Drosophila_mettleri     Drosophila_mettleri
1         Drosophila_recens       Drosophila_recens
2         Drosophila_recens       Drosophila_recens
3    Drosophila_subquinaria  Drosophila_subquinaria
4    Drosophila_subquinaria  Drosophila_subquinaria
..                      ...                     ...
145        Drosophila_angor        Drosophila_angor
146       Drosophila_recens       Drosophila_recens
147       Drosophila_recens       Drosophila_recens
148       Drosophila_pachea       Drosophila_pachea
149       Drosophila_daruma        Drosophila_angor

[150 rows x 2 columns]


# 4. Avaluar el modelo de ML

# 4.1 Codificación One-hot del conjunto de prueba del desafio

In [23]:
df_test.head

<bound method NDFrame.head of 0                    ID S1 S2 S3 S4 S5 S6 S7 S8 S9  ... S654 S655 S656 S657  \
1    DQ471543_110189412  A  A  A  T  T  G  G  A  A  ...    C    A    A    C   
2    DQ471554_110189434  A  T  T  G  G  A  A  C  T  ...    A    C    A    T   
3     DQ383671_87475141  N  N  N  N  N  N  A  C  T  ...    N    N    N    N   
4     DQ383677_87475153  N  N  N  N  N  N  A  C  T  ...    N    N    N    N   
5     DQ383678_87475155  N  N  N  N  N  N  A  C  T  ...    N    N    N    N   
..                  ... .. .. .. .. .. .. .. .. ..  ...  ...  ...  ...  ...   
117            Auxiliar  A  A  A  A  A  A  A  A  A  ...    A    A    A    A   
118            Auxiliar  C  C  C  C  C  C  C  C  C  ...    C    C    C    C   
119            Auxiliar  G  G  G  G  G  G  G  G  G  ...    G    G    G    G   
120            Auxiliar  T  T  T  T  T  T  T  T  T  ...    T    T    T    T   
121            Auxiliar  N  N  N  N  N  N  N  N  N  ...    N    N    N    N   

0   S658 S659 S660 S6

In [24]:
df_X_test = df_test.loc[:, 'S1':'S663']
df_X_test.columns

Index(['S1', 'S2', 'S3', 'S4', 'S5', 'S6', 'S7', 'S8', 'S9', 'S10',
       ...
       'S654', 'S655', 'S656', 'S657', 'S658', 'S659', 'S660', 'S661', 'S662',
       'S663'],
      dtype='object', name=0, length=663)

In [25]:
one_hot_encoded_test = pd.get_dummies(df_X_test, columns = df_X_test.columns)
print(one_hot_encoded_test)

     S1_A  S1_C  S1_G  S1_N  S1_T  S2_A  S2_C  S2_G  S2_N  S2_T  ...  S662_A  \
1       1     0     0     0     0     1     0     0     0     0  ...       1   
2       1     0     0     0     0     0     0     0     0     1  ...       0   
3       0     0     0     1     0     0     0     0     1     0  ...       0   
4       0     0     0     1     0     0     0     0     1     0  ...       0   
5       0     0     0     1     0     0     0     0     1     0  ...       0   
..    ...   ...   ...   ...   ...   ...   ...   ...   ...   ...  ...     ...   
117     1     0     0     0     0     1     0     0     0     0  ...       1   
118     0     1     0     0     0     0     1     0     0     0  ...       0   
119     0     0     1     0     0     0     0     1     0     0  ...       0   
120     0     0     0     0     1     0     0     0     0     1  ...       0   
121     0     0     0     1     0     0     0     0     1     0  ...       0   

     S662_C  S662_G  S662_N  S662_T  S6

In [26]:
one_hot_encoded_test = one_hot_encoded_test.iloc[:-1, :]
one_hot_encoded_test = one_hot_encoded_test.iloc[:-1, :]
one_hot_encoded_test = one_hot_encoded_test.iloc[:-1, :]
one_hot_encoded_test = one_hot_encoded_test.iloc[:-1, :]
one_hot_encoded_test = one_hot_encoded_test.iloc[:-1, :]
print(one_hot_encoded_test)

     S1_A  S1_C  S1_G  S1_N  S1_T  S2_A  S2_C  S2_G  S2_N  S2_T  ...  S662_A  \
1       1     0     0     0     0     1     0     0     0     0  ...       1   
2       1     0     0     0     0     0     0     0     0     1  ...       0   
3       0     0     0     1     0     0     0     0     1     0  ...       0   
4       0     0     0     1     0     0     0     0     1     0  ...       0   
5       0     0     0     1     0     0     0     0     1     0  ...       0   
..    ...   ...   ...   ...   ...   ...   ...   ...   ...   ...  ...     ...   
112     0     0     0     1     0     0     0     0     1     0  ...       0   
113     0     0     0     1     0     0     0     0     1     0  ...       0   
114     0     0     0     1     0     0     0     0     1     0  ...       0   
115     0     0     0     1     0     0     0     0     1     0  ...       0   
116     0     0     0     1     0     0     0     0     1     0  ...       0   

     S662_C  S662_G  S662_N  S662_T  S6

In [27]:
one_hot_encoded_test = one_hot_encoded_test[x_train.columns]

In [28]:
pred = model.predict(one_hot_encoded_test)
le.inverse_transform(pred)

array(['Drosophila_mojavensis', 'Drosophila_angor', 'Drosophila_arizonae',
       'Drosophila_arizonae', 'Drosophila_arizonae',
       'Drosophila_barutani', 'Drosophila_falleni', 'Drosophila_falleni',
       'Drosophila_falleni', 'Drosophila_innubila', 'Drosophila_innubila',
       'Drosophila_innubila', 'Drosophila_innubila',
       'Drosophila_melanogaster', 'Drosophila_melanogaster',
       'Drosophila_mettleri', 'Drosophila_mettleri',
       'Drosophila_mettleri', 'Drosophila_mettleri',
       'Drosophila_mettleri', 'Drosophila_mojavensis',
       'Drosophila_mojavensis', 'Drosophila_mojavensis',
       'Drosophila_mojavensis', 'Drosophila_mojavensis',
       'Drosophila_mojavensis', 'Drosophila_mojavensis',
       'Drosophila_mojavensis', 'Drosophila_mojavensis',
       'Drosophila_montana', 'Drosophila_montana', 'Drosophila_montana',
       'Drosophila_montana', 'Drosophila_montana', 'Drosophila_montana',
       'Drosophila_montana', 'Drosophila_montana',
       'Drosophila_nigr

In [29]:
print(df_test.shape)

(121, 664)


In [30]:
df_test = df_test.iloc[:-1, :]
df_test = df_test.iloc[:-1, :]
df_test = df_test.iloc[:-1, :]
df_test = df_test.iloc[:-1, :]
df_test = df_test.iloc[:-1, :]

In [31]:
df_test['pred_spp'] = le.inverse_transform(pred)
df_test['pred_label'] = pred
print(df_test[['ID','pred_spp','pred_label']])

0                    ID                pred_spp  pred_label
1    DQ471543_110189412   Drosophila_mojavensis          10
2    DQ471554_110189434        Drosophila_angor           0
3     DQ383671_87475141     Drosophila_arizonae           1
4     DQ383677_87475153     Drosophila_arizonae           1
5     DQ383678_87475155     Drosophila_arizonae           1
..                  ...                     ...         ...
112  DQ851680_114187161  Drosophila_subquinaria          17
113  DQ851686_114187173  Drosophila_subquinaria          17
114  DQ851689_114187179  Drosophila_subquinaria          17
115   DQ426800_90018979      Drosophila_virilis          18
116   DQ426803_90018985      Drosophila_virilis          18

[116 rows x 3 columns]


**Importante**

Los resultados del desafio se entragarán en un spreadsheets (Google sheets) acompañado de un video (video oculto en YouTube) de 5 a 10 minutos explicando su implementación.

# 5 Evaluar el resultado por equipos en el desafío  

En esta sección se implementa unas lineas de codigo para evaluar a los equipos.Para ello vamos a:

1. Importar las predicciones de cada equipo
2. Estimar el rendimiento de sus modelos con el conjunto de prueba del desafío
3. Unificar los resultados
4. Visualizarlos

Predicciones por equipo:

Descarge la carpeta del siguiente enlace y pongalo en su cuenta de google drive vinculada con colab

https://drive.google.com/drive/folders/1BXKEvMV1hN3S7cwvGlalC2QBSxDxCkqM?usp=sharing

o almacena los datos de entrenamiento (Drosophila_train) y prueba (Drosophila_test) en **tu** Google Sheets.



## 5.1. Importar las predicciones de cada equipo

In [39]:
import pandas as pd
#defining my worksheet
mapache = gc.open('Mapache_prediccion_final').sheet1
#get_all_values gives a list of rows
rows_mapache = mapache.get_all_values()
#Convert to a DataFrame
df_mapache = pd.DataFrame(rows_mapache)
df_mapache.columns = df_mapache.iloc[0]
df_mapache = df_mapache.iloc[1:]
print(df_mapache)

0               ID_pred                     pred           ID_target  \
1      AF200828_8573373  Drosophila_melanogaster    AF200828_8573373   
2      AF200829_8573387  Drosophila_melanogaster    AF200829_8573387   
3      AF200839_8573527      Drosophila_simulans    AF200839_8573527   
4      AF200842_8573569      Drosophila_simulans    AF200842_8573569   
5      AF200846_8573625      Drosophila_simulans    AF200846_8573625   
..                  ...                      ...                 ...   
112  DQ851751_114187303        Drosophila_recens  DQ851751_114187303   
113  DQ851756_114187313        Drosophila_recens  DQ851756_114187313   
114  DQ851761_114187323        Drosophila_recens  DQ851761_114187323   
115  DQ851779_114187359   Drosophila_subquinaria  DQ851779_114187359   
116  DQ851803_114187407        Drosophila_recens  DQ851803_114187407   

0                     target Aux_ID Aux_target  
1    Drosophila_melanogaster   TRUE       TRUE  
2    Drosophila_melanogaster   TRUE  

In [42]:
 mapache_cohen_kappa_score = cohen_kappa_score(df_mapache['target'],df_mapache['pred'])
 print( mapache_cohen_kappa_score)

0.9495958981489528


In [50]:
import pandas as pd
#defining my worksheet
Ajolotes_rosas = gc.open('Ajolotes_rosas').sheet1
#get_all_values gives a list of rows
rows_Ajolotes_rosas = Ajolotes_rosas.get_all_values()
#Convert to a DataFrame
df_Ajolotes_rosas = pd.DataFrame(rows_Ajolotes_rosas)
df_Ajolotes_rosas.columns = df_Ajolotes_rosas.iloc[0]
df_Ajolotes_rosas = df_Ajolotes_rosas.iloc[1:]
print(df_Ajolotes_rosas)

0               ID_pred                     pred           ID_target  \
1      AF200828_8573373  Drosophila_melanogaster    AF200828_8573373   
2      AF200829_8573387  Drosophila_melanogaster    AF200829_8573387   
3      AF200839_8573527      Drosophila_simulans    AF200839_8573527   
4      AF200842_8573569      Drosophila_simulans    AF200842_8573569   
5      AF200846_8573625      Drosophila_simulans    AF200846_8573625   
..                  ...                      ...                 ...   
112  DQ851751_114187303        Drosophila_recens  DQ851751_114187303   
113  DQ851756_114187313        Drosophila_recens  DQ851756_114187313   
114  DQ851761_114187323        Drosophila_recens  DQ851761_114187323   
115  DQ851779_114187359        Drosophila_recens  DQ851779_114187359   
116  DQ851803_114187407        Drosophila_recens  DQ851803_114187407   

0                     target Aux_ID Aux_pred          Equipo  
1    Drosophila_melanogaster   TRUE     TRUE  Ajolotes_rosas  
2    Dros

In [52]:
Ajolotes_rosas_cohen_kappa_score = cohen_kappa_score(df_Ajolotes_rosas['target'],df_Ajolotes_rosas['pred'])
print(Ajolotes_rosas_cohen_kappa_score)

0.9797944608953144


In [59]:
import pandas as pd
#defining my worksheet
frutilleros = gc.open('frutilleros').sheet1
#get_all_values gives a list of rows
rows_frutilleros = frutilleros.get_all_values()
#Convert to a DataFrame
df_frutilleros = pd.DataFrame(rows_frutilleros)
df_frutilleros.columns = df_frutilleros.iloc[0]
df_frutilleros = df_frutilleros.iloc[1:]
print(df_frutilleros)
frutilleros_cohen_kappa_score = cohen_kappa_score(df_frutilleros['target'],df_frutilleros['pred'])
print(frutilleros_cohen_kappa_score)

0               ID_pred                     pred           ID_target  \
1      AF200828_8573373  Drosophila_melanogaster    AF200828_8573373   
2      AF200829_8573387  Drosophila_melanogaster    AF200829_8573387   
3      AF200839_8573527      Drosophila_simulans    AF200839_8573527   
4      AF200842_8573569      Drosophila_simulans    AF200842_8573569   
5      AF200846_8573625      Drosophila_simulans    AF200846_8573625   
..                  ...                      ...                 ...   
112  DQ851751_114187303        Drosophila_recens  DQ851751_114187303   
113  DQ851756_114187313        Drosophila_recens  DQ851756_114187313   
114  DQ851761_114187323        Drosophila_recens  DQ851761_114187323   
115  DQ851779_114187359        Drosophila_recens  DQ851779_114187359   
116  DQ851803_114187407        Drosophila_recens  DQ851803_114187407   

0                     target Aux_ID Aux_pred       Equipo  
1    Drosophila_melanogaster   TRUE     TRUE  Frutilleros  
2    Drosophila

In [73]:
import pandas as pd
#defining my worksheet
Heliconius = gc.open('Heliconius').sheet1
#get_all_values gives a list of rows
rows_Heliconius = Heliconius.get_all_values()
#Convert to a DataFrame
df_Heliconius = pd.DataFrame(rows_Heliconius)
df_Heliconius.columns = df_Heliconius.iloc[0]
df_Heliconius = df_Heliconius.iloc[1:]
print(df_Heliconius)
Heliconius_cohen_kappa_score = cohen_kappa_score(df_Heliconius['target'],df_Heliconius['pred'])
print(Heliconius_cohen_kappa_score)

0               ID_pred                     pred           ID_target  \
1      AF200828_8573373  Drosophila_melanogaster    AF200828_8573373   
2      AF200829_8573387  Drosophila_melanogaster    AF200829_8573387   
3      AF200839_8573527      Drosophila_simulans    AF200839_8573527   
4      AF200842_8573569      Drosophila_simulans    AF200842_8573569   
5      AF200846_8573625      Drosophila_simulans    AF200846_8573625   
..                  ...                      ...                 ...   
112  DQ851751_114187303        Drosophila_recens  DQ851751_114187303   
113  DQ851756_114187313        Drosophila_recens  DQ851756_114187313   
114  DQ851761_114187323        Drosophila_recens  DQ851761_114187323   
115  DQ851779_114187359        Drosophila_recens  DQ851779_114187359   
116  DQ851803_114187407        Drosophila_recens  DQ851803_114187407   

0                     target Aux_ID Aux_pred      Equipo  
1    Drosophila_melanogaster   TRUE     TRUE  Heliconius  
2    Drosophila_m

In [78]:
import pandas as pd
#defining my worksheet
IntelliGen = gc.open('intelliGen').sheet1
#get_all_values gives a list of rows
rows_IntelliGen = IntelliGen.get_all_values()
#Convert to a DataFrame
df_IntelliGen = pd.DataFrame(rows_IntelliGen)
df_IntelliGen.columns = df_IntelliGen.iloc[0]
df_IntelliGen = df_IntelliGen.iloc[1:]
print(df_IntelliGen)
IntelliGen_cohen_kappa_score = cohen_kappa_score(df_IntelliGen['target'],df_IntelliGen['pred'])
print(IntelliGen_cohen_kappa_score)

0               ID_pred                     pred           ID_target  \
1      AF200828_8573373  Drosophila_melanogaster    AF200828_8573373   
2      AF200829_8573387  Drosophila_melanogaster    AF200829_8573387   
3      AF200839_8573527      Drosophila_simulans    AF200839_8573527   
4      AF200842_8573569      Drosophila_simulans    AF200842_8573569   
5      AF200846_8573625      Drosophila_simulans    AF200846_8573625   
..                  ...                      ...                 ...   
112  DQ851751_114187303        Drosophila_recens  DQ851751_114187303   
113  DQ851756_114187313        Drosophila_recens  DQ851756_114187313   
114  DQ851761_114187323        Drosophila_recens  DQ851761_114187323   
115  DQ851779_114187359        Drosophila_recens  DQ851779_114187359   
116  DQ851803_114187407        Drosophila_recens  DQ851803_114187407   

0                     target Aux_ID Aux_pred      Equipo  
1    Drosophila_melanogaster   TRUE     TRUE  IntelliGen  
2    Drosophila_m

In [83]:
import pandas as pd
#defining my worksheet
monocitos = gc.open('monocitos').sheet1
#get_all_values gives a list of rows
rows_monocitos = monocitos.get_all_values()
#Convert to a DataFrame
df_monocitos = pd.DataFrame(rows_monocitos)
df_monocitos.columns = df_monocitos.iloc[0]
df_monocitos = df_monocitos.iloc[1:]
print(df_monocitos)
monocitos_cohen_kappa_score = cohen_kappa_score(df_monocitos['target'],df_monocitos['pred_RL'])
print(monocitos_cohen_kappa_score)

0               ID_pred                  pred_RL                 pred_KNN  \
1      AF200828_8573373  Drosophila_melanogaster  Drosophila_melanogaster   
2      AF200829_8573387  Drosophila_melanogaster  Drosophila_melanogaster   
3      AF200839_8573527      Drosophila_simulans      Drosophila_simulans   
4      AF200842_8573569      Drosophila_simulans      Drosophila_simulans   
5      AF200846_8573625      Drosophila_simulans      Drosophila_simulans   
..                  ...                      ...                      ...   
112  DQ851751_114187303        Drosophila_recens        Drosophila_recens   
113  DQ851756_114187313        Drosophila_recens        Drosophila_recens   
114  DQ851761_114187323        Drosophila_recens        Drosophila_recens   
115  DQ851779_114187359        Drosophila_recens        Drosophila_recens   
116  DQ851803_114187407        Drosophila_recens        Drosophila_recens   

0                    pred_RF           ID_target                   target  

In [86]:
import pandas as pd
#defining my worksheet
Puma_Burros = gc.open('Puma_Burros').sheet1
#get_all_values gives a list of rows
rows_Puma_Burros = Puma_Burros.get_all_values()
#Convert to a DataFrame
df_Puma_Burros = pd.DataFrame(rows_Puma_Burros)
df_Puma_Burros.columns = df_Puma_Burros.iloc[0]
df_Puma_Burros = df_Puma_Burros.iloc[1:]
print(df_Puma_Burros)
Puma_Burros_cohen_kappa_score = cohen_kappa_score(df_Puma_Burros['target'],df_Puma_Burros['pred'])
print(Puma_Burros_cohen_kappa_score)

0               ID_pred                     pred           ID_target  \
1      AF200828_8573373  Drosophila_melanogaster    AF200828_8573373   
2      AF200829_8573387  Drosophila_melanogaster    AF200829_8573387   
3      AF200839_8573527      Drosophila_simulans    AF200839_8573527   
4      AF200842_8573569      Drosophila_simulans    AF200842_8573569   
5      AF200846_8573625      Drosophila_simulans    AF200846_8573625   
..                  ...                      ...                 ...   
112  DQ851751_114187303        Drosophila_recens  DQ851751_114187303   
113  DQ851756_114187313        Drosophila_recens  DQ851756_114187313   
114  DQ851761_114187323        Drosophila_recens  DQ851761_114187323   
115  DQ851779_114187359        Drosophila_recens  DQ851779_114187359   
116  DQ851803_114187407        Drosophila_recens  DQ851803_114187407   

0                     target Aux_ID Aux_pred       Equipo  
1    Drosophila_melanogaster   TRUE     TRUE  Puma_Burros  
2    Drosophila

In [90]:
import pandas as pd
#defining my worksheet
Sociedad_Arácnida = gc.open('Sociedad_Arácnida').sheet1
#get_all_values gives a list of rows
rows_Sociedad_Arácnida = Sociedad_Arácnida.get_all_values()
#Convert to a DataFrame
df_Sociedad_Arácnida = pd.DataFrame(rows_Sociedad_Arácnida)
df_Sociedad_Arácnida.columns = df_Sociedad_Arácnida.iloc[0]
df_Sociedad_Arácnida = df_Sociedad_Arácnida.iloc[1:]
print(df_Sociedad_Arácnida)
Sociedad_Arácnida_cohen_kappa_score = cohen_kappa_score(df_Sociedad_Arácnida['target'],df_Sociedad_Arácnida['pred_SVM'])
print(Sociedad_Arácnida_cohen_kappa_score)

0               ID_pred                 pred_SVM                  pred_RF  \
1      AF200828_8573373  Drosophila_melanogaster  Drosophila_melanogaster   
2      AF200829_8573387  Drosophila_melanogaster  Drosophila_melanogaster   
3      AF200839_8573527      Drosophila_simulans      Drosophila_simulans   
4      AF200842_8573569      Drosophila_simulans      Drosophila_simulans   
5      AF200846_8573625      Drosophila_simulans      Drosophila_simulans   
..                  ...                      ...                      ...   
112  DQ851751_114187303        Drosophila_recens        Drosophila_recens   
113  DQ851756_114187313        Drosophila_recens        Drosophila_recens   
114  DQ851761_114187323        Drosophila_recens        Drosophila_recens   
115  DQ851779_114187359        Drosophila_recens        Drosophila_recens   
116  DQ851803_114187407        Drosophila_recens        Drosophila_recens   

0             ID_target                   target Aux_ID Aux_target_SVM  \
1

In [94]:
import pandas as pd
#defining my worksheet
UnADN = gc.open('UnADN').sheet1
#get_all_values gives a list of rows
rows_UnADN = UnADN.get_all_values()
#Convert to a DataFrame
df_UnADN = pd.DataFrame(rows_UnADN)
df_UnADN.columns = df_UnADN.iloc[0]
df_UnADN = df_UnADN.iloc[1:]
print(df_UnADN)
UnADN_cohen_kappa_score = cohen_kappa_score(df_UnADN['target'],df_UnADN['pred'])
print(UnADN_cohen_kappa_score)

0               ID_pred                     pred           ID_target  \
1      AF200828_8573373  Drosophila_melanogaster    AF200828_8573373   
2      AF200829_8573387  Drosophila_melanogaster    AF200829_8573387   
3      AF200839_8573527      Drosophila_simulans    AF200839_8573527   
4      AF200842_8573569      Drosophila_simulans    AF200842_8573569   
5      AF200846_8573625      Drosophila_simulans    AF200846_8573625   
..                  ...                      ...                 ...   
112  DQ851751_114187303        Drosophila_recens  DQ851751_114187303   
113  DQ851756_114187313        Drosophila_recens  DQ851756_114187313   
114  DQ851761_114187323        Drosophila_recens  DQ851761_114187323   
115  DQ851779_114187359        Drosophila_recens  DQ851779_114187359   
116  DQ851803_114187407        Drosophila_recens  DQ851803_114187407   

0                     target Aux_ID Aux_pred  
1    Drosophila_melanogaster   TRUE     TRUE  
2    Drosophila_melanogaster   TRUE     T

In [96]:
import pandas as pd
#defining my worksheet
DeRichard = gc.open('DeRichard').sheet1
#get_all_values gives a list of rows
rows_DeRichard = DeRichard.get_all_values()
#Convert to a DataFrame
df_DeRichard = pd.DataFrame(rows_DeRichard)
df_DeRichard.columns = df_DeRichard.iloc[0]
df_DeRichard = df_DeRichard.iloc[1:]
print(df_DeRichard)
DeRichard_cohen_kappa_score = cohen_kappa_score(df_DeRichard['target'],df_DeRichard['pred'])
print(DeRichard_cohen_kappa_score)

0               ID_pred                     pred                  ID  \
1      AF200828_8573373  Drosophila_melanogaster    AF200828_8573373   
2      AF200829_8573387  Drosophila_melanogaster    AF200829_8573387   
3      AF200839_8573527      Drosophila_simulans    AF200839_8573527   
4      AF200842_8573569      Drosophila_simulans    AF200842_8573569   
5      AF200846_8573625    Drosophila_mauritiana    AF200846_8573625   
..                  ...                      ...                 ...   
112  DQ851751_114187303        Drosophila_recens  DQ851751_114187303   
113  DQ851756_114187313        Drosophila_recens  DQ851756_114187313   
114  DQ851761_114187323        Drosophila_recens  DQ851761_114187323   
115  DQ851779_114187359        Drosophila_recens  DQ851779_114187359   
116  DQ851803_114187407        Drosophila_recens  DQ851803_114187407   

0                     target Aux_ID Aux_pred     Equipo  
1    Drosophila_melanogaster   TRUE     TRUE  DeRichard  
2    Drosophila_mel

In [98]:
import pandas as pd
#defining my worksheet
DeAdryanTrejo = gc.open('DeAdryanTrejo').sheet1
#get_all_values gives a list of rows
rows_DeAdryanTrejo = DeAdryanTrejo.get_all_values()
#Convert to a DataFrame
df_DeAdryanTrejo = pd.DataFrame(rows_DeAdryanTrejo)
df_DeAdryanTrejo.columns = df_DeAdryanTrejo.iloc[0]
df_DeAdryanTrejo = df_DeAdryanTrejo.iloc[1:]
print(df_DeAdryanTrejo)
DeAdryanTrejo_cohen_kappa_score = cohen_kappa_score(df_DeAdryanTrejo['target'],df_DeAdryanTrejo['pred'])
print(DeAdryanTrejo_cohen_kappa_score)

0                    ID                     pred                  ID  \
1      AF200828_8573373  Drosophila_melanogaster    AF200828_8573373   
2      AF200829_8573387  Drosophila_melanogaster    AF200829_8573387   
3      AF200839_8573527      Drosophila_simulans    AF200839_8573527   
4      AF200842_8573569      Drosophila_simulans    AF200842_8573569   
5      AF200846_8573625      Drosophila_simulans    AF200846_8573625   
..                  ...                      ...                 ...   
112  DQ851751_114187303        Drosophila_recens  DQ851751_114187303   
113  DQ851756_114187313        Drosophila_recens  DQ851756_114187313   
114  DQ851761_114187323        Drosophila_recens  DQ851761_114187323   
115  DQ851779_114187359        Drosophila_recens  DQ851779_114187359   
116  DQ851803_114187407        Drosophila_recens  DQ851803_114187407   

0                     target Aux_ID Aux_pred         Equipo  
1    Drosophila_melanogaster   TRUE     TRUE  DeAdryanTrejo  
2    Drosop

In [105]:
import pandas as pd
#defining my worksheet
Jakare = gc.open('Jaka').sheet1
#get_all_values gives a list of rows
rows_Jakare = Jakare.get_all_values()
#Convert to a DataFrame
df_Jakare = pd.DataFrame(rows_Jakare)
df_Jakare.columns = df_Jakare.iloc[0]
df_Jakare = df_Jakare.iloc[1:]
print(df_Jakare)
Jakare_cohen_kappa_score = cohen_kappa_score(df_Jakare['target'],df_Jakare['pred'])
print(Jakare_cohen_kappa_score)

0               ID_pred                     pred           ID_target  \
1      AF200828_8573373  Drosophila_melanogaster    AF200828_8573373   
2      AF200829_8573387  Drosophila_melanogaster    AF200829_8573387   
3      AF200839_8573527      Drosophila_simulans    AF200839_8573527   
4      AF200842_8573569      Drosophila_simulans    AF200842_8573569   
5      AF200846_8573625      Drosophila_simulans    AF200846_8573625   
..                  ...                      ...                 ...   
112  DQ851751_114187303        Drosophila_recens  DQ851751_114187303   
113  DQ851756_114187313        Drosophila_recens  DQ851756_114187313   
114  DQ851761_114187323        Drosophila_recens  DQ851761_114187323   
115  DQ851779_114187359        Drosophila_recens  DQ851779_114187359   
116  DQ851803_114187407        Drosophila_recens  DQ851803_114187407   

0                     target Aux_ID Aux_pred  Equipo  
1    Drosophila_melanogaster   TRUE     TRUE  Jakare  
2    Drosophila_melanogas

In [108]:
import pandas as pd
#defining my worksheet
ForeverAlone = gc.open('ForeverAlone').sheet1
#get_all_values gives a list of rows
rows_ForeverAlone = ForeverAlone.get_all_values()
#Convert to a DataFrame
df_ForeverAlone = pd.DataFrame(rows_ForeverAlone)
df_ForeverAlone.columns = df_ForeverAlone.iloc[0]
df_ForeverAlone = df_ForeverAlone.iloc[1:]
print(df_ForeverAlone)
ForeverAlone_cohen_kappa_score = cohen_kappa_score(df_ForeverAlone['target'],df_ForeverAlone['pred'])
print(ForeverAlone_cohen_kappa_score)

0                    ID                     pred                  ID  \
1      AF200828_8573373  Drosophila_melanogaster    AF200828_8573373   
2      AF200829_8573387  Drosophila_melanogaster    AF200829_8573387   
3      AF200839_8573527      Drosophila_simulans    AF200839_8573527   
4      AF200842_8573569      Drosophila_simulans    AF200842_8573569   
5      AF200846_8573625      Drosophila_simulans    AF200846_8573625   
..                  ...                      ...                 ...   
112  DQ851751_114187303        Drosophila_recens  DQ851751_114187303   
113  DQ851756_114187313        Drosophila_recens  DQ851756_114187313   
114  DQ851761_114187323        Drosophila_recens  DQ851761_114187323   
115  DQ851779_114187359        Drosophila_recens  DQ851779_114187359   
116  DQ851803_114187407        Drosophila_recens  DQ851803_114187407   

0                     target Aux_ID Aux_pred        Equipo  
1    Drosophila_melanogaster   TRUE     TRUE  ForeverAlone  
2    Drosophi

In [111]:
import pandas as pd
#defining my worksheet
slowly = gc.open('slowly').sheet1
#get_all_values gives a list of rows
rows_slowly = slowly.get_all_values()
#Convert to a DataFrame
df_slowly = pd.DataFrame(rows_slowly)
df_slowly.columns = df_slowly.iloc[0]
df_slowly = df_slowly.iloc[1:]
print(df_slowly)
slowly_cohen_kappa_score = cohen_kappa_score(df_slowly['target'],df_slowly['pred'])
print(slowly_cohen_kappa_score)

0                    ID                     pred                  ID  \
1      AF200828_8573373      Drosophila_arizonae    AF200828_8573373   
2      AF200829_8573387      Drosophila_arizonae    AF200829_8573387   
3      AF200839_8573527      Drosophila_mettleri    AF200839_8573527   
4      AF200842_8573569  Drosophila_melanogaster    AF200842_8573569   
5      AF200846_8573625      Drosophila_mettleri    AF200846_8573625   
..                  ...                      ...                 ...   
112  DQ851751_114187303    - Drosophila_innubila  DQ851751_114187303   
113  DQ851756_114187313    - Drosophila_innubila  DQ851756_114187313   
114  DQ851761_114187323    - Drosophila_innubila  DQ851761_114187323   
115  DQ851779_114187359    - Drosophila_innubila  DQ851779_114187359   
116  DQ851803_114187407    - Drosophila_innubila  DQ851803_114187407   

0                     target               
1    Drosophila_melanogaster  TRUE  FALSE  
2    Drosophila_melanogaster  TRUE  FALSE  
3  

In [113]:
import pandas as pd
#defining my worksheet
kamikaze = gc.open('kamikaze').sheet1
#get_all_values gives a list of rows
rows_kamikaze = kamikaze.get_all_values()
#Convert to a DataFrame
df_kamikaze = pd.DataFrame(rows_kamikaze)
df_kamikaze.columns = df_kamikaze.iloc[0]
df_kamikaze = df_kamikaze.iloc[1:]
print(df_kamikaze)
kamikaze_cohen_kappa_score = cohen_kappa_score(df_kamikaze['target'],df_kamikaze['pred'])
print(kamikaze_cohen_kappa_score)

0                    ID                     pred                  ID  \
1      AF200828_8573373  Drosophila_melanogaster    AF200828_8573373   
2      AF200829_8573387  Drosophila_melanogaster    AF200829_8573387   
3      AF200839_8573527      Drosophila_simulans    AF200839_8573527   
4      AF200842_8573569      Drosophila_simulans    AF200842_8573569   
5      AF200846_8573625      Drosophila_simulans    AF200846_8573625   
..                  ...                      ...                 ...   
112  DQ851751_114187303        Drosophila_recens  DQ851751_114187303   
113  DQ851756_114187313        Drosophila_recens  DQ851756_114187313   
114  DQ851761_114187323        Drosophila_recens  DQ851761_114187323   
115  DQ851779_114187359        Drosophila_recens  DQ851779_114187359   
116  DQ851803_114187407        Drosophila_recens  DQ851803_114187407   

0                     target Aux_ID  
1    Drosophila_melanogaster   TRUE  
2    Drosophila_melanogaster   TRUE  
3        Drosophila_s

## 5.3. Unificar los resultados

**Metrica de comparación de rendimiento**


"El coeficiente kappa de Cohen ( κ , kappa en griego en minúsculas ) es una estadística que se utiliza para medir la confiabilidad entre calificadores (y también la confiabilidad intracalificadores ) para elementos cualitativos (categóricos). En general, se considera que es una medida más robusta que el simple cálculo del porcentaje de concordancia, ya que κ tiene en cuenta la posibilidad de que la concordancia se produzca por casualidad."

Ver:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3900052/#:~:text=Cohen%20suggested%20the%20Kappa%20result,1.00%20as%20almost%20perfect%20agreement.

https://en.wikipedia.org/wiki/Cohen%27s_kappa

In [151]:
# initialize list of lists
data = [[13,'Mapache',  mapache_cohen_kappa_score, ' ','Ricardo Perea Jacobo',1],
        [8,'Ajolotes rosas', Ajolotes_rosas_cohen_kappa_score, ' ','Alan Vázquez González',12],
        [7,'Frutilleros', frutilleros_cohen_kappa_score, ' ','Leobardo Tepalcapa San Miguel',3],
        [6,'Heliconius', Heliconius_cohen_kappa_score, 'K-Nearest Neighbor','Pérez Flores Luis Miguel',10],
        [1,'IntelliGen', IntelliGen_cohen_kappa_score, 'Redes Neuronales,SVM','Maria Stella Lopez Carvajal',3],
        [5,'Monocitos', monocitos_cohen_kappa_score, 'Función Logistica','Perla Itzel Alvarado Luis',5],
        [2,'Puma Burros', Puma_Burros_cohen_kappa_score, 'Redes Neuronales Artificiales','Luis Gabriel García Mejía',18],
        [4,'Sociedad Arácnida', Sociedad_Arácnida_cohen_kappa_score, 'Random Forest,SVM','Sergio Emiliano Reyes Guardado',13],
        [10,'UnADN', UnADN_cohen_kappa_score, ' ','Johan Sebastian Cano Garcia',10],
        [11,'DeRichard', DeRichard_cohen_kappa_score, ' ','Richard Arteaga Ospina',1],
        [9,'DeAdryanTrejo', DeAdryanTrejo_cohen_kappa_score, ' ','Adryan Trejo Mendoza',1],
        [12,'Jakare', Jakare_cohen_kappa_score, ' ','Luis Alberto Cañete Baez',1],
        [7,'ForeverAlone', ForeverAlone_cohen_kappa_score, ' ','Andres Felipe Sabogal Ramirez',1],
        [14,'slowly', slowly_cohen_kappa_score, ' ','Arciniega González Jorge Arturo',1],
        [3,'Kamikaze', kamikaze_cohen_kappa_score, 'RF y selección de atributos','Kendra Ramirez Acosta',1]]

# Create thepandas DataFrame
df_cohen_kappa_score = pd.DataFrame(data, columns=['Posición','Equipo', 'Cohen kappa Score', 'Modelo','Delegado','Integrantes'])

# print dataframe.
df_cohen_kappa_score.sort_values(by=['Posición','Cohen kappa Score'], ascending=True)

Unnamed: 0,Posición,Equipo,Cohen kappa Score,Modelo,Delegado,Integrantes
4,1,IntelliGen,0.98991,"Redes Neuronales,SVM",Maria Stella Lopez Carvajal,3
6,2,Puma Burros,0.98991,Redes Neuronales Artificiales,Luis Gabriel García Mejía,18
14,3,Kamikaze,0.98991,RF y selección de atributos,Kendra Ramirez Acosta,1
7,4,Sociedad Arácnida,0.98991,"Random Forest,SVM",Sergio Emiliano Reyes Guardado,13
5,5,Monocitos,0.98991,Función Logistica,Perla Itzel Alvarado Luis,5
3,6,Heliconius,0.98991,K-Nearest Neighbor,Pérez Flores Luis Miguel,10
12,7,ForeverAlone,0.979809,,Andres Felipe Sabogal Ramirez,1
2,7,Frutilleros,0.979819,,Leobardo Tepalcapa San Miguel,3
1,8,Ajolotes rosas,0.979794,,Alan Vázquez González,12
10,9,DeAdryanTrejo,0.96986,,Adryan Trejo Mendoza,1


In [150]:
# create, and save df
from gspread_dataframe import set_with_dataframe
title = 'df_cohen_kappa_score'
gc.create(title)  # if not exist
sheet = gc.open(title).sheet1
set_with_dataframe(sheet, df_cohen_kappa_score[['Posición','Equipo', 'Cohen kappa Score', 'Modelo','Delegado','Integrantes']])
# include_index=False, include_column_header=True, resize=False