# **Demo**

¿Cuántas especies hay en la tierra? Como resumió Robert May en un artículo publicado en Science. Si alguna versión extraterrestre de Starship Enterprise visitara la Tierra, ¿cuál podría ser la primera pregunta de los visitantes? Creo que sería: "¿Cuántas formas de vida distintas, especies, tiene su planeta?" Vergonzosamente, nuestra mejor respuesta aproximada estaría en el rango de 5 a 10 millones de eucariotas (sin importar los virus y las bacterias), pero podríamos defender números superiores a 100 millones o tan bajos como 3 millones. Tan solo responder a esta pregunta sigue siendo un desafío en la biología, pero por fortuna la generación de secuencias de ADN está ayudando en esta tarea.  En la misma dirección, como es de esperarse los algoritmos de inteligencia artificial pueden brindar apoyo en la automatización de esta difícil tarea. El desafío que te planteamos aquí es: enseñarles cómo. El problema de clasificación de especies trata sobre asignar un espécimen desconocido a una especie conocida mediante el análisis de su código de barras. Con base a esto se formula el siguiente problema:  Dado un conjunto de secuencias de ADN ¿Qué método de aprendizaje automático supervisado muestra mayor eficacia para clasificar especies a partir de las secuencias de código de barras de ADN?

**Instruccines para seguir este tutorial:**

Descarge la carpeta del siguiente enlace y pongalo en su cuenta de google drive vinculada con colab

https://drive.google.com/drive/folders/1-M1T7ahcmJ56Pd0yiHEm7joJiW7ktArv?usp=sharing

o almacena los datos de entrenamiento (Drosophila_train) y prueba (Drosophila_test) en **tu** Google Sheets.

# **Importante:**

Este desafío lo puesde hacer de forma individual o grupal, los equipos deben asignar un delagado quien será el responsable de la comunicación con los instructores.

# **Nota**

Para entender mejor la teoría del problema que intentamos automatizar puedes consultar:

https://doi.org/10.1186/1756-0381-7-4

# **Detalles sobre el desafio**

https://drive.google.com/file/d/1YsnvINJfRxKOxFYoGnrZig9nCOKJCROi/view?usp=sharing

# **Otros dataset similares para experimentar**

http://dmb.iasi.cnr.it/supbarcodes.php

# 1. Conectar Colab a Hojas de cálculo de Google

Lo primero que debe hacer es conectar Colaboratory a Google Sheets; o, más correctamente, a su unidad; para hacerlo puedes usar el siguiente código:
(ver https://medium.com/mlearning-ai/how-to-access-google-sheets-on-google-colaboratory-8766b3a0996f)

In [1]:
from google.colab import auth
import gspread
from google.auth import default
#autenticating to google
auth.authenticate_user()
creds, _ = default()
gc = gspread.authorize(creds)

# 2. Abrir el archivo que contiene los datos

Vamos a abrir primero el conjunto de prueba (Drosophila_test)

In [None]:
import pandas as pd
#defining my worksheet
Drosophila_test = gc.open('Drosophila_test').sheet1
#get_all_values gives a list of rows
rows_test = Drosophila_test.get_all_values()
#Convert to a DataFrame
df_test = pd.DataFrame(rows_test)
df_test.columns = df_test.iloc[0]
df_test = df_test.iloc[1:]
print(df_test)



0                    ID S1 S2 S3 S4 S5 S6 S7 S8 S9  ... S654 S655 S656 S657  \
1    DQ471543_110189412  A  A  A  T  T  G  G  A  A  ...    C    A    A    C   
2    DQ471554_110189434  A  T  T  G  G  A  A  C  T  ...    A    C    A    T   
3     DQ383671_87475141  N  N  N  N  N  N  A  C  T  ...    N    N    N    N   
4     DQ383677_87475153  N  N  N  N  N  N  A  C  T  ...    N    N    N    N   
5     DQ383678_87475155  N  N  N  N  N  N  A  C  T  ...    N    N    N    N   
..                  ... .. .. .. .. .. .. .. .. ..  ...  ...  ...  ...  ...   
117            Auxiliar  A  A  A  A  A  A  A  A  A  ...    A    A    A    A   
118            Auxiliar  C  C  C  C  C  C  C  C  C  ...    C    C    C    C   
119            Auxiliar  G  G  G  G  G  G  G  G  G  ...    G    G    G    G   
120            Auxiliar  T  T  T  T  T  T  T  T  T  ...    T    T    T    T   
121            Auxiliar  N  N  N  N  N  N  N  N  N  ...    N    N    N    N   

0   S658 S659 S660 S661 S662 S663  
1      A    T  

Vamos a abrir primero el conjunto de entrenamiento (Drosophila_train)

In [None]:
import pandas as pd
#defining my worksheet
Drosophila_train = gc.open('Drosophila_train').sheet1
#get_all_values gives a list of rows
rows_train = Drosophila_train.get_all_values()
#Convert to a DataFrame
df_train = pd.DataFrame(rows_train)
df_train.columns = df_train.iloc[0]
df_train = df_train.iloc[1:]
print(df_train)

0                    ID            target S1 S2 S3 S4 S5 S6 S7 S8  ... S654  \
1    DQ471538_110189402  Drosophila_angor  A  T  T  G  G  A  A  C  ...    A   
2    DQ471539_110189404  Drosophila_angor  A  T  C  G  G  A  A  C  ...    A   
3    DQ471540_110189406  Drosophila_angor  A  T  T  G  G  A  A  C  ...    A   
4    DQ471541_110189408  Drosophila_angor  A  T  C  G  G  T  A  C  ...    A   
5    DQ471542_110189410  Drosophila_angor  A  T  T  G  G  A  A  C  ...    A   
..                  ...               ... .. .. .. .. .. .. .. ..  ...  ...   
500            Auxiliar          Auxiliar  A  A  A  A  A  A  A  A  ...    A   
501            Auxiliar          Auxiliar  C  C  C  C  C  C  C  C  ...    C   
502            Auxiliar          Auxiliar  G  G  G  G  G  G  G  G  ...    G   
503            Auxiliar          Auxiliar  T  T  T  T  T  T  T  T  ...    T   
504            Auxiliar          Auxiliar  N  N  N  N  N  N  N  N  ...    N   

0   S655 S656 S657 S658 S659 S660 S661 S662 S663  


# 3. Implementar un modelo de calsificación supervizada

Aqui puedes encontrar un ejemplo simple de un proyecto de ML: https://towardsdatascience.com/build-your-first-machine-learning-model-with-zero-configuration-exploring-google-colab-5cc7263cfe28

## 3.1 Importación de dependencias

 Utilizaremos principalmente las siguientes bibliotecas:

 - _scikit-learn_ : una biblioteca de ML que consta de una variedad de funciones de procesamiento de datos y algoritmos de ML (p. ej., regresión, clasificación y agrupación). Esta biblioteca también se conoce como sklearn, y usaremos sklearn para fines de referencia.
 - _pandas_ : una biblioteca de ciencia de datos que se especializa principalmente en el preprocesamiento de datos similares a hojas de cálculo antes de construir modelos ML.

In [None]:
from sklearn import datasets, model_selection, metrics, ensemble

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, ConfusionMatrixDisplay
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
from sklearn.preprocessing import LabelEncoder
from sklearn import preprocessing
from sklearn.metrics import cohen_kappa_score

## 3.2 Exploremos el conjunto de datos de trabajo

In [None]:
Y_train = df_train["target"]
Y_train

1      Drosophila_angor
2      Drosophila_angor
3      Drosophila_angor
4      Drosophila_angor
5      Drosophila_angor
             ...       
500            Auxiliar
501            Auxiliar
502            Auxiliar
503            Auxiliar
504            Auxiliar
Name: target, Length: 504, dtype: object

In [None]:
X_train = df_train.loc[:, 'S1':'S663']
X_train.head

<bound method NDFrame.head of 0   S1 S2 S3 S4 S5 S6 S7 S8 S9 S10  ... S654 S655 S656 S657 S658 S659 S660  \
1    A  T  T  G  G  A  A  C  T   T  ...    A    C    A    T    T    T    A   
2    A  T  C  G  G  A  A  C  T   T  ...    A    C    A    T    T    T    A   
3    A  T  T  G  G  A  A  C  T   T  ...    A    C    A    T    T    T    A   
4    A  T  C  G  G  T  A  C  A   C  ...    A    C    A    T    T    T    A   
5    A  T  T  G  G  A  A  C  T   T  ...    A    C    A    T    T    T    A   
..  .. .. .. .. .. .. .. .. ..  ..  ...  ...  ...  ...  ...  ...  ...  ...   
500  A  A  A  A  A  A  A  A  A   A  ...    A    A    A    A    A    A    A   
501  C  C  C  C  C  C  C  C  C   C  ...    C    C    C    C    C    C    C   
502  G  G  G  G  G  G  G  G  G   G  ...    G    G    G    G    G    G    G   
503  T  T  T  T  T  T  T  T  T   T  ...    T    T    T    T    T    T    T   
504  N  N  N  N  N  N  N  N  N   N  ...    N    N    N    N    N    N    N   

0   S661 S662 S663  
1      G    

## 3.3 Preprocesamiento de los datos (one_hot_encoded)

In [None]:
# list(data) or
df_train.columns

Index(['ID', 'target', 'S1', 'S2', 'S3', 'S4', 'S5', 'S6', 'S7', 'S8',
       ...
       'S654', 'S655', 'S656', 'S657', 'S658', 'S659', 'S660', 'S661', 'S662',
       'S663'],
      dtype='object', name=0, length=665)

In [None]:
df = df_train.loc[:, 'target':'S663']
df.columns

Index(['target', 'S1', 'S2', 'S3', 'S4', 'S5', 'S6', 'S7', 'S8', 'S9',
       ...
       'S654', 'S655', 'S656', 'S657', 'S658', 'S659', 'S660', 'S661', 'S662',
       'S663'],
      dtype='object', name=0, length=664)

In [None]:
df_X = df_train.loc[:, 'S1':'S663']
df_X.columns

Index(['S1', 'S2', 'S3', 'S4', 'S5', 'S6', 'S7', 'S8', 'S9', 'S10',
       ...
       'S654', 'S655', 'S656', 'S657', 'S658', 'S659', 'S660', 'S661', 'S662',
       'S663'],
      dtype='object', name=0, length=663)

In [None]:
one_hot_encoded_data = pd.get_dummies(df, columns = df_X.columns)
print(one_hot_encoded_data)

               target  S1_A  S1_C  S1_G  S1_N  S1_T  S2_A  S2_C  S2_G  S2_N  \
1    Drosophila_angor     1     0     0     0     0     0     0     0     0   
2    Drosophila_angor     1     0     0     0     0     0     0     0     0   
3    Drosophila_angor     1     0     0     0     0     0     0     0     0   
4    Drosophila_angor     1     0     0     0     0     0     0     0     0   
5    Drosophila_angor     1     0     0     0     0     0     0     0     0   
..                ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   
500          Auxiliar     1     0     0     0     0     1     0     0     0   
501          Auxiliar     0     1     0     0     0     0     1     0     0   
502          Auxiliar     0     0     1     0     0     0     0     1     0   
503          Auxiliar     0     0     0     0     1     0     0     0     0   
504          Auxiliar     0     0     0     1     0     0     0     0     1   

     ...  S662_A  S662_C  S662_G  S662_N  S662_T  S

Antes de seguir vamos a eliminar las filas adicionales que fueron agregadas para abarcar todo el dominio de valores posibles en la recodificación one-hot

In [None]:
one_hot_encoded_data = one_hot_encoded_data.iloc[:-1, :]
one_hot_encoded_data = one_hot_encoded_data.iloc[:-1, :]
one_hot_encoded_data = one_hot_encoded_data.iloc[:-1, :]
one_hot_encoded_data = one_hot_encoded_data.iloc[:-1, :]
one_hot_encoded_data = one_hot_encoded_data.iloc[:-1, :]
print(one_hot_encoded_data)

                 target  S1_A  S1_C  S1_G  S1_N  S1_T  S2_A  S2_C  S2_G  S2_N  \
1      Drosophila_angor     1     0     0     0     0     0     0     0     0   
2      Drosophila_angor     1     0     0     0     0     0     0     0     0   
3      Drosophila_angor     1     0     0     0     0     0     0     0     0   
4      Drosophila_angor     1     0     0     0     0     0     0     0     0   
5      Drosophila_angor     1     0     0     0     0     0     0     0     0   
..                  ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   
495  Drosophila_virilis     0     0     0     1     0     0     0     0     1   
496  Drosophila_virilis     0     0     0     1     0     0     0     0     1   
497  Drosophila_virilis     0     0     0     1     0     0     0     0     1   
498  Drosophila_virilis     1     0     0     0     0     0     0     0     0   
499  Drosophila_virilis     1     0     0     0     0     0     0     0     0   

     ...  S662_A  S662_C  S

Recodifiquemos las clases (especies), remplazaremos las cadenas de nombres por etiquetas numericas (lebals)

In [None]:
le = preprocessing.LabelEncoder()
le.fit(one_hot_encoded_data.target)
list(le.classes_)
one_hot_encoded_data['target'] = le.transform(one_hot_encoded_data.target)
print(one_hot_encoded_data)

     target  S1_A  S1_C  S1_G  S1_N  S1_T  S2_A  S2_C  S2_G  S2_N  ...  \
1         0     1     0     0     0     0     0     0     0     0  ...   
2         0     1     0     0     0     0     0     0     0     0  ...   
3         0     1     0     0     0     0     0     0     0     0  ...   
4         0     1     0     0     0     0     0     0     0     0  ...   
5         0     1     0     0     0     0     0     0     0     0  ...   
..      ...   ...   ...   ...   ...   ...   ...   ...   ...   ...  ...   
495      18     0     0     0     1     0     0     0     0     1  ...   
496      18     0     0     0     1     0     0     0     0     1  ...   
497      18     0     0     0     1     0     0     0     0     1  ...   
498      18     1     0     0     0     0     0     0     0     0  ...   
499      18     1     0     0     0     0     0     0     0     0  ...   

     S662_A  S662_C  S662_G  S662_N  S662_T  S663_A  S663_C  S663_G  S663_N  \
1         0       0       0     

Para invertir la transformación podemos usar:

In [None]:
le.inverse_transform(one_hot_encoded_data['target'])

array(['Drosophila_angor', 'Drosophila_angor', 'Drosophila_angor',
       'Drosophila_angor', 'Drosophila_angor', 'Drosophila_angor',
       'Drosophila_angor', 'Drosophila_angor', 'Drosophila_angor',
       'Drosophila_angor', 'Drosophila_arizonae', 'Drosophila_arizonae',
       'Drosophila_arizonae', 'Drosophila_arizonae',
       'Drosophila_arizonae', 'Drosophila_arizonae',
       'Drosophila_arizonae', 'Drosophila_arizonae',
       'Drosophila_arizonae', 'Drosophila_arizonae',
       'Drosophila_arizonae', 'Drosophila_arizonae',
       'Drosophila_arizonae', 'Drosophila_arizonae',
       'Drosophila_barutani', 'Drosophila_barutani',
       'Drosophila_barutani', 'Drosophila_barutani',
       'Drosophila_barutani', 'Drosophila_beppui', 'Drosophila_beppui',
       'Drosophila_beppui', 'Drosophila_daruma', 'Drosophila_daruma',
       'Drosophila_daruma', 'Drosophila_daruma', 'Drosophila_falleni',
       'Drosophila_falleni', 'Drosophila_falleni', 'Drosophila_falleni',
       'Drosophi

In [None]:
target = one_hot_encoded_data['target']
one_hot_encoded_data_x = one_hot_encoded_data.loc[:, 'S1_A':'S663_T']
x_train, x_test, y_train, y_test = model_selection.train_test_split(one_hot_encoded_data_x, target, test_size = 0.3, random_state =0)
print("Training Dataset:", x_train.shape)
print("Test Dataset:", x_test.shape)



Training Dataset: (349, 3315)
Test Dataset: (150, 3315)


## 3.4 Entrenamiento del modelo

In [None]:
classifier = ensemble.RandomForestClassifier()
model = classifier.fit(x_train,y_train)
print(model)

RandomForestClassifier()


In [None]:
print(x_test)

     S1_A  S1_C  S1_G  S1_N  S1_T  S2_A  S2_C  S2_G  S2_N  S2_T  ...  S662_A  \
91      0     0     0     1     0     0     0     0     1     0  ...       0   
255     1     0     0     0     0     0     0     0     0     1  ...       0   
284     0     0     0     1     0     0     0     0     1     0  ...       0   
445     0     0     0     1     0     0     0     0     1     0  ...       0   
475     0     0     0     1     0     0     0     0     1     0  ...       0   
..    ...   ...   ...   ...   ...   ...   ...   ...   ...   ...  ...     ...   
5       1     0     0     0     0     0     0     0     0     1  ...       0   
319     0     0     0     1     0     0     0     0     1     0  ...       0   
331     0     0     0     1     0     0     0     0     1     0  ...       0   
246     0     0     0     1     0     0     0     0     1     0  ...       0   
6       1     0     0     0     0     0     0     0     0     1  ...       0   

     S662_C  S662_G  S662_N  S662_T  S6

In [None]:
model.score(x_test,y_test)

0.9733333333333334

In [None]:
y_pred = model.predict(x_test)
report = metrics.classification_report(y_test,y_pred)
print(report)
print(y_pred)

              precision    recall  f1-score   support

           0       1.00      0.83      0.91         6
           1       1.00      1.00      1.00         4
           2       1.00      1.00      1.00         1
           3       0.00      0.00      0.00         0
           5       1.00      1.00      1.00         3
           6       1.00      1.00      1.00         9
           7       1.00      0.50      0.67         2
           8       1.00      1.00      1.00         2
           9       1.00      1.00      1.00         4
          10       1.00      1.00      1.00         8
          11       1.00      1.00      1.00        10
          12       1.00      1.00      1.00         1
          13       1.00      1.00      1.00         2
          14       1.00      1.00      1.00        22
          15       0.94      1.00      0.97        32
          16       0.88      1.00      0.93         7
          17       1.00      0.94      0.97        36
          18       1.00    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
 cohen_kappa_score(y_test,y_pred)

0.9689424918474041

In [None]:
le.inverse_transform(y_pred)

array(['Drosophila_mettleri', 'Drosophila_recens', 'Drosophila_recens',
       'Drosophila_subquinaria', 'Drosophila_subquinaria',
       'Drosophila_arizonae', 'Drosophila_recens',
       'Drosophila_subquinaria', 'Drosophila_montana',
       'Drosophila_montana', 'Drosophila_pachea', 'Drosophila_recens',
       'Drosophila_subquinaria', 'Drosophila_recens',
       'Drosophila_montana', 'Drosophila_recens',
       'Drosophila_subquinaria', 'Drosophila_recens',
       'Drosophila_montana', 'Drosophila_falleni', 'Drosophila_pachea',
       'Drosophila_simulans', 'Drosophila_pachea',
       'Drosophila_mojavensis', 'Drosophila_falleni',
       'Drosophila_subquinaria', 'Drosophila_arizonae',
       'Drosophila_simulans', 'Drosophila_mettleri', 'Drosophila_pachea',
       'Drosophila_subquinaria', 'Drosophila_mojavensis',
       'Drosophila_mojavensis', 'Drosophila_pachea',
       'Drosophila_subquinaria', 'Drosophila_subquinaria',
       'Drosophila_recens', 'Drosophila_subquinaria', 'Dr

In [None]:
le.inverse_transform(y_test)

array(['Drosophila_mettleri', 'Drosophila_recens', 'Drosophila_recens',
       'Drosophila_subquinaria', 'Drosophila_subquinaria',
       'Drosophila_arizonae', 'Drosophila_recens',
       'Drosophila_subquinaria', 'Drosophila_montana',
       'Drosophila_montana', 'Drosophila_pachea', 'Drosophila_recens',
       'Drosophila_subquinaria', 'Drosophila_recens',
       'Drosophila_montana', 'Drosophila_recens',
       'Drosophila_subquinaria', 'Drosophila_recens',
       'Drosophila_montana', 'Drosophila_falleni', 'Drosophila_pachea',
       'Drosophila_simulans', 'Drosophila_pachea',
       'Drosophila_mojavensis', 'Drosophila_falleni',
       'Drosophila_subquinaria', 'Drosophila_arizonae',
       'Drosophila_simulans', 'Drosophila_mettleri', 'Drosophila_pachea',
       'Drosophila_subquinaria', 'Drosophila_mojavensis',
       'Drosophila_mojavensis', 'Drosophila_pachea',
       'Drosophila_subquinaria', 'Drosophila_subquinaria',
       'Drosophila_recens', 'Drosophila_subquinaria', 'Dr

In [None]:
pred_df = pd.DataFrame()
pred_df['Esp'] = le.inverse_transform(y_pred)
pred_df['Obs'] = le.inverse_transform(y_test)
print(pred_df)

                        Esp                     Obs
0       Drosophila_mettleri     Drosophila_mettleri
1         Drosophila_recens       Drosophila_recens
2         Drosophila_recens       Drosophila_recens
3    Drosophila_subquinaria  Drosophila_subquinaria
4    Drosophila_subquinaria  Drosophila_subquinaria
..                      ...                     ...
145        Drosophila_angor        Drosophila_angor
146       Drosophila_recens       Drosophila_recens
147       Drosophila_recens       Drosophila_recens
148       Drosophila_pachea       Drosophila_pachea
149        Drosophila_angor        Drosophila_angor

[150 rows x 2 columns]


# 4. Avaluar el modelo de ML

## 4.1 Codificación One-hot del conjunto de prueba del desafio

In [None]:
df_test.head

<bound method NDFrame.head of 0                    ID S1 S2 S3 S4 S5 S6 S7 S8 S9  ... S654 S655 S656 S657  \
1    DQ471543_110189412  A  A  A  T  T  G  G  A  A  ...    C    A    A    C   
2    DQ471554_110189434  A  T  T  G  G  A  A  C  T  ...    A    C    A    T   
3     DQ383671_87475141  N  N  N  N  N  N  A  C  T  ...    N    N    N    N   
4     DQ383677_87475153  N  N  N  N  N  N  A  C  T  ...    N    N    N    N   
5     DQ383678_87475155  N  N  N  N  N  N  A  C  T  ...    N    N    N    N   
..                  ... .. .. .. .. .. .. .. .. ..  ...  ...  ...  ...  ...   
117            Auxiliar  A  A  A  A  A  A  A  A  A  ...    A    A    A    A   
118            Auxiliar  C  C  C  C  C  C  C  C  C  ...    C    C    C    C   
119            Auxiliar  G  G  G  G  G  G  G  G  G  ...    G    G    G    G   
120            Auxiliar  T  T  T  T  T  T  T  T  T  ...    T    T    T    T   
121            Auxiliar  N  N  N  N  N  N  N  N  N  ...    N    N    N    N   

0   S658 S659 S660 S6

In [None]:
df_X_test = df_test.loc[:, 'S1':'S663']
df_X_test.columns

Index(['S1', 'S2', 'S3', 'S4', 'S5', 'S6', 'S7', 'S8', 'S9', 'S10',
       ...
       'S654', 'S655', 'S656', 'S657', 'S658', 'S659', 'S660', 'S661', 'S662',
       'S663'],
      dtype='object', name=0, length=663)

In [None]:
one_hot_encoded_test = pd.get_dummies(df_X_test, columns = df_X_test.columns)
print(one_hot_encoded_test)

     S1_A  S1_C  S1_G  S1_N  S1_T  S2_A  S2_C  S2_G  S2_N  S2_T  ...  S662_A  \
1       1     0     0     0     0     1     0     0     0     0  ...       1   
2       1     0     0     0     0     0     0     0     0     1  ...       0   
3       0     0     0     1     0     0     0     0     1     0  ...       0   
4       0     0     0     1     0     0     0     0     1     0  ...       0   
5       0     0     0     1     0     0     0     0     1     0  ...       0   
..    ...   ...   ...   ...   ...   ...   ...   ...   ...   ...  ...     ...   
117     1     0     0     0     0     1     0     0     0     0  ...       1   
118     0     1     0     0     0     0     1     0     0     0  ...       0   
119     0     0     1     0     0     0     0     1     0     0  ...       0   
120     0     0     0     0     1     0     0     0     0     1  ...       0   
121     0     0     0     1     0     0     0     0     1     0  ...       0   

     S662_C  S662_G  S662_N  S662_T  S6

In [None]:
one_hot_encoded_test = one_hot_encoded_test.iloc[:-1, :]
one_hot_encoded_test = one_hot_encoded_test.iloc[:-1, :]
one_hot_encoded_test = one_hot_encoded_test.iloc[:-1, :]
one_hot_encoded_test = one_hot_encoded_test.iloc[:-1, :]
one_hot_encoded_test = one_hot_encoded_test.iloc[:-1, :]
print(one_hot_encoded_test)

     S1_A  S1_C  S1_G  S1_N  S1_T  S2_A  S2_C  S2_G  S2_N  S2_T  ...  S662_A  \
1       1     0     0     0     0     1     0     0     0     0  ...       1   
2       1     0     0     0     0     0     0     0     0     1  ...       0   
3       0     0     0     1     0     0     0     0     1     0  ...       0   
4       0     0     0     1     0     0     0     0     1     0  ...       0   
5       0     0     0     1     0     0     0     0     1     0  ...       0   
..    ...   ...   ...   ...   ...   ...   ...   ...   ...   ...  ...     ...   
112     0     0     0     1     0     0     0     0     1     0  ...       0   
113     0     0     0     1     0     0     0     0     1     0  ...       0   
114     0     0     0     1     0     0     0     0     1     0  ...       0   
115     0     0     0     1     0     0     0     0     1     0  ...       0   
116     0     0     0     1     0     0     0     0     1     0  ...       0   

     S662_C  S662_G  S662_N  S662_T  S6

In [None]:
one_hot_encoded_test = one_hot_encoded_test[x_train.columns]

In [None]:
pred = model.predict(one_hot_encoded_test)
le.inverse_transform(pred)

array(['Drosophila_montana', 'Drosophila_angor', 'Drosophila_arizonae',
       'Drosophila_arizonae', 'Drosophila_arizonae',
       'Drosophila_barutani', 'Drosophila_falleni', 'Drosophila_falleni',
       'Drosophila_falleni', 'Drosophila_innubila', 'Drosophila_innubila',
       'Drosophila_innubila', 'Drosophila_innubila',
       'Drosophila_melanogaster', 'Drosophila_melanogaster',
       'Drosophila_mettleri', 'Drosophila_mettleri',
       'Drosophila_mettleri', 'Drosophila_mettleri',
       'Drosophila_mettleri', 'Drosophila_mojavensis',
       'Drosophila_mojavensis', 'Drosophila_mojavensis',
       'Drosophila_mojavensis', 'Drosophila_mojavensis',
       'Drosophila_mojavensis', 'Drosophila_mojavensis',
       'Drosophila_mojavensis', 'Drosophila_mojavensis',
       'Drosophila_montana', 'Drosophila_montana', 'Drosophila_montana',
       'Drosophila_montana', 'Drosophila_montana', 'Drosophila_montana',
       'Drosophila_montana', 'Drosophila_montana',
       'Drosophila_nigrosp

In [None]:
print(df_test.shape)

(121, 664)


In [None]:
df_test = df_test.iloc[:-1, :]
df_test = df_test.iloc[:-1, :]
df_test = df_test.iloc[:-1, :]
df_test = df_test.iloc[:-1, :]
df_test = df_test.iloc[:-1, :]

In [None]:
df_test['pred_spp'] = le.inverse_transform(pred)
df_test['pred_label'] = pred
print(df_test[['ID','pred_spp','pred_label']])

0                    ID                pred_spp  pred_label
1    DQ471543_110189412      Drosophila_montana          11
2    DQ471554_110189434        Drosophila_angor           0
3     DQ383671_87475141     Drosophila_arizonae           1
4     DQ383677_87475153     Drosophila_arizonae           1
5     DQ383678_87475155     Drosophila_arizonae           1
..                  ...                     ...         ...
112  DQ851680_114187161  Drosophila_subquinaria          17
113  DQ851686_114187173  Drosophila_subquinaria          17
114  DQ851689_114187179  Drosophila_subquinaria          17
115   DQ426800_90018979      Drosophila_virilis          18
116   DQ426803_90018985      Drosophila_virilis          18

[116 rows x 3 columns]


Guardalo como un google sheets

In [None]:
# create, and save df
from gspread_dataframe import set_with_dataframe
title = 'Mukurei'
gc.create(title)  # if not exist
sheet = gc.open(title).sheet1
set_with_dataframe(sheet, df_test[['ID','pred_spp','pred_label']])
# include_index=False, include_column_header=True, resize=False