# UMAP: Uniform Manifold Approximation and Projection
## Inteligencia Computacional 2021-2, Grupo 8a
Nicolás Canales, Matías Vergara

Este notebook tiene por objetivo aplicar UMAP sobre las curvas de luz con características computadas para reducir su dimensionalidad a 2D, y luego visualizar el resultado mediante un scatterplot en búsqueda de clusters.

Recordemos que los objetos con los que estamos trabajando son aquellos de tipo periódico, clasificados por ALeRCE como: "LPV", "Periodic-Other", "RRL", "CEP", "E" o "DSCT". 

### Referencias:
umap-learn 0.5.1 Project description on pypi: https://pypi.org/project/umap-learn/

UMAP learn the docs - How to use UMAP https://umap-learn.readthedocs.io/en/latest/basic_usage.html


### IMPORTANTE:
Tras el MP4, se cambia la forma en que se aumenta la data. 
- Ya no se aumentan las alertas (antes de procesar) si no las curvas de luz (ya procesadas)
- Se incorporan dos opciones: obtener un sintético por cada entrada (sintetics_V2.csv) o obtener diez sintéticos distintos por cada entrada original (sintetics_V2_x10.csv)

Lo anterior hace que los comentarios que acompañan a cada celda de código pierdan sentido en función de la data que se esté usando, pues fueron redactados considerando la data sintética original (sintetic_features.csv).

## Instalación de librerías necesarias

In [None]:
!pip install umap-learn[plot]
!pip install pandas_profiling
!pip install jupyter
!pip install iprogress
!pip install summarytools

Collecting umap-learn[plot]
  Downloading umap-learn-0.5.2.tar.gz (86 kB)
[K     |████████████████████████████████| 86 kB 2.9 MB/s 
Collecting pynndescent>=0.5
  Downloading pynndescent-0.5.5.tar.gz (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 30.8 MB/s 
Collecting datashader
  Downloading datashader-0.13.0-py2.py3-none-any.whl (15.8 MB)
[K     |████████████████████████████████| 15.8 MB 14.4 MB/s 
Collecting datashape>=0.5.1
  Downloading datashape-0.5.2.tar.gz (76 kB)
[K     |████████████████████████████████| 76 kB 5.2 MB/s 
Collecting distributed>=2.0
  Downloading distributed-2021.12.0-py3-none-any.whl (802 kB)
[K     |████████████████████████████████| 802 kB 41.4 MB/s 
[?25hCollecting partd>=0.3.10
  Downloading partd-1.2.0-py3-none-any.whl (19 kB)
Collecting fsspec>=0.6.0
  Downloading fsspec-2021.11.1-py3-none-any.whl (132 kB)
[K     |████████████████████████████████| 132 kB 51.0 MB/s 
[?25hCollecting multipledispatch>=0.4.7
  Downloading multipledispatch-0.6

In [None]:
import numpy as np
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import RobustScaler
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
%matplotlib inline
import umap 
from summarytools.summarytools import dfSummary
import pandas_profiling
import re
from io import BytesIO
from PIL import Image
import base64
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource, CategoricalColorMapper
from bokeh.palettes import Spectral10, Spectral6


## Traer las curvas de luz con features

In [None]:
# traemos las curvas de luz con sus features
!gdown --id 19uB-u0gYCGKFlKFXCKIwsK5G4_MCFvV1 #lc_features.csv

# traemos además las curvas de luz sintéticas, que deberemos unir a nuestras curvas
# originales
!gdown --id 1zEo1QzFFhASgTv8XpsZKhirvDuCuaIs8 #sintetic_v2_x10.csv
!gdown --id 18VXFUELzdQWwDRaJAYM-TlLdVmYT0lgR #sintetics_v2.csv

# traemos tambien el archivo labels.csv, con las clasificaciones
!gdown --id  1LU1sCIVXO8BQRMeKCCqu1vZGnceV6P5c #labels_set.csv

Downloading...
From: https://drive.google.com/uc?id=19uB-u0gYCGKFlKFXCKIwsK5G4_MCFvV1
To: /content/lc_features.csv
100% 143M/143M [00:01<00:00, 121MB/s]
Downloading...
From: https://drive.google.com/uc?id=1zEo1QzFFhASgTv8XpsZKhirvDuCuaIs8
To: /content/sintetic_features_v2_x10.csv
100% 47.7M/47.7M [00:00<00:00, 115MB/s] 
Downloading...
From: https://drive.google.com/uc?id=18VXFUELzdQWwDRaJAYM-TlLdVmYT0lgR
To: /content/sintetic_features_v2.csv
100% 4.76M/4.76M [00:00<00:00, 41.7MB/s]
Downloading...
From: https://drive.google.com/uc?id=1LU1sCIVXO8BQRMeKCCqu1vZGnceV6P5c
To: /content/labels_set.csv
100% 10.2M/10.2M [00:00<00:00, 38.8MB/s]


## Preparando la data para UMAP

In [None]:
lc_features = pd.read_csv("lc_features.csv", sep=',', index_col=1).drop(columns='Unnamed: 0')
# cambiamos sintetic_features por sintetic_features_V2_x10, correspondiente a 
# la nueva data sintetica generada POST-procesamiento
sintetic_features = pd.read_csv("sintetic_features_v2_x10.csv", index_col=0)
labels = pd.read_csv("labels_set.csv", index_col=0)

print(len(lc_features.index))
print(len(sintetic_features.index))

87003
26060


In [None]:
sintetic_features

Unnamed: 0_level_0,Multiband_period,PPE,Period_band_g,delta_period_g,Period_band_r,delta_period_r,GP_DRW_sigma_g,GP_DRW_tau_g,GP_DRW_sigma_r,GP_DRW_tau_r,Psi_CS_g,Psi_eta_g,Psi_CS_r,Psi_eta_r,Harmonics_mag_1_g,Harmonics_mag_2_g,Harmonics_mag_3_g,Harmonics_mag_4_g,Harmonics_mag_5_g,Harmonics_mag_6_g,Harmonics_mag_7_g,Harmonics_phase_2_g,Harmonics_phase_3_g,Harmonics_phase_4_g,Harmonics_phase_5_g,Harmonics_phase_6_g,Harmonics_phase_7_g,Harmonics_mse_g,Harmonics_mag_1_r,Harmonics_mag_2_r,Harmonics_mag_3_r,Harmonics_mag_4_r,Harmonics_mag_5_r,Harmonics_mag_6_r,Harmonics_mag_7_r,Harmonics_phase_2_r,Harmonics_phase_3_r,Harmonics_phase_4_r,Harmonics_phase_5_r,Harmonics_phase_6_r,...,PairSlopeTrend_g,PercentAmplitude_g,Q31_g,Rcs_g,Skew_g,SmallKurtosis_g,Std_g,StetsonK_g,Pvar_g,ExcessVar_g,SF_ML_amplitude_g,SF_ML_gamma_g,IAR_phi_g,LinearTrend_g,Amplitude_r,AndersonDarling_r,Autocor_length_r,Beyond1Std_r,Con_r,Eta_e_r,Gskew_r,MaxSlope_r,Mean_r,Meanvariance_r,MedianAbsDev_r,MedianBRP_r,PairSlopeTrend_r,PercentAmplitude_r,Q31_r,Rcs_r,Skew_r,SmallKurtosis_r,Std_r,StetsonK_r,Pvar_r,ExcessVar_r,SF_ML_amplitude_r,SF_ML_gamma_r,IAR_phi_r,LinearTrend_r
oid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
sinteticCEPS0,0.099112,0.089271,0.100329,0.000000,,,0.071626,0.159068,0.029779,0.364274,0.295159,1.553925,0.468459,2.096526,4.019349,4.061796,2.780384,6.108448,1.502323,0.681500,0.962506,6.139743,5.915208,2.650219,5.445091,2.389707,5.399856,4.103934e-27,4.134828,3.094456,1.517355,0.310286,1.998845,3.399733,4.174518,0.003566,0.028272,2.828940,3.066402,3.063408,...,0.000000,0.031171,0.554614,0.339576,-0.647699,-0.926821,0.268092,0.998039,0.997495,0.000197,0.555430,0.049751,0.000005,0.000366,,,,,,,,,,,,,,,,,,,,,,,,,,
sinteticCEPS1,0.099169,0.088147,0.099570,0.000000,,,0.071673,0.161797,0.029684,0.364339,0.295301,1.554834,0.470246,2.076452,4.051274,4.087754,2.770758,6.081897,1.500731,0.680811,0.964964,6.173608,5.974260,2.638696,5.383003,2.380650,5.375307,4.079439e-27,4.157774,3.097097,1.505164,0.312626,2.016052,3.370640,4.150844,0.003561,0.028183,2.777995,3.018388,3.050269,...,0.000000,0.031410,0.555056,0.342993,-0.652239,-0.920977,0.270101,0.996387,0.993448,0.000195,0.560848,0.049418,0.000005,0.000365,,,,,,,,,,,,,,,,,,,,,,,,,,
sinteticCEPS2,0.098699,0.087926,0.099406,0.000000,,,0.071598,0.159937,0.029804,0.366520,0.294160,1.536855,0.474844,2.079180,4.027345,4.077853,2.787175,6.115894,1.505516,0.678950,0.962230,6.109571,5.989581,2.640473,5.365952,2.393844,5.395031,4.065517e-27,4.117611,3.082682,1.510912,0.308327,2.009039,3.380112,4.115728,0.003510,0.028215,2.803217,3.043507,3.019906,...,0.000000,0.031182,0.559451,0.337666,-0.654363,-0.916446,0.269557,0.983110,0.992095,0.000195,0.551303,0.049217,0.000004,0.000369,,,,,,,,,,,,,,,,,,,,,,,,,,
sinteticCEPS3,0.098827,0.088224,0.098811,0.000000,,,0.071522,0.160640,0.029717,0.366719,0.298748,1.546658,0.473148,2.086706,3.993678,4.034017,2.764617,6.056875,1.498182,0.675883,0.977536,6.138937,5.957979,2.633409,5.361000,2.423897,5.347754,4.066460e-27,4.146558,3.058873,1.513719,0.307536,2.033466,3.383197,4.119971,0.003570,0.028363,2.790211,3.037399,3.030844,...,0.000000,0.031196,0.554902,0.339666,-0.655705,-0.919617,0.270398,0.985420,0.992847,0.000199,0.553648,0.049519,0.000005,0.000366,,,,,,,,,,,,,,,,,,,,,,,,,,
sinteticCEPS4,0.098656,0.089197,0.100549,0.000000,,,0.071681,0.161520,0.029476,0.363414,0.294348,1.556388,0.469969,2.095631,4.033606,4.038949,2.755078,6.144406,1.507560,0.670949,0.973785,6.202614,5.968141,2.632776,5.429138,2.403914,5.378743,4.106377e-27,4.134456,3.071846,1.501060,0.311035,2.014689,3.369042,4.190674,0.003513,0.028294,2.826105,3.064029,3.040411,...,0.000000,0.031776,0.552161,0.340904,-0.647998,-0.920594,0.270506,1.000705,1.003743,0.000195,0.553836,0.049298,0.000005,0.000366,,,,,,,,,,,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
sinteticPeriodicOther12555,0.079204,0.010173,0.055053,0.024158,0.091390,0.012417,0.001843,27.373111,0.008776,84.351168,0.307633,2.008275,0.443011,1.749511,1.905918,2.033370,2.155535,1.084262,2.441744,0.184809,1.561182,2.335036,5.842121,1.338555,4.677646,3.647727,4.053115,7.798272e-28,2.757993,3.910771,2.949149,1.690141,3.551470,2.785220,1.631147,0.100123,4.171450,5.244948,4.432017,0.751923,...,-0.132663,0.005879,0.073174,0.448150,-0.447688,-0.721829,0.041791,0.865222,0.996176,0.000005,0.593441,0.864813,0.961743,-0.000129,0.109653,0.999763,1.000266,0.284053,0.0,0.127771,-0.153757,0.031635,15.159038,0.005656,0.025704,0.428374,0.0,0.012192,0.112223,0.440379,-0.857291,0.752752,0.087481,0.919235,0.994129,0.000032,2.090607,1.911991,0.990544,-0.000281
sinteticPeriodicOther12556,0.079169,0.010238,0.055235,0.024188,0.090770,0.012389,0.001832,27.223455,0.008896,85.056560,0.307367,2.017428,0.443441,1.752042,1.933336,2.021091,2.167738,1.083606,2.436534,0.185365,1.564079,2.337264,5.746406,1.352475,4.643599,3.659963,4.049029,7.764879e-28,2.763567,3.915213,2.956226,1.679528,3.585335,2.781817,1.610575,0.100913,4.173663,5.224997,4.428951,0.750504,...,-0.134248,0.005793,0.073018,0.447975,-0.446903,-0.723803,0.042198,0.864314,0.992518,0.000005,0.592483,0.871878,0.976683,-0.000128,0.109681,0.988409,1.001883,0.284095,0.0,0.127849,-0.154106,0.031625,15.159919,0.005703,0.025751,0.426498,0.0,0.012093,0.111527,0.447064,-0.870914,0.742829,0.086402,0.911428,0.991962,0.000032,2.087452,1.886353,0.998666,-0.000278
sinteticPeriodicOther12557,0.078546,0.010372,0.054331,0.023985,0.091051,0.012448,0.001836,27.388109,0.008929,83.585437,0.305308,2.012049,0.447914,1.731012,1.910220,2.024124,2.177319,1.075106,2.452484,0.186056,1.537097,2.334359,5.763605,1.347915,4.652798,3.640524,4.070154,7.799927e-28,2.750797,3.964327,2.974408,1.674189,3.576024,2.760357,1.610736,0.099750,4.232674,5.215847,4.446951,0.748317,...,-0.132604,0.005767,0.072761,0.447481,-0.445398,-0.722104,0.042120,0.868847,0.993289,0.000005,0.585010,0.869238,0.964123,-0.000129,0.109094,0.986123,1.003373,0.286385,0.0,0.128641,-0.153112,0.031758,15.261932,0.005729,0.025883,0.429384,0.0,0.012217,0.112991,0.447422,-0.859808,0.741349,0.087012,0.910306,0.999674,0.000032,2.109561,1.917050,0.990761,-0.000282
sinteticPeriodicOther12558,0.079398,0.010343,0.054232,0.023886,0.091272,0.012574,0.001863,27.200525,0.008775,83.658690,0.307766,2.014368,0.445360,1.750033,1.924206,2.036220,2.149033,1.071289,2.446759,0.186406,1.549812,2.354160,5.774117,1.358008,4.693901,3.644562,4.086277,7.700087e-28,2.798424,3.941023,2.943619,1.664753,3.581871,2.777849,1.605610,0.100605,4.170911,5.229497,4.442694,0.745155,...,-0.134380,0.005828,0.072476,0.451837,-0.447533,-0.718877,0.041662,0.879791,0.992411,0.000005,0.583601,0.870710,0.963303,-0.000129,0.108956,1.003187,0.996130,0.283245,0.0,0.129519,-0.154149,0.031820,15.067379,0.005676,0.025798,0.429000,0.0,0.012238,0.111869,0.448570,-0.863237,0.741925,0.086724,0.919482,1.005106,0.000032,2.105609,1.916322,1.006069,-0.000279


Se logró calcular features para 87.003 curvas de luz originales y 2606 sintéticas, sin embargo, muchas de ellas incorporan valores NaN. Si realizamos un `dropna()`, quedaremos con poco más de 20.000! (que es perder demasiada información). Necesitamos encontrar otra forma de tratar este problema.

Un camino viable puede ser ver qué características son no-nulas para la mayoría de los datos. Para ello usamos pandas-profiler para generar un análisis de las variables, mediante la siguiente celda de código (comentada).

Importante: pandas profiling requiere la versión 0.25 de pandas, que no es la que trae Colab por defecto. Se requiere reiniciar el entorno una vez instalada.

In [None]:
#!pip install pandas==0.25
#profile = pandas_profiling.ProfileReport(lc_features)
#profile.to_file("profile_lc_features.html")

Lo anterior nos genera el archivo `profile_lc_features.html`, el cual estudiamos para dar solución al problema anterior. En particular podemos notar que:
- Eta_e_g presenta un 48.8% de missing values
- MaxSlope_g presenta un 48.8% de missing values
- Eta_e_r presenta un 52.3% de missing values
- MaxSlope_r presenta un 52.3% de missing values

Probaremos dropeando estas columnas antes de aplicar el `drop.na()`, para ver  si esto reduce la fracción de data perdida.

In [None]:
lc_features.drop(axis=1, labels=['Eta_e_g', 'Eta_e_r', 'MaxSlope_g', 'MaxSlope_r'], inplace=True)
sintetic_features.drop(axis=1, labels=['Eta_e_g', 'Eta_e_r', 'MaxSlope_g', 'MaxSlope_r'], inplace=True)

Veamos ahora si cuántos datos mantenemos al aplicar `drop.na()`

---



In [None]:
lc_features.dropna(inplace=True)
sintetic_features.dropna(inplace=True)
print(len(lc_features.index))
print(len(sintetic_features.index))

74908
21730


Retenemos 77081 filas! Es decir, solo perdimos alrededor del 13.1% de la data, lo cual es bueno. Sin embargo, también debemos ver que los datos que perdimos no correspondan a una sola clase (o aún peor: a nuestras clases menos representadas). Veamos cuantas entradas tenemos por clase:

In [None]:
count = {"LPV":0, "Periodic-Other": 0, "RRL": 0, "CEP": 0, "E": 0, "DSCT": 0}
for oid in lc_features.index.unique():
  try:
    alerce_class = labels.loc[oid].values[0]
    count[alerce_class] += 1
  except:
    pass
for oid in sintetic_features.index.unique():
    alerce_class = oid[8:]
    alerce_class = "".join(re.findall("[a-zA-Z]+", alerce_class))
    if alerce_class == "PeriodicOther":
      alerce_class = "Periodic-Other"
    elif alerce_class == "DSCTS":
      alerce_class = "DSCT"
    else:
      alerce_class = "CEP"
    count[alerce_class] += 1
count


{'CEP': 4983,
 'DSCT': 7348,
 'E': 33486,
 'LPV': 9799,
 'Periodic-Other': 11572,
 'RRL': 29450}

Recordemos que, anteriormente, la distribución era la siguiente (Véase notebook de data-augmentation):

{'CEP': 1236,
 'DSCT': 1464,
 'E': 37900,
 'LPV': 14045,
 'Periodic-Other': 2512,
 'RRL': 32464}
 
 Notamos que hemos perdido principalmente entradas de RRL y LPV, dos de las clases más representadas. Asumiremos que podemos continuar sin problema.


Hasta ahora nuestra data es de la forma `oid, feature1, feature2, ..., featuren`. Esto es suficiente para que UMAP realice la reducción de dimensionalidad y nos permita visualizar los datos... Pero hay un problema: no tenemos incorporada la clase de nuestros datos, el **target!**.


**IMPORTANTE: Deseamos incluir el target en los datos para poder colorear la visualización que resultará de aplicar UMAP, ¡no para entrenar!**


Para dar solución a esta situación volveremos a generar un cruce de datos, esta vez entre los datasets `labels_set.csv` y nuestro dataframe de curvas de luz originales y sintéticas, rescatando la clase en ALeRCE `classALeRCE` de cada curva para incorporarla en el dataframe y luego usar dicha columna como target de color en la visualización que obtengamos mediante UMAP. Para las curvas sintéticas - que no están incluidas en el labels_set - obtendremos su clase a partir de su nombre.

In [None]:
# Primero agregamos target a las curvas originales
lc_features["target"] = ""
for index, row in lc_features.iterrows():
    alerce_class = labels.loc[index].values[0]
    lc_features['target'][index] = alerce_class

# Y lo mismo para las curvas sintéticas
sintetic_features["target"] = ""
for index, row in sintetic_features.iterrows():
    alerce_class = index[8:]
    alerce_class = "".join(re.findall("[a-zA-Z]+", alerce_class))
    if alerce_class == "PeriodicOther":
      alerce_class = "Periodic-Other"
    elif alerce_class == "DSCTS":
      alerce_class = "DSCT"
    else:
      alerce_class = "CEP"
    sintetic_features['target'][index] = alerce_class

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [None]:
lc_features.head()

Unnamed: 0_level_0,Multiband_period,PPE,Period_band_g,delta_period_g,Period_band_r,delta_period_r,GP_DRW_sigma_g,GP_DRW_tau_g,GP_DRW_sigma_r,GP_DRW_tau_r,Psi_CS_g,Psi_eta_g,Psi_CS_r,Psi_eta_r,Harmonics_mag_1_g,Harmonics_mag_2_g,Harmonics_mag_3_g,Harmonics_mag_4_g,Harmonics_mag_5_g,Harmonics_mag_6_g,Harmonics_mag_7_g,Harmonics_phase_2_g,Harmonics_phase_3_g,Harmonics_phase_4_g,Harmonics_phase_5_g,Harmonics_phase_6_g,Harmonics_phase_7_g,Harmonics_mse_g,Harmonics_mag_1_r,Harmonics_mag_2_r,Harmonics_mag_3_r,Harmonics_mag_4_r,Harmonics_mag_5_r,Harmonics_mag_6_r,Harmonics_mag_7_r,Harmonics_phase_2_r,Harmonics_phase_3_r,Harmonics_phase_4_r,Harmonics_phase_5_r,Harmonics_phase_6_r,...,MedianBRP_g,PairSlopeTrend_g,PercentAmplitude_g,Q31_g,Rcs_g,Skew_g,SmallKurtosis_g,Std_g,StetsonK_g,Pvar_g,ExcessVar_g,SF_ML_amplitude_g,SF_ML_gamma_g,IAR_phi_g,LinearTrend_g,Amplitude_r,AndersonDarling_r,Autocor_length_r,Beyond1Std_r,Con_r,Gskew_r,Mean_r,Meanvariance_r,MedianAbsDev_r,MedianBRP_r,PairSlopeTrend_r,PercentAmplitude_r,Q31_r,Rcs_r,Skew_r,SmallKurtosis_r,Std_r,StetsonK_r,Pvar_r,ExcessVar_r,SF_ML_amplitude_r,SF_ML_gamma_r,IAR_phi_r,LinearTrend_r,target
oid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
ZTF17aadmidf,0.149331,0.150302,0.149331,0.0,0.149331,0.0,0.012568,1.583378,0.009955,0.540005,0.280821,0.534771,0.446245,1.715562,0.226143,0.17336,0.160905,0.12237,0.105802,0.053345,0.040609,4.410925,1.914819,5.7089,3.399976,0.76343,4.947751,0.0003559182,3.322528,3.786361,0.630072,3.320083,3.456475,3.14531,1.485501,0.920114,0.877684,5.396031,4.185686,1.120278,...,0.482759,-0.133333,0.015222,0.224707,0.173023,0.461301,-1.505973,0.115035,0.860189,1.0,5e-05,0.182344,0.114451,0.601279,-0.000188,0.120463,0.99805,1.0,0.285714,0.0,-0.185051,15.844516,0.006337,0.027004,0.428571,0.0,0.013397,0.133858,0.329755,-0.872082,0.68752,0.100411,0.919013,1.0,3.9e-05,0.008849,-0.5,0.02979923,6.9e-05,E
ZTF17aadqpnu,0.544647,0.064055,0.544647,0.0,0.544647,0.0,0.127657,0.604882,0.062287,0.16038,0.227532,0.197661,0.264729,0.250459,0.386156,0.179297,0.108261,0.060052,0.035854,0.031842,0.042522,2.574642,4.798851,0.584667,2.626285,5.90124,1.947181,0.007086322,0.278139,0.135993,0.094853,0.058553,0.036914,0.024242,0.016701,2.317399,4.284166,0.008309,1.562713,3.695137,...,0.447059,-0.033333,0.053033,0.343159,0.11648,-1.114517,0.053458,0.354037,0.892795,1.0,0.0004,0.611645,0.114981,0.194326,3.7e-05,0.402421,1.0,1.0,0.5,0.0,-0.499046,16.622919,0.014997,0.086149,0.5,-0.033333,0.039674,0.278279,0.098373,-0.951954,-0.496121,0.249302,0.911707,1.0,0.000222,0.374863,0.037573,0.002108215,0.000121,RRL
ZTF18aapkwuy,0.578986,0.110494,0.578986,0.0,0.578986,0.0,0.100087,0.071747,0.056377,0.17247,0.262575,0.054283,0.238333,0.052322,0.341754,0.164201,0.114893,0.072789,0.040806,0.018547,0.011639,2.522284,4.543388,0.297373,2.282109,4.034967,0.63288,0.001907514,0.238908,0.100937,0.09743,0.039466,0.037239,0.009078,0.010867,2.078315,4.031695,5.594666,1.338155,2.12687,...,0.5,-0.033333,0.05322,0.554007,0.060306,-0.608783,-1.146912,0.319406,0.840497,1.0,0.000437,0.454134,-0.006394,2e-06,7e-05,0.350948,1.0,1.0,0.450355,0.0,0.128946,14.990062,0.016191,0.244117,0.244681,-0.1,0.032933,0.443769,0.110692,0.100949,-1.461597,0.242701,0.796048,1.0,0.000261,0.351549,0.016288,0.002633387,-2.6e-05,RRL
ZTF18aajoeri,0.620327,0.095026,0.620327,0.0,0.620327,0.0,0.074342,0.385647,0.041275,0.33111,0.247368,0.102651,0.269291,0.126154,0.318212,0.135723,0.079888,0.051215,0.025414,0.012199,0.021768,2.713206,4.243866,0.786959,1.229373,4.593619,4.457981,0.0005181999,0.215537,0.066414,0.086511,0.017082,0.030978,0.005716,0.011889,1.723358,3.504457,5.564069,6.254613,4.078106,...,0.241379,-0.166667,0.028925,0.546084,0.117069,0.227756,-1.476044,0.270579,0.846095,1.0,0.000248,0.421667,0.043597,0.074929,-9.5e-05,0.283624,1.0,1.0,0.444444,0.0,-0.247613,16.909373,0.012214,0.161923,0.208333,-0.1,0.024387,0.39287,0.131229,-0.128824,-1.734664,0.20653,0.920227,1.0,0.000147,0.272383,-0.043357,0.05270067,-8.9e-05,RRL
ZTF18adaiqlm,0.599899,0.025961,0.230787,0.369112,0.599899,0.0,0.0356,93.530163,0.022202,0.05785,0.352077,1.926255,0.356133,0.878343,115.866663,143.465582,137.550345,144.695345,184.65524,220.949264,170.59098,2.218201,2.143108,2.010562,4.68678,4.653281,1.258688,3.6164380000000003e-22,2711.360489,3010.013855,1943.892353,997.080236,1177.144468,853.748343,291.427822,2.633734,5.175253,2.193438,5.359701,1.473237,...,0.5,0.1,0.024009,0.090657,0.352077,-1.730831,8.226508,0.15394,0.818056,1.0,7.4e-05,0.120289,0.288155,0.984298,0.001207,0.232859,0.783907,1.0,0.3125,0.0,0.05707,16.653477,0.009094,0.139583,0.375,-0.033333,0.015696,0.286974,0.224736,0.17461,-0.606513,0.15145,0.936903,1.0,8.1e-05,0.173758,-0.05862,3.71382e-13,-0.000296,E


In [None]:
sintetic_features.tail()

Unnamed: 0_level_0,Multiband_period,PPE,Period_band_g,delta_period_g,Period_band_r,delta_period_r,GP_DRW_sigma_g,GP_DRW_tau_g,GP_DRW_sigma_r,GP_DRW_tau_r,Psi_CS_g,Psi_eta_g,Psi_CS_r,Psi_eta_r,Harmonics_mag_1_g,Harmonics_mag_2_g,Harmonics_mag_3_g,Harmonics_mag_4_g,Harmonics_mag_5_g,Harmonics_mag_6_g,Harmonics_mag_7_g,Harmonics_phase_2_g,Harmonics_phase_3_g,Harmonics_phase_4_g,Harmonics_phase_5_g,Harmonics_phase_6_g,Harmonics_phase_7_g,Harmonics_mse_g,Harmonics_mag_1_r,Harmonics_mag_2_r,Harmonics_mag_3_r,Harmonics_mag_4_r,Harmonics_mag_5_r,Harmonics_mag_6_r,Harmonics_mag_7_r,Harmonics_phase_2_r,Harmonics_phase_3_r,Harmonics_phase_4_r,Harmonics_phase_5_r,Harmonics_phase_6_r,...,MedianBRP_g,PairSlopeTrend_g,PercentAmplitude_g,Q31_g,Rcs_g,Skew_g,SmallKurtosis_g,Std_g,StetsonK_g,Pvar_g,ExcessVar_g,SF_ML_amplitude_g,SF_ML_gamma_g,IAR_phi_g,LinearTrend_g,Amplitude_r,AndersonDarling_r,Autocor_length_r,Beyond1Std_r,Con_r,Gskew_r,Mean_r,Meanvariance_r,MedianAbsDev_r,MedianBRP_r,PairSlopeTrend_r,PercentAmplitude_r,Q31_r,Rcs_r,Skew_r,SmallKurtosis_r,Std_r,StetsonK_r,Pvar_r,ExcessVar_r,SF_ML_amplitude_r,SF_ML_gamma_r,IAR_phi_r,LinearTrend_r,target
oid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
sinteticPeriodicOther12555,0.079204,0.010173,0.055053,0.024158,0.09139,0.012417,0.001843,27.373111,0.008776,84.351168,0.307633,2.008275,0.443011,1.749511,1.905918,2.03337,2.155535,1.084262,2.441744,0.184809,1.561182,2.335036,5.842121,1.338555,4.677646,3.647727,4.053115,7.798272e-28,2.757993,3.910771,2.949149,1.690141,3.55147,2.78522,1.631147,0.100123,4.17145,5.244948,4.432017,0.751923,...,0.229663,-0.132663,0.005879,0.073174,0.44815,-0.447688,-0.721829,0.041791,0.865222,0.996176,5e-06,0.593441,0.864813,0.961743,-0.000129,0.109653,0.999763,1.000266,0.284053,0.0,-0.153757,15.159038,0.005656,0.025704,0.428374,0.0,0.012192,0.112223,0.440379,-0.857291,0.752752,0.087481,0.919235,0.994129,3.2e-05,2.090607,1.911991,0.990544,-0.000281,Periodic-Other
sinteticPeriodicOther12556,0.079169,0.010238,0.055235,0.024188,0.09077,0.012389,0.001832,27.223455,0.008896,85.05656,0.307367,2.017428,0.443441,1.752042,1.933336,2.021091,2.167738,1.083606,2.436534,0.185365,1.564079,2.337264,5.746406,1.352475,4.643599,3.659963,4.049029,7.764879000000001e-28,2.763567,3.915213,2.956226,1.679528,3.585335,2.781817,1.610575,0.100913,4.173663,5.224997,4.428951,0.750504,...,0.232022,-0.134248,0.005793,0.073018,0.447975,-0.446903,-0.723803,0.042198,0.864314,0.992518,5e-06,0.592483,0.871878,0.976683,-0.000128,0.109681,0.988409,1.001883,0.284095,0.0,-0.154106,15.159919,0.005703,0.025751,0.426498,0.0,0.012093,0.111527,0.447064,-0.870914,0.742829,0.086402,0.911428,0.991962,3.2e-05,2.087452,1.886353,0.998666,-0.000278,Periodic-Other
sinteticPeriodicOther12557,0.078546,0.010372,0.054331,0.023985,0.091051,0.012448,0.001836,27.388109,0.008929,83.585437,0.305308,2.012049,0.447914,1.731012,1.91022,2.024124,2.177319,1.075106,2.452484,0.186056,1.537097,2.334359,5.763605,1.347915,4.652798,3.640524,4.070154,7.7999270000000005e-28,2.750797,3.964327,2.974408,1.674189,3.576024,2.760357,1.610736,0.09975,4.232674,5.215847,4.446951,0.748317,...,0.230974,-0.132604,0.005767,0.072761,0.447481,-0.445398,-0.722104,0.04212,0.868847,0.993289,5e-06,0.58501,0.869238,0.964123,-0.000129,0.109094,0.986123,1.003373,0.286385,0.0,-0.153112,15.261932,0.005729,0.025883,0.429384,0.0,0.012217,0.112991,0.447422,-0.859808,0.741349,0.087012,0.910306,0.999674,3.2e-05,2.109561,1.91705,0.990761,-0.000282,Periodic-Other
sinteticPeriodicOther12558,0.079398,0.010343,0.054232,0.023886,0.091272,0.012574,0.001863,27.200525,0.008775,83.65869,0.307766,2.014368,0.44536,1.750033,1.924206,2.03622,2.149033,1.071289,2.446759,0.186406,1.549812,2.35416,5.774117,1.358008,4.693901,3.644562,4.086277,7.700087e-28,2.798424,3.941023,2.943619,1.664753,3.581871,2.777849,1.60561,0.100605,4.170911,5.229497,4.442694,0.745155,...,0.228795,-0.13438,0.005828,0.072476,0.451837,-0.447533,-0.718877,0.041662,0.879791,0.992411,5e-06,0.583601,0.87071,0.963303,-0.000129,0.108956,1.003187,0.99613,0.283245,0.0,-0.154149,15.067379,0.005676,0.025798,0.429,0.0,0.012238,0.111869,0.44857,-0.863237,0.741925,0.086724,0.919482,1.005106,3.2e-05,2.105609,1.916322,1.006069,-0.000279,Periodic-Other
sinteticPeriodicOther12559,0.078538,0.010291,0.054594,0.023939,0.091194,0.012343,0.001839,27.048434,0.008923,83.815892,0.306164,1.99394,0.444848,1.722165,1.907349,2.036369,2.161784,1.070776,2.4232,0.185631,1.537335,2.338535,5.844413,1.341814,4.625073,3.673124,4.10615,7.8043900000000005e-28,2.761106,3.973219,2.953741,1.663992,3.531743,2.796998,1.625382,0.100679,4.180165,5.296142,4.494614,0.747057,...,0.228888,-0.13318,0.005787,0.073601,0.446123,-0.445697,-0.729669,0.041594,0.86594,1.000177,5e-06,0.594118,0.857325,0.967424,-0.000128,0.108869,1.003116,1.00191,0.285618,0.0,-0.15324,15.263966,0.005641,0.025899,0.42867,0.0,0.012183,0.11233,0.445032,-0.863235,0.740273,0.087476,0.918546,0.995998,3.2e-05,2.102531,1.906676,0.998035,-0.000278,Periodic-Other


Ahora que ya tenemos ambos dataset con sus respectivos target y las mismas columnas, podemos proceder a concatenarlos y entrar a trabajar con UMAP.

In [None]:
data = pd.concat([lc_features, sintetic_features])
print(len(data.index))
data.to_csv("augmented_features.csv")

96638


## UMAP en acción

In [None]:
# Reproducibilidad: en caso de querer ejecutar solamente esta sección correr esta celda
# para descargar la data concatenada en la sección anterior.
# !gdown --id 1HFEbip5SX591MCLi-S6DKw7LEx-CJFNt # data aumentada con las sinteticas originales
# !gdown --id 1WNhzNbJF44Z2upm1XEwZE_gAWM3M2s9A # data aumentada con las sinteticas nuevas
!gdown --id 1qY3HYGq7rH5ZzwM_vgshAK9lHQIzXsTf # data aumentada con las sinteticas nuevas x 10

Downloading...
From: https://drive.google.com/uc?id=1qY3HYGq7rH5ZzwM_vgshAK9lHQIzXsTf
To: /content/augmented_features (2).csv
100% 169M/169M [00:01<00:00, 111MB/s]


Para usar UMAP para esta tarea primero necesitamos construir un objeto UMAP que hará el trabajo por nosotros. Para eso basta instanciar la clase:

In [None]:
reducer = umap.UMAP()


Antes de hacer cualquier trabajo con los datos será útil limpiarlos un poco. Dado que las medidas están en escalas completamente distintas, será útil convertir cada feature en z-scores (cantidad de desviaciones desde la media) para poder compararlas.

Desde luego, antes de cualquier procesamiento también nos preocupamos de dropear la columna de `target`, pues buscamos generar un modelo no supervisado.

In [None]:
data = pd.read_csv("augmented_features.csv", index_col=0)
scaled_data = QuantileTransformer().fit_transform(data.drop(labels='target', axis=1))

Ahora necesitamos entrenar nuestro reductor, permitiéndole aprender del manifold. Para ello UMAP sigue la API de sklearn, incorporando el método fit al cual le pasamos la data de la cual queremos que el modelo aprenda.


In [None]:
reducer = umap.UMAP(random_state=13)
reducer.fit(scaled_data)

UMAP(random_state=13, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})

Luego tenemos el método transform, que nos dará la data transformada.

In [None]:
embedding = reducer.transform(scaled_data)

Finalmente, importamos algunas librerías de visualización y armamos el scatterplot de nuestros datos.

In [None]:
output_notebook()

In [None]:
data_df = pd.DataFrame(embedding, columns=('x', 'y'))
data_df['target'] = [x for x in data.target]

datasource = ColumnDataSource(data_df)
color_mapping = CategoricalColorMapper(factors=["E", "RRL", "CEP", "DSCT", "LPV", "Periodic-Other"],
                                       palette=Spectral6)

plot_figure = figure(
    title='UMAP projection of the periodic light curves',
    plot_width=600,
    plot_height=600,
    tools=('pan, wheel_zoom, reset')
)

plot_figure.circle(
    'x',
    'y',
    source=datasource,
    color=dict(field='target', transform=color_mapping),
    line_alpha=0.6,
    fill_alpha=0.6,
    size=4,
    legend='target'
)
show(plot_figure)



In [None]:
embedding.tofile("embedding_UMAP.csv")

## Y si antes de UMAP, escogemos las features más importantes...?

Siguiendo la instrucción del enunciado del proyecto, probaremos ahora qué pasa si ayudamos al modelo trabajando solo con un subconjunto de las features. 

Para escoger las features, se estudia la distribución, pérdida de entradas y correlación entre columnas mediante  la generación de un `pandas profiler` de los datos.

In [None]:
# Reproducibilidad: en caso de querer ejecutar solamente esta sección correr esta celda
# para descargar la data concatenada en la sección 1.                                 
#!gdown --id 1HFEbip5SX591MCLi-S6DKw7LEx-CJFNt # data aumentada con las sinteticas originales
#gdown --id 1WNhzNbJF44Z2upm1XEwZE_gAWM3M2s9A  # data aumentada con las sinteticas nuevas
!gdown --id 1qY3HYGq7rH5ZzwM_vgshAK9lHQIzXsTf # data aumentada con las sinteticas nuevas x 10
# !gdown --id 1XCl8BiVOP7aheBYjOHIAM378s_34-8kl #reduced_data, misma data pero subsampleando las clases
                                              #sobrerepresentadas        

Downloading...
From: https://drive.google.com/uc?id=1qY3HYGq7rH5ZzwM_vgshAK9lHQIzXsTf
To: /content/augmented_features (2).csv
100% 169M/169M [00:01<00:00, 166MB/s]


In [None]:
data = pd.read_csv("augmented_features.csv", index_col=0)
#data = pd.read_csv("reduced_data.csv", index_col=0)

Generamos el profiler

In [None]:
#!pip install pandas==0.25
#el profiler solo es compatible con pandas 0.25
#se recomienda obtenerlo y luego reiniciar el entorno
#profile = pandas_profiling.ProfileReport(data)
#profile.to_file("profile_data_augmented.html")

# o para traer el resultado desde el drive del proyecto:
#!gdown --id 12IZp0_0A6yKDQdGfIQUe3F4F8ZlclItB

En la sección "Overview" del profiler generado, se presentan las caraceristicas que el profiler propone descartar, ya sea porque tienen un alto porcentaje de ceros o missing values, porque son demasiado sesgadas (skew) y porque tienen mucha correlación. Se realizan pruebas con distintos subconjuntos de features, teniendo en cuenta estas recomendaciones del profiler.

In [None]:
interest_features = [
                     'Multiband_period',
                     'Mean_g',
                     'Mean_r',
                     'delta_period_g',
                     'delta_period_r',
                     'GP_DRW_sigma_r',
                     'GP_DRW_tau_g',
                     'GP_DRW_sigma_r',
                     'GP_DRW_tau_r',
                     'Harmonics_mag_1_g',
                     'Harmonics_mag_1_r',
                     'Harmonics_mse_r', # comentar esta da otro conjunto viable
                     'Harmonics_mse_g', ##
                     'Power_rate_1/4', ##
                     'AndersonDarling_g', ##
                     'AndersonDarling_r', ##
                     'iqr_g',
                     'iqr_r',
                     'Amplitude_g',
                     'Mean_g',
                     'Meanvariance_g',
                     'Amplitude_r',
                     'Mean_r',
                     'Meanvariance_r',
                     'PairSlopeTrend_g',
                     'PairSlopeTrend_r',
                     'LinearTrend_r',
                     'ExcessVar_r',
                     'LinearTrend_g',
                     'ExcessVar_g',
                    'target'
                     
]



#data = data[interest_features]
data.head()

Unnamed: 0_level_0,Multiband_period,PPE,Period_band_g,delta_period_g,Period_band_r,delta_period_r,GP_DRW_sigma_g,GP_DRW_tau_g,GP_DRW_sigma_r,GP_DRW_tau_r,Psi_CS_g,Psi_eta_g,Psi_CS_r,Psi_eta_r,Harmonics_mag_1_g,Harmonics_mag_2_g,Harmonics_mag_3_g,Harmonics_mag_4_g,Harmonics_mag_5_g,Harmonics_mag_6_g,Harmonics_mag_7_g,Harmonics_phase_2_g,Harmonics_phase_3_g,Harmonics_phase_4_g,Harmonics_phase_5_g,Harmonics_phase_6_g,Harmonics_phase_7_g,Harmonics_mse_g,Harmonics_mag_1_r,Harmonics_mag_2_r,Harmonics_mag_3_r,Harmonics_mag_4_r,Harmonics_mag_5_r,Harmonics_mag_6_r,Harmonics_mag_7_r,Harmonics_phase_2_r,Harmonics_phase_3_r,Harmonics_phase_4_r,Harmonics_phase_5_r,Harmonics_phase_6_r,...,MedianBRP_g,PairSlopeTrend_g,PercentAmplitude_g,Q31_g,Rcs_g,Skew_g,SmallKurtosis_g,Std_g,StetsonK_g,Pvar_g,ExcessVar_g,SF_ML_amplitude_g,SF_ML_gamma_g,IAR_phi_g,LinearTrend_g,Amplitude_r,AndersonDarling_r,Autocor_length_r,Beyond1Std_r,Con_r,Gskew_r,Mean_r,Meanvariance_r,MedianAbsDev_r,MedianBRP_r,PairSlopeTrend_r,PercentAmplitude_r,Q31_r,Rcs_r,Skew_r,SmallKurtosis_r,Std_r,StetsonK_r,Pvar_r,ExcessVar_r,SF_ML_amplitude_r,SF_ML_gamma_r,IAR_phi_r,LinearTrend_r,target
oid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
ZTF17aadmidf,0.149331,0.150302,0.149331,0.0,0.149331,0.0,0.012568,1.583378,0.009955,0.540005,0.280821,0.534771,0.446245,1.715562,0.226143,0.17336,0.160905,0.12237,0.105802,0.053345,0.040609,4.410925,1.914819,5.7089,3.399976,0.76343,4.947751,0.0003559182,3.322528,3.786361,0.630072,3.320083,3.456475,3.14531,1.485501,0.920114,0.877684,5.396031,4.185686,1.120278,...,0.482759,-0.133333,0.015222,0.224707,0.173023,0.461301,-1.505973,0.115035,0.860189,1.0,5e-05,0.182344,0.114451,0.601279,-0.000188,0.120463,0.99805,1.0,0.285714,0.0,-0.185051,15.844516,0.006337,0.027004,0.428571,0.0,0.013397,0.133858,0.329755,-0.872082,0.68752,0.100411,0.919013,1.0,3.9e-05,0.008849,-0.5,0.02979923,6.9e-05,E
ZTF17aadqpnu,0.544647,0.064055,0.544647,0.0,0.544647,0.0,0.127657,0.604882,0.062287,0.16038,0.227532,0.197661,0.264729,0.250459,0.386156,0.179297,0.108261,0.060052,0.035854,0.031842,0.042522,2.574642,4.798851,0.584667,2.626285,5.90124,1.947181,0.007086322,0.278139,0.135993,0.094853,0.058553,0.036914,0.024242,0.016701,2.317399,4.284166,0.008309,1.562713,3.695137,...,0.447059,-0.033333,0.053033,0.343159,0.11648,-1.114517,0.053458,0.354037,0.892795,1.0,0.0004,0.611645,0.114981,0.194326,3.7e-05,0.402421,1.0,1.0,0.5,0.0,-0.499046,16.622919,0.014997,0.086149,0.5,-0.033333,0.039674,0.278279,0.098373,-0.951954,-0.496121,0.249302,0.911707,1.0,0.000222,0.374863,0.037573,0.002108215,0.000121,RRL
ZTF18aapkwuy,0.578986,0.110494,0.578986,0.0,0.578986,0.0,0.100087,0.071747,0.056377,0.17247,0.262575,0.054283,0.238333,0.052322,0.341754,0.164201,0.114893,0.072789,0.040806,0.018547,0.011639,2.522284,4.543388,0.297373,2.282109,4.034967,0.63288,0.001907514,0.238908,0.100937,0.09743,0.039466,0.037239,0.009078,0.010867,2.078315,4.031695,5.594666,1.338155,2.12687,...,0.5,-0.033333,0.05322,0.554007,0.060306,-0.608783,-1.146912,0.319406,0.840497,1.0,0.000437,0.454134,-0.006394,2e-06,7e-05,0.350948,1.0,1.0,0.450355,0.0,0.128946,14.990062,0.016191,0.244117,0.244681,-0.1,0.032933,0.443769,0.110692,0.100949,-1.461597,0.242701,0.796048,1.0,0.000261,0.351549,0.016288,0.002633387,-2.6e-05,RRL
ZTF18aajoeri,0.620327,0.095026,0.620327,0.0,0.620327,0.0,0.074342,0.385647,0.041275,0.33111,0.247368,0.102651,0.269291,0.126154,0.318212,0.135723,0.079888,0.051215,0.025414,0.012199,0.021768,2.713206,4.243866,0.786959,1.229373,4.593619,4.457981,0.0005181999,0.215537,0.066414,0.086511,0.017082,0.030978,0.005716,0.011889,1.723358,3.504457,5.564069,6.254613,4.078106,...,0.241379,-0.166667,0.028925,0.546084,0.117069,0.227756,-1.476044,0.270579,0.846095,1.0,0.000248,0.421667,0.043597,0.074929,-9.5e-05,0.283624,1.0,1.0,0.444444,0.0,-0.247613,16.909373,0.012214,0.161923,0.208333,-0.1,0.024387,0.39287,0.131229,-0.128824,-1.734664,0.20653,0.920227,1.0,0.000147,0.272383,-0.043357,0.05270067,-8.9e-05,RRL
ZTF18adaiqlm,0.599899,0.025961,0.230787,0.369112,0.599899,0.0,0.0356,93.530163,0.022202,0.05785,0.352077,1.926255,0.356133,0.878343,115.866663,143.465582,137.550345,144.695345,184.65524,220.949264,170.59098,2.218201,2.143108,2.010562,4.68678,4.653281,1.258688,3.6164380000000003e-22,2711.360489,3010.013855,1943.892353,997.080236,1177.144468,853.748343,291.427822,2.633734,5.175253,2.193438,5.359701,1.473237,...,0.5,0.1,0.024009,0.090657,0.352077,-1.730831,8.226508,0.15394,0.818056,1.0,7.4e-05,0.120289,0.288155,0.984298,0.001207,0.232859,0.783907,1.0,0.3125,0.0,0.05707,16.653477,0.009094,0.139583,0.375,-0.033333,0.015696,0.286974,0.224736,0.17461,-0.606513,0.15145,0.936903,1.0,8.1e-05,0.173758,-0.05862,3.71382e-13,-0.000296,E


### Repetimos el proceso con nuestra data de features reducidas manualmente.

In [None]:
scaler = QuantileTransformer() # robustScaler, QuantileTransformer o StandardScaler
scaled_data = scaler.fit_transform(data.drop(labels='target', axis=1))
# cambiar n_neighbors = [15 30 60 100]
reducer = umap.UMAP(n_neighbors=30)

In [None]:
reducer.fit(scaled_data)
embedding = reducer.transform(scaled_data)

In [None]:
reducer

UMAP(n_neighbors=30, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True})

Visualicemos los resultados de UMAP en este nuevo intento

In [None]:
output_notebook()

In [None]:
data_df = pd.DataFrame(embedding, columns=('x', 'y'))
data_df['target'] = [x for x in data.target]

datasource = ColumnDataSource(data_df)
color_mapping = CategoricalColorMapper(factors=["E", "RRL", "CEP", "DSCT", "LPV", "Periodic-Other"],
                                       palette=Spectral6)

plot_figure = figure(
    title='UMAP projection of the periodic light curves',
    plot_width=600,
    plot_height=600,
    tools=('pan, wheel_zoom, reset')
)

plot_figure.circle(
    'x',
    'y',
    source=datasource,
    color=dict(field='target', transform=color_mapping),
    line_alpha=0.6,
    fill_alpha=0.6,
    size=4,
    legend='target'
)
show(plot_figure)



In [None]:
data_df.to_csv("embedding.csv")