# UMAP: Uniform Manifold Approximation and Projection
## Inteligencia Computacional 2021-2, Grupo 8a
Nicolás Canales, Matías Vergara

Este notebook tiene por objetivo aplicar UMAP sobre las curvas de luz con características computadas para reducir su dimensionalidad a 2D, y luego visualizar el resultado mediante un scatterplot en búsqueda de clusters.

Recordemos que los objetos con los que estamos trabajando son aquellos de tipo periódico, clasificados por ALeRCE como: "LPV", "Periodic-Other", "RRL", "CEP", "E" o "DSCT". 

### Referencias:
umap-learn 0.5.1 Project description on pypi: https://pypi.org/project/umap-learn/

UMAP learn the docs - How to use UMAP https://umap-learn.readthedocs.io/en/latest/basic_usage.html


## Instalación de librerías necesarias

In [3]:
!pip install umap-learn[plot]
!pip install pandas_profiling
!pip install jupyter
!pip install iprogress
!pip install summarytools

Collecting umap-learn[plot]
  Downloading umap-learn-0.5.1.tar.gz (80 kB)
[K     |████████████████████████████████| 80 kB 2.4 MB/s 
Collecting pynndescent>=0.5
  Downloading pynndescent-0.5.5.tar.gz (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 7.0 MB/s 
Collecting datashader
  Downloading datashader-0.13.0-py2.py3-none-any.whl (15.8 MB)
[K     |████████████████████████████████| 15.8 MB 646 bytes/s 
Collecting datashape>=0.5.1
  Downloading datashape-0.5.2.tar.gz (76 kB)
[K     |████████████████████████████████| 76 kB 5.1 MB/s 
Collecting partd>=0.3.10
  Downloading partd-1.2.0-py3-none-any.whl (19 kB)
Collecting distributed>=2.0
  Downloading distributed-2021.10.0-py3-none-any.whl (791 kB)
[K     |████████████████████████████████| 791 kB 34.7 MB/s 
[?25hCollecting fsspec>=0.6.0
  Downloading fsspec-2021.10.1-py3-none-any.whl (125 kB)
[K     |████████████████████████████████| 125 kB 52.1 MB/s 
Collecting multipledispatch>=0.4.7
  Downloading multipledispatch-0.6.0-py

In [4]:
import numpy as np
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
%matplotlib inline
import umap 
from summarytools.summarytools import dfSummary
import pandas_profiling
import re
from io import BytesIO
from PIL import Image
import base64
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource, CategoricalColorMapper
from bokeh.palettes import Spectral10, Spectral6


## Traer las curvas de luz con features

In [None]:
# traemos las curvas de luz con sus features
!gdown --id 19uB-u0gYCGKFlKFXCKIwsK5G4_MCFvV1
# traemos además las curvas de luz sintéticas, que deberemos unir a nuestras curvas
# originales
!gdown --id 1r8BcRI5vJgi5s9qOGsja_mlAk3S6fW7P
# traemos tambien el archivo labels.csv, con las clasificaciones
!gdown --id  1LU1sCIVXO8BQRMeKCCqu1vZGnceV6P5c

Downloading...
From: https://drive.google.com/uc?id=19uB-u0gYCGKFlKFXCKIwsK5G4_MCFvV1
To: /content/lc_features.csv
100% 143M/143M [00:00<00:00, 168MB/s]
Downloading...
From: https://drive.google.com/uc?id=1r8BcRI5vJgi5s9qOGsja_mlAk3S6fW7P
To: /content/sintetic_features.csv
100% 4.23M/4.23M [00:00<00:00, 136MB/s]
Downloading...
From: https://drive.google.com/uc?id=1LU1sCIVXO8BQRMeKCCqu1vZGnceV6P5c
To: /content/labels_set.csv
100% 10.2M/10.2M [00:00<00:00, 90.1MB/s]


## Preparando la data para UMAP

In [None]:
lc_features = pd.read_csv("lc_features.csv", sep=',', index_col=1)
sintetic_features = pd.read_csv("sintetic_features.csv", index_col=1)
labels = pd.read_csv("labels_set.csv", index_col=0)

print(len(lc_features.index))
print(len(sintetic_features.index))

87003
2606


Se logró calcular features para 87.003 curvas de luz originales y 2606 sintéticas, sin embargo, muchas de ellas incorporan valores NaN. Si realizamos un `dropna()`, quedaremos con poco más de 20.000! (que es perder demasiada información). Necesitamos encontrar otra forma de tratar este problema.

Un camino viable puede ser ver qué características son no-nulas para la mayoría de los datos. Para ello usamos pandas-profiler para generar un análisis de las variables, mediante la siguiente celda de código (comentada).

Importante: pandas profiling requiere la versión 0.25 de pandas, que no es la que trae Colab por defecto. Se requiere reiniciar el entorno una vez instalada.

In [None]:
#!pip install pandas==0.25
#profile = pandas_profiling.ProfileReport(lc_features)
#profile.to_file("profile_lc_features.html")

Lo anterior nos genera el archivo `profile_lc_features.html`, el cual estudiamos para dar solución al problema anterior. En particular podemos notar que:
- Eta_e_g presenta un 48.8% de missing values
- MaxSlope_g presenta un 48.8% de missing values
- Eta_e_r presenta un 52.3% de missing values
- MaxSlope_r presenta un 52.3% de missing values

Probaremos dropeando estas columnas antes de aplicar el `drop.na()`, para ver  si esto reduce la fracción de data perdida.

In [None]:
lc_features.drop(axis=1, labels=['Eta_e_g', 'Eta_e_r', 'MaxSlope_g', 'MaxSlope_r'], inplace=True)
sintetic_features.drop(axis=1, labels=['Eta_e_g', 'Eta_e_r', 'MaxSlope_g', 'MaxSlope_r'], inplace=True)

Veamos ahora si cuántos datos mantenemos al aplicar `drop.na()`

---



In [None]:
lc_features.dropna(inplace=True)
sintetic_features.dropna(inplace=True)
print(len(lc_features.index))
print(len(sintetic_features.index))

74908
2173


Retenemos 77081 filas! Es decir, solo perdimos alrededor del 13.1% de la data, lo cual es bueno. Sin embargo, también debemos ver que los datos que perdimos no correspondan a una sola clase (o aún peor: a nuestras clases menos representadas). Veamos cuantas entradas tenemos por clase:

In [None]:
count = {"LPV":0, "Periodic-Other": 0, "RRL": 0, "CEP": 0, "E": 0, "DSCT": 0}
for oid in lc_features.index.unique():
  try:
    alerce_class = labels.loc[oid].values[0]
    count[alerce_class] += 1
  except:
    pass
for oid in sintetic_features.index.unique():
    alerce_class = oid[8:]
    alerce_class = "".join(re.findall("[a-zA-Z]+", alerce_class))
    if alerce_class == "PeriodicOther":
      alerce_class = "Periodic-Other"
    elif alerce_class == "DSCTS":
      alerce_class = "DSCT"
    else:
      alerce_class = "CEP"
    count[alerce_class] += 1
count


{'CEP': 906,
 'DSCT': 1336,
 'E': 33486,
 'LPV': 9799,
 'Periodic-Other': 2104,
 'RRL': 29450}

Recordemos que, anteriormente, la distribución era la siguiente (Véase notebook de data-augmentation):

{'CEP': 1236,
 'DSCT': 1464,
 'E': 37900,
 'LPV': 14045,
 'Periodic-Other': 2512,
 'RRL': 32464}
 
 Notamos que hemos perdido principalmente entradas de RRL y LPV, dos de las clases más representadas. Asumiremos que podemos continuar sin problema.


Hasta ahora nuestra data es de la forma `oid, feature1, feature2, ..., featuren`. Esto es suficiente para que UMAP realice la reducción de dimensionalidad y nos permita visualizar los datos... Pero hay un problema: no tenemos incorporada la clase de nuestros datos, el **target!**.


**IMPORTANTE: Deseamos incluir el target en los datos para poder colorear la visualización que resultará de aplicar UMAP, ¡no para entrenar!**


Para dar solución a esta situación volveremos a generar un cruce de datos, esta vez entre los datasets `labels_set.csv` y nuestro dataframe de curvas de luz originales y sintéticas, rescatando la clase en ALeRCE `classALeRCE` de cada curva para incorporarla en el dataframe y luego usar dicha columna como target de color en la visualización que obtengamos mediante UMAP. Para las curvas sintéticas - que no están incluidas en el labels_set - obtendremos su clase a partir de su nombre.

In [None]:
# Primero agregamos target a las curvas originales
lc_features["target"] = ""
for index, row in lc_features.iterrows():
    alerce_class = labels.loc[index].values[0]
    lc_features['target'][index] = alerce_class

# Y lo mismo para las curvas sintéticas
sintetic_features["target"] = ""
for index, row in sintetic_features.iterrows():
    alerce_class = index[8:]
    alerce_class = "".join(re.findall("[a-zA-Z]+", alerce_class))
    if alerce_class == "PeriodicOther":
      alerce_class = "Periodic-Other"
    elif alerce_class == "DSCTS":
      alerce_class = "DSCT"
    else:
      alerce_class = "CEP"
    sintetic_features['target'][index] = alerce_class

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [None]:
lc_features.head()

Unnamed: 0_level_0,Unnamed: 0,Multiband_period,PPE,Period_band_g,delta_period_g,Period_band_r,delta_period_r,GP_DRW_sigma_g,GP_DRW_tau_g,GP_DRW_sigma_r,GP_DRW_tau_r,Psi_CS_g,Psi_eta_g,Psi_CS_r,Psi_eta_r,Harmonics_mag_1_g,Harmonics_mag_2_g,Harmonics_mag_3_g,Harmonics_mag_4_g,Harmonics_mag_5_g,Harmonics_mag_6_g,Harmonics_mag_7_g,Harmonics_phase_2_g,Harmonics_phase_3_g,Harmonics_phase_4_g,Harmonics_phase_5_g,Harmonics_phase_6_g,Harmonics_phase_7_g,Harmonics_mse_g,Harmonics_mag_1_r,Harmonics_mag_2_r,Harmonics_mag_3_r,Harmonics_mag_4_r,Harmonics_mag_5_r,Harmonics_mag_6_r,Harmonics_mag_7_r,Harmonics_phase_2_r,Harmonics_phase_3_r,Harmonics_phase_4_r,Harmonics_phase_5_r,...,MedianBRP_g,PairSlopeTrend_g,PercentAmplitude_g,Q31_g,Rcs_g,Skew_g,SmallKurtosis_g,Std_g,StetsonK_g,Pvar_g,ExcessVar_g,SF_ML_amplitude_g,SF_ML_gamma_g,IAR_phi_g,LinearTrend_g,Amplitude_r,AndersonDarling_r,Autocor_length_r,Beyond1Std_r,Con_r,Gskew_r,Mean_r,Meanvariance_r,MedianAbsDev_r,MedianBRP_r,PairSlopeTrend_r,PercentAmplitude_r,Q31_r,Rcs_r,Skew_r,SmallKurtosis_r,Std_r,StetsonK_r,Pvar_r,ExcessVar_r,SF_ML_amplitude_r,SF_ML_gamma_r,IAR_phi_r,LinearTrend_r,target
oid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
ZTF17aadmidf,0,0.149331,0.150302,0.149331,0.0,0.149331,0.0,0.012568,1.583378,0.009955,0.540005,0.280821,0.534771,0.446245,1.715562,0.226143,0.17336,0.160905,0.12237,0.105802,0.053345,0.040609,4.410925,1.914819,5.7089,3.399976,0.76343,4.947751,0.0003559182,3.322528,3.786361,0.630072,3.320083,3.456475,3.14531,1.485501,0.920114,0.877684,5.396031,4.185686,...,0.482759,-0.133333,0.015222,0.224707,0.173023,0.461301,-1.505973,0.115035,0.860189,1.0,5e-05,0.182344,0.114451,0.601279,-0.000188,0.120463,0.99805,1.0,0.285714,0.0,-0.185051,15.844516,0.006337,0.027004,0.428571,0.0,0.013397,0.133858,0.329755,-0.872082,0.68752,0.100411,0.919013,1.0,3.9e-05,0.008849,-0.5,0.02979923,6.9e-05,E
ZTF17aadqpnu,0,0.544647,0.064055,0.544647,0.0,0.544647,0.0,0.127657,0.604882,0.062287,0.16038,0.227532,0.197661,0.264729,0.250459,0.386156,0.179297,0.108261,0.060052,0.035854,0.031842,0.042522,2.574642,4.798851,0.584667,2.626285,5.90124,1.947181,0.007086322,0.278139,0.135993,0.094853,0.058553,0.036914,0.024242,0.016701,2.317399,4.284166,0.008309,1.562713,...,0.447059,-0.033333,0.053033,0.343159,0.11648,-1.114517,0.053458,0.354037,0.892795,1.0,0.0004,0.611645,0.114981,0.194326,3.7e-05,0.402421,1.0,1.0,0.5,0.0,-0.499046,16.622919,0.014997,0.086149,0.5,-0.033333,0.039674,0.278279,0.098373,-0.951954,-0.496121,0.249302,0.911707,1.0,0.000222,0.374863,0.037573,0.002108215,0.000121,RRL
ZTF18aapkwuy,0,0.578986,0.110494,0.578986,0.0,0.578986,0.0,0.100087,0.071747,0.056377,0.17247,0.262575,0.054283,0.238333,0.052322,0.341754,0.164201,0.114893,0.072789,0.040806,0.018547,0.011639,2.522284,4.543388,0.297373,2.282109,4.034967,0.63288,0.001907514,0.238908,0.100937,0.09743,0.039466,0.037239,0.009078,0.010867,2.078315,4.031695,5.594666,1.338155,...,0.5,-0.033333,0.05322,0.554007,0.060306,-0.608783,-1.146912,0.319406,0.840497,1.0,0.000437,0.454134,-0.006394,2e-06,7e-05,0.350948,1.0,1.0,0.450355,0.0,0.128946,14.990062,0.016191,0.244117,0.244681,-0.1,0.032933,0.443769,0.110692,0.100949,-1.461597,0.242701,0.796048,1.0,0.000261,0.351549,0.016288,0.002633387,-2.6e-05,RRL
ZTF18aajoeri,0,0.620327,0.095026,0.620327,0.0,0.620327,0.0,0.074342,0.385647,0.041275,0.33111,0.247368,0.102651,0.269291,0.126154,0.318212,0.135723,0.079888,0.051215,0.025414,0.012199,0.021768,2.713206,4.243866,0.786959,1.229373,4.593619,4.457981,0.0005181999,0.215537,0.066414,0.086511,0.017082,0.030978,0.005716,0.011889,1.723358,3.504457,5.564069,6.254613,...,0.241379,-0.166667,0.028925,0.546084,0.117069,0.227756,-1.476044,0.270579,0.846095,1.0,0.000248,0.421667,0.043597,0.074929,-9.5e-05,0.283624,1.0,1.0,0.444444,0.0,-0.247613,16.909373,0.012214,0.161923,0.208333,-0.1,0.024387,0.39287,0.131229,-0.128824,-1.734664,0.20653,0.920227,1.0,0.000147,0.272383,-0.043357,0.05270067,-8.9e-05,RRL
ZTF18adaiqlm,0,0.599899,0.025961,0.230787,0.369112,0.599899,0.0,0.0356,93.530163,0.022202,0.05785,0.352077,1.926255,0.356133,0.878343,115.866663,143.465582,137.550345,144.695345,184.65524,220.949264,170.59098,2.218201,2.143108,2.010562,4.68678,4.653281,1.258688,3.6164380000000003e-22,2711.360489,3010.013855,1943.892353,997.080236,1177.144468,853.748343,291.427822,2.633734,5.175253,2.193438,5.359701,...,0.5,0.1,0.024009,0.090657,0.352077,-1.730831,8.226508,0.15394,0.818056,1.0,7.4e-05,0.120289,0.288155,0.984298,0.001207,0.232859,0.783907,1.0,0.3125,0.0,0.05707,16.653477,0.009094,0.139583,0.375,-0.033333,0.015696,0.286974,0.224736,0.17461,-0.606513,0.15145,0.936903,1.0,8.1e-05,0.173758,-0.05862,3.71382e-13,-0.000296,E


In [None]:
sintetic_features.tail()

Unnamed: 0_level_0,Unnamed: 0,Multiband_period,PPE,Period_band_g,delta_period_g,Period_band_r,delta_period_r,GP_DRW_sigma_g,GP_DRW_tau_g,GP_DRW_sigma_r,GP_DRW_tau_r,Psi_CS_g,Psi_eta_g,Psi_CS_r,Psi_eta_r,Harmonics_mag_1_g,Harmonics_mag_2_g,Harmonics_mag_3_g,Harmonics_mag_4_g,Harmonics_mag_5_g,Harmonics_mag_6_g,Harmonics_mag_7_g,Harmonics_phase_2_g,Harmonics_phase_3_g,Harmonics_phase_4_g,Harmonics_phase_5_g,Harmonics_phase_6_g,Harmonics_phase_7_g,Harmonics_mse_g,Harmonics_mag_1_r,Harmonics_mag_2_r,Harmonics_mag_3_r,Harmonics_mag_4_r,Harmonics_mag_5_r,Harmonics_mag_6_r,Harmonics_mag_7_r,Harmonics_phase_2_r,Harmonics_phase_3_r,Harmonics_phase_4_r,Harmonics_phase_5_r,...,MedianBRP_g,PairSlopeTrend_g,PercentAmplitude_g,Q31_g,Rcs_g,Skew_g,SmallKurtosis_g,Std_g,StetsonK_g,Pvar_g,ExcessVar_g,SF_ML_amplitude_g,SF_ML_gamma_g,IAR_phi_g,LinearTrend_g,Amplitude_r,AndersonDarling_r,Autocor_length_r,Beyond1Std_r,Con_r,Gskew_r,Mean_r,Meanvariance_r,MedianAbsDev_r,MedianBRP_r,PairSlopeTrend_r,PercentAmplitude_r,Q31_r,Rcs_r,Skew_r,SmallKurtosis_r,Std_r,StetsonK_r,Pvar_r,ExcessVar_r,SF_ML_amplitude_r,SF_ML_gamma_r,IAR_phi_r,LinearTrend_r,target
oid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
sinteticPeriodicOther633,0,0.127443,0.005017,0.152625,0.025182,0.127443,0.0,3.934079e-08,0.998934,4.196704e-08,0.999679,0.263516,2.499104,0.327747,1.16145,1.372516,1.890075,1.590491,0.950599,1.176937,0.756059,0.521274,2.556384,5.545694,1.892509,5.100773,1.389646,3.609022,1.818692e-05,5.98709,4.033693,1.339784,4.127114,5.074448,3.667979,2.588053,3.104508,1.726231,5.503727,2.811288,...,0.411765,-0.066667,0.003596,0.04044,0.248242,-1.203855,2.686932,0.042547,0.749447,6.36187e-09,-3.5e-05,-0.5,-0.5,0.000145,-9.3e-05,0.13947,0.258995,1.0,0.307692,0.0,-0.01756,32.96092,0.002211,0.050806,0.153846,-0.066667,0.004496,0.099388,0.312941,0.035321,0.460526,0.072865,0.843195,0.0002466146,-3.3e-05,-0.5,-0.5,1.0,-0.000156,Periodic-Other
sinteticDSCTS491,0,0.083514,0.137735,0.083514,0.0,0.091147,0.007633,3.742134e-08,0.592388,1.992906e-07,0.7603,0.33377,0.246206,0.321865,0.916986,0.101666,0.057051,0.015967,0.028457,0.0206,0.014218,0.007776,2.931082,1.383578,0.095799,3.621489,3.236329,5.928237,9.369493e-05,1218.59491,3653.235638,897.336787,2256.72005,660.41945,642.233798,276.802677,0.037789,0.379481,0.10779,1.156308,...,0.036364,-0.1,0.005188,0.181838,0.200711,0.107153,-1.865912,0.091855,0.979917,4.256703e-31,-0.000234,-0.5,-0.5,1.0,0.000102,0.086088,0.999993,1.0,0.428571,0.0,-0.090653,33.321064,0.001947,0.017339,0.5,0.1,0.00394,0.128389,0.269912,-0.536587,-1.368408,0.064884,0.942261,2.53195e-10,-0.000242,-0.5,-0.5,3.423586e-11,0.000144,DSCT
sinteticDSCTS8,0,0.098159,0.013782,0.111012,0.012854,0.104559,0.0064,1.608409e-07,0.966339,1.187407e-07,0.776796,0.477788,1.096832,0.308153,1.154244,12.730592,12.401894,11.883018,6.120208,19.936341,5.168524,15.743436,5.400054,3.5423,2.711066,1.717517,0.500552,4.68555,2.018828e-25,9.649632,8.418455,9.025389,12.019949,5.264975,7.485553,4.53525,6.27208,2.411248,4.591545,1.921039,...,0.0,-0.1,0.011194,0.289021,0.334076,-0.090374,-1.413512,0.162687,0.956014,0.04662794,-0.000123,-0.5,-0.5,0.002181,0.00029,0.166893,0.999687,1.0,0.363636,0.0,-0.197266,21.054536,0.006492,0.06826,0.181818,0.0,0.01256,0.268138,0.308644,-0.194821,-1.867828,0.136696,0.984484,0.008963273,-0.000144,-0.5,-0.5,8.055458e-08,4e-05,DSCT
sinteticDSCTS92,0,0.104379,0.10387,0.104379,0.0,0.104379,0.0,1.037372e-07,0.689132,4.689424e-07,1.018331,0.291852,0.241856,0.320504,0.821848,0.197345,0.056199,0.036105,0.021088,0.005824,0.004223,0.011139,2.382247,4.347025,1.14936,5.570234,6.202756,3.543052,0.0009061869,8076.133668,7174.326338,5568.592434,3505.13481,1749.833647,607.103517,136.276741,3.196495,0.178314,3.492908,0.552542,...,0.575758,0.166667,0.011674,0.220363,0.148818,-1.011943,-0.037191,0.14441,0.883211,4.5560230000000005e-33,-0.000292,-0.5,-0.5,1.0,-7.3e-05,0.190466,0.999717,1.0,0.588235,0.0,-0.188234,40.485894,0.003425,0.096349,0.117647,0.133333,0.007018,0.289347,0.315649,-0.201834,-1.615254,0.13867,0.959731,2.052954e-09,-0.000299,-0.5,-0.5,1.0,-0.00016,DSCT
sinteticDSCTS485,0,0.106283,0.033613,0.057405,0.048878,0.66269,0.556408,3.424709e-07,1.000652,5.264926e-07,0.999085,0.322591,3.172811,0.362486,3.509261,6.17418,4.075227,3.202519,4.452905,3.370464,1.84699,1.481397,4.995326,5.554025,1.263401,2.002958,4.945968,4.119609,2.797256e-08,3.357118,2.447763,7.302775,2.570187,1.8132,4.153707,3.35412,5.838108,4.965541,5.904749,0.317184,...,0.375,0.1,0.004965,0.086056,0.46759,-0.471773,-0.713994,0.047514,0.937346,1.557748e-06,-0.000489,-0.5,-0.5,0.953948,2.6e-05,0.300276,0.997753,1.0,0.166667,0.0,0.384714,19.739365,0.010327,0.052328,0.5,-0.033333,0.025019,0.082399,0.362486,1.570362,9.851841,0.203848,0.6906,0.03952101,-0.000436,-0.5,-0.5,0.02798729,0.000145,DSCT


Ahora que ya tenemos ambos dataset con sus respectivos target y las mismas columnas, podemos proceder a concatenarlos y entrar a trabajar con UMAP.

In [None]:
data = pd.concat([lc_features, sintetic_features])
print(len(data.index))
data.to_csv("augmented_features.csv")

77081


## UMAP en acción

In [None]:
# Reproducibilidad: en caso de querer ejecutar solamente esta sección correr esta celda
# para descargar la data concatenada en la sección anterior.
!gdown --id 1HFEbip5SX591MCLi-S6DKw7LEx-CJFNt

Downloading...
From: https://drive.google.com/uc?id=1HFEbip5SX591MCLi-S6DKw7LEx-CJFNt
To: /content/augmented_features.csv
100% 131M/131M [00:00<00:00, 144MB/s]


Para usar UMAP para esta tarea primero necesitamos construir un objeto UMAP que hará el trabajo por nosotros. Para eso basta instanciar la clase:

In [None]:
reducer = umap.UMAP()


Antes de hacer cualquier trabajo con los datos será útil limpiarlos un poco. Dado que las medidas están en escalas completamente distintas, será útil convertir cada feature en z-scores (cantidad de desviaciones desde la media) para poder compararlas.

Desde luego, antes de cualquier procesamiento también nos preocupamos de dropear la columna de `target`, pues buscamos generar un modelo no supervisado.

In [None]:
data = pd.read_csv("augmented_features.csv", index_col=0)
scaled_data = StandardScaler().fit_transform(data.drop(labels='target', axis=1))

Ahora necesitamos entrenar nuestro reductor, permitiéndole aprender del manifold. Para ello UMAP sigue la API de sklearn, incorporando el método fit al cual le pasamos la data de la cual queremos que el modelo aprenda.


In [None]:
reducer = umap.UMAP(random_state=13)
reducer.fit(scaled_data)

UMAP(a=None, angular_rp_forest=False, b=None, dens_frac=0.0, dens_lambda=0.0,
     dens_var_shift=0.1, densmap=False, disconnection_distance=None,
     force_approximation_algorithm=False, init='spectral', learning_rate=1.0,
     local_connectivity=1.0, low_memory=True, metric='euclidean',
     metric_kwds=None, min_dist=0.1, n_components=2, n_epochs=None, n_jobs=-1,
     n_neighbors=15, negative_sample_rate=5, output_dens=False,
     output_metric='euclidean', output_metric_kwds=None, random_state=13,
     repulsion_strength=1.0, set_op_mix_ratio=1.0, spread=1.0,
     target_metric='categorical', target_metric_kwds=None, ...)

Luego tenemos el método transform, que nos dará la data transformada.

In [None]:
embedding = reducer.transform(scaled_data)

Finalmente, importamos algunas librerías de visualización y armamos el scatterplot de nuestros datos.

In [None]:
output_notebook()

In [None]:
data_df = pd.DataFrame(embedding, columns=('x', 'y'))
data_df['target'] = [x for x in data.target]

datasource = ColumnDataSource(data_df)
color_mapping = CategoricalColorMapper(factors=["E", "RRL", "CEP", "DSCT", "LPV", "Periodic-Other"],
                                       palette=Spectral6)

plot_figure = figure(
    title='UMAP projection of the periodic light curves',
    plot_width=600,
    plot_height=600,
    tools=('pan, wheel_zoom, reset')
)

plot_figure.circle(
    'x',
    'y',
    source=datasource,
    color=dict(field='target', transform=color_mapping),
    line_alpha=0.6,
    fill_alpha=0.6,
    size=4,
    legend='target'
)
show(plot_figure)



In [None]:
embedding.tofile("embedding_UMAP.csv")

## Y si antes de UMAP, escogemos las features más importantes...?

Siguiendo la instrucción del enunciado del proyecto, probaremos ahora qué pasa si ayudamos al modelo trabajando solo con un subconjunto de las features. 

Para escoger las features, se estudia la distribución, pérdida de entradas y correlación entre columnas mediante  la generación de un `pandas profiler` de los datos.

In [76]:
# Reproducibilidad: en caso de querer ejecutar solamente esta sección correr esta celda
# para descargar la data concatenada en la sección 1.
!gdown --id 1HFEbip5SX591MCLi-S6DKw7LEx-CJFNt #augmented_features, curvas de luz con sus features
                                              #incluye curvas originales y sintéticas
!gdown --id 1XCl8BiVOP7aheBYjOHIAM378s_34-8kl #reduced_data, misma data pero subsampleando las clases
                                              #sobrerepresentadas

Downloading...
From: https://drive.google.com/uc?id=1HFEbip5SX591MCLi-S6DKw7LEx-CJFNt
To: /content/augmented_features.csv
100% 131M/131M [00:01<00:00, 93.1MB/s]
Downloading...
From: https://drive.google.com/uc?id=1XCl8BiVOP7aheBYjOHIAM378s_34-8kl
To: /content/reduced_data.csv
100% 23.9M/23.9M [00:00<00:00, 111MB/s]


In [102]:
data = pd.read_csv("augmented_features.csv", index_col=0)

Generamos el profiler

In [78]:
#!pip install pandas==0.25
#el profiler solo es compatible con pandas 0.25
#se recomienda obtenerlo y luego reiniciar el entorno
#profile = pandas_profiling.ProfileReport(data)
#profile.to_file("profile_data_augmented.html")

# o para traer el resultado desde el drive del proyecto:
!gdown --id 12IZp0_0A6yKDQdGfIQUe3F4F8ZlclItB

Downloading...
From: https://drive.google.com/uc?id=12IZp0_0A6yKDQdGfIQUe3F4F8ZlclItB
To: /content/profile_data_augmented.html
  0% 0.00/3.49M [00:00<?, ?B/s]100% 3.49M/3.49M [00:00<00:00, 111MB/s]


En la sección "Overview" del profiler generado, se presentan las caraceristicas que el profiler propone descartar, ya sea porque tienen un alto porcentaje de ceros o missing values, porque son demasiado sesgadas (skew) y porque tienen mucha correlación. Se realizan pruebas con distintos subconjuntos de features, teniendo en cuenta estas recomendaciones del profiler.

In [111]:
interest_features = [
                     'Multiband_period',
                     'delta_period_g',
                     'delta_period_r',
                 #    'GP_DRW_sigma_r', demasiado skew
                 #    'GP_DRW_tau_g',
                 #    'GP_DRW_sigma_r',
                 #    'GP_DRW_tau_r',
                     'Harmonics_mag_1_g',
                     'Harmonics_mag_1_r',
                     'Harmonics_mse_r', # comentar esta da otro conjunto viable
                     'Harmonics_mse_g', ##
                     #'Power_rate_1/4', ##
                     #'AndersonDarling_g', ##
                     #'AndersonDarling_g' ##
                     'iqr_g',
                     'iqr_r',
                     'Amplitude_g',
                     'Mean_g',
                     'Meanvariance_g',
                     'Amplitude_r',
                     'Mean_r',
                     'Meanvariance_r',
                     'PairSlopeTrend_r',
                     'target',
                     'LinearTrend_r',
                     'ExcessVar_r',
                     'LinearTrend_g',
                     'ExcessVar_g'
                     
]


data = data[interest_features]
data.head()

Unnamed: 0_level_0,Multiband_period,delta_period_g,delta_period_r,Harmonics_mag_1_g,Harmonics_mag_1_r,Harmonics_mse_r,Harmonics_mse_g,iqr_g,iqr_r,Amplitude_g,Mean_g,Meanvariance_g,Amplitude_r,Mean_r,Meanvariance_r,PairSlopeTrend_r,target,LinearTrend_r,ExcessVar_r,LinearTrend_g,ExcessVar_g
oid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ZTF17aadmidf,0.149331,0.0,0.0,0.226143,3.322528,7.618142e-29,0.0003559182,0.224707,0.133858,0.157265,16.096489,0.007147,0.120463,15.844516,0.006337,0.0,E,6.9e-05,3.9e-05,-0.000188,5e-05
ZTF17aadqpnu,0.544647,0.0,0.0,0.386156,0.278139,0.001774301,0.007086322,0.343159,0.278279,0.584628,17.479521,0.020254,0.402421,16.622919,0.014997,-0.033333,RRL,0.000121,0.000222,3.7e-05,0.0004
ZTF18aapkwuy,0.578986,0.0,0.0,0.341754,0.238908,0.001250283,0.001907514,0.554007,0.443769,0.47613,15.262718,0.020927,0.350948,14.990062,0.016191,-0.1,RRL,-2.6e-05,0.000261,7e-05,0.000437
ZTF18aajoeri,0.620327,0.0,0.0,0.318212,0.215537,0.000176957,0.0005181999,0.546084,0.39287,0.390076,17.094142,0.015829,0.283624,16.909373,0.012214,-0.1,RRL,-8.9e-05,0.000147,-9.5e-05,0.000248
ZTF18adaiqlm,0.599899,0.369112,0.0,115.866663,2711.360489,1.00099e-05,3.6164380000000003e-22,0.090657,0.286974,0.257177,17.396314,0.008849,0.232859,16.653477,0.009094,-0.033333,E,-0.000296,8.1e-05,0.001207,7.4e-05


### Repetimos el proceso con nuestra data de features reducidas manualmente.

In [112]:
reducer = umap.UMAP()
scaled_data = StandardScaler().fit_transform(data.drop(labels='target', axis=1))
reducer = umap.UMAP(random_state=13)

In [113]:
reducer.fit(scaled_data)
embedding = reducer.transform(scaled_data)

Visualicemos los resultados de UMAP en este nuevo intento

In [109]:
output_notebook()

In [114]:
data_df = pd.DataFrame(embedding, columns=('x', 'y'))
data_df['target'] = [x for x in data.target]

datasource = ColumnDataSource(data_df)
color_mapping = CategoricalColorMapper(factors=["E", "RRL", "CEP", "DSCT", "LPV", "Periodic-Other"],
                                       palette=Spectral6)

plot_figure = figure(
    title='UMAP projection of the periodic light curves',
    plot_width=600,
    plot_height=600,
    tools=('pan, wheel_zoom, reset')
)

plot_figure.circle(
    'x',
    'y',
    source=datasource,
    color=dict(field='target', transform=color_mapping),
    line_alpha=0.6,
    fill_alpha=0.6,
    size=4,
    legend='target'
)
show(plot_figure)

