# Autoencoders
## Inteligencia Computacional 2021-2, Grupo 8a
Nicolás Canales, Matías Vergara

Este notebook tiene por objetivo aplicar Autoencoders sobre las curvas de luz con características computadas y visualizar el código resultante en búsqueda de clusters. Para ello se utilizará un cuello de botella bidimensional.

Recordemos que los objetos con los que estamos trabajando son aquellos de tipo periódico, clasificados por ALeRCE como: "LPV", "Periodic-Other", "RRL", "CEP", "E" o "DSCT". 

### Referencias:
Adnan Karol, Introduction to 2 dimensional LSTM autoencoder - https://medium.com/analytics-vidhya/introduction-to-2-dimensional-lstm-autoencoder-47c238fd827f

B. Zong et al. “Deep Autoencoding Gaussian Mixture Model for Unsupervised Anomaly Detection”. ICLR 2018 

Jason Brownlee, A Gentle Introduction to LSTM Autoencoders - https://machinelearningmastery.com/lstm-autoencoders/

### Dependencias

In [1]:
import pandas as pd
import numpy as np

from tensorflow import keras
from tensorflow.python.keras.layers import Input, Dense,RepeatVector, TimeDistributed, Dense, Dropout, LSTM
from tensorflow.python.keras.models import Sequential
from tensorflow.keras.layers import Input, Add, Dense, Activation, ZeroPadding2D, BatchNormalization, Flatten, Conv2D, AveragePooling2D, MaxPooling2D, Dropout
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.initializers import glorot_uniform
from tensorflow.keras.optimizers import SGD

import matplotlib.pyplot as plt
%matplotlib inline

import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans

from tensorflow.keras.callbacks import EarlyStopping


from io import BytesIO
from PIL import Image
import base64
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource, CategoricalColorMapper
from bokeh.palettes import Spectral10, Spectral6

output_notebook()
early_stopping = EarlyStopping(monitor='loss', patience=4)


### Traer la data

In [2]:
!gdown --id 1HFEbip5SX591MCLi-S6DKw7LEx-CJFNt #augmented_features.csv

Downloading...
From: https://drive.google.com/uc?id=1HFEbip5SX591MCLi-S6DKw7LEx-CJFNt
To: /content/augmented_features.csv
100% 131M/131M [00:00<00:00, 150MB/s]


In [3]:
data = pd.read_csv("augmented_features.csv", index_col=0)

Estandarizamos las features (tienen escalas muy distintas):

In [None]:
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data.drop(labels="target", axis=1))

### Crear el modelo de autoencoder


In [None]:
encoding_dim = 9
input_df = Input(shape=(99,))


# Glorot normal initializer (Xavier normal initializer) draws samples from a truncated normal distribution 

x = Dense(encoding_dim, activation='relu')(input_df)
x = Dense(120, activation='relu', kernel_initializer = 'glorot_uniform')(x)
x = Dense(120, activation='relu', kernel_initializer = 'glorot_uniform')(x)
x = Dense(90, activation='relu', kernel_initializer = 'glorot_uniform')(x)

encoded = Dense(2, activation='relu', kernel_initializer = 'glorot_uniform')(x)

x = Dense(90, activation='relu', kernel_initializer = 'glorot_uniform')(encoded)
x = Dense(160, activation='relu', kernel_initializer = 'glorot_uniform')(x)

decoded = Dense(99, kernel_initializer = 'glorot_uniform')(x)

# autoencoder
autoencoder = Model(input_df, decoded)

#encoder - used for our dimention reduction
encoder = Model(input_df, encoded)

autoencoder.compile(optimizer= 'adam', loss='mean_squared_error')

In [None]:
autoencoder.fit(data_scaled, data_scaled, batch_size = 64, epochs = 120,  verbose = 1, callbacks=[early_stopping])

Epoch 1/120
Epoch 2/120
Epoch 3/120
Epoch 4/120
Epoch 5/120
Epoch 6/120
Epoch 7/120
Epoch 8/120
Epoch 9/120
Epoch 10/120
Epoch 11/120
Epoch 12/120
Epoch 13/120


<keras.callbacks.History at 0x7f0c43404fd0>

In [None]:
pred = encoder.predict(data_scaled)

In [None]:
pred.shape

(77081, 2)

In [None]:
pred[0:20]

array([[ 1.3817878 ,  1.6406355 ],
       [ 0.45461386,  1.2749498 ],
       [ 0.599365  ,  1.3200868 ],
       [ 0.62364024,  1.4842097 ],
       [ 1.8596442 ,  1.6630875 ],
       [ 0.17116296,  1.0598805 ],
       [ 1.1707871 ,  1.2000608 ],
       [ 1.4414935 ,  1.4075136 ],
       [ 0.59204364,  1.4923215 ],
       [ 0.6546238 ,  1.2917267 ],
       [ 0.47445714,  1.242319  ],
       [ 1.0609846 ,  1.3968232 ],
       [ 1.0526092 ,  1.4474565 ],
       [ 2.225582  ,  2.2782984 ],
       [41.992897  , 60.195786  ],
       [ 0.9513678 ,  1.6687112 ],
       [ 1.2134728 ,  1.7045724 ],
       [ 1.952766  ,  1.5179607 ],
       [ 0.48694438,  1.1981642 ],
       [ 1.631479  ,  1.6549686 ]], dtype=float32)

In [12]:
def show_scatter(pred):
  data_df = pd.DataFrame(pred, columns=('x', 'y'))
  data_df['target'] = [x for x in data.target]

  datasource = ColumnDataSource(data_df)
  color_mapping = CategoricalColorMapper(factors=["E", "RRL", "CEP", "DSCT", "LPV", "Periodic-Other"],
                                        palette=Spectral6)

  plot_figure = figure(
      title='Autoencoder projection of the periodic light curves',
      plot_width=1200,
      plot_height=600,
      tools=('pan, wheel_zoom, reset')
  )

  plot_figure.cross(
      'x',
      'y',
      source=datasource,
      color=dict(field='target', transform=color_mapping),
      line_alpha=0.5,
      fill_alpha=0,
      size=4,
      legend='target'
  )
  show(plot_figure)

show_scatter(pred)



### Y si probamos con algunas features en lugar de todas?
Análogo a lo que hicimos con UMAP, veremos si el resultado del encoder mejora al considerar solo ciertas features, en lugar de todas las disponibles. 

In [4]:
interest_features = [
                     'Multiband_period',
                     'Period_band_g',
                     'Period_band_r',
                     'GP_DRW_sigma_r', 
                     'GP_DRW_tau_g',
                     'GP_DRW_sigma_r',
                     'GP_DRW_tau_r',
                     'Harmonics_mag_1_g',
                     'Harmonics_mag_1_r',
                     'Harmonics_mse_r', # comentar esta da otro conjunto viable
                     'Harmonics_mse_g', ##
                     'Power_rate_1/4', ##
                     'Power_rate_1/3', ##
                     'Power_rate_1/2', ##
                     'Power_rate_2', ##
                     'Power_rate_3', ##
                     'Power_rate_4', ##
                     'AndersonDarling_g', ##
                     'AndersonDarling_r', ##
                     'Autocor_length_g',
                     'Autocor_length_r',
                     'IAR_phi_g',
                     'IAR_phi_r',
                     'Skew_g',
                     'Skew_r',
                     'StetsonK_r',
                     'StetsonK_g',
                     'iqr_g',
                     'iqr_r',
                     'Amplitude_g',
                     'Mean_g',
                     'Meanvariance_g',
                     'Amplitude_r',
                     'Mean_r',
                     'Meanvariance_r',
                     'PairSlopeTrend_r',
                     'target',
                     'LinearTrend_r',
                     'ExcessVar_r',
                     'LinearTrend_g',
                     'ExcessVar_g',
                     'PPE',
                     'Psi_CS_g',
                     'Psi_CS_r',
                     'Psi_eta_g',
                     'Psi_eta_r',
                     'iqr_g',
                     'iqr_r',
                     


                     
]

data = data[interest_features]
data.head()
data.shape

(77081, 48)

Actualizamos nuestra variable data_scaled para que ahora se calcule en base a las columnas de interés solamente

In [None]:
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data.drop(labels="target", axis=1))

In [None]:
# This is the dimension of the original space
input_dim = 47
# This is the dimension of the latent space (encoding space)
latent_dim = 2

encoder = Sequential([
    Dense(120, activation='relu', kernel_initializer = 'glorot_uniform', input_shape=(input_dim,)),
    Dense(120, activation='relu', kernel_initializer = 'glorot_uniform'),
    Dense(90, activation='relu', kernel_initializer = 'glorot_uniform'),
    Dense(latent_dim, activation='relu')
])

decoder = Sequential([
    Dense(90, activation='relu', input_shape=(latent_dim,), kernel_initializer = 'glorot_uniform'), 
    Dense(128, activation='relu', kernel_initializer = 'glorot_uniform'),
    Dense(256, activation='relu', kernel_initializer = 'glorot_uniform'),
    Dense(input_dim, activation=None)
])

autoencoder = keras.models.Sequential([encoder, decoder])
autoencoder.compile(loss='mse', optimizer='adam')

In [None]:
history = autoencoder.fit(data_scaled, data_scaled, epochs=50, batch_size=64,verbose=1, callbacks=[early_stopping])

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50


In [None]:
pred = encoder.predict(data_scaled)

In [None]:
pred[0:20]

array([[ 0.50806236,  0.7747422 ],
       [ 0.6769413 ,  1.0023401 ],
       [ 0.7162702 ,  1.0616109 ],
       [ 0.6612924 ,  1.0178068 ],
       [ 1.0229675 ,  0.5481758 ],
       [ 0.89576256,  0.92377543],
       [ 1.649776  ,  1.3931658 ],
       [ 0.78747565,  1.2354195 ],
       [ 0.61165094,  0.8603741 ],
       [ 0.548795  ,  1.0146844 ],
       [ 0.66140044,  1.0145738 ],
       [ 0.5646633 ,  1.247732  ],
       [ 0.6478301 ,  1.3229676 ],
       [ 0.66227925,  0.7385375 ],
       [13.647603  , 15.397951  ],
       [ 0.41059864,  1.0502427 ],
       [ 0.43002743,  0.86859065],
       [ 1.1944492 ,  0.5545502 ],
       [ 0.6450565 ,  0.8017977 ],
       [ 0.6835917 ,  0.7767534 ]], dtype=float32)

In [None]:
show_scatter(pred)



### Hm.. y si probamos con los datos subsampleados?


In [5]:
!gdown --id 1XCl8BiVOP7aheBYjOHIAM378s_34-8kl #reduced_data, misma data pero subsampleando las clases
                                              #sobrerepresentadas


Downloading...
From: https://drive.google.com/uc?id=1XCl8BiVOP7aheBYjOHIAM378s_34-8kl
To: /content/reduced_data.csv
100% 23.9M/23.9M [00:00<00:00, 75.2MB/s]


In [31]:
data = pd.read_csv("reduced_data.csv", index_col=0)

In [7]:
interest_features = [
                     'Multiband_period',
                     'Period_band_g',
                     'Period_band_r',
                     'GP_DRW_sigma_r', 
                     'GP_DRW_tau_g',
                     'GP_DRW_sigma_r',
                     'GP_DRW_tau_r',
                     'Harmonics_mag_1_g',
                     'Harmonics_mag_1_r',
                     'Harmonics_mse_r', # comentar esta da otro conjunto viable
                     'Harmonics_mse_g', ##
                     'Power_rate_1/4', ##
                     'Power_rate_1/3', ##
                     'Power_rate_1/2', ##
                     'Power_rate_2', ##
                     'Power_rate_3', ##
                     'Power_rate_4', ##
                     'AndersonDarling_g', ##
                     'AndersonDarling_r', ##
                     'Autocor_length_g',
                     'Autocor_length_r',
                     'target'
                     
]

#descomentar para probar con las interest_features (no mejora).
#data = data[interest_features]
#data.head()
#data.shape

(14081, 22)

In [32]:
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data.drop(labels="target", axis=1))

In [34]:
# This is the dimension of the original space
input_dim = 99
# This is the dimension of the latent space (encoding space)
latent_dim = 2

encoder = Sequential([
    Dense(2048, activation='relu', kernel_initializer = 'glorot_uniform', input_shape=(input_dim,)),
    Dense(1024, activation='relu', kernel_initializer = 'glorot_uniform'),
    Dense(512, activation='relu', kernel_initializer = 'glorot_uniform'),
    Dense(256, activation='selu', kernel_initializer = 'glorot_uniform'),
    Dense(latent_dim, activation='sigmoid')
])

decoder = Sequential([
    Dense(512, activation='relu', input_shape=(latent_dim,), kernel_initializer = 'glorot_uniform'), 
    Dense(1024, activation='relu', kernel_initializer = 'glorot_uniform'),
    Dense(2048, activation='relu', kernel_initializer = 'glorot_uniform'),
    Dense(input_dim, activation=None)
])
autoencoder = keras.models.Sequential([encoder, decoder])
autoencoder.compile(loss='mse', optimizer='adam')
history = autoencoder.fit(data_scaled, data_scaled, epochs=50, batch_size=64,verbose=1, callbacks=[early_stopping])

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50


In [35]:
pred = encoder.predict(data_scaled)

Usaremos distintas celdas para plottear los distintos intentos, para no perderlos e ir comparando.

In [36]:
show_scatter(pred)



In [27]:
show_scatter(pred)



In [None]:
show_scatter(pred)



In [None]:
show_scatter(pred)

