# Extracción de características
## Inteligencia Computacional 2021-2, Grupo 8a
Nicolás Canales, Matías Vergara

Este notebook es una reproducción del trabajo realizado por Ignacio Reyes Jainaga, miembro del equipo del broker astronómico ALeRCE. Dicha obra es pública y se encuentra en github: https://github.com/alercebroker/lc_classifier/blob/main/examples/feature_extraction.ipynb


### Instalación de dependencias

In [None]:
# pyarrow might be needed to read the data
!python -m pip install Cython
!python -m pip install -e git+https://git@github.com/alercebroker/turbo-fats#egg=turbofats
!python -m pip install -e git+https://git@github.com/alercebroker/mhps#egg=mhps
!python -m pip install -e git+https://git@github.com/alercebroker/P4J#egg=P4J
!python -m pip install pyarrow
!python -m pip install -e git+https://git@github.com/alercebroker/lc_classifier#egg=lc_classifier

Obtaining turbofats from git+https://****@github.com/alercebroker/turbo-fats#egg=turbofats
  Cloning https://****@github.com/alercebroker/turbo-fats to ./src/turbofats
  Running command git clone -q 'https://****@github.com/alercebroker/turbo-fats' /content/src/turbofats
Installing collected packages: turbofats
  Running setup.py develop for turbofats
Successfully installed turbofats-2.0.0
Obtaining mhps from git+https://****@github.com/alercebroker/mhps#egg=mhps
  Cloning https://****@github.com/alercebroker/mhps to ./src/mhps
  Running command git clone -q 'https://****@github.com/alercebroker/mhps' /content/src/mhps
Installing collected packages: mhps
  Running setup.py develop for mhps
Successfully installed mhps-0.0.1
Obtaining P4J from git+https://****@github.com/alercebroker/P4J#egg=P4J
  Cloning https://****@github.com/alercebroker/P4J to ./src/p4j
  Running command git clone -q 'https://****@github.com/alercebroker/P4J' /content/src/p4j
Installing collected packages: P4J
  Run

### Traer datasets

In [None]:
!gdown --id 1r80160ipZD6QoDlzqWgDLCGlmxr3qNDn

!gdown --id 1n6k-usORljXHVzm3_l1eYDlc7ev1ur3V

!gdown --id 1G_vyYdBw32heWTAme3AH_YPQ6BvUMUBZ

Downloading...
From: https://drive.google.com/uc?id=1r80160ipZD6QoDlzqWgDLCGlmxr3qNDn
To: /content/alerts_G.csv
100% 247M/247M [00:01<00:00, 186MB/s]
Downloading...
From: https://drive.google.com/uc?id=1n6k-usORljXHVzm3_l1eYDlc7ev1ur3V
To: /content/alerts_R.csv
100% 211M/211M [00:01<00:00, 189MB/s]
Downloading...
From: https://drive.google.com/uc?id=1G_vyYdBw32heWTAme3AH_YPQ6BvUMUBZ
To: /content/present_curves.csv
100% 1.13M/1.13M [00:00<00:00, 17.8MB/s]


### Imports necesarios

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from lc_classifier.utils import LightcurveBuilder
from sklearn import preprocessing

In [None]:
from lc_classifier.features import MHPSExtractor, PeriodExtractor, GPDRWExtractor
from lc_classifier.features import FoldedKimExtractor
from lc_classifier.features import HarmonicsExtractor, IQRExtractor
from lc_classifier.features import PowerRateExtractor
from lc_classifier.features import TurboFatsFeatureExtractor
from lc_classifier.features import FeatureExtractorComposer

  import pandas.util.testing as tm


In [None]:
alerts_G = pd.read_csv("/content/alerts_G.csv",
                       index_col=0)
alerts_R = pd.read_csv("/content/alerts_R.csv",
                       index_col=0)
present_curves = pd.read_csv("/content/present_curves.csv",
                             index_col=None)

### Estructurando la data
Los dataframes disponibles en `alerts_G.csv` y `alerts_R.csv` corresponden a observaciones de estrellas periódicas en las bandas G y R respectivamente. Sus columnas son oid, magnitud, error y tiempo, que no coincide con aquellas requeridas por la librería (oid, time, magnitude, error, band). En esta sección se reestructurará la data para que tome la forma necesaria.

Primero vamos por las alertas en G

In [None]:
alerts_G.head()

Unnamed: 0_level_0,magnitude,time,error
oid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ZTF18ablusst,17.612375,58606.479734,0.027428
ZTF18ablutwd,13.388574,58648.334549,100.0
ZTF18aawlhni,16.428589,58732.36287,100.0
ZTF18aawlhni,15.383415,58785.170544,0.015321
ZTF18aawlhrt,12.855704,58266.456551,100.0


In [None]:
g = []
for i in range(0, len(alerts_G.index)):
  g.append('g')
alerts_G['band'] = g
alerts_G.head()

Unnamed: 0_level_0,magnitude,time,error,band
oid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ZTF18ablusst,17.612375,58606.479734,0.027428,g
ZTF18ablutwd,13.388574,58648.334549,100.0,g
ZTF18aawlhni,16.428589,58732.36287,100.0,g
ZTF18aawlhni,15.383415,58785.170544,0.015321,g
ZTF18aawlhrt,12.855704,58266.456551,100.0,g


In [None]:
columns_titles = ["time","magnitude","error", "band"]
alerts_G=alerts_G.reindex(columns=columns_titles)
alerts_G.head()

Unnamed: 0_level_0,time,magnitude,error,band
oid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ZTF18ablusst,58606.479734,17.612375,0.027428,g
ZTF18ablutwd,58648.334549,13.388574,100.0,g
ZTF18aawlhni,58732.36287,16.428589,100.0,g
ZTF18aawlhni,58785.170544,15.383415,0.015321,g
ZTF18aawlhrt,58266.456551,12.855704,100.0,g


Ahora procesamos las alertas en R

In [None]:
alerts_R.head()

Unnamed: 0_level_0,magnitude,time,error
oid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ZTF18ablutuy,14.735083,58334.337859,0.013329
ZTF18ablutvl,17.014446,58643.297141,0.015508
ZTF18abluulc,12.949492,58941.501921,100.0
ZTF18aawlhrt,12.59524,58663.343831,100.0
ZTF18aawlhrt,12.465376,59003.23875,100.0


In [None]:
r = []
for i in range(0, len(alerts_R.index)):
  r.append('r')
alerts_R['band'] = r

In [None]:
alerts_R=alerts_R.reindex(columns=columns_titles)
alerts_R.head()

Unnamed: 0_level_0,time,magnitude,error,band
oid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ZTF18ablutuy,58334.337859,14.735083,0.013329,r
ZTF18ablutvl,58643.297141,17.014446,0.015508,r
ZTF18abluulc,58941.501921,12.949492,100.0,r
ZTF18aawlhrt,58663.343831,12.59524,100.0,r
ZTF18aawlhrt,59003.23875,12.465376,100.0,r


Juntamos ambos dataframe en uno, para pasarselo al extractor

In [None]:
light_curves = pd.concat([alerts_G, alerts_R])
print(len(light_curves.index))

#limpieza de filas con valores nan
light_curves.dropna(inplace=True)
print(len(light_curves.index))
df = light_curves

# limpieza de valores negativos o 0 (no hay)
df = df.drop(df[df.time <= 0].index)
df = df.drop(df[df.magnitude <= 0].index)
df = df.drop(df[df.error <= 0].index)
# correccion de errores 100 a 0.5
df.loc[df.error == 100, "error"] = 0.5

# estandarización de columnas por min-max
# tampoco funciona 
# df.iloc[:,0:-1] = df.iloc[:,0:-1].apply(lambda x: (x-x.mean())/ x.std(), axis=0)
df

9003067
8935910


Unnamed: 0_level_0,time,magnitude,error,band
oid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ZTF18ablusst,-0.321858,0.364808,-0.031581,g
ZTF18ablutwd,-0.111454,-0.585980,0.023923,g
ZTF18aawlhni,0.310958,0.098335,0.023923,g
ZTF18aawlhni,0.576423,-0.136936,-0.033003,g
ZTF18aawlhrt,-2.031159,-0.705931,0.023923,g
...,...,...,...,...
ZTF18acnokhc,-0.167471,0.230571,-0.033272,r
ZTF18acnokib,-0.568789,0.522128,-0.028261,r
ZTF18acnokif,1.336652,-0.080962,0.023923,r
ZTF18acnoksr,-0.553816,-0.624579,0.023923,r


### Computando características sobre el dataset

Para computar las características le damos el dataframe con observaciones al método `compute_features` de nuestro extractor de características. Las características se mostrarán como columnas y los objetos como filas.

In [None]:
bands = ['g', 'r']
feature_extractor = FeatureExtractorComposer(
    [
        MHPSExtractor(bands),
        PeriodExtractor(bands),
        GPDRWExtractor(bands),
        FoldedKimExtractor(bands),
        HarmonicsExtractor(bands),
        IQRExtractor(bands),
        PowerRateExtractor(bands),
        TurboFatsFeatureExtractor(bands)
    ]
)

In [None]:
features = feature_extractor.compute_features(light_curves)
features

AssertionError: ignored

In [None]:
period_feature_names = [f for f in features.columns if 'period' in f.lower()]

In [None]:
computed_periods = features['Multiband_period']
features.to_csv('featured_alerts.csv')

# Aparte

In [None]:
alerts = pd.read_csv("filtered_alerts.csv")
alerts.head(10)

Unnamed: 0.1,Unnamed: 0,oid,candid,dec,fid,mjd,magpsf_corr,ra,sigmapsf_corr
0,43,ZTF18ablusst,852479735715015015,-6.465756,1,58606.479734,17.612375,285.974146,0.027428
1,66,ZTF18ablutuy,580337852915015006,-24.483665,2,58334.337859,14.735083,283.820669,0.013329
2,67,ZTF18ablutvl,889297145115010019,-7.30629,2,58643.297141,17.014446,284.240721,0.015508
3,70,ZTF18ablutwd,894334545615015010,-14.001698,1,58648.334549,13.388574,275.98252,100.0
4,77,ZTF18abluulc,1187501925815010001,-14.258699,2,58941.501921,12.949492,275.322053,100.0
5,108,ZTF18aawlhni,978362871815010007,60.938628,1,58732.36287,16.428589,291.99088,100.0
6,109,ZTF18aawlhni,1031170541815015004,60.93858,1,58785.170544,15.383415,291.990882,0.015321
7,114,ZTF18aawlhrt,512456551115015002,58.951151,1,58266.456551,12.855704,286.19255,100.0
8,115,ZTF18aawlhrt,909343831115015005,58.951104,2,58663.343831,12.59524,286.192925,100.0
9,116,ZTF18aawlhrt,985241991115010002,58.951175,1,58739.241991,13.448686,286.192633,100.0
