# Data augmentation
## Inteligencia Computacional 2021-2, Grupo 8a
Nicolás Canales, Matías Vergara

Este notebook tiene por objetivo crear entradas falsas para balancear en parte las clases de objetos periodicos presentes en el dataset. 

Recordemos que los objetos periódicos son aquellos clasificados por ALeRCE como: "LPV", "Periodic-Other", "RRL", "CEP", "E" o "DSCT". Veamos cuántos objetos distintos hay en el dataset para cada clase:


In [None]:
# imports necesarios
import pandas as pd
import random

In [None]:
# traemos el archivo filtered_alerts.csv, que incluye todas las alertas filtradas
!gdown --id 1bXgIR1nLk0iXrGtWaXBGBbl-goe23pXX
# traemos tambien el archivo labels.csv, con las clasificaciones
!gdown --id  1LU1sCIVXO8BQRMeKCCqu1vZGnceV6P5c

Downloading...
From: https://drive.google.com/uc?id=1bXgIR1nLk0iXrGtWaXBGBbl-goe23pXX
To: /content/filtered_alerts.csv
100% 756M/756M [00:04<00:00, 187MB/s]
Downloading...
From: https://drive.google.com/uc?id=1LU1sCIVXO8BQRMeKCCqu1vZGnceV6P5c
To: /content/labels_set.csv
100% 10.2M/10.2M [00:00<00:00, 140MB/s]


In [None]:
# cargamos los csv a dataframes de pandas
filtered_alerts = pd.read_csv("filtered_alerts.csv", index_col = 0).dropna()
labels = pd.read_csv("labels_set.csv", index_col = 0)

In [None]:
# contamos cuantos oid distintos en las alertas filtradas corresponden
# a cada clase periodica en los labels
count = {"LPV":0, "Periodic-Other": 0, "RRL": 0, "CEP": 0, "E": 0, "DSCT": 0}
for oid in filtered_alerts.index.unique():
  try:
    alerce_class = labels.loc[oid].values[0]
    count[alerce_class] += 1
  except:
    pass
count

#87015 
# estamos agregando casi un 3% de data

{'CEP': 618,
 'DSCT': 732,
 'E': 37900,
 'LPV': 14045,
 'Periodic-Other': 1256,
 'RRL': 32464}

Notamos un importante debalance de clases, que deja muy infrarrepresentadas las clases CEP, DSCT y Periodic-Other. Crearemos curvas de luz sintéticas para estas clases, basándonos en las que ya existen y cambiando levemente sus características `magnitud`, `tiempo`, `error_ext`. El atributo `error` no lo usamos (recomendación del auxiliar) por lo cual lo dejaremos en None (el valor realmente no importa, pues se ignora en la extracción de features).

In [None]:
# la forma menos bacán de hacerlo (la mas facil a mi parecer) es,
# de los CEP, DSCT y Periodic-Other en 
# labels_set, sacar los OID. 
CEPS = []
DSCTS = []
POS = []

for index, row in labels.iterrows():
  classf = row['classALeRCE']
  if classf == 'CEP':
    CEPS.append(index)
  elif classf == 'DSCT':
    DSCTS.append(index)
  elif classf == 'Periodic-Other':
    POS.append(index)
  else:
    pass

#print(CEPS)
print(len(CEPS))
print(len(DSCTS))
print(len(POS))



618
732
1256


In [None]:
# Con eso, ir al filtered_alerts y hacer filtered_alerts.loc(OID)
# eso te va a dar un dataframe con todas las entradas de ese objeto. 
# AZK123 | 12.434823 | 534912 | 100 | 0.55
# AZK123 | 14.524823 | 534102 | 0.3 | 0.63
# AZK123 | 10.042823 | 534400 | 0.5 | 0.65
# AZK123 | 19.104823 | 534568 | 100 | 0.59
# AZK123 | 10.348023 | 534964 | 100 | 0.51

#... si hay columnas extras como el candid o las de ubicacion, las ignoramos 
# (no las dropeemos para reutilizar los mismos scripts de procesamiento)

# cambiarles el oid por algo asi como "sinteticCEP01"

# sinteticCEP01 | 12.434823 | 534912 | 100 | 0.55
# sinteticCEP01 | 14.524823 | 534102 | 0.3 | 0.63
# sinteticCEP01 | 10.042823 | 534400 | 0.5 | 0.65
# sinteticCEP01 | 19.104823 | 534568 | 100 | 0.59
# sinteticCEP01 | 10.348023 | 534964 | 100 | 0.51

# y metele un poco de ruido aleatorio a cada variable (manteniendo sus escalas)

# sinteticCEP01 | 11.394823 | 534600 | None | 0.53
# sinteticCEP01 | 13.694823 | 534795 | None | 0.60
# sinteticCEP01 | 12.122823 | 534604 | None | 0.66
# sinteticCEP01 | 17.144823 | 535100 | None | 0.57
# sinteticCEP01 | 19.348023 | 534503 | None | 0.51

# el campo del error normal podemos dejarlo tal cual o mandarlo a None, porque se ignora
# al calcular features

# Hacer eso mismo para cada CEP, DSCT y Periodic-Other. Eso nos dejará con 
# el doble de CEP, de DSCT y de Periodic-Other (sigue siendo poco en comparación
# a sus pares, pero por lo menos es algo más).


def createSintetic(base_objects, name):
  i = 0
  for obj in base_objects:
    obj_df = filtered_alerts.loc[obj]
    obj_df.rename(index={obj:'sintetic{}{}'.format(name, i)},inplace=True)
    max_sigmapsfcorrext = obj_df['sigmapsf_corr_ext'].max()
    min_sigmapsfcorrext = obj_df['sigmapsf_corr_ext'].min()
    max_time = obj_df['mjd'].max()
    min_time = obj_df['mjd'].min()
    max_mag = obj_df['magpsf_corr'].max()
    min_mag = obj_df['magpsf_corr'].min()
    for index, row in obj_df.iterrows():
      ruido_sigmapsfcorrext = random.uniform(0, min(max_sigmapsfcorrext-min_sigmapsfcorrext, 5))
      ruido_time = random.uniform(0, min(max_time-min_time, 200))
      ruido_mag = random.uniform(0, min(max_mag-min_mag, 30))
      obj_df.loc[index, 'sigmapsf_corr_ext'] += ruido_sigmapsfcorrext
      obj_df.loc[index, 'magpsf_corr'] += ruido_mag
      obj_df.loc[index, 'mjd'] += ruido_time
    if i==0:
      sintetics = obj_df
    else:
      sintetics = pd.concat([sintetics, obj_df])
    i += 1
  print(sintetics)
  sintetics.to_csv("sintetics_{}.csv".format(name))

createSintetic(CEPS, "CEPS")
createSintetic(DSCTS, "DSCTS")
createSintetic(POS, "PeriodicOther")



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(ilocs[0], value)


                             candid  fid  ...  sigmapsf_corr  sigmapsf_corr_ext
oid                                       ...                                  
sinteticCEPS0    963378614515015001    1  ...       0.043460           0.463625
sinteticCEPS0    951393324515010003    1  ...       0.059855           0.483000
sinteticCEPS0    892439904515010001    1  ...       0.084891           0.506429
sinteticCEPS0    882421164515015000    2  ...       0.025708           0.446361
sinteticCEPS0    918422634515010002    1  ...       0.059055           0.481978
...                             ...  ...  ...            ...                ...
sinteticCEPS617  588404833615015015    1  ...     100.000000          11.509691
sinteticCEPS617  957174983615015015    2  ...     100.000000          11.513635
sinteticCEPS617  996162673615015007    2  ...     100.000000          11.503433
sinteticCEPS617  980341273615015011    2  ...     100.000000          11.503037
sinteticCEPS617  983282903615010013    2

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(ilocs[0], value)


                               candid  fid  ...  sigmapsf_corr  sigmapsf_corr_ext
oid                                         ...                                  
sinteticDSCTS0     891295064115010001    2  ...     100.000000           0.904389
sinteticDSCTS0     920204944115015001    1  ...       0.003735           0.902682
sinteticDSCTS0     932189984115015004    1  ...     100.000000           0.899234
sinteticDSCTS0     840274574115015003    1  ...     100.000000           0.896716
sinteticDSCTS0    1119548254115010006    1  ...     100.000000           0.899430
...                               ...  ...  ...            ...                ...
sinteticDSCTS731  1237309203815015009    1  ...     100.000000           0.486996
sinteticDSCTS731   522247003815010003    1  ...     100.000000           0.482169
sinteticDSCTS731   726555693815010001    2  ...     100.000000           0.486311
sinteticDSCTS731   687508803815010000    1  ...     100.000000           0.485347
sinteticDSCTS731

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(ilocs[0], value)


                                        candid  ...  sigmapsf_corr_ext
oid                                             ...                   
sinteticPeriodicOther0     1081086850615015000  ...         176.846608
sinteticPeriodicOther0     1084093430615010000  ...         177.941402
sinteticPeriodicOther0     1031215530615010002  ...         177.107621
sinteticPeriodicOther0     1052211230615010001  ...         177.807796
sinteticPeriodicOther0     1057171000615015000  ...         176.787731
...                                        ...  ...                ...
sinteticPeriodicOther1255   757493372015010002  ...           0.264329
sinteticPeriodicOther1255   748541842015010000  ...           0.256404
sinteticPeriodicOther1255   733498622015010001  ...           0.263921
sinteticPeriodicOther1255   751540392015010000  ...           0.256744
sinteticPeriodicOther1255   722510682015010001  ...           0.257256

[67272 rows x 6 columns]


In [None]:
filtered_alerts.loc[CEPS[0]].head()

Unnamed: 0_level_0,candid,fid,mjd,magpsf_corr,sigmapsf_corr,sigmapsf_corr_ext
oid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ZTF18abwwdsc,963378614515015001,1,58717.378611,18.361734,0.04346,0.045807
ZTF18abwwdsc,951393324515010003,1,58705.393322,18.989523,0.059855,0.065182
ZTF18abwwdsc,892439904515010001,1,58646.439907,18.972616,0.084891,0.088612
ZTF18abwwdsc,882421164515015000,2,58636.421169,18.11252,0.025708,0.028543
ZTF18abwwdsc,918422634515010002,1,58672.422639,18.958513,0.059055,0.064161


In [None]:
filtered_alerts.loc[POS[0]].tail()

Unnamed: 0_level_0,candid,fid,mjd,magpsf_corr,sigmapsf_corr,sigmapsf_corr_ext
oid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ZTF18accqcwt,918462100615015000,2,58672.462106,18.78312,0.047196,0.049122
ZTF18accqcwt,703119680615010001,1,58457.119687,21.258444,1.045449,1.054233
ZTF18accqcwt,1027249160615015001,2,58781.249167,18.41144,0.086194,0.086736
ZTF18accqcwt,728095410615015001,2,58482.095417,18.096096,0.067301,0.067689
ZTF18accqcwt,1035206060615010001,1,58789.206065,21.7847,1.288545,1.307277


In [None]:
filtered_alerts.loc[DSCTS[0]].head()

Unnamed: 0_level_0,candid,fid,mjd,magpsf_corr,sigmapsf_corr,sigmapsf_corr_ext
oid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ZTF18aavskgy,891295064115010001,2,58645.295069,14.181499,100.0,0.013843
ZTF18aavskgy,920204944115015001,1,58674.204942,13.931778,0.003735,0.012135
ZTF18aavskgy,932189984115015004,1,58686.189988,13.945841,100.0,0.008687
ZTF18aavskgy,840274574115015003,1,58594.274572,13.916053,100.0,0.00617
ZTF18aavskgy,1119548254115010006,1,58873.548252,14.198402,100.0,0.008883
