# tratando_anomalias
En el TP1 hemos visto que el dataset, aparte de datos sin sentido o datos nulos, presenta anomalias, y no solo eso, sino que se trata de un dataset desbalanceado.

Decimos que presenta anomalias y que es un dataset desbalanceado porque:
- Se cargaron muchas mas propiedades en **Diciembre del 2016** que en el resto de los anio-meses. Consideramos que esto es una **anomalia** del dataset, ya que no hay razon alguna para que en el ultimo mes los datos se cuadripliquen. Sospechamos que pueden ser datos que no correspondan a ese anio-mes.
- Algo similar sucede con las provincias, donde **Distrito Federal** tiene la mayoria de publicaciones. Esto no es una anomalia ya que es la provincia mas poblada de Mexico y tiene sentido que tenga mas publicaciones. Aun asi, genera **desbalanceo** en el dataset, y puede provocar que algunos algoritmos funcionen peor.

En las siguientes lineas intentaremos proponer algunas soluciones a estos problemas:

In [1]:
import pandas as pd
import numpy as np
import math

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

seed = 7

In [2]:
import nbimporter

import pre_processing
import feature_generation
import feature_selection

Importing Jupyter notebook from pre_processing.ipynb
Importing Jupyter notebook from feature_generation.ipynb
Importing Jupyter notebook from feature_selection.ipynb


<hr>

### Modelo a utilizar:

In [3]:
import lightgbm as lgb

params = {
    'objective': 'regression', 
    'boosting': 'gbdt',
    'metric': 'mae',
    'boost_from_average': False,
    'num_threads': 4,
    'learning_rate': 0.0081,
    'num_leaves': 97,
    'max_depth': 11,
    'feature_fraction': 0.041,
    'bagging_freq': 5,
    'bagging_fraction': 0.331,
    'min_data_in_leaf': 80,
    'min_sum_hessian_in_leaf': 10.0,
    'verbosity': 1,
    'num_iterations': 99999999,
    'seed': seed}


<hr>

# TRATAMIENTO DE ANIO-MES DIC2016

## Primera alternativa: llenar con nulos los campos de fecha para dichos datos.
Analizaremos esta opcion probando realizar el tratamiento solo para el dataset de train, y luego para ambos y ver que resultados obtenemos.

In [4]:
train,_ = pre_processing.load_featured_datasets()

In [5]:
features = feature_generation.get_features()
features_con_fecha = features['fecha']['all']
features_con_fecha.remove('aniomes')

In [6]:
train = feature_selection.get_selected_dataframe(train, aniomes=True)

In [7]:
train.shape

(240000, 143)

In [8]:
#train['precio'] = train['precio'].map(lambda x: math.log(x))
train_original = train.copy()

In [9]:
for feature in features_con_fecha:
    if feature in train.columns:
        train[feature] = train.apply(lambda x: (np.nan) if (x['aniomes'] == 201612) else (x[feature]), axis=1)

In [10]:
X = train.drop('precio', axis=1)
Y = train['precio']

X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.2, random_state=seed)

In [None]:
d_train = lgb.Dataset(X_train.values, label=Y_train.values)
d_valid = lgb.Dataset(X_val.values, label=Y_val.values)
watchlist = [d_valid]
reg = lgb.train(params, d_train, valid_sets=watchlist, verbose_eval=1000, early_stopping_rounds=3000)



Training until validation scores don't improve for 3000 rounds
[1000]	valid_0's l1: 614496
[2000]	valid_0's l1: 565735
[3000]	valid_0's l1: 546628
[4000]	valid_0's l1: 537191
[5000]	valid_0's l1: 530130
[6000]	valid_0's l1: 525115
[7000]	valid_0's l1: 521696
[8000]	valid_0's l1: 518568
[9000]	valid_0's l1: 516598
[10000]	valid_0's l1: 514574


In [37]:
# Ahora, para comparar, probaremos el mismo metodo pero sin mejora aplicada:

In [38]:
X = train_original.drop('precio', axis=1)
Y = train_original['precio']

X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.2, random_state=seed)

In [39]:
d_train = lgb.Dataset(X_train.values, label=Y_train.values)
d_valid = lgb.Dataset(X_val.values, label=Y_val.values)
watchlist = [d_valid]
reg = lgb.train(params, d_train, n_estimators, watchlist, verbose_eval=1)

[1]	valid_0's l1: 1.48531e+06
Training until validation scores don't improve for 50 rounds
[2]	valid_0's l1: 1.37867e+06
[3]	valid_0's l1: 1.28343e+06
[4]	valid_0's l1: 1.1988e+06
[5]	valid_0's l1: 1.12503e+06
[6]	valid_0's l1: 1.06057e+06
[7]	valid_0's l1: 1.00358e+06
[8]	valid_0's l1: 951820
[9]	valid_0's l1: 908145
[10]	valid_0's l1: 869109
[11]	valid_0's l1: 834903
[12]	valid_0's l1: 805767
[13]	valid_0's l1: 779900
[14]	valid_0's l1: 756476
[15]	valid_0's l1: 736496
[16]	valid_0's l1: 717973
[17]	valid_0's l1: 701968
[18]	valid_0's l1: 687948
[19]	valid_0's l1: 674741
[20]	valid_0's l1: 663199
[21]	valid_0's l1: 653236
[22]	valid_0's l1: 644082
[23]	valid_0's l1: 635083
[24]	valid_0's l1: 627052
[25]	valid_0's l1: 620826
[26]	valid_0's l1: 614663
[27]	valid_0's l1: 609286
[28]	valid_0's l1: 604209
[29]	valid_0's l1: 599506
[30]	valid_0's l1: 595395
[31]	valid_0's l1: 591809
[32]	valid_0's l1: 588717
[33]	valid_0's l1: 585798
[34]	valid_0's l1: 582989
[35]	valid_0's l1: 580582
[36]

[308]	valid_0's l1: 513105
[309]	valid_0's l1: 513034
[310]	valid_0's l1: 512941
[311]	valid_0's l1: 512874
[312]	valid_0's l1: 512840
[313]	valid_0's l1: 512801
[314]	valid_0's l1: 512774
[315]	valid_0's l1: 512692
[316]	valid_0's l1: 512635
[317]	valid_0's l1: 512580
[318]	valid_0's l1: 512511
[319]	valid_0's l1: 512507
[320]	valid_0's l1: 512475
[321]	valid_0's l1: 512417
[322]	valid_0's l1: 512293
[323]	valid_0's l1: 512214
[324]	valid_0's l1: 512207
[325]	valid_0's l1: 512137
[326]	valid_0's l1: 512090
[327]	valid_0's l1: 512009
[328]	valid_0's l1: 511959
[329]	valid_0's l1: 511947
[330]	valid_0's l1: 511871
[331]	valid_0's l1: 511804
[332]	valid_0's l1: 511734
[333]	valid_0's l1: 511690
[334]	valid_0's l1: 511674
[335]	valid_0's l1: 511639
[336]	valid_0's l1: 511517
[337]	valid_0's l1: 511458
[338]	valid_0's l1: 511376
[339]	valid_0's l1: 511387
[340]	valid_0's l1: 511326
[341]	valid_0's l1: 511322
[342]	valid_0's l1: 511284
[343]	valid_0's l1: 511242
[344]	valid_0's l1: 511148
[

[615]	valid_0's l1: 503200
[616]	valid_0's l1: 503195
[617]	valid_0's l1: 503178
[618]	valid_0's l1: 503180
[619]	valid_0's l1: 503116
[620]	valid_0's l1: 503095
[621]	valid_0's l1: 503084
[622]	valid_0's l1: 503066
[623]	valid_0's l1: 503025
[624]	valid_0's l1: 502971
[625]	valid_0's l1: 502925
[626]	valid_0's l1: 502923
[627]	valid_0's l1: 502874
[628]	valid_0's l1: 502856
[629]	valid_0's l1: 502844
[630]	valid_0's l1: 502823
[631]	valid_0's l1: 502825
[632]	valid_0's l1: 502831
[633]	valid_0's l1: 502792
[634]	valid_0's l1: 502766
[635]	valid_0's l1: 502737
[636]	valid_0's l1: 502747
[637]	valid_0's l1: 502732
[638]	valid_0's l1: 502741
[639]	valid_0's l1: 502730
[640]	valid_0's l1: 502714
[641]	valid_0's l1: 502684
[642]	valid_0's l1: 502666
[643]	valid_0's l1: 502674
[644]	valid_0's l1: 502646
[645]	valid_0's l1: 502615
[646]	valid_0's l1: 502589
[647]	valid_0's l1: 502573
[648]	valid_0's l1: 502569
[649]	valid_0's l1: 502571
[650]	valid_0's l1: 502536
[651]	valid_0's l1: 502512
[

[925]	valid_0's l1: 499236
[926]	valid_0's l1: 499236
[927]	valid_0's l1: 499240
[928]	valid_0's l1: 499234
[929]	valid_0's l1: 499225
[930]	valid_0's l1: 499253
[931]	valid_0's l1: 499228
[932]	valid_0's l1: 499211
[933]	valid_0's l1: 499198
[934]	valid_0's l1: 499181
[935]	valid_0's l1: 499152
[936]	valid_0's l1: 499124
[937]	valid_0's l1: 499142
[938]	valid_0's l1: 499155
[939]	valid_0's l1: 499140
[940]	valid_0's l1: 499137
[941]	valid_0's l1: 499117
[942]	valid_0's l1: 499119
[943]	valid_0's l1: 499091
[944]	valid_0's l1: 499068
[945]	valid_0's l1: 499028
[946]	valid_0's l1: 499022
[947]	valid_0's l1: 499006
[948]	valid_0's l1: 498996
[949]	valid_0's l1: 498976
[950]	valid_0's l1: 498946
[951]	valid_0's l1: 498937
[952]	valid_0's l1: 498934
[953]	valid_0's l1: 498903
[954]	valid_0's l1: 498901
[955]	valid_0's l1: 498893
[956]	valid_0's l1: 498895
[957]	valid_0's l1: 498878
[958]	valid_0's l1: 498862
[959]	valid_0's l1: 498873
[960]	valid_0's l1: 498862
[961]	valid_0's l1: 498860
[

[1225]	valid_0's l1: 497385
[1226]	valid_0's l1: 497396
[1227]	valid_0's l1: 497384
[1228]	valid_0's l1: 497362
[1229]	valid_0's l1: 497348
[1230]	valid_0's l1: 497324
[1231]	valid_0's l1: 497314
[1232]	valid_0's l1: 497298
[1233]	valid_0's l1: 497299
[1234]	valid_0's l1: 497293
[1235]	valid_0's l1: 497271
[1236]	valid_0's l1: 497291
[1237]	valid_0's l1: 497285
[1238]	valid_0's l1: 497254
[1239]	valid_0's l1: 497261
[1240]	valid_0's l1: 497271
[1241]	valid_0's l1: 497263
[1242]	valid_0's l1: 497274
[1243]	valid_0's l1: 497290
[1244]	valid_0's l1: 497297
[1245]	valid_0's l1: 497286
[1246]	valid_0's l1: 497287
[1247]	valid_0's l1: 497281
[1248]	valid_0's l1: 497253
[1249]	valid_0's l1: 497270
[1250]	valid_0's l1: 497251
[1251]	valid_0's l1: 497252
[1252]	valid_0's l1: 497250
[1253]	valid_0's l1: 497279
[1254]	valid_0's l1: 497264
[1255]	valid_0's l1: 497267
[1256]	valid_0's l1: 497268
[1257]	valid_0's l1: 497266
[1258]	valid_0's l1: 497252
[1259]	valid_0's l1: 497270
[1260]	valid_0's l1:

### Resultados de la primer alternativa:
Podemos ver entonces que cuando llenamos con nulos, el MAE pasa de **466k** a **462k**.

# Papelera de codigo

In [None]:
# Prediccion logaritmica...

Y_pred = reg.predict(X_val.values)

f = np.vectorize(math.exp)
Y_pred = f(Y_pred)
Y_val = f(Y_val.values)
mean_absolute_error(Y_val,Y_pred)