## Соревнование: Предсказание погоды

### Признаки
Из первых уст про признаки можно узнать из [видео Лены Волжиной.](https://youtu.be/ECALEJ79KHg?t=578)

Их источники можно разделить на следующие группы
  * Прогноз глобальных метеомоделей. В выгруженных данных это модели CMC и GFS
    * признаки, которые начинаются с ```cmc_```
    * признаки, которые начинаются с ```gfs_```
  * Прогнозы от региональной модели WRF
    * признаки, которые начинаются с ```wrf_```
  * Климатические данные (за 40 лет наблюдений)
    * признаки, которые начинаются с ```climate_```
  * Прочие признаки
    * 'sun_elevation' -- та самая высота солнца над горизонтом.
    * 'topography_bathymetry' 
  
  
Признаки cmc имеют в своём имени 4 числа, которые кодируют физическую величину. Последнее значение -- это высота, на которой величина вычислена.

Признаки gfs и wrf имеют человеко-читаемые названия.

Еще есть технические атрибуты прогноза поставщика:
  * `_available` -- был ли доступен поставщик при прогнозе
  * ```_horizon_h``` -- каковым горизонтом для модели поставщика была временная отметка, на которую meteum рассчитвал прогноз.

  * ```_next``` -- это значение следущего прогноза данного поставщика для данной точки.

Кроме того, есть еще признаки, комбинирующие выше описанные, например
  * `_grad` -- это разница разница между обычным и _next прогнозом температуры.

### Целевые переменные
  * fact_temperature -- температура в данной точке (точка описывается парой координат 'fact_latitude', 'fact_longitude'  и временной отметкой 'fact_time' )
  * fact_cwsm_class -- код облачности и осадков

### Таргет совернования
  * классификатор облачности [будем говорить, что объект обладает меткой l1, если fact_cwsm_class не ноль]
      * __l1 = 1, если fact_cwsm_class != 0, иначе 0__
  * классификатор осадков [будем говорить, что объект обладает меткой l2, если fact_cwsm_class не (0,10,20)]
      * __l2 = 1, если fact_cwsm_class not in (0,10,20), иначе 0__

In [110]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import catboost
from sklearn.metrics import balanced_accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

import warnings
warnings.filterwarnings('ignore')

## EDA

In [2]:
%%time
train_2018_09 = pd.read_csv('data/train_2018_09.csv')
train_2018_10 = pd.read_csv('data/train_2018_10.csv')
train_2018_11 = pd.read_csv('data/train_2018_11.csv')
train_2018_12 = pd.read_csv('data/train_2018_12.csv')
train_2019_01 = pd.read_csv('data/train_2019_01.csv')
test = pd.read_csv('data/test.csv')
sample_submission = pd.read_csv('data/sample_submission.csv')

CPU times: user 35.6 s, sys: 2.94 s, total: 38.5 s
Wall time: 39 s


In [3]:
print('train_2018_09', train_2018_09.shape)
print('train_2018_10', train_2018_10.shape)
print('train_2018_11', train_2018_11.shape)
print('train_2018_12', train_2018_12.shape)
print('train_2019_01', train_2019_01.shape)
print('test (2019_02)', test.shape)
print('sample_submission', sample_submission.shape)

train_2018_09 (430436, 131)
train_2018_10 (441518, 131)
train_2018_11 (429419, 131)
train_2018_12 (440694, 131)
train_2019_01 (443112, 131)
test (400547, 128)
sample_submission (400547, 3)


In [142]:
train_all = pd.concat([train_2018_09, train_2018_10, train_2018_11, train_2018_12, train_2019_01], axis=0)

In [4]:
[x for x in train_2018_10.columns if x.startswith('fact')]

['fact_time',
 'fact_latitude',
 'fact_longitude',
 'fact_temperature',
 'fact_cwsm_class']

In [29]:
test.dtypes[test.dtypes == object]

climate    object
dtype: object

## Modeling

In [111]:
def print_metrics_class(y_true, y_pred, model=''):
    model_ = '-->> ' + model if model else ''
    print(f"Balanced accuracy =".ljust(19), f'{balanced_accuracy_score(y_true, y_pred):.3f} {model_}')
    print(f"Precision =".ljust(19), f'{precision_score(y_true, y_pred):.3f} {model_}')
    print(f"Recall =".ljust(19), f'{recall_score(y_true, y_pred):.3f} {model_}')
    print(f"F1-score =".ljust(19), f'{f1_score(y_true, y_pred):.3f} {model_}')
    
def roc_aucs(roc_auc_l1, roc_auc_l2):
    return (roc_auc_l1 + roc_auc_l2 * 2) / 3

### Train/eval

In [115]:
X_train = train_2018_12.drop(['l1', 'l2', 'fact_cwsm_class', 'fact_temperature'], axis=1)
X_train['climate'] = X_train['climate'].astype(str)
columns = X_train.columns
y_train_1 = train_2018_12['l1']
y_train_2 = train_2018_12['l2']

train_dataset_1 = catboost.Pool(data=X_train,
                              label=y_train_1,
                              cat_features=['climate'])
train_dataset_2 = catboost.Pool(data=X_train,
                              label=y_train_2,
                              cat_features=['climate'])

X_eval = train_2019_01.drop(['l1', 'l2', 'fact_cwsm_class', 'fact_temperature'], axis=1)
X_eval['climate'] = X_eval['climate'].astype(str)
y_eval_1 = train_2019_01['l1']
y_eval_2 = train_2019_01['l2']

X_test = test.drop('ID', axis=1)
X_test['climate'] = X_test['climate'].astype(str)
X_test = X_test[columns]

In [121]:
clf_1 = catboost.CatBoostClassifier()
clf_2 = catboost.CatBoostClassifier()

In [122]:
%%time
clf_1 = clf_1.fit(train_dataset_1, verbose=False)
clf_2 = clf_2.fit(train_dataset_2, verbose=False)

CPU times: user 41min 8s, sys: 1min, total: 42min 9s
Wall time: 3min 40s


In [132]:
y_eval_1_pred = clf_1.predict(X_eval)
y_eval_2_pred = clf_2.predict(X_eval)

y_eval_1_pred_prob = clf_1.predict_proba(X_eval)[:, 1]
y_eval_2_pred_prob = clf_2.predict_proba(X_eval)[:, 1]

In [133]:
print_metrics_class(y_eval_1, y_eval_1_pred)

Balanced accuracy = 0.807 
Precision =         0.852 
Recall =            0.888 
F1-score =          0.870 


In [135]:
print_metrics_class(y_eval_2, y_eval_2_pred)

Balanced accuracy = 0.648 
Precision =         0.605 
Recall =            0.316 
F1-score =          0.415 


In [137]:
roc_aucs(roc_auc_score(y_eval_1, y_eval_1_pred_prob), 
         roc_auc_score(y_eval_2, y_eval_2_pred_prob))

0.9059968630471014

### Test

In [144]:
train_all.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2185179 entries, 0 to 443111
Columns: 131 entries, fact_time to l2
dtypes: float64(128), int64(2), object(1)
memory usage: 2.3 GB


In [145]:
X_train = train_all.drop(['l1', 'l2', 'fact_cwsm_class', 'fact_temperature'], axis=1)
X_train['climate'] = X_train['climate'].astype(str)
columns = X_train.columns
y_train_1 = train_all['l1']
y_train_2 = train_all['l2']

train_dataset_1 = catboost.Pool(data=X_train,
                              label=y_train_1,
                              cat_features=['climate'])
train_dataset_2 = catboost.Pool(data=X_train,
                              label=y_train_2,
                              cat_features=['climate'])

X_test = test.drop('ID', axis=1)
X_test['climate'] = X_test['climate'].astype(str)
X_test = X_test[columns]

In [168]:
clf_1 = catboost.CatBoostClassifier(auto_class_weights='Balanced')
clf_2 = catboost.CatBoostClassifier(auto_class_weights='Balanced')

In [169]:
%%time
clf_1 = clf_1.fit(train_dataset_1, verbose=True)
clf_2 = clf_2.fit(train_dataset_2, verbose=True)

Learning rate set to 0.274723
0:	learn: 0.5876836	total: 643ms	remaining: 10m 42s
1:	learn: 0.5348586	total: 1.19s	remaining: 9m 55s
2:	learn: 0.5059390	total: 1.8s	remaining: 9m 59s
3:	learn: 0.4896562	total: 2.36s	remaining: 9m 46s
4:	learn: 0.4780088	total: 3.08s	remaining: 10m 13s
5:	learn: 0.4687113	total: 3.6s	remaining: 9m 57s
6:	learn: 0.4635521	total: 4.2s	remaining: 9m 56s
7:	learn: 0.4582030	total: 4.73s	remaining: 9m 47s
8:	learn: 0.4549010	total: 5.25s	remaining: 9m 38s
9:	learn: 0.4520480	total: 5.81s	remaining: 9m 35s
10:	learn: 0.4497765	total: 6.47s	remaining: 9m 41s
11:	learn: 0.4478328	total: 7.12s	remaining: 9m 46s
12:	learn: 0.4457460	total: 7.64s	remaining: 9m 39s
13:	learn: 0.4443169	total: 8.14s	remaining: 9m 33s
14:	learn: 0.4421366	total: 8.72s	remaining: 9m 32s
15:	learn: 0.4408172	total: 9.3s	remaining: 9m 32s
16:	learn: 0.4399331	total: 9.85s	remaining: 9m 29s
17:	learn: 0.4388373	total: 10.4s	remaining: 9m 27s
18:	learn: 0.4378914	total: 11s	remaining: 9m 

157:	learn: 0.3925162	total: 1m 34s	remaining: 8m 24s
158:	learn: 0.3922505	total: 1m 35s	remaining: 8m 23s
159:	learn: 0.3920356	total: 1m 35s	remaining: 8m 23s
160:	learn: 0.3916683	total: 1m 36s	remaining: 8m 22s
161:	learn: 0.3914384	total: 1m 36s	remaining: 8m 21s
162:	learn: 0.3911769	total: 1m 37s	remaining: 8m 21s
163:	learn: 0.3909997	total: 1m 38s	remaining: 8m 21s
164:	learn: 0.3908663	total: 1m 38s	remaining: 8m 20s
165:	learn: 0.3906756	total: 1m 39s	remaining: 8m 20s
166:	learn: 0.3905465	total: 1m 40s	remaining: 8m 19s
167:	learn: 0.3903765	total: 1m 40s	remaining: 8m 18s
168:	learn: 0.3902565	total: 1m 41s	remaining: 8m 18s
169:	learn: 0.3901680	total: 1m 41s	remaining: 8m 17s
170:	learn: 0.3899454	total: 1m 42s	remaining: 8m 17s
171:	learn: 0.3897276	total: 1m 43s	remaining: 8m 16s
172:	learn: 0.3895977	total: 1m 43s	remaining: 8m 15s
173:	learn: 0.3894147	total: 1m 44s	remaining: 8m 15s
174:	learn: 0.3893048	total: 1m 44s	remaining: 8m 14s
175:	learn: 0.3891570	total:

311:	learn: 0.3752338	total: 3m 9s	remaining: 6m 57s
312:	learn: 0.3751112	total: 3m 9s	remaining: 6m 57s
313:	learn: 0.3750202	total: 3m 10s	remaining: 6m 56s
314:	learn: 0.3749667	total: 3m 11s	remaining: 6m 56s
315:	learn: 0.3749007	total: 3m 11s	remaining: 6m 55s
316:	learn: 0.3748283	total: 3m 12s	remaining: 6m 54s
317:	learn: 0.3747689	total: 3m 13s	remaining: 6m 54s
318:	learn: 0.3746085	total: 3m 13s	remaining: 6m 53s
319:	learn: 0.3745480	total: 3m 14s	remaining: 6m 53s
320:	learn: 0.3744621	total: 3m 15s	remaining: 6m 52s
321:	learn: 0.3742621	total: 3m 15s	remaining: 6m 52s
322:	learn: 0.3741897	total: 3m 16s	remaining: 6m 51s
323:	learn: 0.3741424	total: 3m 16s	remaining: 6m 50s
324:	learn: 0.3740281	total: 3m 17s	remaining: 6m 50s
325:	learn: 0.3739745	total: 3m 18s	remaining: 6m 49s
326:	learn: 0.3738902	total: 3m 18s	remaining: 6m 49s
327:	learn: 0.3736928	total: 3m 19s	remaining: 6m 48s
328:	learn: 0.3736097	total: 3m 20s	remaining: 6m 47s
329:	learn: 0.3735407	total: 3

464:	learn: 0.3639010	total: 4m 40s	remaining: 5m 22s
465:	learn: 0.3638540	total: 4m 40s	remaining: 5m 21s
466:	learn: 0.3638059	total: 4m 41s	remaining: 5m 21s
467:	learn: 0.3637152	total: 4m 42s	remaining: 5m 20s
468:	learn: 0.3636326	total: 4m 42s	remaining: 5m 20s
469:	learn: 0.3635906	total: 4m 43s	remaining: 5m 19s
470:	learn: 0.3635303	total: 4m 43s	remaining: 5m 18s
471:	learn: 0.3634343	total: 4m 44s	remaining: 5m 17s
472:	learn: 0.3634010	total: 4m 44s	remaining: 5m 17s
473:	learn: 0.3633439	total: 4m 45s	remaining: 5m 16s
474:	learn: 0.3632803	total: 4m 45s	remaining: 5m 15s
475:	learn: 0.3632330	total: 4m 46s	remaining: 5m 15s
476:	learn: 0.3631834	total: 4m 46s	remaining: 5m 14s
477:	learn: 0.3630853	total: 4m 47s	remaining: 5m 13s
478:	learn: 0.3630176	total: 4m 48s	remaining: 5m 13s
479:	learn: 0.3629810	total: 4m 48s	remaining: 5m 12s
480:	learn: 0.3629265	total: 4m 49s	remaining: 5m 12s
481:	learn: 0.3628665	total: 4m 49s	remaining: 5m 11s
482:	learn: 0.3627841	total:

618:	learn: 0.3551829	total: 6m 6s	remaining: 3m 45s
619:	learn: 0.3551518	total: 6m 7s	remaining: 3m 45s
620:	learn: 0.3551117	total: 6m 8s	remaining: 3m 44s
621:	learn: 0.3550704	total: 6m 8s	remaining: 3m 43s
622:	learn: 0.3550338	total: 6m 8s	remaining: 3m 43s
623:	learn: 0.3549730	total: 6m 9s	remaining: 3m 42s
624:	learn: 0.3549414	total: 6m 10s	remaining: 3m 42s
625:	learn: 0.3549005	total: 6m 10s	remaining: 3m 41s
626:	learn: 0.3548249	total: 6m 11s	remaining: 3m 40s
627:	learn: 0.3547827	total: 6m 11s	remaining: 3m 40s
628:	learn: 0.3547476	total: 6m 12s	remaining: 3m 39s
629:	learn: 0.3547095	total: 6m 12s	remaining: 3m 38s
630:	learn: 0.3546823	total: 6m 13s	remaining: 3m 38s
631:	learn: 0.3546460	total: 6m 13s	remaining: 3m 37s
632:	learn: 0.3545997	total: 6m 14s	remaining: 3m 37s
633:	learn: 0.3545562	total: 6m 15s	remaining: 3m 36s
634:	learn: 0.3545171	total: 6m 15s	remaining: 3m 35s
635:	learn: 0.3544749	total: 6m 16s	remaining: 3m 35s
636:	learn: 0.3544348	total: 6m 16

771:	learn: 0.3485089	total: 7m 31s	remaining: 2m 13s
772:	learn: 0.3484346	total: 7m 31s	remaining: 2m 12s
773:	learn: 0.3483707	total: 7m 32s	remaining: 2m 12s
774:	learn: 0.3483274	total: 7m 32s	remaining: 2m 11s
775:	learn: 0.3482989	total: 7m 33s	remaining: 2m 10s
776:	learn: 0.3482619	total: 7m 34s	remaining: 2m 10s
777:	learn: 0.3482336	total: 7m 34s	remaining: 2m 9s
778:	learn: 0.3481978	total: 7m 35s	remaining: 2m 9s
779:	learn: 0.3481623	total: 7m 35s	remaining: 2m 8s
780:	learn: 0.3480832	total: 7m 36s	remaining: 2m 7s
781:	learn: 0.3480411	total: 7m 36s	remaining: 2m 7s
782:	learn: 0.3479761	total: 7m 37s	remaining: 2m 6s
783:	learn: 0.3479452	total: 7m 37s	remaining: 2m 6s
784:	learn: 0.3479156	total: 7m 38s	remaining: 2m 5s
785:	learn: 0.3478780	total: 7m 39s	remaining: 2m 5s
786:	learn: 0.3477947	total: 7m 39s	remaining: 2m 4s
787:	learn: 0.3477499	total: 7m 40s	remaining: 2m 3s
788:	learn: 0.3477177	total: 7m 40s	remaining: 2m 3s
789:	learn: 0.3476786	total: 7m 41s	rema

925:	learn: 0.3425710	total: 8m 57s	remaining: 43s
926:	learn: 0.3425230	total: 8m 58s	remaining: 42.4s
927:	learn: 0.3424725	total: 8m 58s	remaining: 41.8s
928:	learn: 0.3424108	total: 8m 59s	remaining: 41.2s
929:	learn: 0.3423754	total: 8m 59s	remaining: 40.6s
930:	learn: 0.3423304	total: 9m	remaining: 40.1s
931:	learn: 0.3423037	total: 9m 1s	remaining: 39.5s
932:	learn: 0.3422717	total: 9m 1s	remaining: 38.9s
933:	learn: 0.3422375	total: 9m 2s	remaining: 38.3s
934:	learn: 0.3422001	total: 9m 2s	remaining: 37.7s
935:	learn: 0.3421580	total: 9m 3s	remaining: 37.2s
936:	learn: 0.3421241	total: 9m 4s	remaining: 36.6s
937:	learn: 0.3420930	total: 9m 4s	remaining: 36s
938:	learn: 0.3420180	total: 9m 5s	remaining: 35.4s
939:	learn: 0.3419890	total: 9m 5s	remaining: 34.8s
940:	learn: 0.3419638	total: 9m 6s	remaining: 34.3s
941:	learn: 0.3419437	total: 9m 6s	remaining: 33.7s
942:	learn: 0.3419083	total: 9m 7s	remaining: 33.1s
943:	learn: 0.3418717	total: 9m 7s	remaining: 32.5s
944:	learn: 0.

82:	learn: 0.3716413	total: 46.1s	remaining: 8m 29s
83:	learn: 0.3713087	total: 46.6s	remaining: 8m 28s
84:	learn: 0.3710043	total: 47.2s	remaining: 8m 28s
85:	learn: 0.3708174	total: 47.9s	remaining: 8m 28s
86:	learn: 0.3705635	total: 48.4s	remaining: 8m 28s
87:	learn: 0.3704164	total: 49s	remaining: 8m 28s
88:	learn: 0.3702702	total: 49.5s	remaining: 8m 26s
89:	learn: 0.3700526	total: 50.1s	remaining: 8m 26s
90:	learn: 0.3698716	total: 50.6s	remaining: 8m 25s
91:	learn: 0.3697514	total: 51.2s	remaining: 8m 25s
92:	learn: 0.3696050	total: 51.8s	remaining: 8m 24s
93:	learn: 0.3693662	total: 52.4s	remaining: 8m 24s
94:	learn: 0.3691083	total: 53s	remaining: 8m 24s
95:	learn: 0.3689365	total: 53.5s	remaining: 8m 23s
96:	learn: 0.3687786	total: 54s	remaining: 8m 22s
97:	learn: 0.3686001	total: 54.6s	remaining: 8m 22s
98:	learn: 0.3684092	total: 55.1s	remaining: 8m 21s
99:	learn: 0.3681492	total: 55.7s	remaining: 8m 21s
100:	learn: 0.3680124	total: 56.2s	remaining: 8m 20s
101:	learn: 0.367

237:	learn: 0.3510429	total: 2m 13s	remaining: 7m 6s
238:	learn: 0.3509365	total: 2m 13s	remaining: 7m 5s
239:	learn: 0.3508333	total: 2m 14s	remaining: 7m 5s
240:	learn: 0.3507368	total: 2m 14s	remaining: 7m 4s
241:	learn: 0.3506391	total: 2m 15s	remaining: 7m 4s
242:	learn: 0.3505465	total: 2m 16s	remaining: 7m 3s
243:	learn: 0.3504721	total: 2m 16s	remaining: 7m 3s
244:	learn: 0.3503410	total: 2m 17s	remaining: 7m 2s
245:	learn: 0.3502732	total: 2m 17s	remaining: 7m 2s
246:	learn: 0.3501929	total: 2m 18s	remaining: 7m 1s
247:	learn: 0.3500803	total: 2m 18s	remaining: 7m 1s
248:	learn: 0.3499220	total: 2m 19s	remaining: 7m
249:	learn: 0.3498394	total: 2m 19s	remaining: 6m 59s
250:	learn: 0.3497649	total: 2m 20s	remaining: 6m 59s
251:	learn: 0.3496752	total: 2m 20s	remaining: 6m 58s
252:	learn: 0.3495575	total: 2m 21s	remaining: 6m 57s
253:	learn: 0.3494797	total: 2m 22s	remaining: 6m 57s
254:	learn: 0.3493917	total: 2m 22s	remaining: 6m 57s
255:	learn: 0.3492895	total: 2m 23s	remaini

390:	learn: 0.3387405	total: 3m 44s	remaining: 5m 49s
391:	learn: 0.3386852	total: 3m 45s	remaining: 5m 49s
392:	learn: 0.3386142	total: 3m 45s	remaining: 5m 48s
393:	learn: 0.3385653	total: 3m 46s	remaining: 5m 48s
394:	learn: 0.3384962	total: 3m 47s	remaining: 5m 47s
395:	learn: 0.3383886	total: 3m 47s	remaining: 5m 47s
396:	learn: 0.3383239	total: 3m 48s	remaining: 5m 46s
397:	learn: 0.3382520	total: 3m 48s	remaining: 5m 46s
398:	learn: 0.3381916	total: 3m 49s	remaining: 5m 45s
399:	learn: 0.3381083	total: 3m 49s	remaining: 5m 44s
400:	learn: 0.3380445	total: 3m 50s	remaining: 5m 44s
401:	learn: 0.3379921	total: 3m 51s	remaining: 5m 43s
402:	learn: 0.3379270	total: 3m 51s	remaining: 5m 43s
403:	learn: 0.3378682	total: 3m 52s	remaining: 5m 42s
404:	learn: 0.3377113	total: 3m 52s	remaining: 5m 42s
405:	learn: 0.3376665	total: 3m 53s	remaining: 5m 41s
406:	learn: 0.3375809	total: 3m 54s	remaining: 5m 40s
407:	learn: 0.3375121	total: 3m 54s	remaining: 5m 40s
408:	learn: 0.3373998	total:

543:	learn: 0.3287106	total: 5m 13s	remaining: 4m 23s
544:	learn: 0.3286466	total: 5m 14s	remaining: 4m 22s
545:	learn: 0.3286068	total: 5m 14s	remaining: 4m 21s
546:	learn: 0.3285388	total: 5m 15s	remaining: 4m 21s
547:	learn: 0.3284645	total: 5m 16s	remaining: 4m 20s
548:	learn: 0.3284073	total: 5m 16s	remaining: 4m 20s
549:	learn: 0.3283535	total: 5m 17s	remaining: 4m 19s
550:	learn: 0.3282962	total: 5m 17s	remaining: 4m 19s
551:	learn: 0.3282181	total: 5m 18s	remaining: 4m 18s
552:	learn: 0.3281743	total: 5m 19s	remaining: 4m 17s
553:	learn: 0.3280937	total: 5m 19s	remaining: 4m 17s
554:	learn: 0.3280363	total: 5m 20s	remaining: 4m 16s
555:	learn: 0.3279660	total: 5m 20s	remaining: 4m 16s
556:	learn: 0.3279073	total: 5m 21s	remaining: 4m 15s
557:	learn: 0.3278556	total: 5m 22s	remaining: 4m 15s
558:	learn: 0.3278024	total: 5m 22s	remaining: 4m 14s
559:	learn: 0.3277436	total: 5m 23s	remaining: 4m 13s
560:	learn: 0.3276943	total: 5m 23s	remaining: 4m 13s
561:	learn: 0.3276394	total:

696:	learn: 0.3198239	total: 6m 48s	remaining: 2m 57s
697:	learn: 0.3197589	total: 6m 48s	remaining: 2m 56s
698:	learn: 0.3197104	total: 6m 49s	remaining: 2m 56s
699:	learn: 0.3196409	total: 6m 49s	remaining: 2m 55s
700:	learn: 0.3195914	total: 6m 50s	remaining: 2m 55s
701:	learn: 0.3195432	total: 6m 51s	remaining: 2m 54s
702:	learn: 0.3195050	total: 6m 51s	remaining: 2m 53s
703:	learn: 0.3194491	total: 6m 52s	remaining: 2m 53s
704:	learn: 0.3193858	total: 6m 52s	remaining: 2m 52s
705:	learn: 0.3193388	total: 6m 53s	remaining: 2m 52s
706:	learn: 0.3192850	total: 6m 54s	remaining: 2m 51s
707:	learn: 0.3192310	total: 6m 54s	remaining: 2m 51s
708:	learn: 0.3191733	total: 6m 55s	remaining: 2m 50s
709:	learn: 0.3191270	total: 6m 55s	remaining: 2m 49s
710:	learn: 0.3190804	total: 6m 56s	remaining: 2m 49s
711:	learn: 0.3190101	total: 6m 57s	remaining: 2m 48s
712:	learn: 0.3189523	total: 6m 57s	remaining: 2m 48s
713:	learn: 0.3189023	total: 6m 58s	remaining: 2m 47s
714:	learn: 0.3188209	total:

849:	learn: 0.3121127	total: 8m 18s	remaining: 1m 27s
850:	learn: 0.3120687	total: 8m 18s	remaining: 1m 27s
851:	learn: 0.3120091	total: 8m 19s	remaining: 1m 26s
852:	learn: 0.3119709	total: 8m 19s	remaining: 1m 26s
853:	learn: 0.3119250	total: 8m 20s	remaining: 1m 25s
854:	learn: 0.3118889	total: 8m 21s	remaining: 1m 24s
855:	learn: 0.3118373	total: 8m 21s	remaining: 1m 24s
856:	learn: 0.3117918	total: 8m 22s	remaining: 1m 23s
857:	learn: 0.3117636	total: 8m 22s	remaining: 1m 23s
858:	learn: 0.3117192	total: 8m 23s	remaining: 1m 22s
859:	learn: 0.3116565	total: 8m 24s	remaining: 1m 22s
860:	learn: 0.3116135	total: 8m 24s	remaining: 1m 21s
861:	learn: 0.3115502	total: 8m 25s	remaining: 1m 20s
862:	learn: 0.3114917	total: 8m 25s	remaining: 1m 20s
863:	learn: 0.3114303	total: 8m 26s	remaining: 1m 19s
864:	learn: 0.3113735	total: 8m 26s	remaining: 1m 19s
865:	learn: 0.3113241	total: 8m 27s	remaining: 1m 18s
866:	learn: 0.3112636	total: 8m 27s	remaining: 1m 17s
867:	learn: 0.3112010	total:

In [170]:
y_test_1_pred_prob = clf_1.predict_proba(X_test)[:, 1]
y_test_2_pred_prob = clf_2.predict_proba(X_test)[:, 1]

In [171]:
answer = pd.concat([test['ID'],
                    pd.Series(y_test_1_pred_prob, name='l1'), 
                    pd.Series(y_test_2_pred_prob, name='l2')], axis=1)

In [172]:
answer.head()

Unnamed: 0,ID,l1,l2
0,0,0.885523,0.127417
1,1,0.789655,0.048466
2,2,0.772024,0.422436
3,3,0.108486,0.007056
4,4,0.23115,0.003153


In [173]:
answer.to_csv('submission_2.csv', index=False)

In [177]:
answer_old = pd.read_csv('submission.csv')

### Mean predictions of two different models

In [185]:
merged = pd.merge(answer.rename(index=str, columns={'l1': 'l1_new', 'l2': 'l2_new'}), 
                  answer_old.rename(index=str, columns={'l1': 'l1_old', 'l2': 'l2_old'}), on='ID')

In [187]:
merged['l1'] = (merged['l1_new'] + merged['l1_old']) / 2
merged['l2'] = (merged['l2_new'] + merged['l2_old']) / 2

In [190]:
merged[['ID', 'l1', 'l2']].to_csv('submission_3.csv', index=False)

### Add third model to averaging

In [196]:
from sklearn.linear_model import LogisticRegression

In [200]:
logit_1 = LogisticRegression()
logit_2 = LogisticRegression()

In [201]:
%%time
logit_1 = logit_1.fit(X_train.drop('climate', axis=1).fillna(0), y_train_1)
logit_2 = logit_2.fit(X_train.drop('climate', axis=1).fillna(0), y_train_2)

CPU times: user 1min 53s, sys: 8.04 s, total: 2min 1s
Wall time: 24.2 s


In [202]:
y_test_1_pred_prob_logit = logit_1.predict_proba(X_test.drop('climate', axis=1).fillna(0))[:, 1]
y_test_2_pred_prob_logit = logit_2.predict_proba(X_test.drop('climate', axis=1).fillna(0))[:, 1]

In [213]:
answer_3 = pd.concat([test['ID'],
                    pd.Series(y_test_1_pred_prob_logit, name='l1_3'), 
                    pd.Series(y_test_2_pred_prob_logit, name='l2_3')], axis=1)

In [214]:
merged_3 = pd.merge(merged.drop(['l1', 'l2'], axis=1), answer_3, on='ID')

In [231]:
merged_3['l1'] = merged_3['l1_3']
merged_3['l2'] = merged_3['l2_3']

In [232]:
merged_3[['ID', 'l1', 'l2']].to_csv('submission_9.csv', index=False)