# FitFood sales analysis and predictions

**Semester Project for Decision Systems 2020/2021 Course**

The goal of the competition is to create an efficient model for predicting whether the total 14-days sales of a particular product, offered by the Fitfood company at one of their FitBoxy locations in Poland, will exceed four pieces

## Task description

Provided data describe a short-term sales history of products at various point of sales (PoS). The target attribute will_it_sell tells if in the following 14 day period the total sales of a given product at a particular location will be at least 4 pcs. 

The data is very similar to the one from the second graded task, with the difference that the sets in this challenge do not contain any random probes (which were deliberately added to the data from the second graded task for evaluation purposes). 

The data tables are provided as two CSV files with the ';' separator sign. They can be downloaded after the registration for the challenge. Both files (training and test sets) have exactly the same format but all the values from the will_it_sell column in the test set are missing.

The evaluation metric will be AUC. During the challenge, your solutions will be evaluated on a small fraction of the test set, and your best preliminary AUC score will be displayed on the public Leaderboard. 

The submission format: the solutions need to be submitted as text files with predictions. The file should have exactly the same number of rows as the test data table. In each row, it should contain exactly one real number expressing the likeliness that the correct target value for the corresponding test set instance is 1.

## Solution

#### Author of the solution: Krzysztof Piesiewicz

#### Contents of the notebook:
  - [Loading the data](#Loading-the-data)
  - [Preliminary analysis and preprocessing](#Preliminary-analysis-and-preprocessing)
    - [Size attribute](#Size-attribute)
    - [Storage temperature attribute](#Storage-temperature-attribute)
    - [Generalizing some discrete attributes](#Generalizing-some-discrete-attributes)
    - [Preprocessing the data](#Preprocessing-the-data)
  - [Spliting data for training and validation](#Spliting-data-for-training-and-validation)
  - [Chalange test set preparation and answer saver](#Chalange-test-set-preparation-and-answer-saver)
  - [Features selection](#Features-selection)
    - [Analysis of correlations with target value](#Analysis-of-correlations-with-target-value)
    - [Features importance with random forrest](#Features-importance-with-random-forrest)
  - [Training on all the features](#Training-on-all-the-features)
    - [More samples in a leaf](#More-samples-in-a-leaf)
    - [Less samples in a leaf](#Less-samples-in-a-leaf)
    - [Training each estimator on more samples](#Training-each-estimator-on-more-samples)
  - [Fast comparison of feature selections with the same classifiers](#Fast-comparison-of-feature-selections-with-the-same-classifiers)
    - [The most important features with no information about locations](#The-most-important-features-with-no-information-about-locations)
    - [The most correlated with target value](#The-most-correlated-with-target-value)
  - [Training on the 85 most important features](#Training-on-the-85-most-important-features)
  - [Let's consider time dependecies problem](#Let's-consider-time-dependecies-problem)
    - [Benchmark of ExtraTreesClassifiers with season features and without](#Benchmark-of-ExtraTreesClassifiers-with-season-features-and-without)
  - [ExtraTreesClassifier and GradientBoosting with no season features trained on all data](#ExtraTreesClassifier-and-GradientBoosting-with-no-season-features-trained-on-all-data)
    - [ExtraTreesClassifier](#ExtraTreesClassifier)
    - [GradientBoosting](#GradientBoosting)
  - [Combining multiple answers into a final one](#Combining-multiple-answers-into-a-final-one)
  - [Summary - lessons learned](#Summary-lessons-learned)

## Loading the data

In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 150)

In [3]:
train_data = pd.read_csv("FitFood_competition_data_training.csv", sep=";")

## Preliminary analysis and preprocessing

In [4]:
df = train_data
attrs = set(df.keys())
df.describe(include='all')

Unnamed: 0,will_it_sell,pos_id,product_id_unified,company_id,category_id,category_name,product_name,partner_product,address_city,diet,size,cooking_time,cooking_mv,cooking_ov,storage_temp,vat,bialko_100,weglow_100,cukry_calk,tluszcz_nasyc_calk,energia_calk,energia_100,tluszcz_nasyc_100,blonnik_100,tluszcz_calk,bialko_calk,sol_100,weglow_calk,blonnik_calk,tluszcz_100,cukry_100,sol_calk,weekday,quarter,month,week,qty_lag1,qty_lag2,qty_lag3,qty_lag4,qty_lag5,qty_lag6,qty_lag7,qty_lag8,qty_lag9,qty_lag10,qty_lag11,qty_lag12,qty_lag13,qty_lag14,meanLastPeriod_lag1,meanLastPeriod_lag2,meanLastPeriod_lag3,meanLastPeriod_lag4,meanLastPeriod_lag5,meanLastPeriod_lag6,meanLastPeriod_lag7,meanLastPeriod_lag1_lag7_diff,sdLastPeriod_lag1,sdLastPeriod_lag2,sdLastPeriod_lag3,sdLastPeriod_lag4,sdLastPeriod_lag5,sdLastPeriod_lag6,sdLastPeriod_lag7,minLastPeriod_lag1,minLastPeriod_lag7,minLastPeriod_lag1_lag7_diff,maxLastPeriod_lag1,maxLastPeriod_lag7,maxLastPeriod_lag1_lag7_diff,diff1_lag1,diff1_lag7,diff1_lag1_lag7_diff,diffLagPeriod_lag1,diffLagPeriod_lag7,diffLagPeriod_lag1_lag7_diff,mean_diff1_lag1,mean_diff1_lag7,mean_diff1_lag1_lag7_diff,sum_qty,avg_discount_mean_value_lag1,avg_discount_count_lag1,avg_from_blik_lag1,avg_from_paypass_lag1,avg_from_payu_lag1,avg_total_lag1,avg_total_to_discount_lag1,avg_total_base_lag1,avg_sum_fv_lag1,avg_transaction_discount_count_lag1,roc1_lag1,rocPeriod_lag1,days_since_prev_delivery,sales_since_prev_delivery,available_products,is_delivery_day
count,5360496.0,5360496,5360496.0,5360496,5360496,5360496,5360496,5360496.0,5360496,2840752,5360496,3632502,2336346.0,2336346.0,3930214,5360496.0,4999443.0,4999443.0,4998511.0,4998511.0,4999443.0,4999443.0,4998511.0,4070153.0,4999443.0,4999443.0,4998511.0,4999443.0,4070153.0,4999443.0,4998511.0,4998511.0,5360496,5360496,5360496.0,5360496.0,5331152.0,5301896.0,5272670.0,5243488.0,5214306.0,5185124.0,5155944.0,5126791.0,5097737.0,5068781.0,5039988.0,5011318.0,4982648.0,4953978.0,5155944.0,5126791.0,5097737.0,5068781.0,5039988.0,5011318.0,4982648.0,4982648.0,5155944.0,5126791.0,5097737.0,5068781.0,5039988.0,5011318.0,4982648.0,5155944.0,4982648.0,4982648.0,5155944.0,4982648.0,4982648.0,5301896.0,5126791.0,5126791.0,5126791.0,4953978.0,4953978.0,5126791.0,4953978.0,4953978.0,5185124.0,5331152.0,5331152.0,5331152.0,5331152.0,5331152.0,5331152.0,5331152.0,5331152.0,5331152.0,5331152.0,5331152.0,5331152.0,4779475.0,4779475.0,4805004.0,5360496.0
unique,,375,,298,14,14,137,,33,10,22,1,,,5,,,,,,,,,,,,,,,,,,7,4,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
top,,59fc5325c94b722506678bd1,,5a587706cf5c8134b3a9891d,5a6f110ca0899f5ca2f7d6e9,Dania Lunch Duże,Dyniowe curry z indykiem,,Warszawa,Dieta Samuraja,350g,2-3 min.,,,2-5 °C,,,,,,,,,,,,,,,,,,czwartek,Q3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
freq,,25498,,101533,1144201,1144201,112303,,2353704,994953,1195365,3632502,,,3477491,,,,,,,,,,,,,,,,,,776919,2034382,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
mean,0.1195335,,1104.991,,,,,0.09166092,,,,,1.0,1.0,,6.627107,7.470809,19.20212,8.337381,3.861783,296.1271,153.1352,2.124834,2.689635,8.034908,16.89004,0.5981112,174.6482,7.774678,4.951239,8.053772,1.905875,,,7.216968,29.60717,0.09486317,0.09478666,0.0947446,0.09506039,0.09545451,0.09583937,0.09582765,0.0956692,0.09549787,0.09540085,0.09530975,0.09563293,0.09593152,0.09612921,0.0932419,0.09334906,0.09344387,0.09353105,0.09356136,0.093582,0.09358882,-0.002614129,0.1639429,0.1640696,0.1641682,0.1642481,0.1642443,0.1642216,0.164172,6.128849e-05,6.221591e-05,-4.01393e-07,0.4068483,0.4073537,-0.01017913,-0.00128935,-0.001241322,0.001025788,-0.003887032,-0.00343421,-0.0001647161,-0.0005552903,-0.0004906014,-2.353087e-05,0.652401,0.001083268,0.00169012,0.04194721,0.1674438,0.02564613,2.541403,0.7954283,3.110736,0.0,0.09892658,-0.001643735,-0.008121109,16.86129,0.3515495,1.218998,0.1169069
std,0.3244153,,55.83013,,,,,0.2885467,,,,,0.0,0.0,,4.418595,6.744022,15.00976,7.253456,3.267804,159.0917,118.7398,2.980594,1.875738,5.511651,13.04603,0.3823469,228.7943,5.423359,5.725387,12.46568,1.499202,,,2.921179,12.69062,0.4513092,0.4513945,0.4515465,0.4524057,0.4533894,0.4543876,0.4546307,0.454537,0.4543946,0.4542686,0.4540837,0.454949,0.4557977,0.4564781,0.2466131,0.2469229,0.2472161,0.247494,0.2476744,0.2478502,0.2480214,0.1995848,0.3691396,0.3694786,0.3697978,0.3700779,0.3702582,0.3704252,0.3705731,0.007902433,0.007963431,0.0111369,0.9371576,0.9408787,0.8323831,0.5500891,0.5536175,0.7811386,0.5350323,0.5370825,0.7523585,0.07643318,0.07672608,0.1074798,1.724346,0.02766163,0.03947974,0.1704341,0.3541303,0.136802,5.225088,3.098942,6.169283,0.0,0.2925222,0.2389877,0.6231182,30.68664,1.024493,26.77701,0.3213094
min,0.0,,1004.0,,,,,0.0,,,,,1.0,1.0,,5.0,0.0,2.7,0.2,0.0,25.0,20.3,0.0,0.7,0.0,0.0,0.0,2.7,0.7,0.0,0.1,0.0,,,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-10.57143,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-2.0,0.0,0.0,-29.0,-30.0,-30.0,-31.0,-30.0,-30.0,-32.0,-4.285714,-4.285714,-4.571429,0.0,-2.775558e-16,0.0,0.0,0.0,0.0,-1.172396e-13,-1.776357e-13,0.0,0.0,0.0,-3.044522,-3.465736,1.0,0.0,-24.0,0.0
25%,0.0,,1057.0,,,,,0.0,,,,,1.0,1.0,,5.0,3.2,9.2,3.0,1.5,162.0,76.0,0.7,1.3,4.4,4.08,0.2,22.5,4.0,1.6,0.9,0.5,,,5.0,21.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0
50%,0.0,,1117.0,,,,,0.0,,,,,1.0,1.0,,5.0,6.8,13.0,5.9,3.0,266.8,108.0,1.1,2.1,6.7,14.0,0.7,43.8,6.9,2.5,2.0,1.9,,,8.0,32.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0
75%,0.0,,1152.0,,,,,0.0,,,,,1.0,1.0,,5.0,8.9,23.4,12.6,5.1,423.5,162.0,2.2,3.5,10.5,29.9,0.8,240.0,9.7,4.9,7.3,3.3,,,10.0,40.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0,0.0,2.0,0.0


In [5]:
df.sample(n=50)

Unnamed: 0,will_it_sell,pos_id,product_id_unified,company_id,category_id,category_name,product_name,partner_product,address_city,diet,size,cooking_time,cooking_mv,cooking_ov,storage_temp,vat,bialko_100,weglow_100,cukry_calk,tluszcz_nasyc_calk,energia_calk,energia_100,tluszcz_nasyc_100,blonnik_100,tluszcz_calk,bialko_calk,sol_100,weglow_calk,blonnik_calk,tluszcz_100,cukry_100,sol_calk,weekday,quarter,month,week,qty_lag1,qty_lag2,qty_lag3,qty_lag4,qty_lag5,qty_lag6,qty_lag7,qty_lag8,qty_lag9,qty_lag10,qty_lag11,qty_lag12,qty_lag13,qty_lag14,meanLastPeriod_lag1,meanLastPeriod_lag2,meanLastPeriod_lag3,meanLastPeriod_lag4,meanLastPeriod_lag5,meanLastPeriod_lag6,meanLastPeriod_lag7,meanLastPeriod_lag1_lag7_diff,sdLastPeriod_lag1,sdLastPeriod_lag2,sdLastPeriod_lag3,sdLastPeriod_lag4,sdLastPeriod_lag5,sdLastPeriod_lag6,sdLastPeriod_lag7,minLastPeriod_lag1,minLastPeriod_lag7,minLastPeriod_lag1_lag7_diff,maxLastPeriod_lag1,maxLastPeriod_lag7,maxLastPeriod_lag1_lag7_diff,diff1_lag1,diff1_lag7,diff1_lag1_lag7_diff,diffLagPeriod_lag1,diffLagPeriod_lag7,diffLagPeriod_lag1_lag7_diff,mean_diff1_lag1,mean_diff1_lag7,mean_diff1_lag1_lag7_diff,sum_qty,avg_discount_mean_value_lag1,avg_discount_count_lag1,avg_from_blik_lag1,avg_from_paypass_lag1,avg_from_payu_lag1,avg_total_lag1,avg_total_to_discount_lag1,avg_total_base_lag1,avg_sum_fv_lag1,avg_transaction_discount_count_lag1,roc1_lag1,rocPeriod_lag1,days_since_prev_delivery,sales_since_prev_delivery,available_products,is_delivery_day
4885010,0,5b4893ddee2a423f39dacbf6,1156,5b39bb530663ab48e336173e,591301c83dd75608a9c2ef1b,Napoje,Smoothie BeRAW Breakfast energy,0,Skawina,,250ml,,,,,5,,,,,,,,,,,,,,,,,sobota,Q4,10,43,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0
2160445,0,5cab37f2a77669455cbbaa78,1067,5caaeaa8822b5e2d312ac807,5cd1a4d32b10792bc08dab31,Pan Pomidor - Pierogi,"Pierogi z dynią, jarmużem, quinoą i kolendrą",0,Wrocław,,240g,2-3 min.,,,1-6 °C,5,3.9,23.0,1.1,0.4,155.0,155.0,0.4,2.3,4.8,3.9,1.1,23.0,2.3,4.8,1.1,1.1,poniedziałek,Q3,9,37,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,31.0,0.0,0.0,0
514033,0,5a37f93f753e9f3591458dd7,1189,5af3e4736e089c600a14dd86,590053bdc5c79d3575eb44f6,Zupy,Barszcz z mlekiem kokosowym,0,Warszawa,Zupy,300g,2-3 min.,1.0,1.0,2-5 °C,5,1.2,6.6,12.6,4.1,130.3,43.4,1.4,1.9,4.7,3.5,0.4,199.0,5.6,1.6,4.2,1.3,wtorek,Q3,9,39,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0,0.0,1.0,1
508094,0,5b2b8fa60663ab48e334dd2b,1063,5b30ea120663ab48e3354bfc,59005cd6c5c79d3575eb450d,Przekąski,BeRAW Baton protein 38% - surowe kakao w gorzk...,0,Kraków,,60g,,,,,23,38.0,36.0,13.2,3.24,229.8,383.0,5.4,,5.16,22.8,0.06,21.6,,8.6,22.0,0.036,środa,Q2,6,26,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.285714,0.285714,0.285714,0.285714,0.285714,0.0,0.0,0.285714,0.755929,0.755929,0.755929,0.755929,0.755929,0.0,0.0,0.0,0.0,0.0,2.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,1.0,-8.881784e-16,0.0,6.99,0.0,1.0,0.0,1.94591,2.0,0.0,0.0,0
2429560,1,5c8bae66a7d3a504da7a4b6b,1137,5c7fb0d36b25e24bf5028f65,5d1b55aa5379175d45e9360a,Mr Thai,Sesame Beef,0,Katowice,,380g,2-3 min.,,,2-5 °C,5,8.9,26.1,3.9,0.7,162.0,162.0,0.7,0.7,2.1,8.9,0.1,26.1,0.7,2.1,3.9,0.1,piątek,Q3,8,32,4.0,1.0,0.0,0.0,0.0,0.0,3.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,1.142857,0.857143,0.857143,0.857143,0.857143,0.857143,0.857143,0.285714,1.676163,1.214986,1.214986,1.214986,1.214986,1.214986,1.214986,0.0,0.0,0.0,4.0,3.0,1.0,3.0,1.0,2.0,2.0,3.0,-1.0,0.285714,0.428571,-0.142857,5.0,0.0,0.0,0.375,0.625,0.0,15.99,0.0,15.99,0.0,0.0,0.287682,0.980829,2.0,5.0,0.0,0
5141712,0,5c79285d7b6c863a5d522c20,1146,5c7cd37e3ae6e53aff1548b4,59005cd6c5c79d3575eb450d,Przekąski,"Superfood SPORT - banan, białko",1,Warszawa,,35g,,,,,8,11.0,61.3,14.7,0.5,135.8,388.0,1.4,7.8,5.2,3.9,0.01,215.0,2.7,14.9,42.1,0.0,środa,Q2,4,17,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0,0.0,2.0,1
2334542,1,5caf375e2c9b633ae260d078,1181,5c924f69b6cb840dcac430fe,5cd1a4f40a544c2d0d156fea,Pan Pomidor - Zupy,"Marokańska z quinoą, batatem i kolendrą",0,Katowice,,400g,2-3 min.,,,2-6 °C,8,1.6,5.1,2.6,0.1,45.0,45.0,0.1,,1.5,1.6,0.78,5.1,,1.5,2.6,0.78,poniedziałek,Q3,7,28,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.428571,0.428571,0.428571,0.428571,0.428571,0.428571,0.0,0.428571,1.133893,1.133893,1.133893,1.133893,1.133893,1.133893,0.0,0.0,0.0,0.0,3.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,1.0,0.0,16.98333,3.666667,20.643333,0.0,0.666667,0.0,1.609438,5.0,0.0,2.0,0
867374,0,5bc9b10ba6c5c61a73d3fefa,1078,5a572369cf5c8134b3a98692,5cb9b8eedf68013fb09db8f0,Makarony,Wegański z masłem orzechowym,0,Warszawa,,250g,2-3 min.,,,2-5 °C,5,8.0,18.6,1.5,1.5,415.0,166.0,0.6,6.7,12.9,20.0,1.0,465.0,16.8,5.1,0.6,2.5,niedziela,Q4,10,43,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,110.0,0.0,0.0,0
4047454,0,5cd56095aa566714bfc60c5c,1098,5c98d473bedc6e4786a40398,5a6f110ca0899f5ca2f7d6e9,Dania Lunch Duże,Pieczone placki ziemniaczane z gulaszem drobiowym,0,Warszawa,Kuchnia Słowiańska,400g,2-3 min.,1.0,1.0,2-5 °C,5,7.2,12.1,3.2,2.2,444.0,111.0,0.6,3.3,12.0,28.9,0.9,48.4,13.4,3.0,0.8,3.6,czwartek,Q3,9,39,0.0,1.0,1.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.571429,0.571429,0.428571,0.285714,0.285714,0.285714,0.285714,0.285714,0.786796,0.786796,0.786796,0.755929,0.755929,0.755929,0.755929,0.0,0.0,0.0,2.0,2.0,0.0,-1.0,2.0,-3.0,0.0,1.0,-1.0,0.0,0.142857,-0.142857,2.0,0.0,0.0,0.0,1.0,0.0,15.99,0.0,15.99,0.0,0.0,0.0,1.386294,3.0,2.0,0.0,0
1946567,0,5b1b963bcaef965005d0e6b0,1061,5b1f8a5ecaef965005d12d82,5abe0aed049e180557e22330,Sałatki,"FitSalad - Sałatka z fetą, oliwkami i pomidork...",0,Warszawa,--,350g,,,,2-5 °C,5,4.4,15.8,1.8,7.0,504.0,144.0,2.0,2.1,22.7,15.4,0.8,55.3,7.5,6.5,0.5,2.7,sobota,Q2,5,21,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,0.0,0.0,0


The fist glance at the table above leads to decuding that cooking time in oven or microwave oven are not varrying variables.

In [6]:
all(df.cooking_mv.notna() == df.cooking_ov.notna())

True

In [7]:
inds = df[df.cooking_mv.notna()].index
all(df.cooking_mv[inds] == df.cooking_ov[inds])

True

Atrributes cooking_mv and cooking_ov are not distinguishable

In [8]:
def impute_alt_or_zero(x, y):
    if not np.isnan(x) and not np.isnan(y):
        return max(x, y)
    if not np.isnan(x):
        return x
    if not np.isnan(y):
        return y
    return .0

def join_cooking_mv_and_ov(df):
    df["cooking_mv_or_ov"] = np.vectorize(impute_alt_or_zero)(df.cooking_mv, df.cooking_ov)

In [9]:
attrs_to_be_removed = {'cooking_mv', 'cooking_ov'}
numeric_attrs = set(df.select_dtypes(include=np.number).columns.to_list())

In [10]:
attrs_with_ids = set(filter(lambda c: "id" in c, attrs))
attrs_with_ids

{'category_id', 'company_id', 'pos_id', 'product_id_unified'}

In [11]:
attrs_with_names = set(filter(lambda c: "name" in c, attrs))
attrs_with_names

{'category_name', 'product_name'}

In [12]:
numeric_attrs -= attrs_with_ids
discrete_attrs = attrs - numeric_attrs
discrete_attrs

{'address_city',
 'category_id',
 'category_name',
 'company_id',
 'cooking_time',
 'diet',
 'pos_id',
 'product_id_unified',
 'product_name',
 'quarter',
 'size',
 'storage_temp',
 'weekday'}

In [13]:
discrete_descr = df[discrete_attrs].describe(include="all")
discrete_descr

Unnamed: 0,company_id,product_id_unified,diet,category_name,address_city,product_name,weekday,size,storage_temp,cooking_time,category_id,pos_id,quarter
count,5360496,5360496.0,2840752,5360496,5360496,5360496,5360496,5360496,3930214,3632502,5360496,5360496,5360496
unique,298,,10,14,33,137,7,22,5,1,14,375,4
top,5a587706cf5c8134b3a9891d,,Dieta Samuraja,Dania Lunch Duże,Warszawa,Dyniowe curry z indykiem,czwartek,350g,2-5 °C,2-3 min.,5a6f110ca0899f5ca2f7d6e9,59fc5325c94b722506678bd1,Q3
freq,101533,,994953,1144201,2353704,112303,776919,1195365,3477491,3632502,1144201,25498,2034382
mean,,1104.991,,,,,,,,,,,
std,,55.83013,,,,,,,,,,,
min,,1004.0,,,,,,,,,,,
25%,,1057.0,,,,,,,,,,,
50%,,1117.0,,,,,,,,,,,
75%,,1152.0,,,,,,,,,,,


I remove category and product ids in favour of the names.

In [14]:
attrs_to_be_removed.update({'category_id', 'product_id_unified'})
discrete_attrs = attrs - numeric_attrs - attrs_to_be_removed

### Size attribute

In [15]:
list(pd.unique(df["size"]))

['35g',
 '350g',
 '240g',
 '500g',
 '250ml',
 '180g',
 '50g',
 '300g',
 '230g',
 '400g',
 '40g',
 '60g',
 '250g',
 '370g',
 '380g',
 '300ml',
 '200ml',
 '120g',
 '85g',
 '220g',
 '2x68g',
 '100g']

Size attribute could be convert to numeric weight and volume attributes.

In [16]:
import re

def multiply_all_numbers_from_str(str):
    return np.prod(list(map(float, re.findall(r"\d+", str))))

def get_weight(size):
    if not size.endswith("g"):
        return np.nan
    return multiply_all_numbers_from_str(size)

def get_volume(size):
    if not size.endswith("ml"):
        return np.nan
    return multiply_all_numbers_from_str(size)

def convert_size_to_weight_and_volume(df):
    df["weight"] = np.vectorize(get_weight)(df["size"])
    df["volume"] = np.vectorize(get_volume)(df["size"])

In [17]:
attrs_to_be_removed.add("size")
discrete_attrs.remove("size")
numeric_attrs.update({"weight", "volume"})

### Storage temperature attribute

In [18]:
pd.unique(df.storage_temp)

array([nan, '2-5 °C', '1-6 °C', '2-6 °C', '0-6 °C', '0-7 °C'],
      dtype=object)

I will change storage temperature to stored in fridge attribute.

In [19]:
def convert_storage_temp_to_in_fridge(df):
    df["stored_in_fridge"] = df["storage_temp"].notna() * 1.0

attrs_to_be_removed.add("storage_temp")
discrete_attrs.remove("storage_temp")
numeric_attrs.add("stored_in_fridge")

### Generalizing some discrete attributes

There are some discrete attributes with no `NA` in any record and some of them should be generalizable.

In [20]:
counts = discrete_descr.loc["count",]
counts[counts == df.shape[0]].keys().intersection(discrete_attrs)

Index(['company_id', 'category_name', 'address_city', 'product_name',
       'weekday', 'pos_id', 'quarter'],
      dtype='object')

'weekday' and 'quarter' attributes are not extensible so I will focus on 'pos_id', 'company_id', 'product_name', 'category_name', 'address_city'.

#### Sales locations

In [21]:
pos_counts = df.pos_id.value_counts()
pos_counts

59fc5325c94b722506678bd1    25498
595e36122e84531cd204096b    24850
5b61699404852c4704a26f28    24658
5b3381160663ab48e335895f    24620
5b1b963bcaef965005d0e6b0    24590
                            ...  
5dc694443bf8f15dc97ec3c7       78
5dc69e723bf8f15dc97ec4de       46
5dcd4c06dacdf109c179d79b       31
5dcd2fb58fc7d108e6b8bc09       29
5dcbccb4aa5f032f78d77579       28
Name: pos_id, Length: 375, dtype: int64

In [22]:
pos_counts[pos_counts < 300]

5dbff693bb783c5423163b8d    292
5dbac6883bf8f15dc97d4251    210
5dc28ac6b902d15de1f0152b    193
5c7901d27b6c863a5d521f21     93
5dc694443bf8f15dc97ec3c7     78
5dc69e723bf8f15dc97ec4de     46
5dcd4c06dacdf109c179d79b     31
5dcd2fb58fc7d108e6b8bc09     29
5dcbccb4aa5f032f78d77579     28
Name: pos_id, dtype: int64

I will delete information for locations with less than 100 sales records for training purposes. I expect the samples with the location information removed to help me generalize the predictive model for new data. I want it to predict sales in completely new locations.

In [23]:
pos_id_cats = pos_counts[pos_counts >= 100].index
pos_ids_to_be_forgotten = set(pos_counts.index.difference(pos_id_cats))
np.sum(pos_counts[pos_ids_to_be_forgotten])

305

#### Companies

In [24]:
company_counts = df.company_id.value_counts()
company_counts

5a587706cf5c8134b3a9891d    101533
5c48270855f3af637b69b13f     94045
5b9fa5e201f82e03b412bcff     81488
5b503c4a9586df16bbb6e07a     77930
5c924f69b6cb840dcac430fe     75667
                             ...  
5dc534a2fdcd0622bd7b34e5       239
5dbff3dd9f38f239c2ffeb0c       188
5e4550dd177446520bcd2c91       101
5d764e24bf7a586310594da4        72
5d764e2fd0c59c62f781ff36        46
Name: company_id, Length: 298, dtype: int64

In [25]:
company_counts[company_counts<300]

5dbff5229f38f239c2ffeb55    270
5ce2969fc828ba60d8093297    245
5dc534a2fdcd0622bd7b34e5    239
5dbff3dd9f38f239c2ffeb0c    188
5e4550dd177446520bcd2c91    101
5d764e24bf7a586310594da4     72
5d764e2fd0c59c62f781ff36     46
Name: company_id, dtype: int64

In [26]:
company_id_cats = company_counts[company_counts >= 200].index
companies_to_be_forgotten = company_counts.index.difference(company_id_cats)
np.sum(company_counts[companies_to_be_forgotten])

407

#### Products names

In [27]:
product_counts = df.product_name.value_counts()
product_counts

Dyniowe curry z indykiem                 112303
Ostra wołowina z kaszą gryczaną          109728
Wołowina z marchewkowym puree            108298
Risotto z indykiem i cukinią             107248
Szaszłyk z ryżem w kurkumie              104697
                                          ...  
Silny Łasuch - Łasuch i jego orzeszki       153
FitElixir - Black                           144
Silny Łasuch - 2w1                           95
All'arrabiata                                71
Pomidorowa z chilli                           9
Name: product_name, Length: 137, dtype: int64

In [28]:
product_counts[product_counts<400]

Silny Łasuch - Marchew w tropikach       342
Silny Łasuch - Cynamonowe jabłuszko      342
Chili con carne z ryżen                  230
ROŚLEKO - jaglano - orzechowe            218
Allarrabbiata                            203
Superfood - ZDROWIE - kakao - maca       193
Kaszotto z soczewicą i grzybami          190
ROŚLEKO - kakao                          184
Silny Łasuch - Łasuch i jego orzeszki    153
FitElixir - Black                        144
Silny Łasuch - 2w1                        95
All'arrabiata                             71
Pomidorowa z chilli                        9
Name: product_name, dtype: int64

I will delete names of products with less than 200 sales records for training purposes. I would like to predict sales for completely new products.

In [29]:
product_name_cats = product_counts[product_counts >= 200].index
products_names_to_be_forgotten = product_counts.index.difference(product_name_cats)
np.sum(product_counts[products_names_to_be_forgotten])

1039

#### Cattegories

In [30]:
category_counts = df.category_name.value_counts()
category_counts

Dania Lunch Duże         1144201
Przekąski                 929010
Dania Lunch Małe          797960
Napoje                    498033
Zupy                      486330
Makarony                  412650
Śniadania                 265078
Pan Pomidor - Zupy        258243
Sałatki                   208141
Pan Pomidor - Pierogi     166430
Mr Thai                   157172
Lucky Fish                 28050
Desery                      5557
DayUp                       3641
Name: category_name, dtype: int64

In [31]:
len(category_counts)

14

There only 14 product categories so I do nothing with them and I do not expect the presence of any new category.

#### Address cities

In [32]:
address_counts = df.address_city.value_counts()
address_counts

Warszawa                           2353704
Kraków                             1577282
Wrocław                             694811
Katowice                            234123
Skawina                              71726
11                                   50702
Gliwice                              41412
Niepołomice                          39114
Wieliczka                            38747
Kobierzyce, Bielany Wrocławskie      28283
Zabierzów                            25683
Balice                               22416
Wroclaw                              20324
Jagiellońska 74                      16850
30-001                               15393
Bielany Wrocławskie                  14924
Waszawa                              14923
Ruda Śląska                          14873
Prądnicka 65                         14257
Podłęże, Kraków                      11648
Wysoka, Wrocław                      11360
Bytom                                 8362
Wrocław, Biskupice Podgórne           7342
Wrocławiu  

I will delete address city information for the locations with less than 200 sales records. I would like to predict sales at unknown places.

In [33]:
address_city_cats = address_counts[address_counts >= 200].index
addresses_to_be_forgotten = address_counts.index.difference(address_city_cats)
np.sum(address_counts[addresses_to_be_forgotten])

386

#### Saving the indices of the samples with attributes values to be forgotten
I will use them for mantaining generality the model and validating the trained results.

In [34]:
missing_cats_samples_idxs = {}
missing_cats_samples_idxs["pos_id"] = np.where(df.pos_id.isin(pos_ids_to_be_forgotten))[0]
missing_cats_samples_idxs["company_id"] = np.where(df.company_id.isin(companies_to_be_forgotten))[0]
missing_cats_samples_idxs["product_name"] = np.where(df.product_name.isin(products_names_to_be_forgotten))[0]
missing_cats_samples_idxs["address_city"] = np.where(df.address_city.isin(addresses_to_be_forgotten))[0]

important_missings_attrs = ["pos_id", "company_id", "product_name", "address_city"]

### Preprocessing the data

In [35]:
cats_for_attr = {'pos_id': pos_id_cats,
                 'company_id': company_id_cats,
                 'product_name': product_name_cats,
                 'address_city': address_city_cats}

for attr in discrete_attrs - {'pos_id', 'company_id', 'product_name', 'address_city'}:
    _, uniques = pd.factorize(df[attr])
    cats_for_attr[attr] = uniques

In [36]:
def remove_unnecessary_attrs(df):
    for attr in attrs_to_be_removed:
        del df[attr]

def convert_to_categoricals(df):
    for attr, attr_cats in cats_for_attr.items():
        df[attr] = df[attr].astype(pd.CategoricalDtype(attr_cats))

def with_categoricals_as_dummies(df):
    return pd.get_dummies(df,dummy_na=True, sparse=True)

def preprocessed_data_in_place(df):
    convert_size_to_weight_and_volume(df)
    join_cooking_mv_and_ov(df)
    convert_storage_temp_to_in_fridge(df)
    convert_to_categoricals(df)
    remove_unnecessary_attrs(df)
    return with_categoricals_as_dummies(df)

## Spliting data for training and validation

### Set weights of samples and extract important ones

In [37]:
samples_weights = np.empty(df.shape[0])
samples_weights.fill(0.01)
important_indices = set()

for attr in important_missings_attrs:
    for idx in missing_cats_samples_idxs[attr]:
        samples_weights[idx] += 0.24
    important_indices.update(missing_cats_samples_idxs[attr])

important_indices = pd.Index(important_indices).unique()
    
important_samples = df.loc[important_indices, ]

avg_samples = df.drop(important_indices, axis=0)

### Create training and validation sets

In [38]:
from sklearn.model_selection import train_test_split
avg_samples_train, avg_samples_val = train_test_split(avg_samples, random_state=0, test_size=0.05)
impt_samples_train, impt_samples_val = train_test_split(important_samples, random_state=0, test_size=0.3)

In [39]:
samples_train = pd.concat([avg_samples_train, impt_samples_train]).sample(frac=1)
samples_val = pd.concat([avg_samples_val, impt_samples_val]).sample(frac=1)

In [40]:
def x_y_split(df):
    y = df.will_it_sell
    X = df.drop(["will_it_sell"], axis=1)
    return X, y
    
def as_sparse(df):
    return df.astype(pd.SparseDtype("float", np.nan))

In [41]:
X_train, y_train = x_y_split(samples_train)
X_val, y_val = x_y_split(samples_val)

X_train = as_sparse(preprocessed_data_in_place(X_train))
y_train = as_sparse(y_train)
X_val = as_sparse(preprocessed_data_in_place(X_val))
y_val = as_sparse(y_val)

In [4]:
X_impt_val, y_impt_val = x_y_split(impt_samples_val)
X_impt_val = as_sparse(preprocessed_data_in_place(X_impt_val))
y_impt_val = as_sparse(y_impt_val)

## Chalange test set preparation and answer saver

In [19]:
X_test = pd.read_csv("FitFood_competition_data_test.csv", sep=";")
X_test = preprocessed_data_in_place(X_test).drop(["will_it_sell"], 1)
X_test = X_test.astype(pd.SparseDtype("float", np.nan))

In [20]:
import datetime

def save_ans(ans, ans_prefix_name="ans"):
    cur_dt = datetime.datetime.today()
    str_dt = cur_dt.strftime("%y-%m-%d_%H-%M-%S")
    file_name = f"{ans_prefix_name}_{str_dt}"
    with open(file_name, "w") as f:
        for p in ans:
            print(p, file=f)
    print(f"The answer saved as '{file_name}'")
            
            
def predict_ans(cls, X_test):
    return cls.predict_proba(X_test)[:,1]

def predict_and_save_ans(cls, ans_prefix_name="ans", X_test=X_test):
    ans = predict_ans(cls, X_test)
    save_ans(ans, ans_prefix_name)

## Features selection

### Analysis of correlations with target value

In [43]:
df = preprocessed_data_in_place(df)
corrs = df.corrwith(df.will_it_sell)
corrs.sort_values(inplace=True, ascending=False)

In [9]:
corrs[abs(corrs)>=0.04]

will_it_sell                                       1.000000
sum_qty                                            0.512098
meanLastPeriod_lag1                                0.482397
sdLastPeriod_lag1                                  0.479201
meanLastPeriod_lag2                                0.471712
sdLastPeriod_lag2                                  0.467968
maxLastPeriod_lag1                                 0.467142
meanLastPeriod_lag3                                0.461228
sdLastPeriod_lag3                                  0.456877
meanLastPeriod_lag4                                0.450535
sdLastPeriod_lag4                                  0.445714
meanLastPeriod_lag5                                0.439661
sdLastPeriod_lag5                                  0.434390
meanLastPeriod_lag6                                0.428485
sdLastPeriod_lag6                                  0.422924
meanLastPeriod_lag7                                0.417058
sdLastPeriod_lag7                       

In [10]:
len(corrs[abs(corrs)>=0.04])

98

In [46]:
len(corrs[abs(corrs)>=0.01])

438

In [47]:
len(corrs[abs(corrs)>=0.025])

171

### Features importance with random forrest

In [50]:
X_small = X_val
y_small = y_val

In [51]:
from sklearn.ensemble import ExtraTreesClassifier

forest = ExtraTreesClassifier(n_estimators=250, random_state=0)

forest.fit(X_small, y_small)
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_], axis=0)
indices = np.argsort(importances)[::-1]

In [8]:
print("Feature ranking:")

ftrs = X_small.keys()
for i in range(len(ftrs)):
    print(f"{i + 1}. feature {indices[i]} ({ftrs[indices[i]]}) - importance {importances[indices[i]]}")

Feature ranking:
1. feature 64 (sum_qty) - importance 0.02857332555538195
2. feature 19 (week) - importance 0.026891786531448355
3. feature 68 (avg_from_paypass_lag1) - importance 0.021251139815816302
4. feature 18 (month) - importance 0.020849031102915774
5. feature 79 (available_products) - importance 0.01962649657615115
6. feature 77 (days_since_prev_delivery) - importance 0.019343703637926778
7. feature 34 (meanLastPeriod_lag1) - importance 0.01388388024562896
8. feature 42 (sdLastPeriod_lag1) - importance 0.012889859686469315
9. feature 36 (meanLastPeriod_lag3) - importance 0.012379493106474331
10. feature 35 (meanLastPeriod_lag2) - importance 0.012144795825265719
11. feature 74 (avg_transaction_discount_count_lag1) - importance 0.011964420664176374
12. feature 45 (sdLastPeriod_lag4) - importance 0.011362275913776547
13. feature 76 (rocPeriod_lag1) - importance 0.011260772429316009
14. feature 72 (avg_total_base_lag1) - importance 0.011078064406763889
15. feature 78 (sales_since_p

In [18]:
def auc_score(cls, X_val=X_val, y_val=y_val):
    y_val_pred = cls.predict_proba(X_val)[:,1]
    from sklearn.metrics import roc_auc_score
    return roc_auc_score(y_val, y_val_pred)

## Training on all the features

### More samples in a leaf

In [54]:
trees = ExtraTreesClassifier(n_estimators=100, random_state=0, bootstrap=True,
                             min_samples_leaf=100, max_samples=0.2, n_jobs=-1)
trees.fit(X_train, y_train)

ExtraTreesClassifier(bootstrap=True, max_samples=0.2, min_samples_leaf=100,
                     n_jobs=-1, random_state=0)

In [57]:
auc_score(trees)

0.9053929549124441

In [70]:
auc_score(trees, X_impt_val, y_impt_val)

0.8302667523702393

In [78]:
predict_and_save_ans(trees, "ans_trees")

### Less samples in a leaf

In [62]:
trees1 = ExtraTreesClassifier(n_estimators=100, random_state=1, bootstrap=True,
                             min_samples_leaf=30, max_samples=0.2, n_jobs=-1)
trees1.fit(X_train, y_train)

ExtraTreesClassifier(bootstrap=True, max_samples=0.2, min_samples_leaf=30,
                     n_jobs=-1, random_state=1)

In [63]:
auc_score(trees1)

0.9175393820524615

In [69]:
auc_score(trees1, X_impt_val, y_impt_val)

0.8448296641491242

In [79]:
predict_and_save_ans(trees1, "ans_trees1")

### Training each estimator on more samples

In [80]:
trees2 = ExtraTreesClassifier(n_estimators=100, random_state=1, bootstrap=True,
                             min_samples_leaf=30, max_samples=0.7, n_jobs=-1)
trees2.fit(X_train, y_train)

ExtraTreesClassifier(bootstrap=True, max_samples=0.7, min_samples_leaf=30,
                     n_jobs=-1, random_state=1)

In [81]:
auc_score(trees2)

0.9300189710101461

In [82]:
auc_score(trees2, X_impt_val, y_impt_val)

0.8582878033103005

In [87]:
predict_and_save_ans(trees2, "ans_trees2")

## Fast comparison of feature selections with the same classifiers

I add samples weights at fitting stage in order to focus on important samples (more general ones with unknown locations, products names, categories names or companies ids).

### The most important features with no information about locations
*with the exception of Warsaw and Krakow*

In [9]:
impt_ftrs85 = X_small.keys()[indices[:85]]
not_impt_ftrs85 = X_small.keys()[indices[85:]]

In [10]:
impt_ftrs85

Index(['sum_qty', 'week', 'avg_from_paypass_lag1', 'month',
       'available_products', 'days_since_prev_delivery', 'meanLastPeriod_lag1',
       'sdLastPeriod_lag1', 'meanLastPeriod_lag3', 'meanLastPeriod_lag2',
       'avg_transaction_discount_count_lag1', 'sdLastPeriod_lag4',
       'rocPeriod_lag1', 'avg_total_base_lag1', 'sales_since_prev_delivery',
       'maxLastPeriod_lag1', 'meanLastPeriod_lag6', 'meanLastPeriod_lag4',
       'sdLastPeriod_lag7', 'avg_total_lag1', 'avg_from_blik_lag1',
       'meanLastPeriod_lag7', 'sdLastPeriod_lag5', 'sdLastPeriod_lag2',
       'meanLastPeriod_lag5', 'sdLastPeriod_lag3', 'sdLastPeriod_lag6',
       'product_name_Dyniowe curry z indykiem', 'maxLastPeriod_lag7',
       'qty_lag1', 'meanLastPeriod_lag1_lag7_diff', 'qty_lag3', 'qty_lag2',
       'quarter_Q3', 'diff1_lag1', 'qty_lag4', 'weekday_środa',
       'weekday_piątek', 'sol_calk', 'is_delivery_day', 'avg_from_payu_lag1',
       'weekday_czwartek', 'avg_total_to_discount_lag1', 'weekday_n

In [7]:
indices[:85]

array([ 64,  19,  68,  18,  79,  77,  34,  42,  36,  35,  74,  45,  76,
        72,  78,  52,  39,  37,  48,  70,  67,  40,  46,  43,  38,  44,
        47, 765,  53,  20,  41,  22,  21, 950,  55,  23, 941, 943,  17,
        80,  69, 940,  71, 945, 944, 942,  11, 946, 895, 951,  54,   2,
        16,   4,  24,   7,  57, 896,  60, 948,  75, 949, 761,   6,   3,
        25,  63,  12,  13,  81,  15,  30,  29,  32,  31,  26,  10,   8,
        84,   5,  14,   9,  28,  59,  33])

In [30]:
X_impt_ftrs85_train = X_train.drop(not_impt_ftrs85, axis=1)
X_impt_ftrs85_val = X_val.drop(not_impt_ftrs85, axis=1)
X_impt_ftrs85_impt_val = X_impt_val.drop(not_impt_ftrs85, axis=1)

In [9]:
import multiprocessing

multiprocessing.cpu_count()

40

In [43]:
trees_impt85_fast = ExtraTreesClassifier(n_estimators=20, random_state=1, bootstrap=True,
                                         max_features=30, min_samples_leaf=1e-4, max_samples=0.2, n_jobs=30,
                                         verbose=1)
trees_impt85_fast.fit(X_impt_ftrs85_train, y_train, samples_weights[X_impt_ftrs85_train.index,])

[Parallel(n_jobs=30)]: Using backend ThreadingBackend with 30 concurrent workers.
[Parallel(n_jobs=30)]: Done   3 out of  20 | elapsed: 24.0min remaining: 136.0min
[Parallel(n_jobs=30)]: Done  20 out of  20 | elapsed: 25.3min finished


ExtraTreesClassifier(bootstrap=True, max_features=30, max_samples=0.2,
                     min_samples_leaf=0.0001, n_estimators=20, n_jobs=30,
                     random_state=1, verbose=True)

In [44]:
auc_score(trees_impt85_fast, X_impt_ftrs85_val, y_val)

[Parallel(n_jobs=20)]: Using backend ThreadingBackend with 20 concurrent workers.
[Parallel(n_jobs=20)]: Done   2 out of  20 | elapsed:    0.8s remaining:    7.0s
[Parallel(n_jobs=20)]: Done  20 out of  20 | elapsed:    0.9s finished


0.9200937299058559

In [45]:
auc_score(trees_impt85_fast, X_impt_ftrs85_impt_val, y_impt_val)

[Parallel(n_jobs=20)]: Using backend ThreadingBackend with 20 concurrent workers.
[Parallel(n_jobs=20)]: Done   2 out of  20 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=20)]: Done  20 out of  20 | elapsed:    0.0s finished


0.8938916117628153

### The most correlated with target value
#### Top 97 features

In [9]:
ftrs_corr97 = corrs[abs(corrs)>=0.04].keys().drop("will_it_sell")
not_ftrs_corr97 = corrs[abs(corrs)<0.04].keys()

In [10]:
len(ftrs_corr97.intersection(impt_ftrs85))

63

In [11]:
len(ftrs_corr97.union(impt_ftrs85))

119

The most correlated ones and the most important ones are mostly the same features

In [26]:
X_ftrs_corr97_train = X_train.drop(not_ftrs_corr97, axis=1)
X_ftrs_corr97_val = X_val.drop(not_ftrs_corr97, axis=1)
X_ftrs_corr97_impt_val = X_impt_val.drop(not_ftrs_corr97, axis=1)

In [46]:
trees_corr97 = ExtraTreesClassifier(n_estimators=20, random_state=1, bootstrap=True,
                                    max_features=30, min_samples_leaf=1e-4, max_samples=0.2, n_jobs=30,
                                    verbose=1)
trees_corr97.fit(X_ftrs_corr97_train, y_train, samples_weights[X_ftrs_corr97_train.index,])

[Parallel(n_jobs=30)]: Using backend ThreadingBackend with 30 concurrent workers.
[Parallel(n_jobs=30)]: Done   3 out of  20 | elapsed: 17.3min remaining: 97.9min
[Parallel(n_jobs=30)]: Done  20 out of  20 | elapsed: 19.0min finished


ExtraTreesClassifier(bootstrap=True, max_features=30, max_samples=0.2,
                     min_samples_leaf=0.0001, n_estimators=20, n_jobs=30,
                     random_state=1, verbose=True)

In [47]:
auc_score(trees_corr97, X_ftrs_corr97_val, y_val)

[Parallel(n_jobs=20)]: Using backend ThreadingBackend with 20 concurrent workers.
[Parallel(n_jobs=20)]: Done   2 out of  20 | elapsed:    0.7s remaining:    6.2s
[Parallel(n_jobs=20)]: Done  20 out of  20 | elapsed:    0.8s finished


0.916871361045221

In [48]:
auc_score(trees_corr97, X_ftrs_corr97_impt_val, y_impt_val)

[Parallel(n_jobs=20)]: Using backend ThreadingBackend with 20 concurrent workers.
[Parallel(n_jobs=20)]: Done   2 out of  20 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=20)]: Done  20 out of  20 | elapsed:    0.0s finished


0.8846717820986661

#### Top 170 features

In [2]:
ftrs_corr170 = corrs[abs(corrs)>=0.025].keys().drop("will_it_sell")
not_ftrs_corr170 = corrs[abs(corrs)<0.025].keys()

In [3]:
len(ftrs_corr170)

170

In [4]:
X_ftrs_corr170_train = X_train.drop(not_ftrs_corr170, axis=1)
X_ftrs_corr170_val = X_val.drop(not_ftrs_corr170, axis=1)
X_ftrs_corr170_impt_val = X_impt_val.drop(not_ftrs_corr170, axis=1)

In [5]:
trees_corr170 = ExtraTreesClassifier(n_estimators=20, random_state=1, bootstrap=True,
                                    max_features=30, min_samples_leaf=1e-4, max_samples=0.2, n_jobs=30,
                                    verbose=1)
trees_corr170.fit(X_ftrs_corr170_train, y_train, samples_weights[X_ftrs_corr170_train.index,])

[Parallel(n_jobs=30)]: Using backend ThreadingBackend with 30 concurrent workers.
[Parallel(n_jobs=30)]: Done   3 out of  20 | elapsed: 11.3min remaining: 63.8min
[Parallel(n_jobs=30)]: Done  20 out of  20 | elapsed: 12.0min finished


ExtraTreesClassifier(bootstrap=True, max_features=30, max_samples=0.2,
                     min_samples_leaf=0.0001, n_estimators=20, n_jobs=30,
                     random_state=1, verbose=True)

In [6]:
auc_score(trees_corr170, X_ftrs_corr170_val, y_val)

[Parallel(n_jobs=20)]: Using backend ThreadingBackend with 20 concurrent workers.
[Parallel(n_jobs=20)]: Done   2 out of  20 | elapsed:    0.6s remaining:    5.3s
[Parallel(n_jobs=20)]: Done  20 out of  20 | elapsed:    0.8s finished


0.9144692038135699

In [7]:
auc_score(trees_corr170, X_ftrs_corr170_impt_val, y_impt_val)

[Parallel(n_jobs=20)]: Using backend ThreadingBackend with 20 concurrent workers.
[Parallel(n_jobs=20)]: Done   2 out of  20 | elapsed:    0.0s remaining:    0.1s
[Parallel(n_jobs=20)]: Done  20 out of  20 | elapsed:    0.0s finished


0.8906275108468584

## Training on the 85 most important features
### Lots of estimators and high max_samples rate

In [31]:
trees_impt85 = ExtraTreesClassifier(n_estimators=100, random_state=1, bootstrap=True,
                                    max_features=30, min_samples_leaf=1e-4, max_samples=0.7, n_jobs=30,
                                    oob_score=True, verbose=1)
trees_impt85.fit(X_impt_ftrs85_train, y_train, samples_weights[X_impt_ftrs85_train.index,])

[Parallel(n_jobs=30)]: Using backend ThreadingBackend with 30 concurrent workers.
[Parallel(n_jobs=30)]: Done 100 out of 100 | elapsed: 447.6min finished


ExtraTreesClassifier(bootstrap=True, max_features=30, max_samples=0.7,
                     min_samples_leaf=0.0001, n_jobs=30, oob_score=True,
                     random_state=1, verbose=True)

In [32]:
auc_score(trees_impt85, X_impt_ftrs85_val, y_val)

[Parallel(n_jobs=30)]: Using backend ThreadingBackend with 30 concurrent workers.
[Parallel(n_jobs=30)]: Done 100 out of 100 | elapsed:    4.8s finished


0.9301058309562702

In [33]:
auc_score(trees_impt85, X_impt_ftrs85_impt_val, y_impt_val)

[Parallel(n_jobs=30)]: Using backend ThreadingBackend with 30 concurrent workers.
[Parallel(n_jobs=30)]: Done 100 out of 100 | elapsed:    0.0s finished


0.917754700305319

In [36]:
X_impt_ftrs85_test = X_test.drop(not_impt_ftrs85, axis=1)

In [37]:
predict_and_save_ans(trees_impt85, "ans_trees_impt85", X_impt_ftrs85_test)

[Parallel(n_jobs=30)]: Using backend ThreadingBackend with 30 concurrent workers.
[Parallel(n_jobs=30)]: Done 100 out of 100 | elapsed:   28.4s finished


### A few estimators with high max_samples rate but each feature is taken into account when splitting at every level

In [13]:
trees_impt85_best_splitting = ExtraTreesClassifier(n_estimators=10, random_state=1, bootstrap=True,
                                    max_features=85, min_samples_leaf=1e-4, max_samples=0.95, n_jobs=30,
                                    oob_score=True, verbose=1)
trees_impt85_best_splitting.fit(X_impt_ftrs85_train, y_train, samples_weights[X_impt_ftrs85_train.index,])

[Parallel(n_jobs=30)]: Using backend ThreadingBackend with 30 concurrent workers.
[Parallel(n_jobs=30)]: Done   6 out of  10 | elapsed: 164.4min remaining: 109.6min
[Parallel(n_jobs=30)]: Done  10 out of  10 | elapsed: 167.0min finished
  warn("Some inputs do not have OOB scores. "
  decision = (predictions[k] /


ExtraTreesClassifier(bootstrap=True, max_features=85, max_samples=0.95,
                     min_samples_leaf=0.0001, n_estimators=10, n_jobs=30,
                     oob_score=True, random_state=1, verbose=True)

In [14]:
auc_score(trees_impt85_best_splitting, X_impt_ftrs85_val, y_val)

[Parallel(n_jobs=10)]: Using backend ThreadingBackend with 10 concurrent workers.
[Parallel(n_jobs=10)]: Done   2 out of  10 | elapsed:    0.4s remaining:    1.6s
[Parallel(n_jobs=10)]: Done  10 out of  10 | elapsed:    0.5s finished


0.9374549951022586

In [15]:
auc_score(trees_impt85_best_splitting, X_impt_ftrs85_impt_val, y_impt_val)

[Parallel(n_jobs=10)]: Using backend ThreadingBackend with 10 concurrent workers.
[Parallel(n_jobs=10)]: Done   2 out of  10 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=10)]: Done  10 out of  10 | elapsed:    0.0s finished


0.9264924473726499

In [17]:
predict_and_save_ans(trees_impt85_best_splitting, "ans_trees_impt85_best_splitting", X_impt_ftrs85_test)

[Parallel(n_jobs=10)]: Using backend ThreadingBackend with 10 concurrent workers.
[Parallel(n_jobs=10)]: Done   2 out of  10 | elapsed:    2.9s remaining:   11.5s
[Parallel(n_jobs=10)]: Done  10 out of  10 | elapsed:    3.0s finished


### Let's fit an estimator for predicting sales at known locations
Its test predictions for known locations could be combined with predictions of more general estimator for unknown locations.

I add pos_ids features to the 85 most important ones.

In [25]:
impt_ftrs455 = X_small.keys()[indices[:455]]
not_impt_ftrs455 = X_small.keys()[indices[455:]]

In [26]:
impt_ftrs455

Index(['sum_qty', 'week', 'avg_from_paypass_lag1', 'month',
       'available_products', 'days_since_prev_delivery', 'meanLastPeriod_lag1',
       'sdLastPeriod_lag1', 'meanLastPeriod_lag3', 'meanLastPeriod_lag2',
       ...
       'company_id_5bbdc6a015c00937a0810fc7',
       'pos_id_5a44d792f9cdb15f7f7d9d49',
       'company_id_5b35c7a90663ab48e335be72',
       'pos_id_5b713f4fd8f80d6ebf31d49f', 'pos_id_5ba0d8dd01f82e03b41313a0',
       'company_id_5bd018b451478a069f6c34fa', 'product_name_Gnocchi primavera',
       'pos_id_5c94d70e1032d642b63487c2',
       'company_id_5c8f6c59c667da0ef079d9fc',
       'pos_id_5cc040375e70463ad9d520dd'],
      dtype='object', length=455)

In [27]:
X_impt_ftrs455_train = X_train.drop(not_impt_ftrs455, axis=1)
X_impt_ftrs455_val = X_val.drop(not_impt_ftrs455, axis=1)
X_impt_ftrs455_impt_val = X_impt_val.drop(not_impt_ftrs455, axis=1)

In [28]:
trees_impt455 = ExtraTreesClassifier(n_estimators=20, random_state=2, bootstrap=True,
                                     max_features=30, min_samples_leaf=1e-5, max_samples=0.7, n_jobs=30,
                                     verbose=1)
trees_impt455.fit(X_impt_ftrs455_train, y_train)

[Parallel(n_jobs=30)]: Using backend ThreadingBackend with 30 concurrent workers.
[Parallel(n_jobs=30)]: Done   3 out of  20 | elapsed: 57.7min remaining: 326.8min
[Parallel(n_jobs=30)]: Done  20 out of  20 | elapsed: 60.2min finished


ExtraTreesClassifier(bootstrap=True, max_features=30, max_samples=0.7,
                     min_samples_leaf=1e-05, n_estimators=20, n_jobs=30,
                     random_state=2, verbose=True)

In [29]:
auc_score(trees_impt455, X_impt_ftrs455_val, y_val)

[Parallel(n_jobs=20)]: Using backend ThreadingBackend with 20 concurrent workers.
[Parallel(n_jobs=20)]: Done   2 out of  20 | elapsed:    1.5s remaining:   13.9s
[Parallel(n_jobs=20)]: Done  20 out of  20 | elapsed:    1.6s finished


0.9314723596253695

In [30]:
auc_score(trees_impt455, X_impt_ftrs455_impt_val, y_impt_val)

[Parallel(n_jobs=20)]: Using backend ThreadingBackend with 20 concurrent workers.
[Parallel(n_jobs=20)]: Done   2 out of  20 | elapsed:    0.0s remaining:    0.1s
[Parallel(n_jobs=20)]: Done  20 out of  20 | elapsed:    0.0s finished


0.8778924955809095

### Let's try with Gradient Boosting

In [6]:
from sklearn.ensemble import GradientBoostingClassifier
gb_clf = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=8, min_samples_leaf=1e-4,
                                    subsample=0.01, validation_fraction=0.001,
                                    n_iter_no_change=3, random_state=3, verbose=3)
gb_clf.fit(X_impt_ftrs85_train, y_train)

      Iter       Train Loss      OOB Improve   Remaining Time 
         1           0.6617           0.0644           33.33m
         2           0.6287           0.0392           29.90m
         3           0.6041           0.0296           29.57m
         4           0.5772           0.0234           29.05m
         5           0.5567           0.0187           28.42m
         6           0.5374           0.0153           28.16m
         7           0.5212           0.0131           27.94m
         8           0.5193           0.0116           27.76m
         9           0.5130           0.0086           27.19m
        10           0.5037           0.0072           26.70m
        11           0.4985           0.0073           26.40m
        12           0.4888           0.0065           26.05m
        13           0.4850           0.0055           25.69m
        14           0.4833           0.0049           25.31m
        15           0.4709           0.0043           24.91m
       

GradientBoostingClassifier(max_depth=8, min_samples_leaf=0.0001,
                           n_iter_no_change=3, random_state=3, subsample=0.01,
                           validation_fraction=0.001, verbose=3)

In [7]:
auc_score(gb_clf, X_impt_ftrs85_val, y_val)

0.9233331500706501

In [8]:
auc_score(gb_clf, X_impt_ftrs85_impt_val, y_impt_val)

0.9132552627350152

In [5]:
predict_and_save_ans(gb_clf, "ans_gb_clf", X_impt_ftrs85_test)

## Let's consider time dependecies problem
Relating to features importance ranking, time features are at high positions. Note that test set samples are only from part of the year. I try to reach better generality by removing time/season information (with the except of weekdays).

In [26]:
np.sort(pd.unique(df.month))

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12])

In [25]:
np.sort(pd.unique(X_test.month).to_dense())

array([ 1.,  2., 11., 12.])

In [27]:
np.sort(pd.unique(df.week))

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
       35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
       52])

In [28]:
np.sort(pd.unique(X_test.week).to_dense())

array([ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 48., 49., 50., 51.,
       52.])

In [15]:
time_ftrs = ['quarter_Q1', 'quarter_Q2', 'quarter_Q3', 'quarter_Q4', 'month', 'week']
X_impt_ftrs85_no_time_train = X_impt_ftrs85_train.drop(time_ftrs, axis=1)
X_impt_ftrs85_no_time_val = X_impt_ftrs85_val.drop(time_ftrs, axis=1)
X_impt_ftrs85_no_time_impt_val = X_impt_ftrs85_impt_val.drop(time_ftrs, axis=1)

In [16]:
X_impt_ftrs85_no_time_val.keys()

Index(['bialko_100', 'weglow_100', 'cukry_calk', 'tluszcz_nasyc_calk',
       'energia_calk', 'energia_100', 'tluszcz_nasyc_100', 'blonnik_100',
       'tluszcz_calk', 'bialko_calk', 'sol_100', 'weglow_calk', 'blonnik_calk',
       'tluszcz_100', 'cukry_100', 'sol_calk', 'qty_lag1', 'qty_lag2',
       'qty_lag3', 'qty_lag4', 'qty_lag5', 'qty_lag6', 'qty_lag7', 'qty_lag9',
       'qty_lag10', 'qty_lag11', 'qty_lag12', 'qty_lag13', 'qty_lag14',
       'meanLastPeriod_lag1', 'meanLastPeriod_lag2', 'meanLastPeriod_lag3',
       'meanLastPeriod_lag4', 'meanLastPeriod_lag5', 'meanLastPeriod_lag6',
       'meanLastPeriod_lag7', 'meanLastPeriod_lag1_lag7_diff',
       'sdLastPeriod_lag1', 'sdLastPeriod_lag2', 'sdLastPeriod_lag3',
       'sdLastPeriod_lag4', 'sdLastPeriod_lag5', 'sdLastPeriod_lag6',
       'sdLastPeriod_lag7', 'maxLastPeriod_lag1', 'maxLastPeriod_lag7',
       'maxLastPeriod_lag1_lag7_diff', 'diff1_lag1', 'diff1_lag1_lag7_diff',
       'diffLagPeriod_lag7', 'diffLagPeriod_lag1_

### Benchmark of ExtraTreesClassifiers with season features and without
#### Benchmark on my validations sets
I will test dropping time information with the same fast classifier which I used with `X_impt_ftrs85` for the first time when it scored `0.92009` and `0.89389` AUC on my validation sets.

In [17]:
trees_impt85_no_time_fast = ExtraTreesClassifier(n_estimators=20, random_state=1, bootstrap=True,
                                         max_features=30, min_samples_leaf=1e-4, max_samples=0.2, n_jobs=30,
                                         verbose=2)
trees_impt85_no_time_fast.fit(X_impt_ftrs85_no_time_train, y_train,
                              samples_weights[X_impt_ftrs85_no_time_train.index,])

[Parallel(n_jobs=30)]: Using backend ThreadingBackend with 30 concurrent workers.


building tree 1 of 20
building tree 2 of 20
building tree 3 of 20
building tree 4 of 20
building tree 5 of 20
building tree 6 of 20
building tree 7 of 20
building tree 8 of 20
building tree 9 of 20
building tree 10 of 20building tree 11 of 20
building tree 12 of 20
building tree 13 of 20

building tree 14 of 20
building tree 15 of 20
building tree 16 of 20
building tree 17 of 20
building tree 18 of 20
building tree 19 of 20
building tree 20 of 20


[Parallel(n_jobs=30)]: Done   5 out of  20 | elapsed: 23.4min remaining: 70.3min
[Parallel(n_jobs=30)]: Done  16 out of  20 | elapsed: 24.2min remaining:  6.1min
[Parallel(n_jobs=30)]: Done  20 out of  20 | elapsed: 24.3min finished


ExtraTreesClassifier(bootstrap=True, max_features=30, max_samples=0.2,
                     min_samples_leaf=0.0001, n_estimators=20, n_jobs=30,
                     random_state=1, verbose=2)

In [19]:
auc_score(trees_impt85_no_time_fast, X_impt_ftrs85_no_time_val, y_val)

[Parallel(n_jobs=20)]: Using backend ThreadingBackend with 20 concurrent workers.
[Parallel(n_jobs=20)]: Done   3 out of  20 | elapsed:    0.8s remaining:    4.3s
[Parallel(n_jobs=20)]: Done  14 out of  20 | elapsed:    0.8s remaining:    0.3s
[Parallel(n_jobs=20)]: Done  20 out of  20 | elapsed:    0.8s finished


0.9036358011163456

In [21]:
auc_score(trees_impt85_no_time_fast, X_impt_ftrs85_no_time_impt_val, y_impt_val)

[Parallel(n_jobs=20)]: Using backend ThreadingBackend with 20 concurrent workers.
[Parallel(n_jobs=20)]: Done   3 out of  20 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=20)]: Done  14 out of  20 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=20)]: Done  20 out of  20 | elapsed:    0.0s finished


0.8815985055439498

In [22]:
X_impt_ftrs85_no_time_test = X_impt_ftrs85_test.drop(time_ftrs, axis=1)

In [23]:
predict_and_save_ans(trees_impt85_no_time_fast, "trees_impt85_no_time_fast", X_impt_ftrs85_no_time_test)

[Parallel(n_jobs=20)]: Using backend ThreadingBackend with 20 concurrent workers.
[Parallel(n_jobs=20)]: Done   3 out of  20 | elapsed:    5.5s remaining:   31.3s
[Parallel(n_jobs=20)]: Done  14 out of  20 | elapsed:    5.8s remaining:    2.5s
[Parallel(n_jobs=20)]: Done  20 out of  20 | elapsed:    5.8s finished


These results are worse but both training and validation sets contain samples from each season. So the trees in previous model could be better fitted to specific time I should have split samples acca validation.
#### Benchmark on train and validation sets split by season

In [5]:
df85_impt_ftrs85 = df.drop(not_impt_ftrs85, axis=1)

samples_months_3to10_indices = np.where(df85_impt_ftrs85.month.isin(range(3, 11)))[0]

samples_impt_ftrs85_months_3to10 = df85_impt_ftrs85.loc[samples_months_3to10_indices, ]
samples_impt_ftrs85_months_11to2 = df85_impt_ftrs85.drop(samples_months_3to10_indices, axis=0)

X_impt_ftrs85_months_3to10, y_months_3to10 = x_y_split(samples_impt_ftrs85_months_3to10)
X_impt_ftrs85_months_11to2, y_months_11to2 = x_y_split(samples_impt_ftrs85_months_11to2)

In [6]:
X_impt_ftrs85_months_3to10 = as_sparse(X_impt_ftrs85_months_3to10)
y_months_3to10 = as_sparse(y_months_3to10)
X_impt_ftrs85_months_11to2 = as_sparse(X_impt_ftrs85_months_11to2)
y_months_11to2 = as_sparse(y_months_11to2)

In [7]:
time_ftrs = ['quarter_Q1', 'quarter_Q2', 'quarter_Q3', 'quarter_Q4', 'month', 'week']

In [8]:
X_impt_ftrs85_no_time_months_3to10 = X_impt_ftrs85_months_3to10.drop(time_ftrs, axis=1)
X_impt_ftrs85_no_time_months_11to2 = X_impt_ftrs85_months_11to2.drop(time_ftrs, axis=1)

In [9]:
trees_impt85_very_fast = ExtraTreesClassifier(n_estimators=10, random_state=1, bootstrap=True,
                                              max_features=30, min_samples_leaf=1e-4, max_samples=0.1,
                                              n_jobs=30, verbose=2)
trees_impt85_very_fast.fit(X_impt_ftrs85_months_3to10, y_months_3to10,
                           samples_weights[X_impt_ftrs85_months_3to10.index,])

[Parallel(n_jobs=30)]: Using backend ThreadingBackend with 30 concurrent workers.


building tree 1 of 10building tree 2 of 10

building tree 3 of 10
building tree 4 of 10
building tree 5 of 10
building tree 6 of 10
building tree 7 of 10
building tree 8 of 10
building tree 9 of 10
building tree 10 of 10


[Parallel(n_jobs=30)]: Done   5 out of  10 | elapsed:  4.4min remaining:  4.4min
[Parallel(n_jobs=30)]: Done  10 out of  10 | elapsed:  4.4min finished


ExtraTreesClassifier(bootstrap=True, max_features=30, max_samples=0.1,
                     min_samples_leaf=0.0001, n_estimators=10, n_jobs=30,
                     random_state=1, verbose=2)

In [10]:
auc_score(trees_impt85_very_fast, X_impt_ftrs85_months_11to2, y_months_11to2)

[Parallel(n_jobs=10)]: Using backend ThreadingBackend with 10 concurrent workers.
[Parallel(n_jobs=10)]: Done   3 out of  10 | elapsed:    1.4s remaining:    3.4s
[Parallel(n_jobs=10)]: Done  10 out of  10 | elapsed:    1.5s finished


0.8418574024096729

In [11]:
trees_impt85_no_time_very_fast = ExtraTreesClassifier(n_estimators=10, random_state=1, bootstrap=True,
                                                      max_features=30, min_samples_leaf=1e-4, max_samples=0.1,
                                                      n_jobs=30, verbose=2)
trees_impt85_no_time_very_fast.fit(X_impt_ftrs85_no_time_months_3to10, y_months_3to10,
                                   samples_weights[X_impt_ftrs85_no_time_months_3to10.index,])

[Parallel(n_jobs=30)]: Using backend ThreadingBackend with 30 concurrent workers.


building tree 1 of 10
building tree 2 of 10
building tree 3 of 10
building tree 4 of 10
building tree 5 of 10
building tree 6 of 10
building tree 7 of 10
building tree 8 of 10
building tree 9 of 10
building tree 10 of 10


[Parallel(n_jobs=30)]: Done   5 out of  10 | elapsed:  4.3min remaining:  4.3min
[Parallel(n_jobs=30)]: Done  10 out of  10 | elapsed:  4.4min finished


ExtraTreesClassifier(bootstrap=True, max_features=30, max_samples=0.1,
                     min_samples_leaf=0.0001, n_estimators=10, n_jobs=30,
                     random_state=1, verbose=2)

In [12]:
auc_score(trees_impt85_no_time_very_fast, X_impt_ftrs85_no_time_months_11to2, y_months_11to2)

[Parallel(n_jobs=10)]: Using backend ThreadingBackend with 10 concurrent workers.
[Parallel(n_jobs=10)]: Done   3 out of  10 | elapsed:    1.4s remaining:    3.2s
[Parallel(n_jobs=10)]: Done  10 out of  10 | elapsed:    1.4s finished


0.8473781705546937

Training with no season features outperformed the one with them
## ExtraTreesClassifier and GradientBoosting with no season features trained on all data
### ExtraTreesClassifier

In [3]:
not_impt_ftrs85_no_time = not_impt_ftrs85.append(pd.Index(time_ftrs))

In [4]:
X, y = x_y_split(df.drop(not_impt_ftrs85_no_time, axis=1))

In [6]:
X = as_sparse(X)
y = as_sparse(y)

In [23]:
trees_impt85_no_time_best_splitting = ExtraTreesClassifier(n_estimators=10, random_state=1, bootstrap=True,
                                                           max_features=79, min_samples_leaf=1e-4,
                                                           max_samples=0.95, n_jobs=40, verbose=2)
trees_impt85_no_time_best_splitting.fit(X, y, samples_weights[X.index,])

[Parallel(n_jobs=40)]: Using backend ThreadingBackend with 40 concurrent workers.


building tree 1 of 10building tree 2 of 10

building tree 3 of 10
building tree 4 of 10
building tree 5 of 10
building tree 6 of 10
building tree 7 of 10
building tree 8 of 10
building tree 9 of 10
building tree 10 of 10


[Parallel(n_jobs=40)]: Done   3 out of  10 | elapsed: 99.2min remaining: 231.4min
[Parallel(n_jobs=40)]: Done  10 out of  10 | elapsed: 103.7min finished


ExtraTreesClassifier(bootstrap=True, max_features=79, max_samples=0.95,
                     min_samples_leaf=0.0001, n_estimators=10, n_jobs=40,
                     random_state=1, verbose=2)

In [7]:
X_test = X_test.drop(not_impt_ftrs85_no_time, axis=1)

In [26]:
predict_and_save_ans(trees_impt85_no_time_best_splitting, "ans_trees_impt85_no_time_best_splitting", X_test)

[Parallel(n_jobs=10)]: Using backend ThreadingBackend with 10 concurrent workers.
[Parallel(n_jobs=10)]: Done   3 out of  10 | elapsed:    3.8s remaining:    8.9s
[Parallel(n_jobs=10)]: Done  10 out of  10 | elapsed:    3.9s finished


### GradientBoosting

In [9]:
from sklearn.ensemble import GradientBoostingClassifier
gb_clf_no_time = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=8,
                                            min_samples_leaf=1e-4,
                                            subsample=0.01, validation_fraction=0.001,
                                            n_iter_no_change=3, random_state=3, verbose=3)
gb_clf_no_time.fit(X, y)

      Iter       Train Loss      OOB Improve   Remaining Time 
         1           0.6674           0.0645           47.86m
         2           0.6276           0.0398           40.72m
         3           0.6057           0.0289           38.36m
         4           0.5750           0.0225           37.08m
         5           0.5641           0.0180           36.08m
         6           0.5410           0.0153           35.19m
         7           0.5314           0.0126           34.39m
         8           0.5122           0.0107           33.68m
         9           0.5075           0.0093           33.02m
        10           0.4979           0.0080           32.41m
        11           0.4934           0.0070           31.93m
        12           0.4910           0.0056           31.37m
        13           0.4863           0.0046           30.78m
        14           0.4795           0.0044           30.38m
        15           0.4741           0.0035           29.92m
       

GradientBoostingClassifier(max_depth=8, min_samples_leaf=0.0001,
                           n_iter_no_change=3, random_state=3, subsample=0.01,
                           validation_fraction=0.001, verbose=3)

In [10]:
predict_and_save_ans(gb_clf_no_time, "ans_gb_clf_no_time", X_test)

## Combining multiple answers into a final one

In [24]:
def combine_answers(answers, weights, masks):
    cb_ans = np.zeros(answers[0].shape)
    weights_sum = np.zeros(answers[0].shape)
    for ans, w, mask in zip(answers, weights, masks):
        cb_ans += ans * w * mask
        weights_sum += w * mask
    return cb_ans / weights_sum

In [20]:
def load_answer(file_name):
    ans = numpy.empty(X_test.shape[0])
    with open(file_name, "r") as f:
        for i, p in enumerate(f.readlines()):
            ans[i] = p

Finally I decided not to combine multiple answers...

## Summary - Lessons learned <a class="anchor" id="Summary-lessons-learned"></a>
To make a long story short I conclude that most important thing is creating not overfitted models that uses not too many features so that training samples could not be too finely split by irrelevant ones. Of course, there is also a need to set minimum number of samples required to be at a leaf node.

Due to the contest site evaluation my gradient boosting models outperforms my extra trees classifiers. My test does not confirm that but it is not trustworthy because my validation set is too similar to the training one. Another reason could be the fact I set max depth for gradient boosting trees.