# 'Trained'  models
* The first reference will be a model without any usage of knowledge. The probability to end up in the class for a test customer is assumed to be *n(customers per class)/all(customers)*.
* The second model is a multi-class boosted decision tree classifier
* The third model uses a boosted decision tree to classify male or female and a regression model to predict the age.

---

Using logistic loss evaluation the prediction scores:
* Naive entries per class model: 
   - loss: **2.42786222642**
   - score on kaggle: **2.42762**
* AdaBoost Classifier with: 
   - algorithm='SAMME.R'
   - DT(max_features=4, min_samples_leaf=74.645) 
   - learning_rate=0.15 , n_estimators=800 
   - random_state=666
   - _loss on training_: **2.39243** // **2.53330** (Kaggle)
   - _separate optimised BDTs for nEvts==0 and nEvts>=1_: **2.42243** (Kaggle)
* Gradient Boosting:
   - loss: deviance
   - max_features=None, min_samples_leaf=800, 
   - learning_rate=0.005, n_estimators=700
   - score on Kaggle: 2.39166 (local part of training data: 2.3988)
* SVM Solution:
   - C=1.1, gamma='auto', kernel='rbf'
   - score on Kaggle: 2.40673
   - but 80 minutes training
* Neural net (Lasagne&Theano):

In [3]:
import pandas as pd
from pandas import DataFrame as df
import seaborn as sns
import sys
import numpy as np
import time

%matplotlib inline
from matplotlib import pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import log_loss
from sklearn.externals import joblib
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder

In [5]:
sns.set_style('ticks')

In [6]:
sys.path.append("/home/mschlupp/pythonTools")
tmp = %pwd
files_dir = tmp + "/files/" 
solution_dir = tmp+'/predictions/'

In [7]:
#train = pd.read_csv(files_dir+'gender_age_train.csv')
#test = pd.read_csv(files_dir+'gender_age_test.csv')

In [8]:
%ls files/

app_events.csv        phone_brand_device_model.csv
app_labels.csv        phone_brand_device_model_engl.csv
events.csv            sample_submission.csv
events_day_hour.csv   traintest_fullevt.csv
gender_age_test.csv   traintest_phone.csv
gender_age_train.csv  traintest_phone_day_hour.csv
label_categories.csv  traintest_phone_evts.csv


#### Let us first remind ourself, what we have in our file

In [9]:
new_set = pd.read_csv(files_dir+'traintest_fullevt.csv', nrows=0) # just read the header

In [10]:
cols=new_set.columns
print(cols)

Index(['age', 'device_id', 'gender', 'group', 'isTrain', 'phone_brand',
       'device_model', 'hasEvents', 'nEvts', 'longitude_mean',
       'longitude_variance', 'latitude_mean', 'latitude_variance',
       'usageTime_mean', 'usageTime_variance', 'usageDay_mean',
       'usageDay_variance'],
      dtype='object')


In [11]:
cols = cols.drop('hasEvents')

In [12]:
# let's read the data chunkwise and split in train and test sample
# we only want the training sample for now.
data = pd.read_csv(files_dir+'traintest_fullevt.csv', usecols=cols)

In [13]:
train=data[data.isTrain==1]

Prepare our multi-class labels.

In [14]:
true_classes = pd.DataFrame(np.zeros((len(train.device_id),len(train.group.unique()))))
true_classes.columns

Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], dtype='int64')

In [15]:
le_groups = LabelEncoder()
le_groups.fit(data.group.unique())
le_groups.classes_

array(['F23-', 'F24-26', 'F27-28', 'F29-32', 'F33-42', 'F43+', 'M22-',
       'M23-26', 'M27-28', 'M29-31', 'M32-38', 'M39+', 'none'], dtype=object)

In [16]:
true_classes.columns = le_groups.inverse_transform(list(true_classes.columns))

In [17]:
true_classes.columns

Index(['F23-', 'F24-26', 'F27-28', 'F29-32', 'F33-42', 'F43+', 'M22-',
       'M23-26', 'M27-28', 'M29-31', 'M32-38', 'M39+'],
      dtype='object')

In [18]:
# There should be a smarter way. It's late, sorry --> sparse matrices
for i,row,x in zip(range(0,len(train.group)),true_classes.iterrows(), train.group):
    if i % 10001 == 0: 
        print('still allive... ', i)
    row[1][x]=1

still allive...  0
still allive...  10001
still allive...  20002
still allive...  30003
still allive...  40004
still allive...  50005
still allive...  60006
still allive...  70007


In [19]:
true_classes.columns = le_groups.transform(true_classes.columns)

In [20]:
# create the dataset we are actually using for training
in_data = train.drop(['age','gender','group','device_id','isTrain'], axis=1)

# now transform string variables into numerical categories 
le_phone = LabelEncoder()
le_device = LabelEncoder()
le_phone.fit(data['phone_brand'].unique())
le_device.fit(data['device_model'].unique())

in_data['device_model'] = le_device.transform(in_data['device_model'])
in_data['phone_brand'] = le_phone.transform(in_data['phone_brand'])

# let's check our dataset
in_data.head(1)


Unnamed: 0,phone_brand,device_model,nEvts,longitude_mean,longitude_variance,latitude_mean,latitude_variance,usageTime_mean,usageTime_variance,usageDay_mean,usageDay_variance
0,51,749,0,-1,-1,-1,-1,-1,-1,-1,-1


In [21]:
true_class = le_groups.transform(train.group)
true_class

array([10, 10, 10, ...,  6, 10,  7])

# Reference  model
The first reference will be a model without any usage of knowledge.
The probability to end up in the class for a test customer is assumed to be *n(customers per class)/all(customers)*.

---

Using logistic loss evaluation the prediction scores:
* Naive entries per class model: 
   - loss: **2.42786222642**
   - score on kaggle: **2.42762**

In [13]:
groups = train.groupby('group').count()

In [14]:
groups.device_id = groups.device_id/len(train.age)

In [15]:
groups.device_id # show our naive prediction

group
F23-      0.067654
F24-26    0.056132
F27-28    0.041771
F29-32    0.062000
F33-42    0.074499
F43+      0.056186
M22-      0.100315
M23-26    0.128676
M27-28    0.072945
M29-31    0.097917
M32-38    0.126948
M39+      0.114957
Name: device_id, dtype: float64

In [16]:
# build the prediction matrix
prediction = np.zeros((len(train.age),len(groups.device_id)))

In [17]:
# Let us use the log_loss 
# (sklearn's logistic loss / cross-entropy) 
# implementation to score our prediction

# first transform group into numerical classes
labelEnc = LabelEncoder()
labelEnc.fit(train.group)
true_group = labelEnc.transform(train.group)

In [18]:
dg = df(columns=groups.index.values)
probs_per_group = dg.append(groups.device_id)

In [17]:
# assign our probabilities to the prediction array
for i in range(0,prediction.shape[0]):
    prediction[i]=probs_per_group.values[0]

In [18]:
print("Logistic loss of our prediction is: ")
print(log_loss(true_group,prediction))

Logistic loss of our prediction is: 
2.42786222642


# Multi-class boosted decision tree classification

In [15]:
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import AdaBoostClassifier, AdaBoostRegressor, GradientBoostingClassifier
from sklearn.cross_validation import StratifiedKFold, KFold, LabelKFold


#### Build the ML algorithm

In [19]:
# no hyper-parameter tuning, just from experience...
bdt = AdaBoostClassifier(DecisionTreeClassifier(min_samples_leaf=0.01*len(true_classes.index)),
                         algorithm="SAMME.R",
                         learning_rate=0.05,
                         n_estimators=800)

AttributeError: 'numpy.ndarray' object has no attribute 'index'

In [127]:
s=time.time()
bdt.fit(in_data,true_class)
print('training finished after: ', (time.time()-s)/60.0, ' minutes.' )

AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=4, max_leaf_nodes=None, min_samples_leaf=74.645,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'),
          learning_rate=0.15, n_estimators=800, random_state=666)

In [239]:
s=time.time()
probas = bdt.predict_proba(in_data)
print('duration of predictions: ', (time.time()-s)/60.0, ' minutes')

ValueError: could not convert string to float: 'M23-26'

In [136]:
log_loss(true_classes,probas)

2.3924290904229317

Not really overwhelming.

In [175]:
bdt.get_params()

{'algorithm': 'SAMME.R',
 'base_estimator': DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
             max_features=4, max_leaf_nodes=None, min_samples_leaf=74.645,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             presort=False, random_state=None, splitter='best'),
 'base_estimator__class_weight': None,
 'base_estimator__criterion': 'gini',
 'base_estimator__max_depth': None,
 'base_estimator__max_features': 4,
 'base_estimator__max_leaf_nodes': None,
 'base_estimator__min_samples_leaf': 74.645,
 'base_estimator__min_samples_split': 2,
 'base_estimator__min_weight_fraction_leaf': 0.0,
 'base_estimator__presort': False,
 'base_estimator__random_state': None,
 'base_estimator__splitter': 'best',
 'learning_rate': 0.15,
 'n_estimators': 800,
 'random_state': 666}

In [146]:
joblib.dump(bdt, 'trainedModels/ad_hoc_BDT.pkl',compress=3) 

['trainedModels/ad_hoc_BDT.pkl']

In [160]:
bdt = joblib.load('trainedModels/ad_hoc_BDT.pkl')

In [165]:
# let's prepare a submission
probs_naive_df = pd.DataFrame(bdt.predict_proba(test), index=test.index)

  proba = proba[:, :self.n_classes_]


In [169]:
probs_naive_df.columns = le_groups.inverse_transform(probs_naive_df.columns)

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

In [176]:
out = pd.concat([pd.DataFrame(data.device_id[data.isTrain==0]),probs_naive_df], axis=1)
out.to_csv(solution_dir+'naive_bdt.csv',index=False)

In our current data set up, two very different prediction scenarios are presend: either the data has event entries, or it doesn't. We'll test if two optimised algorithms for each scenario help to reduce the loss score.

In [22]:
def optimisePars(mva, points, data , classes, fraction=0.7, score = 'log_loss', cvs=5):
    '''
    Funtion to optimise hyper-parameters. Follows sklearn example:
    "example-model-selection-grid-search-digits-py"
    
    Arguments:
    mva - multivariate method to optimise
    points - dictionary that holds optimisation poitns
    data - your training data
    classes - true categories/classes/labels 
    fraction - fraction of training/test split
    score - score function to optimies the classifier
    cvs - number of cross-validation folds or cross-validation generator
    
    To-Do:
    - classification report does not exactly work every time. 
    '''
    import time
    print("# Tuning hyper-parameters for log_loss score")
    
    # Splits data
    data_train, data_test, classes_train, classes_test =  train_test_split(
    data, classes, test_size=fraction, random_state=0)
    s =  time.time()
    clf = GridSearchCV(mva, points, cv=cvs,
                       scoring=score, n_jobs=4, 
                       verbose=2)
                       
    clf.fit(data_train, classes_train)

    print('GridSearch completed after ', (time.time()-s)/60.0, ' minutes.')
    print()
    print("Best parameters set found on training set:")
    print()
    print(clf.best_params_)
    print()
    print("Grid scores on training set:")
    print()
    for params, mean_score, scores in clf.grid_scores_:
        print("%0.3f (+/-%0.03f) for %r"
              % (mean_score, scores.std() * 2, params))
    print()

    print("Detailed classification report:")
    print()
    print("The model is trained on the full training set.")
    print("The scores are computed on the full test set.")
    print()
    y_true, y_pred = classes_test, clf.predict_proba(data_test)
    print("Log loss score on test sample: ", log_loss(y_true, y_pred))
    
    return clf



#### Optimise BDT Parameters
We use sklearn GridSearchCV method

In [3]:
in_data['group'] = train.group
data_hasEvts = in_data.query('nEvts>0')
class_hasEvts =  le_groups.transform(data_hasEvts.group)
data_noEvts = in_data.query('nEvts==0').drop(['nEvts','longitude_mean', 'longitude_variance',
       'latitude_mean', 'latitude_variance', 'usageTime_mean',
       'usageTime_variance', 'usageDay_mean', 'usageDay_variance'],axis=1)
class_noEvts =  le_groups.transform(data_noEvts.group)

data_hasEvts = data_hasEvts.drop(['group'],axis=1);
data_noEvts = data_noEvts.drop(['group'],axis=1);


NameError: name 'train' is not defined

In [178]:
# tune new models
bdt_noEvts = AdaBoostClassifier(DecisionTreeClassifier(),
                         algorithm="SAMME.R")


bdt_hasEvts = AdaBoostClassifier(DecisionTreeClassifier(),
                         algorithm="SAMME.R")

In [198]:
# list of points considered in the optimisation
bdt_pars = {'learning_rate': [0.005, 0.001, 0.0005],
             'n_estimators': [300,800,500],
 #'base_estimator__min_samples_leaf': [400,200,300]
            'base_estimator__max_depth': [2,3,4]
           }

In [199]:
# first try:
# -2.472 (+/-0.001) for {'base_estimator__min_samples_leaf': 500, 'learning_rate': 0.05, 'n_estimators': 300}
# second try:
# -2.437 (+/-0.003) for {'base_estimator__min_samples_leaf': 300, 'learning_rate': 0.01, 'n_estimators': 300}
# third
# -2.411 (+/-0.013) for {'base_estimator__min_samples_leaf': 200, 'learning_rate': 0.005, 'n_estimators': 50}
# fourth
# -2.408 (+/-0.011) for {'base_estimator__min_samples_leaf': 300, 'learning_rate': 0.001, 'n_estimators': 300}
# fifth
# -2.408 (+/-0.009) for {'base_estimator__min_samples_leaf': 400, 'learning_rate': 0.001, 'n_estimators': 300}
# sixth
# -2.409 (+/-0.007) for {'learning_rate': 0.0005, 'n_estimators': 500, 'base_estimator__max_depth': 3}

'''
Grid scores on training set:
~52 min (36 points)

-4.124 (+/-0.069) for {'base_estimator__min_samples_leaf': 10, 'learning_rate': 0.15, 'n_estimators': 300}
-3.208 (+/-0.091) for {'base_estimator__min_samples_leaf': 10, 'learning_rate': 0.15, 'n_estimators': 800}
-3.615 (+/-0.116) for {'base_estimator__min_samples_leaf': 10, 'learning_rate': 0.15, 'n_estimators': 500}
-4.600 (+/-0.138) for {'base_estimator__min_samples_leaf': 10, 'learning_rate': 0.05, 'n_estimators': 300}
-4.220 (+/-0.079) for {'base_estimator__min_samples_leaf': 10, 'learning_rate': 0.05, 'n_estimators': 800}
-4.466 (+/-0.099) for {'base_estimator__min_samples_leaf': 10, 'learning_rate': 0.05, 'n_estimators': 500}
-3.059 (+/-0.086) for {'base_estimator__min_samples_leaf': 10, 'learning_rate': 0.5, 'n_estimators': 300}
-2.621 (+/-0.037) for {'base_estimator__min_samples_leaf': 10, 'learning_rate': 0.5, 'n_estimators': 800}
-2.774 (+/-0.054) for {'base_estimator__min_samples_leaf': 10, 'learning_rate': 0.5, 'n_estimators': 500}
-2.610 (+/-0.073) for {'base_estimator__min_samples_leaf': 50, 'learning_rate': 0.15, 'n_estimators': 300}
-2.613 (+/-0.073) for {'base_estimator__min_samples_leaf': 50, 'learning_rate': 0.15, 'n_estimators': 800}
-2.614 (+/-0.070) for {'base_estimator__min_samples_leaf': 50, 'learning_rate': 0.15, 'n_estimators': 500}
-2.607 (+/-0.104) for {'base_estimator__min_samples_leaf': 50, 'learning_rate': 0.05, 'n_estimators': 300}
-2.609 (+/-0.076) for {'base_estimator__min_samples_leaf': 50, 'learning_rate': 0.05, 'n_estimators': 800}
-2.607 (+/-0.091) for {'base_estimator__min_samples_leaf': 50, 'learning_rate': 0.05, 'n_estimators': 500}
-2.616 (+/-0.074) for {'base_estimator__min_samples_leaf': 50, 'learning_rate': 0.5, 'n_estimators': 300}
-2.594 (+/-0.067) for {'base_estimator__min_samples_leaf': 50, 'learning_rate': 0.5, 'n_estimators': 800}
-2.614 (+/-0.080) for {'base_estimator__min_samples_leaf': 50, 'learning_rate': 0.5, 'n_estimators': 500}
-2.482 (+/-0.021) for {'base_estimator__min_samples_leaf': 100, 'learning_rate': 0.15, 'n_estimators': 300}
-2.484 (+/-0.017) for {'base_estimator__min_samples_leaf': 100, 'learning_rate': 0.15, 'n_estimators': 800}
-2.483 (+/-0.018) for {'base_estimator__min_samples_leaf': 100, 'learning_rate': 0.15, 'n_estimators': 500}
-2.476 (+/-0.027) for {'base_estimator__min_samples_leaf': 100, 'learning_rate': 0.05, 'n_estimators': 300}
-2.482 (+/-0.022) for {'base_estimator__min_samples_leaf': 100, 'learning_rate': 0.05, 'n_estimators': 800}
-2.480 (+/-0.024) for {'base_estimator__min_samples_leaf': 100, 'learning_rate': 0.05, 'n_estimators': 500}
-2.485 (+/-0.022) for {'base_estimator__min_samples_leaf': 100, 'learning_rate': 0.5, 'n_estimators': 300}
-2.487 (+/-0.016) for {'base_estimator__min_samples_leaf': 100, 'learning_rate': 0.5, 'n_estimators': 800}
-2.487 (+/-0.022) for {'base_estimator__min_samples_leaf': 100, 'learning_rate': 0.5, 'n_estimators': 500}
-2.480 (+/-0.000) for {'base_estimator__min_samples_leaf': 500, 'learning_rate': 0.15, 'n_estimators': 300}
-2.483 (+/-0.000) for {'base_estimator__min_samples_leaf': 500, 'learning_rate': 0.15, 'n_estimators': 800}
-2.482 (+/-0.000) for {'base_estimator__min_samples_leaf': 500, 'learning_rate': 0.15, 'n_estimators': 500}
-2.472 (+/-0.001) for {'base_estimator__min_samples_leaf': 500, 'learning_rate': 0.05, 'n_estimators': 300}
-2.480 (+/-0.000) for {'base_estimator__min_samples_leaf': 500, 'learning_rate': 0.05, 'n_estimators': 800}
-2.477 (+/-0.000) for {'base_estimator__min_samples_leaf': 500, 'learning_rate': 0.05, 'n_estimators': 500}
-2.483 (+/-0.000) for {'base_estimator__min_samples_leaf': 500, 'learning_rate': 0.5, 'n_estimators': 300}
-2.484 (+/-0.000) for {'base_estimator__min_samples_leaf': 500, 'learning_rate': 0.5, 'n_estimators': 800}
-2.484 (+/-0.000) for {'base_estimator__min_samples_leaf': 500, 'learning_rate': 0.5, 'n_estimators': 500}

~20 min (18 points)
-2.471 (+/-0.001) for {'base_estimator__min_samples_leaf': 300, 'learning_rate': 0.05, 'n_estimators': 300}
-2.479 (+/-0.000) for {'base_estimator__min_samples_leaf': 300, 'learning_rate': 0.05, 'n_estimators': 800}
-2.476 (+/-0.000) for {'base_estimator__min_samples_leaf': 300, 'learning_rate': 0.05, 'n_estimators': 500}
-2.437 (+/-0.003) for {'base_estimator__min_samples_leaf': 300, 'learning_rate': 0.01, 'n_estimators': 300}
-2.462 (+/-0.001) for {'base_estimator__min_samples_leaf': 300, 'learning_rate': 0.01, 'n_estimators': 800}
-2.451 (+/-0.002) for {'base_estimator__min_samples_leaf': 300, 'learning_rate': 0.01, 'n_estimators': 500}
-2.472 (+/-0.001) for {'base_estimator__min_samples_leaf': 500, 'learning_rate': 0.05, 'n_estimators': 300}
-2.480 (+/-0.000) for {'base_estimator__min_samples_leaf': 500, 'learning_rate': 0.05, 'n_estimators': 800}
-2.477 (+/-0.000) for {'base_estimator__min_samples_leaf': 500, 'learning_rate': 0.05, 'n_estimators': 500}
-2.439 (+/-0.003) for {'base_estimator__min_samples_leaf': 500, 'learning_rate': 0.01, 'n_estimators': 300}
-2.463 (+/-0.001) for {'base_estimator__min_samples_leaf': 500, 'learning_rate': 0.01, 'n_estimators': 800}
-2.453 (+/-0.002) for {'base_estimator__min_samples_leaf': 500, 'learning_rate': 0.01, 'n_estimators': 500}
-2.473 (+/-0.001) for {'base_estimator__min_samples_leaf': 1000, 'learning_rate': 0.05, 'n_estimators': 300}
-2.480 (+/-0.000) for {'base_estimator__min_samples_leaf': 1000, 'learning_rate': 0.05, 'n_estimators': 800}
-2.478 (+/-0.000) for {'base_estimator__min_samples_leaf': 1000, 'learning_rate': 0.05, 'n_estimators': 500}
-2.440 (+/-0.003) for {'base_estimator__min_samples_leaf': 1000, 'learning_rate': 0.01, 'n_estimators': 300}
-2.464 (+/-0.001) for {'base_estimator__min_samples_leaf': 1000, 'learning_rate': 0.01, 'n_estimators': 800}
-2.454 (+/-0.001) for {'base_estimator__min_samples_leaf': 1000, 'learning_rate': 0.01, 'n_estimators': 500}

~3 min (18 points)
-2.610 (+/-0.118) for {'base_estimator__min_samples_leaf': 50, 'learning_rate': 0.005, 'n_estimators': 50}
-2.611 (+/-0.129) for {'base_estimator__min_samples_leaf': 50, 'learning_rate': 0.005, 'n_estimators': 100}
-2.612 (+/-0.134) for {'base_estimator__min_samples_leaf': 50, 'learning_rate': 0.005, 'n_estimators': 200}
-2.616 (+/-0.105) for {'base_estimator__min_samples_leaf': 50, 'learning_rate': 0.001, 'n_estimators': 50}
-2.613 (+/-0.106) for {'base_estimator__min_samples_leaf': 50, 'learning_rate': 0.001, 'n_estimators': 100}
-2.610 (+/-0.123) for {'base_estimator__min_samples_leaf': 50, 'learning_rate': 0.001, 'n_estimators': 200}
-2.442 (+/-0.055) for {'base_estimator__min_samples_leaf': 100, 'learning_rate': 0.005, 'n_estimators': 50}
-2.441 (+/-0.049) for {'base_estimator__min_samples_leaf': 100, 'learning_rate': 0.005, 'n_estimators': 100}
-2.439 (+/-0.046) for {'base_estimator__min_samples_leaf': 100, 'learning_rate': 0.005, 'n_estimators': 200}
-2.450 (+/-0.054) for {'base_estimator__min_samples_leaf': 100, 'learning_rate': 0.001, 'n_estimators': 50}
-2.447 (+/-0.053) for {'base_estimator__min_samples_leaf': 100, 'learning_rate': 0.001, 'n_estimators': 100}
-2.443 (+/-0.055) for {'base_estimator__min_samples_leaf': 100, 'learning_rate': 0.001, 'n_estimators': 200}
-2.411 (+/-0.013) for {'base_estimator__min_samples_leaf': 200, 'learning_rate': 0.005, 'n_estimators': 50}
-2.410 (+/-0.011) for {'base_estimator__min_samples_leaf': 200, 'learning_rate': 0.005, 'n_estimators': 100}
-2.413 (+/-0.009) for {'base_estimator__min_samples_leaf': 200, 'learning_rate': 0.005, 'n_estimators': 200}
-2.415 (+/-0.015) for {'base_estimator__min_samples_leaf': 200, 'learning_rate': 0.001, 'n_estimators': 50}
-2.413 (+/-0.014) for {'base_estimator__min_samples_leaf': 200, 'learning_rate': 0.001, 'n_estimators': 100}
-2.412 (+/-0.013) for {'base_estimator__min_samples_leaf': 200, 'learning_rate': 0.001, 'n_estimators': 200}

~ 11 min (18 points)
~ 21 min (27 points)
~ 26 min (27 points)
'''

grid_bdt_noEvts = optimisePars(bdt_noEvts, bdt_pars, data_noEvts, class_noEvts)

# Tuning hyper-parameters for log_loss score

GridSearch completed after  25.57805508375168  minutes.

Best parameters set found on training set:

{'learning_rate': 0.0005, 'n_estimators': 500, 'base_estimator__max_depth': 3}

Grid scores on training set:

-2.427 (+/-0.002) for {'learning_rate': 0.005, 'n_estimators': 300, 'base_estimator__max_depth': 2}
-2.451 (+/-0.001) for {'learning_rate': 0.005, 'n_estimators': 800, 'base_estimator__max_depth': 2}
-2.439 (+/-0.002) for {'learning_rate': 0.005, 'n_estimators': 500, 'base_estimator__max_depth': 2}
-2.413 (+/-0.004) for {'learning_rate': 0.001, 'n_estimators': 300, 'base_estimator__max_depth': 2}
-2.418 (+/-0.003) for {'learning_rate': 0.001, 'n_estimators': 800, 'base_estimator__max_depth': 2}
-2.415 (+/-0.003) for {'learning_rate': 0.001, 'n_estimators': 500, 'base_estimator__max_depth': 2}
-2.412 (+/-0.004) for {'learning_rate': 0.0005, 'n_estimators': 300, 'base_estimator__max_depth': 2}
-2.414 (+/-0.004) for {'learning_rate': 0.

ValueError: Found arrays with inconsistent numbers of samples: [ 7464 30802]

#### Use the AdaBoost DT with
**Loss: -2.408 (+/-0.009) for {'base_estimator__min_samples_leaf': 400, 'learning_rate': 0.001, 'n_estimators': 300}**


In [202]:
# optimisation resulted in error, so we set it by hand 
bdt_noEvts = AdaBoostClassifier(DecisionTreeClassifier(min_samples_leaf=400),
                                learning_rate=0.001,
                                n_estimators=300,
                                algorithm="SAMME.R")

In [204]:
# split in train and test sets
noE_train, noE_test, noE_class_train, noE_class_test = train_test_split(data_noEvts, class_noEvts, 
                                                                        test_size = 0.2, random_state=0,
                                                                        stratify=class_noEvts)

In [205]:
# train the bdt
bdt_noEvts.fit(noE_train, noE_class_train)

AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=400,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'),
          learning_rate=0.001, n_estimators=300, random_state=None)

In [209]:
# predict test set
probas = bdt_noEvts.predict_proba(noE_test)

In [210]:
# print log_loss of this model
print('log loss score: ', log_loss(noE_class_test, probas))

log loss score:  2.40527688085


Great, performs as expected: loss of ~2.408 

In [212]:
# pickle result to be able to skip training the next time
joblib.dump(bdt_noEvts, 'trainedModels/optimised_bdtnoEvts.pkl',compress=3) 

['trainedModels/optimised_bdtnoEvts.pkl']

In [28]:
# load bdt
bdt_noEvts = joblib.load('trainedModels/optimised_bdtnoEvts.pkl')

In [233]:
# optimise bdt for the data with events
# list of points considered in the optimisation
bdt_pars = {'learning_rate': [0.005, 0.001, 0.0005, 0.01],
             'n_estimators': [300,800,500],
 #'base_estimator__min_samples_leaf': [400,200,300]
            'base_estimator__max_depth': [2,3,4,5,6]
           }

grid_bdt_hasEvts = optimisePars(bdt_hasEvts, bdt_pars, data_hasEvts, class_hasEvts)

# Tuning hyper-parameters for log_loss score

GridSearch completed after  56.55024749437968  minutes.

Best parameters set found on training set:

{'learning_rate': 0.0005, 'n_estimators': 500, 'base_estimator__max_depth': 2}

Grid scores on training set:

-2.401 (+/-0.003) for {'learning_rate': 0.005, 'n_estimators': 300, 'base_estimator__max_depth': 2}
-2.433 (+/-0.002) for {'learning_rate': 0.005, 'n_estimators': 800, 'base_estimator__max_depth': 2}
-2.417 (+/-0.003) for {'learning_rate': 0.005, 'n_estimators': 500, 'base_estimator__max_depth': 2}
-2.383 (+/-0.006) for {'learning_rate': 0.001, 'n_estimators': 300, 'base_estimator__max_depth': 2}
-2.389 (+/-0.004) for {'learning_rate': 0.001, 'n_estimators': 800, 'base_estimator__max_depth': 2}
-2.385 (+/-0.005) for {'learning_rate': 0.001, 'n_estimators': 500, 'base_estimator__max_depth': 2}
-2.383 (+/-0.007) for {'learning_rate': 0.0005, 'n_estimators': 300, 'base_estimator__max_depth': 2}
-2.384 (+/-0.006) for {'learning_rate': 0.

In [237]:
# pickle bdt result
joblib.dump(grid_bdt_hasEvts.best_estimator_, 'trainedModels/optimised_bdthasEvts.pkl',compress=3);

In [159]:
# load result (you can skip the training steps)
bdt_hasEvts = joblib.load('trainedModels/optimised_bdthasEvts.pkl')

#### Let's see how the two BDTs work together.

In [73]:
test = data[data.isTrain==0]

In [74]:
test = test.drop(['age','device_id','gender','group', 'isTrain'],axis=1)

In [75]:
test['phone_brand'] = le_phone.transform(test.phone_brand)
test['device_model'] = le_device.transform(test.device_model)

In [76]:
hasEvts_i = test[test.nEvts>=1].index
noEvts_i = test[test.nEvts==0].index

In [80]:
test_evts = test.loc[hasEvts_i]
test_noevts = test.loc[noEvts_i]

In [81]:
probas_evts = bdt_hasEvts.predict_proba(test_evts)

In [84]:
probas_noevts = bdt_noEvts.predict_proba(test_noevts.drop(['nEvts','longitude_mean', 'longitude_variance',
       'latitude_mean', 'latitude_variance', 'usageTime_mean',
       'usageTime_variance', 'usageDay_mean', 'usageDay_variance'],axis=1))

In [150]:
submission = pd.DataFrame(data.device_id[data.isTrain==0])

In [151]:
probas_evts_df = pd.DataFrame(probas_evts, index=hasEvts_i, columns=le_groups.inverse_transform(range(0,12)))

In [152]:
probas_noevts_df = pd.DataFrame(probas_noevts, index=noEvts_i, columns=le_groups.inverse_transform(range(0,12)))

In [153]:
n = pd.concat([submission.loc[hasEvts_i],probas_evts_df], join='inner', axis=1)

In [154]:
nn = pd.concat([submission.loc[noEvts_i],probas_noevts_df], join='inner', axis=1)

In [155]:
subm = pd.concat([n,nn], join='outer')

In [156]:
subm.sort_index();

In [178]:
subm.to_csv(solution_dir+'separate_bdt.csv',index=False)

---

---

# Try GradientBoosting
first build the simple gradient boosting model

In [37]:
gbdt = GradientBoostingClassifier(loss='deviance');

In [24]:
classes_gbtrain = le_groups.transform(train.group)

In [45]:
# dictionary holding omtimisation points
GBpars = {'n_estimators': [900,1200],
          'min_samples_leaf' : [800,1300],
          'learning_rate' : [0.001, 0.0001]
          #,'max_features' : [None, 'sqrt']
         }

#grid_gbdt = optimisePars(gbdt, GBpars, in_data.drop(["group"], axis=1), classes_gbtrain, cvs=3)

'''
# Tuning hyper-parameters for log_loss score
Fitting 3 folds for each of 36 candidates, totalling 108 fits

[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed: 53.2min
[Parallel(n_jobs=4)]: Done 108 out of 108 | elapsed: 137.8min finished

GridSearch completed after  146.37849520047504  minutes.

Best parameters set found on training set:

{'max_features': None, 'min_samples_leaf': 800, 'learning_rate': 0.005, 'n_estimators': 700}

Grid scores on training set:

-2.651 (+/-0.033) for {'max_features': None, 'min_samples_leaf': 50, 'learning_rate': 0.4, 'n_estimators': 300}
-2.850 (+/-0.044) for {'max_features': None, 'min_samples_leaf': 50, 'learning_rate': 0.4, 'n_estimators': 700}
-2.515 (+/-0.022) for {'max_features': None, 'min_samples_leaf': 300, 'learning_rate': 0.4, 'n_estimators': 300}
-2.619 (+/-0.025) for {'max_features': None, 'min_samples_leaf': 300, 'learning_rate': 0.4, 'n_estimators': 700}
-2.462 (+/-0.017) for {'max_features': None, 'min_samples_leaf': 800, 'learning_rate': 0.4, 'n_estimators': 300}
-2.518 (+/-0.023) for {'max_features': None, 'min_samples_leaf': 800, 'learning_rate': 0.4, 'n_estimators': 700}
-2.607 (+/-0.027) for {'max_features': 'sqrt', 'min_samples_leaf': 50, 'learning_rate': 0.4, 'n_estimators': 300}
-2.798 (+/-0.038) for {'max_features': 'sqrt', 'min_samples_leaf': 50, 'learning_rate': 0.4, 'n_estimators': 700}
-2.496 (+/-0.018) for {'max_features': 'sqrt', 'min_samples_leaf': 300, 'learning_rate': 0.4, 'n_estimators': 300}
-2.582 (+/-0.017) for {'max_features': 'sqrt', 'min_samples_leaf': 300, 'learning_rate': 0.4, 'n_estimators': 700}
-2.447 (+/-0.016) for {'max_features': 'sqrt', 'min_samples_leaf': 800, 'learning_rate': 0.4, 'n_estimators': 300}
-2.491 (+/-0.023) for {'max_features': 'sqrt', 'min_samples_leaf': 800, 'learning_rate': 0.4, 'n_estimators': 700}
-2.408 (+/-0.005) for {'max_features': None, 'min_samples_leaf': 50, 'learning_rate': 0.005, 'n_estimators': 300}
-2.405 (+/-0.007) for {'max_features': None, 'min_samples_leaf': 50, 'learning_rate': 0.005, 'n_estimators': 700}
-2.408 (+/-0.005) for {'max_features': None, 'min_samples_leaf': 300, 'learning_rate': 0.005, 'n_estimators': 300}
-2.403 (+/-0.007) for {'max_features': None, 'min_samples_leaf': 300, 'learning_rate': 0.005, 'n_estimators': 700}
-2.408 (+/-0.004) for {'max_features': None, 'min_samples_leaf': 800, 'learning_rate': 0.005, 'n_estimators': 300}
-2.403 (+/-0.005) for {'max_features': None, 'min_samples_leaf': 800, 'learning_rate': 0.005, 'n_estimators': 700}
-2.409 (+/-0.004) for {'max_features': 'sqrt', 'min_samples_leaf': 50, 'learning_rate': 0.005, 'n_estimators': 300}
-2.404 (+/-0.006) for {'max_features': 'sqrt', 'min_samples_leaf': 50, 'learning_rate': 0.005, 'n_estimators': 700}
-2.409 (+/-0.004) for {'max_features': 'sqrt', 'min_samples_leaf': 300, 'learning_rate': 0.005, 'n_estimators': 300}
-2.403 (+/-0.006) for {'max_features': 'sqrt', 'min_samples_leaf': 300, 'learning_rate': 0.005, 'n_estimators': 700}
-2.409 (+/-0.002) for {'max_features': 'sqrt', 'min_samples_leaf': 800, 'learning_rate': 0.005, 'n_estimators': 300}
-2.403 (+/-0.004) for {'max_features': 'sqrt', 'min_samples_leaf': 800, 'learning_rate': 0.005, 'n_estimators': 700}
-2.455 (+/-0.018) for {'max_features': None, 'min_samples_leaf': 50, 'learning_rate': 0.1, 'n_estimators': 300}
-2.525 (+/-0.024) for {'max_features': None, 'min_samples_leaf': 50, 'learning_rate': 0.1, 'n_estimators': 700}
-2.429 (+/-0.012) for {'max_features': None, 'min_samples_leaf': 300, 'learning_rate': 0.1, 'n_estimators': 300}
-2.459 (+/-0.016) for {'max_features': None, 'min_samples_leaf': 300, 'learning_rate': 0.1, 'n_estimators': 700}
-2.416 (+/-0.010) for {'max_features': None, 'min_samples_leaf': 800, 'learning_rate': 0.1, 'n_estimators': 300}
-2.433 (+/-0.012) for {'max_features': None, 'min_samples_leaf': 800, 'learning_rate': 0.1, 'n_estimators': 700}
-2.440 (+/-0.014) for {'max_features': 'sqrt', 'min_samples_leaf': 50, 'learning_rate': 0.1, 'n_estimators': 300}
-2.496 (+/-0.021) for {'max_features': 'sqrt', 'min_samples_leaf': 50, 'learning_rate': 0.1, 'n_estimators': 700}
-2.423 (+/-0.013) for {'max_features': 'sqrt', 'min_samples_leaf': 300, 'learning_rate': 0.1, 'n_estimators': 300}
-2.449 (+/-0.015) for {'max_features': 'sqrt', 'min_samples_leaf': 300, 'learning_rate': 0.1, 'n_estimators': 700}
-2.412 (+/-0.009) for {'max_features': 'sqrt', 'min_samples_leaf': 800, 'learning_rate': 0.1, 'n_estimators': 300}
-2.426 (+/-0.012) for {'max_features': 'sqrt', 'min_samples_leaf': 800, 'learning_rate': 0.1, 'n_estimators': 700}

Detailed classification report:

The model is trained on the full training set.
The scores are computed on the full test set.

Log loss score on test sample:  2.39880543668
'''


# Tuning hyper-parameters for log_loss score
Fitting 3 folds for each of 8 candidates, totalling 24 fits


[Parallel(n_jobs=4)]: Done  24 out of  24 | elapsed: 69.7min finished


GridSearch completed after  77.24419992764791  minutes.

Best parameters set found on training set:

{'min_samples_leaf': 800, 'n_estimators': 1200, 'learning_rate': 0.001}

Grid scores on training set:

-2.416 (+/-0.003) for {'min_samples_leaf': 800, 'n_estimators': 900, 'learning_rate': 0.001}
-2.411 (+/-0.003) for {'min_samples_leaf': 800, 'n_estimators': 1200, 'learning_rate': 0.001}
-2.418 (+/-0.002) for {'min_samples_leaf': 1300, 'n_estimators': 900, 'learning_rate': 0.001}
-2.412 (+/-0.003) for {'min_samples_leaf': 1300, 'n_estimators': 1200, 'learning_rate': 0.001}
-2.464 (+/-0.000) for {'min_samples_leaf': 800, 'n_estimators': 900, 'learning_rate': 0.0001}
-2.460 (+/-0.000) for {'min_samples_leaf': 800, 'n_estimators': 1200, 'learning_rate': 0.0001}
-2.464 (+/-0.000) for {'min_samples_leaf': 1300, 'n_estimators': 900, 'learning_rate': 0.0001}
-2.461 (+/-0.000) for {'min_samples_leaf': 1300, 'n_estimators': 1200, 'learning_rate': 0.0001}

Detailed classification report:

The mo

"\n# Tuning hyper-parameters for log_loss score\nFitting 3 folds for each of 36 candidates, totalling 108 fits\n\n[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed: 53.2min\n[Parallel(n_jobs=4)]: Done 108 out of 108 | elapsed: 137.8min finished\n\nGridSearch completed after  146.37849520047504  minutes.\n\nBest parameters set found on training set:\n\n{'max_features': None, 'min_samples_leaf': 800, 'learning_rate': 0.005, 'n_estimators': 700}\n\nGrid scores on training set:\n\n-2.651 (+/-0.033) for {'max_features': None, 'min_samples_leaf': 50, 'learning_rate': 0.4, 'n_estimators': 300}\n-2.850 (+/-0.044) for {'max_features': None, 'min_samples_leaf': 50, 'learning_rate': 0.4, 'n_estimators': 700}\n-2.515 (+/-0.022) for {'max_features': None, 'min_samples_leaf': 300, 'learning_rate': 0.4, 'n_estimators': 300}\n-2.619 (+/-0.025) for {'max_features': None, 'min_samples_leaf': 300, 'learning_rate': 0.4, 'n_estimators': 700}\n-2.462 (+/-0.017) for {'max_features': None, 'min_samples_leaf

In [50]:
gbdt = GradientBoostingClassifier(loss='deviance', max_features=None
                                  , min_samples_leaf=800
                                  , learning_rate=0.005
                                  , n_estimators=700);

In [51]:
gbdt.fit(in_data.drop('group', axis=1), classes_gbtrain)

GradientBoostingClassifier(init=None, learning_rate=0.005, loss='deviance',
              max_depth=3, max_features=None, max_leaf_nodes=None,
              min_samples_leaf=800, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=700,
              presort='auto', random_state=None, subsample=1.0, verbose=0,
              warm_start=False)

In [53]:
joblib.dump(gbdt, 'trainedModels/gbdt_fullset.pkl', compress=3)

['trainedModels/gbdt_fullset.pkl']

In [25]:
gbdt = joblib.load("trainedModels/gbdt_fullset.pkl")

In [44]:
test_sample = data[data.isTrain==0].drop(['age','gender','device_id','group','isTrain'],axis=1)
test_sample["phone_brand"] = le_phone.transform(test_sample.phone_brand)
test_sample["device_model"] = le_device.transform(test_sample.device_model)

In [38]:
probas_gbdt = gbdt.predict_proba(test_sample)

### prepare gbdt submission

In [40]:
probas_gbdt_df = pd.DataFrame(probas_gbdt, index=test_sample.index)
probas_gbdt_df.columns = le_groups.inverse_transform(probas_gbdt_df.columns)

In [44]:
out = pd.concat([pd.DataFrame(data[data.isTrain==0].device_id),probas_gbdt_df], axis=1);

In [49]:
out.to_csv('predictions/gbdt_full.csv', index=False)

# SVM solution

In [54]:
from sklearn.svm import SVC


In [63]:
svm = SVC(C=1.1, cache_size=500
    , decision_function_shape='ovo'
    , gamma='auto', kernel='rbf'
    , verbose=1, probability=True)

In [64]:
s = time.time()
svm.fit(in_data.drop('group', axis=1), classes_gbtrain)
print("SVM trained in ", (time.time()-s)/60.0, " minutes")

[LibSVM]SVM trained in  79.19006366729737  minutes


In [66]:
probas_svm_df =  pd.DataFrame(svm.predict_proba(test_sample)
                              , index=test_sample.index
                              , columns=le_groups.inverse_transform(range(0,12)))

In [68]:
out = pd.concat([pd.DataFrame(data[data.isTrain==0].device_id),probas_svm_df], axis=1);

In [69]:
out.to_csv('predictions/svm_full.csv', index=False)

# Try a neural net


In [22]:
from keras.models import Sequential
from keras.layers import Dense, Activation

Using Theano backend.


In [23]:
in_data.head(2)

Unnamed: 0,phone_brand,device_model,nEvts,longitude_mean,longitude_variance,latitude_mean,latitude_variance,usageTime_mean,usageTime_variance,usageDay_mean,usageDay_variance
0,51,749,0,-1,-1,-1,-1,-1,-1,-1,-1
1,51,749,0,-1,-1,-1,-1,-1,-1,-1,-1


In [24]:
true_classes_nn = le_groups.transform(train.group)

In [25]:
in_data.as_matrix().shape

(74645, 11)

In [26]:
len(true_classes.columns)

12

In [38]:
import theano.tensor as T
from theano.tensor import *

In [36]:
dim = sha(in_data.as_matrix())

TypeError: 'module' object is not callable

In [32]:
# in keras we need to build models.
# we build our own sequential model
model = Sequential()

# first we add a dense layer (std NN layer)
# we need an output of 12 dimensions
model.add(Dense(output_dim=20, input_shape=(11,74645)))
model.add(Activation("relu")) # no real motivation for relu here
model.add(Dropout(0.3))
model.add(Dense(output_dim=25))
model.add(Activation("relu")) 
model.add(Dropout(0.2))
model.add(Dense(output_dim=15))
model.add(Activation("relu")) 
model.add(Dropout(0.2))
model.add(Dense(output_dim=15))
model.add(Activation("relu")) 
model.add(Dense(output_dim=12))
model.add(Activation("softmax"))

model.summary()

Exception: Input 0 is incompatible with layer dense_6: expected ndim=2, found ndim=3

In [31]:
# now we need to configure the learning process
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [34]:
# we would like to use the categorical_crossentropy (multiclass logloss)
# so let's convert classes to categroies
from keras.utils.np_utils import to_categorical

In [37]:
true_classes_nn = to_categorical(true_classes_nn)

In [41]:
s = time.time()
model.fit(in_data.as_matrix(),true_classes_nn
         , verbose=1,nb_epoch=100)
print("NN trained in ", (time.time()-s)/60.0, " minutes")

Train on 59716 samples, validate on 14929 samples
Epoch 1/100
1s - loss: 2.4338 - acc: 0.1264 - val_loss: 2.4007 - val_acc: 0.1385
Epoch 2/100
1s - loss: 2.4335 - acc: 0.1241 - val_loss: 2.4040 - val_acc: 0.1523
Epoch 3/100
1s - loss: 2.4335 - acc: 0.1267 - val_loss: 2.3959 - val_acc: 0.1320
Epoch 4/100
1s - loss: 2.4333 - acc: 0.1268 - val_loss: 2.4035 - val_acc: 0.1565
Epoch 5/100
1s - loss: 2.4342 - acc: 0.1256 - val_loss: 2.4055 - val_acc: 0.1291
Epoch 6/100
1s - loss: 2.4333 - acc: 0.1270 - val_loss: 2.4078 - val_acc: 0.0919
Epoch 7/100
1s - loss: 2.4332 - acc: 0.1265 - val_loss: 2.4151 - val_acc: 0.0918
Epoch 8/100
1s - loss: 2.4335 - acc: 0.1262 - val_loss: 2.4029 - val_acc: 0.1504
Epoch 9/100
1s - loss: 2.4334 - acc: 0.1265 - val_loss: 2.3990 - val_acc: 0.1576
Epoch 10/100
1s - loss: 2.4328 - acc: 0.1265 - val_loss: 2.4014 - val_acc: 0.1538
Epoch 11/100
1s - loss: 2.4322 - acc: 0.1267 - val_loss: 2.4025 - val_acc: 0.1474
Epoch 12/100
1s - loss: 2.4312 - acc: 0.1273 - val_loss: 

In [42]:
loss_and_metrics = model.evaluate(in_data.as_matrix(),true_classes_nn)



In [43]:
loss_and_metrics

[2.422510975863196, 0.13706209391297622]

In [47]:
pd.DataFrame(model.predict_proba(test_sample.as_matrix())).shape



(112071, 12)

In [39]:
help(model.fit)

Help on method fit in module keras.models:

fit(x, y, batch_size=32, nb_epoch=10, verbose=1, callbacks=[], validation_split=0.0, validation_data=None, shuffle=True, class_weight=None, sample_weight=None, **kwargs) method of keras.models.Sequential instance
    Trains the model for a fixed number of epochs.
    
    # Arguments
        x: input data, as a Numpy array or list of Numpy arrays
            (if the model has multiple inputs).
        y: labels, as a Numpy array.
        batch_size: integer. Number of samples per gradient update.
        nb_epoch: integer, the number of epochs to train the model.
        verbose: 0 for no logging to stdout,
            1 for progress bar logging, 2 for one log line per epoch.
        callbacks: list of `keras.callbacks.Callback` instances.
            List of callbacks to apply during training.
            See [callbacks](/callbacks).
        validation_split: float (0. < x < 1).
            Fraction of the data to use as held-out validation 

### ToDo: check multi-input NN with data from different stages!

# Prepare the submission
First create a matrix for the predictions of the test set.

In [66]:
prediction = np.zeros((len(test.device_id),len(groups.index.values)))
# assign our probabilities to the prediction array
for i in range(0,prediction.shape[0]):
    prediction[i]=probs_per_group.values[0]

NameError: name 'test' is not defined

#### Now define function that prepares the valid submission csv
It uses the test dataset and the prediction matrix as an input.

In [263]:
def prepareOutput(test, pred, label='talkingData'):
    '''
    Writes an valid submission file from the prediction matrix.
    The valid output must look like: 
    device_id,F23-,F24-26,F27-28,F29-32,F33-42,F43+,M22-,M23-26,M27-28,M29-31,M32-38,M39+
    (id, probailities)

    Arguments:
    test  - the DataFrame with the device_id's to be tested
    pred  - is the prediction matrix with pred.shape = (len(test.device_id,len(unique groups))
    label - prefix of the submission file
    
    Return:
    The merged submission dataset is returned.
    '''
    p = pd.DataFrame(pred)
    p.columns = labelEnc.inverse_transform(p.columns)
    i = pd.DataFrame(test.device_id.values) 
    i.columns = ['device_id']
    merged= pd.concat([i,p], axis=1)
    merged.to_csv(solution_dir+label+'_submission.csv', index=False)
    return merged

In [265]:
o = prepareOutput(test,prediction,'entriesPerClass')
%ls predictions/

entriesPerClass_submission.csv


In [266]:
o.head(2)

Unnamed: 0,device_id,F23-,F24-26,F27-28,F29-32,F33-42,F43+,M22-,M23-26,M27-28,M29-31,M32-38,M39+
0,1002079943728939269,0.067654,0.056132,0.041771,0.062,0.074499,0.056186,0.100315,0.128676,0.072945,0.097917,0.126948,0.114957
1,-1547860181818787117,0.067654,0.056132,0.041771,0.062,0.074499,0.056186,0.100315,0.128676,0.072945,0.097917,0.126948,0.114957


#### This worked. The ouput can be submitted to kaggle.