# 'Trained'  models
* The first reference will be a model without any usage of knowledge. The probability to end up in the class for a test customer is assumed to be *n(customers per class)/all(customers)*.
* The second model is a multi-class boosted decision tree classifier
* The third model uses a boosted decision tree to classify male or female and a regression model to predict the age.

---

Using logistic loss evaluation the prediction scores:
* Naive entries per class model: 
   - loss: **2.42786222642**
   - score on kaggle: **2.42762**
* AdaBoost Classifier with: 
   - algorithm='SAMME.R'
   - DT(max_features=4, min_samples_leaf=74.645) 
   - learning_rate=0.15 
   - n_estimators=800 
   - random_state=666
   - _loss on training_: **2.39243**
   - _with k-Folding, mean loss_: **--**
   - _separate optimised BDTs for nEvts==0 and nEvts>=1_: **--**
* Gradient Boosting:
   - loss: deviance

In [147]:
import pandas as pd
from pandas import DataFrame as df
import seaborn as sns
import sys
import numpy as np

%matplotlib inline
from matplotlib import pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import log_loss
from sklearn.externals import joblib
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report

In [2]:
sns.set_style('ticks')

In [3]:
sys.path.append("/home/mschlupp/pythonTools")
tmp = %pwd
files_dir = tmp + "/files/" 
solution_dir = tmp+'/predictions/'

In [4]:
#train = pd.read_csv(files_dir+'gender_age_train.csv')
#test = pd.read_csv(files_dir+'gender_age_test.csv')

In [5]:
%ls files/

app_events.csv        phone_brand_device_model.csv
app_labels.csv        phone_brand_device_model_engl.csv
events.csv            sample_submission.csv
events_day_hour.csv   traintest_fullevt.csv
gender_age_test.csv   traintest_phone.csv
gender_age_train.csv  traintest_phone_day_hour.csv
label_categories.csv  traintest_phone_evts.csv


#### Let us first remind ourself, what we have in our file

In [6]:
new_set = pd.read_csv(files_dir+'traintest_fullevt.csv', nrows=0) # just read the header

In [7]:
cols=new_set.columns
print(cols)

Index(['age', 'device_id', 'gender', 'group', 'isTrain', 'phone_brand',
       'device_model', 'hasEvents', 'nEvts', 'longitude_mean',
       'longitude_variance', 'latitude_mean', 'latitude_variance',
       'usageTime_mean', 'usageTime_variance', 'usageDay_mean',
       'usageDay_variance'],
      dtype='object')


In [8]:
cols = cols.drop('hasEvents','')

In [9]:
# let's read the data chunkwise and split in train and test sample
# we only want the training sample for now.
iter_csv = pd.read_csv(files_dir+'traintest_fullevt.csv', usecols=cols, iterator=True, chunksize=1500)
train = pd.concat([chunk[chunk['isTrain'] ==1] for chunk in iter_csv])
                       

In [10]:
train.head(2)

Unnamed: 0,age,device_id,gender,group,isTrain,phone_brand,device_model,nEvts,longitude_mean,longitude_variance,latitude_mean,latitude_variance,usageTime_mean,usageTime_variance,usageDay_mean,usageDay_variance
0,35,-8076087639492063270,M,M32-38,1,小米,MI 2,0,-1,-1,-1,-1,-1,-1,-1,-1
1,35,-2897161552818060146,M,M32-38,1,小米,MI 2,0,-1,-1,-1,-1,-1,-1,-1,-1


# Reference  model
The first reference will be a model without any usage of knowledge.
The probability to end up in the class for a test customer is assumed to be *n(customers per class)/all(customers)*.

---

Using logistic loss evaluation the prediction scores:
* Naive entries per class model: 
   - loss: **2.42786222642**
   - score on kaggle: **2.42762**

In [11]:
groups = train.groupby('group').count()

In [12]:
groups.device_id = groups.device_id/len(train.age)

In [13]:
groups.device_id # show our naive prediction

group
F23-      0.067654
F24-26    0.056132
F27-28    0.041771
F29-32    0.062000
F33-42    0.074499
F43+      0.056186
M22-      0.100315
M23-26    0.128676
M27-28    0.072945
M29-31    0.097917
M32-38    0.126948
M39+      0.114957
Name: device_id, dtype: float64

In [14]:
# build the prediction matrix
prediction = np.zeros((len(train.age),len(groups.device_id)))

In [15]:
# Let us use the log_loss 
# (sklearn's logistic loss / cross-entropy) 
# implementation to score our prediction

# first transform group into numerical classes
labelEnc = LabelEncoder()
labelEnc.fit(train.group)
true_group = labelEnc.transform(train.group)

In [16]:
dg = df(columns=groups.index.values)
probs_per_group = dg.append(groups.device_id)

In [17]:
# assign our probabilities to the prediction array
for i in range(0,prediction.shape[0]):
    prediction[i]=probs_per_group.values[0]

In [18]:
print("Logistic loss of our prediction is: ")
print(log_loss(true_group,prediction))

Logistic loss of our prediction is: 
2.42786222642


# Multi-class boosted decision tree classification

In [19]:
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import AdaBoostClassifier, AdaBoostRegressor
from sklearn.cross_validation import StratifiedKFold, KFold, LabelKFold
from sklearn.preprocessing import LabelEncoder

Prepare our multi-class labels.

In [119]:
true_classes = pd.DataFrame(np.zeros((len(train.device_id),len(train.group.unique()))))
true_classes.columns

Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], dtype='int64')

In [120]:
le_groups = LabelEncoder()
le_groups.fit(train.group.unique())
le_groups.classes_

array(['F23-', 'F24-26', 'F27-28', 'F29-32', 'F33-42', 'F43+', 'M22-',
       'M23-26', 'M27-28', 'M29-31', 'M32-38', 'M39+'], dtype=object)

In [121]:
true_classes.columns = le_groups.inverse_transform(list(true_classes.columns))

In [122]:
true_classes.columns

Index(['F23-', 'F24-26', 'F27-28', 'F29-32', 'F33-42', 'F43+', 'M22-',
       'M23-26', 'M27-28', 'M29-31', 'M32-38', 'M39+'],
      dtype='object')

In [124]:
# There should be a smarter way. It's late, sorry --> sparse matrices
for i,row,x in zip(range(0,len(train.group)),true_classes.iterrows(), train.group):
    if i % 10001 == 0: 
        print('still allive... ', i)
    row[1][x]=1

still allive...  0
still allive...  10001
still allive...  20002
still allive...  30003
still allive...  40004
still allive...  50005
still allive...  60006
still allive...  70007


In [125]:
true_classes.columns = le_groups.transform(true_classes.columns)

#### Build the ML algorithm

In [100]:
# create the dataset we are actually using for training
in_data = train.drop(['age','gender','group','device_id','isTrain'], axis=1)

# now transform string variables into numerical categories 
le_phone = LabelEncoder()
le_device = LabelEncoder()
le_phone.fit(in_data['phone_brand'].unique())
le_device.fit(in_data['device_model'].unique())

in_data['device_model'] = le_device.transform(in_data['device_model'])
in_data['phone_brand'] = le_phone.transform(in_data['phone_brand'])

# let's check our dataset
in_data.head(1)


Unnamed: 0,phone_brand,device_model,nEvts,longitude_mean,longitude_variance,latitude_mean,latitude_variance,usageTime_mean,usageTime_variance,usageDay_mean,usageDay_variance
0,47,677,0,-1,-1,-1,-1,-1,-1,-1,-1


In [126]:
true_class = le_groups.transform(train.group)
true_class

array([10, 10, 10, ...,  6, 10,  7])

In [103]:
# no hyper-parameter tuning, just from experience...
bdt = AdaBoostClassifier(DecisionTreeClassifier(min_samples_leaf=0.01*len(true_class.index)),
                         algorithm="SAMME.R",
                         learning_rate=0.05,
                         n_estimators=800)

In [127]:
import time
s=time.time()
bdt.fit(in_data,true_class)
print('training finished after: ', (time.time()-s)/60.0, ' minutes.' )

AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=4, max_leaf_nodes=None, min_samples_leaf=74.645,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'),
          learning_rate=0.15, n_estimators=800, random_state=666)

In [132]:
s=time.time()
probas = bdt.predict_proba(in_data)
print('duration of predictions: ', (time.time()-s)/60.0, ' minutes')

duration of predictions:  0.8685107827186584  minutes


array([[ 0.08271853,  0.08349847,  0.08299848, ...,  0.083619  ,
         0.08381157,  0.08353482],
       [ 0.08271853,  0.08349847,  0.08299848, ...,  0.083619  ,
         0.08381157,  0.08353482],
       [ 0.07350266,  0.0755404 ,  0.06335669, ...,  0.0959348 ,
         0.10704796,  0.09117485],
       ..., 
       [ 0.08319621,  0.08313595,  0.08278658, ...,  0.0834848 ,
         0.08365674,  0.08358209],
       [ 0.07508997,  0.04637466,  0.04418379, ...,  0.09102956,
         0.11005088,  0.11972742],
       [ 0.0914771 ,  0.04563885,  0.04851729, ...,  0.10231873,
         0.09931683,  0.1022712 ]])

In [136]:
log_loss(true_classes,probas)

2.3924290904229317

Not really overwhelming.

#### does not make much sense any more.
mean_loss = 0
clfs = []
for indx_train, indx_test in kf:
    s = time.time()
    data_train, data_test = in_data.iloc[indx_train], in_data.iloc[indx_test]
    class_train, class_test = true_class[indx_train], true_class[indx_test]
    bdt.fit(data_train,class_train)
    probs = bdt.predict_proba(data_test)
    loss = log_loss(class_test,probs)
    mean_loss+=loss
    clfs.append(bdt)
    print('fold yields log loss fundction value of: ', loss)
    print('fold trained in: ', (time.time()-s)/60., ' minutes.')
mean_loss/=len(kf)
print('average loss of BDT model with 10 folds: ', mean_loss)

In our current data set up, two very different prediction scenarios are presend: eithe the data has event entries, or it doesn't. We'll test if two optimised algorithms for each scenario help to reduce the loss score.

In [175]:
bdt.get_params()

{'algorithm': 'SAMME.R',
 'base_estimator': DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
             max_features=4, max_leaf_nodes=None, min_samples_leaf=74.645,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             presort=False, random_state=None, splitter='best'),
 'base_estimator__class_weight': None,
 'base_estimator__criterion': 'gini',
 'base_estimator__max_depth': None,
 'base_estimator__max_features': 4,
 'base_estimator__max_leaf_nodes': None,
 'base_estimator__min_samples_leaf': 74.645,
 'base_estimator__min_samples_split': 2,
 'base_estimator__min_weight_fraction_leaf': 0.0,
 'base_estimator__presort': False,
 'base_estimator__random_state': None,
 'base_estimator__splitter': 'best',
 'learning_rate': 0.15,
 'n_estimators': 800,
 'random_state': 666}

In [146]:
joblib.dump(bdt, 'trainedModels/ad_hoc_BDT.pkl',compress=3) 

['trainedModels/ad_hoc_BDT.pkl']

In [179]:
def optimisePars(mva, points, data , classes, fraction=0.6, score = 'log_loss', cvs=5):
    import time
    print("# Tuning hyper-parameters for log_loss score")
    print()
    
    # Splits data
    data_train, data_test, classes_train, classes_test =  train_test_split(
    data, classes, test_size=fraction, random_state=0)
    s =  time.time()
    clf = GridSearchCV(mva, points, cv=cvs,
                       scoring=score, n_jobs=4)
                       
    clf.fit(data_train, classes_train)

    print('GridSearch completed after ', (time.time()-s)/60.0, ' minutes.')
    print()
    print("Best parameters set found on training set:")
    print()
    print(clf.best_params_)
    print()
    print("Grid scores on training set:")
    print()
    for params, mean_score, scores in clf.grid_scores_:
        print("%0.3f (+/-%0.03f) for %r"
              % (mean_score, scores.std() * 2, params))
    print()

    print("Detailed classification report:")
    print()
    print("The model is trained on the full training set.")
    print("The scores are computed on the full test set.")
    print()
    y_true, y_pred = y_test, clf.predict_proba(data_test)
    print(classification_report(y_true, y_pred))
    

#### Optimise BDT Parameters

In [150]:
# list of points considered in the optimisation
bdt_pars = {'learning_rate': [0.15,0.05,0.5],
 'n_estimators': [300,800,500],
 'base_estimator__min_samples_leaf': [10,50,100,500]}

In [172]:
in_data['group'] = train.group
data_hasEvts = in_data.query('nEvts>0')
class_hasEvts =  le_groups.transform(data_hasEvts.group)
data_noEvts = in_data.query('nEvts==0').drop(['nEvts','longitude_mean', 'longitude_variance',
       'latitude_mean', 'latitude_variance', 'usageTime_mean',
       'usageTime_variance', 'usageDay_mean', 'usageDay_variance'],axis=1)
class_noEvts =  le_groups.transform(data_noEvts.group)

data_hasEvts = data_hasEvts.drop(['group'],axis=1);
data_noEvts = data_noEvts.drop(['group'],axis=1);


In [178]:
# tune new models
bdt_noEvts = AdaBoostClassifier(DecisionTreeClassifier(),
                         algorithm="SAMME.R")


bdt_hasEvts = AdaBoostClassifier(DecisionTreeClassifier(),
                         algorithm="SAMME.R")

In [None]:
optimisePars(bdt_noEvts, bdt_pars, data_noEvts, class_noEvts)

# Tuning hyper-parameters for log_loss score



In [None]:
optimisePars(bdt_hasEvts, bdt_pars, data_hasEvts, class_hasEvts)

# Try GradientBoosting


In [142]:
# dictionary holding omtimisation points
GBpars = {'n_estimators': [100,300,400,700],
          'min_samples_leaf' : [10, 50, 100, 500],
           'learning_rate' : [0.4, 0.05, 0.1, 0.2]
         }

# Prepare the submission
First create a matrix for the predictions of the test set.

In [66]:
prediction = np.zeros((len(test.device_id),len(groups.index.values)))
# assign our probabilities to the prediction array
for i in range(0,prediction.shape[0]):
    prediction[i]=probs_per_group.values[0]

NameError: name 'test' is not defined

#### Now define function that prepares the valid submission csv
It uses the test dataset and the prediction matrix as an input.

In [263]:
def prepareOutput(test, pred, label='talkingData'):
    '''
    Writes an valid submission file from the prediction matrix.
    The valid output must look like: 
    device_id,F23-,F24-26,F27-28,F29-32,F33-42,F43+,M22-,M23-26,M27-28,M29-31,M32-38,M39+
    (id, probailities)

    Arguments:
    test  - the DataFrame with the device_id's to be tested
    pred  - is the prediction matrix with pred.shape = (len(test.device_id,len(unique groups))
    label - prefix of the submission file
    
    Return:
    The merged submission dataset is returned.
    '''
    p = pd.DataFrame(pred)
    p.columns = labelEnc.inverse_transform(p.columns)
    i = pd.DataFrame(test.device_id.values) 
    i.columns = ['device_id']
    merged= pd.concat([i,p], axis=1)
    merged.to_csv(solution_dir+label+'_submission.csv', index=False)
    return merged

In [265]:
o = prepareOutput(test,prediction,'entriesPerClass')
%ls predictions/

entriesPerClass_submission.csv


In [266]:
o.head(2)

Unnamed: 0,device_id,F23-,F24-26,F27-28,F29-32,F33-42,F43+,M22-,M23-26,M27-28,M29-31,M32-38,M39+
0,1002079943728939269,0.067654,0.056132,0.041771,0.062,0.074499,0.056186,0.100315,0.128676,0.072945,0.097917,0.126948,0.114957
1,-1547860181818787117,0.067654,0.056132,0.041771,0.062,0.074499,0.056186,0.100315,0.128676,0.072945,0.097917,0.126948,0.114957


#### This worked. The ouput can be submitted to kaggle.