# 'Trained'  models
* The first reference will be a model without any usage of knowledge. The probability to end up in the class for a test customer is assumed to be *n(customers per class)/all(customers)*.
* The second model is a multi-class boosted decision tree classifier
* The third model uses a boosted decision tree to classify male or female and a regression model to predict the age.

---

Using logistic loss evaluation the prediction scores:
* Naive entries per class model: 
   - loss: **2.42786222642**
   - score on kaggle: **2.42762**

In [1]:
import pandas as pd
from pandas import DataFrame as df
import seaborn as sns
import sys
import numpy as np

%matplotlib inline
from matplotlib import pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import log_loss

In [2]:
sns.set_style('ticks')

In [3]:
sys.path.append("/home/mschlupp/pythonTools")
tmp = %pwd
files_dir = tmp + "/files/" 
solution_dir = tmp+'/predictions/'

In [4]:
#train = pd.read_csv(files_dir+'gender_age_train.csv')
#test = pd.read_csv(files_dir+'gender_age_test.csv')

In [5]:
%ls files/

app_events.csv        phone_brand_device_model.csv
app_labels.csv        phone_brand_device_model_engl.csv
events.csv            sample_submission.csv
events_day_hour.csv   traintest_fullevt.csv
gender_age_test.csv   traintest_phone.csv
gender_age_train.csv  traintest_phone_day_hour.csv
label_categories.csv  traintest_phone_evts.csv


#### Let us first remind ourself, what we have in our file

In [6]:
new_set = pd.read_csv(files_dir+'traintest_fullevt.csv', nrows=0) # just read the header

In [7]:
cols=new_set.columns
print(cols)

Index(['age', 'device_id', 'gender', 'group', 'isTrain', 'phone_brand',
       'device_model', 'hasEvents', 'nEvts', 'longitude_mean',
       'longitude_variance', 'latitude_mean', 'latitude_variance',
       'usageTime_mean', 'usageTime_variance', 'usageDay_mean',
       'usageDay_variance'],
      dtype='object')


In [8]:
cols = cols.drop('hasEvents','')

In [9]:
# let's read the data chunkwise and split in train and test sample
# we only want the training sample for now.
iter_csv = pd.read_csv(files_dir+'traintest_fullevt.csv', usecols=cols, iterator=True, chunksize=1500)
train = pd.concat([chunk[chunk['isTrain'] ==1] for chunk in iter_csv])
                       

In [10]:
train.head(2)

Unnamed: 0,age,device_id,gender,group,isTrain,phone_brand,device_model,nEvts,longitude_mean,longitude_variance,latitude_mean,latitude_variance,usageTime_mean,usageTime_variance,usageDay_mean,usageDay_variance
0,35,-8076087639492063270,M,M32-38,1,小米,MI 2,0,-1,-1,-1,-1,-1,-1,-1,-1
1,35,-2897161552818060146,M,M32-38,1,小米,MI 2,0,-1,-1,-1,-1,-1,-1,-1,-1


# Reference  model
The first reference will be a model without any usage of knowledge.
The probability to end up in the class for a test customer is assumed to be *n(customers per class)/all(customers)*.

---

Using logistic loss evaluation the prediction scores:
* Naive entries per class model: 
   - loss: **2.42786222642**
   - score on kaggle: **2.42762**

In [104]:
groups = train.groupby('group').count()

In [105]:
groups.device_id = groups.device_id/len(train.age)

In [106]:
groups.device_id # show our naive prediction

group
F23-      0.067654
F24-26    0.056132
F27-28    0.041771
F29-32    0.062000
F33-42    0.074499
F43+      0.056186
M22-      0.100315
M23-26    0.128676
M27-28    0.072945
M29-31    0.097917
M32-38    0.126948
M39+      0.114957
Name: device_id, dtype: float64

In [17]:
# build the prediction matrix
prediction = np.zeros((len(train.age),len(groups.device_id)))

In [15]:
# Let us use the log_loss 
# (sklearn's logistic loss / cross-entropy) 
# implementation to score our prediction

# first transform group into numerical classes
labelEnc = LabelEncoder()
labelEnc.fit(train.group)
true_group = labelEnc.transform(train.group)

In [146]:
dg = df(columns=groups.index.values)
probs_per_group = dg.append(groups.device_id)

In [164]:
# assign our probabilities to the prediction array
for i in range(0,prediction.shape[0]):
    prediction[i]=probs_per_group.values[0]

In [167]:
print("Logistic loss of our prediction is: ")
print(log_loss(true_group,prediction))

Logistic loss of our prediction is: 
2.42786222642


# Multi-class boosted decision tree classification

In [115]:
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import AdaBoostClassifier, AdaBoostRegressor
from sklearn.cross_validation import StratifiedKFold, KFold, LabelKFold
from sklearn.preprocessing import LabelEncoder

Prepare our multi-class labels.

In [126]:
a = np.zeros((len(train.device_id),len(train.group.unique())))

In [68]:
true_class = pd.DataFrame(np.zeros((len(train.device_id),len(train.group.unique()))))
true_class.columns

Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], dtype='int64')

In [69]:
le_groups = LabelEncoder()
le_groups.fit(train.group.unique())
le_groups.classes_

array(['F23-', 'F24-26', 'F27-28', 'F29-32', 'F33-42', 'F43+', 'M22-',
       'M23-26', 'M27-28', 'M29-31', 'M32-38', 'M39+'], dtype=object)

In [73]:
true_class.columns = le_groups.inverse_transform(list(true_class.columns))

In [138]:
# There should be a smarter way. It's late, sorry
for i,row,x,n in zip(range(0,len(train.group)),true_class.iterrows(), train.group, a):
    if i % 10001 == 0: 
        print('still allive... ', i)
    row[1][x]=1
    n[le_groups.transform(x)]=1

still allive...  0
still allive...  10001
still allive...  20002
still allive...  30003
still allive...  40004
still allive...  50005
still allive...  60006
still allive...  70007


In [119]:
# KFold indices
kf = KFold(len(true_class.index), n_folds=10, shuffle=True)

#### Build the ML algorithm

In [118]:
# no hyper-parameter tuning, just from experience...
bdt = AdaBoostClassifier(DecisionTreeClassifier(max_depth=3,min_samples_split=0.008*len(true_class.index)),
                         algorithm="SAMME.R",
                         learning_rate=0.15,
                         random_state=666, # start reproducible 
                         n_estimators=800)

In [145]:
mean_loss = 0
for indx_train, indx_test in kf:
    data_train, data_test = in_data.iloc(indx_train), in_data.iloc(indx_test)
    class_train, class_test = true_class.iloc(indx_train), true_class.iloc(indx_test)
    bdt.fit(data_train,class_train)
    probs = bdt.predict_proba(data_test)
    loss = log_loss(class_test,probs)
    mean_loss+=loss
    print('fold yields log loss fundction value of: ', loss)
mean_loss/=len(kf)
print('average loss of BDT model with 10 folds: ', mean_loss)

ValueError: setting an array element with a sequence.

# Prepare the submission
First create a matrix for the predictions of the test set.

In [194]:
prediction = np.zeros((len(test.device_id),len(groups.index.values)))
# assign our probabilities to the prediction array
for i in range(0,prediction.shape[0]):aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    prediction[i]=probs_per_group.values[0]

#### Now define function that prepares the valid submission csv
It uses the test dataset and the prediction matrix as an input.

In [263]:
def prepareOutput(test, pred, label='talkingData'):
    '''
    Writes an valid submission file from the prediction matrix.
    The valid output must look like: 
    device_id,F23-,F24-26,F27-28,F29-32,F33-42,F43+,M22-,M23-26,M27-28,M29-31,M32-38,M39+
    (id, probailities)

    Arguments:
    test  - the DataFrame with the device_id's to be tested
    pred  - is the prediction matrix with pred.shape = (len(test.device_id,len(unique groups))
    label - prefix of the submission file
    
    Return:
    The merged submission dataset is returned.
    '''
    p = pd.DataFrame(pred)
    p.columns = labelEnc.inverse_transform(p.columns)
    i = pd.DataFrame(test.device_id.values) 
    i.columns = ['device_id']
    merged= pd.concat([i,p], axis=1)
    merged.to_csv(solution_dir+label+'_submission.csv', index=False)
    return merged

In [265]:
o = prepareOutput(test,prediction,'entriesPerClass')
%ls predictions/

entriesPerClass_submission.csv


In [266]:
o.head(2)

Unnamed: 0,device_id,F23-,F24-26,F27-28,F29-32,F33-42,F43+,M22-,M23-26,M27-28,M29-31,M32-38,M39+
0,1002079943728939269,0.067654,0.056132,0.041771,0.062,0.074499,0.056186,0.100315,0.128676,0.072945,0.097917,0.126948,0.114957
1,-1547860181818787117,0.067654,0.056132,0.041771,0.062,0.074499,0.056186,0.100315,0.128676,0.072945,0.097917,0.126948,0.114957


#### This worked. The ouput can be submitted to kaggle.