<h1>Using categorical data in machine learning with python<br></h1>
https://blog.myyellowroad.com/using-categorical-data-in-machine-learning-with-python-from-dummy-variables-to-deep-category-66041f734512

Data fields (https://www.kaggle.com/c/avazu-ctr-prediction/data)

id: ad identifier
click: 0/1 for non-click/click
hour: format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC.
C1 -- anonymized categorical variable
banner_pos
site_id
site_domain
site_category
app_id
app_domain
app_category
device_id
device_ip
device_model
device_type
device_conn_type
C14-C21 -- anonymized categorical variables

In [118]:
import numpy as np
import pandas
from sklearn.metrics import log_loss
import copy

train_file = "../input/click_through/train_obs10000.csv"
train = pandas.read_csv(train_file)
train.head()

Unnamed: 0,id,click,hour,C1,banner_pos,site_id,site_domain,site_category,app_id,app_domain,...,device_type,device_conn_type,C14,C15,C16,C17,C18,C19,C20,C21
0,1000009418151094273,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,2,15706,320,50,1722,0,35,-1,79
1,10000169349117863715,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,0,15704,320,50,1722,0,35,100084,79
2,10000371904215119486,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,0,15704,320,50,1722,0,35,100084,79
3,10000640724480838376,0,14102100,1005,0,1fbe01fe,f3845767,28905ebd,ecad2386,7801e8d9,...,1,0,15706,320,50,1722,0,35,100084,79
4,10000679056417042096,0,14102100,1005,1,fe8cc448,9166c161,0569f928,ecad2386,7801e8d9,...,1,0,18993,320,50,2161,0,35,-1,157


In [119]:
msk = np.random.rand(len(train)) < 0.8
print("msk shape: ", msk.shape, "msk: ", msk)
features = [3,4,5,6,7,8,9,10,11,13,14,15,16,17,18,19,20,21,22,23]
X_train = train[msk].iloc[:, features]
X_test = train[~msk].iloc[:, features]
y_train = train[msk].iloc[:,1]
y_test = train[~msk].iloc[:,1]

X_train_ori = copy.deepcopy(X_train)
X_test_ori = copy.deepcopy(X_test)
y_train_ori = copy.deepcopy(y_train)
y_test_ori = copy.deepcopy(y_test)


print("X_train shape: ", X_train.shape)
print("X_test shape: ", X_test.shape)
print("y_train shape: ", y_train.shape)
print("y_test shape: ", y_test.shape)
print(X_train_ori.head(2))
print(y_train_ori.head(2))

msk shape:  (9999,) msk:  [ True  True  True ...  True  True  True]
X_train shape:  (8056, 20)
X_test shape:  (1943, 20)
y_train shape:  (8056,)
y_test shape:  (1943,)
     C1  banner_pos   site_id site_domain site_category    app_id app_domain  \
0  1005           0  1fbe01fe    f3845767      28905ebd  ecad2386   7801e8d9   
1  1005           0  1fbe01fe    f3845767      28905ebd  ecad2386   7801e8d9   

  app_category device_id device_model  device_type  device_conn_type    C14  \
0     07d7df22  a99f214a     44956a24            1                 2  15706   
1     07d7df22  a99f214a     711ee120            1                 0  15704   

   C15  C16   C17  C18  C19     C20  C21  
0  320   50  1722    0   35      -1   79  
1  320   50  1722    0   35  100084   79  
0    0
1    0
Name: click, dtype: int64


In [120]:
print("Base line: ", log_loss(y_test, np.ones(len(y_test)) * y_train.mean()))

Base line:  0.47458841471057495


<h2>Embedding Method 1: Encoding to ordinal variables</h2>

In [121]:
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

X_train_ordinal = X_train.values
X_test_ordinal = X_test.values
les = []
l = LogisticRegression()
r = RandomForestClassifier(n_estimators=25, max_depth=10)
for i in  range(X_train_ordinal.shape[1]):
    le = preprocessing.LabelEncoder()
    le.fit(train.iloc[:, features].iloc[:, i])
    les.append(le)
    train_ordinal_embedding = le.transform(X_train_ordinal[:,i])
    X_train_ordinal[:, i] = train_ordinal_embedding
    print("train ordinal transform: ", train_ordinal_embedding)
    X_test_ordinal[:,i] = le.transform(X_test_ordinal[:,i])


train ordinal transform:  [2 2 2 ... 2 2 2]
train ordinal transform:  [0 0 0 ... 0 1 0]
train ordinal transform:  [ 43  43  43 ...  43 141 202]
train ordinal transform:  [301 301 301 ... 301  27 244]
train ordinal transform:  [ 2  2  2 ...  2 12  5]
train ordinal transform:  [293 293 293 ... 293 293 119]
train ordinal transform:  [13 13 13 ... 13 13  3]
train ordinal transform:  [0 0 0 ... 0 0 2]
train ordinal transform:  [722 722 722 ... 722 722 722]
train ordinal transform:  [315 488 604 ... 987 992 204]
train ordinal transform:  [1 1 1 ... 1 1 1]
train ordinal transform:  [1 0 0 ... 0 0 0]
train ordinal transform:  [ 67  65  65 ...  62  87 241]
train ordinal transform:  [2 2 2 ... 2 2 2]
train ordinal transform:  [1 1 1 ... 1 1 1]
train ordinal transform:  [ 27  27  27 ...  27  40 104]
train ordinal transform:  [0 0 0 ... 0 0 3]
train ordinal transform:  [ 0  0  0 ...  0 16 12]
train ordinal transform:  [ 0 42 42 ...  0  0 55]
train ordinal transform:  [15 15 15 ... 15 26 11]


In [122]:
l.fit(X_train_ordinal, y_train)
y_pred = l.predict_proba(X_test_ordinal)
print("Logistic Regression: ", log_loss(y_test, y_pred))
r.fit(X_train_ordinal, y_train)
y_pred = r.predict_proba(X_test_ordinal)
print("Random Forest: ", log_loss(y_test, y_pred))

Logistic Regression:  0.4542862899520212
Random Forest:  0.434011278356416


Where in logistic regression we apply a coefficient to each explaining featuer (most of which contain merely noise, dueo to the nonsense encoding), random noise includes an inherent feature selection mechanism. Most probably, random forest selected these features in which the encoding somehow correlated with the CTR, and used mainly these features to explain the target

<h2>Method 2: One hot encoding (or dummy variables)</h2>

In [123]:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(X_train_ordinal)
X_train_one_hot = enc.transform(X_train_ordinal)
X_test_one_hot = enc.transform(X_test_ordinal)
l.fit(X_train_one_hot, y_train)
y_pred = l.predict_proba(X_test_one_hot)
print("LR: ", log_loss(y_test, y_pred))
r.fit(X_train_one_hot, y_train)
y_pred = r.predict_proba(X_test_one_hot)
print("RF: ", log_loss(y_test, y_pred))
print(X_train_one_hot.shape)

LR:  0.4333707765043064
RF:  0.44193633413557315
(8056, 3446)


why is logistic regression better than random forest? In this method we represented our data with a huge amount of features (curse of dimensionality). Having many features we need a very simple classifier in order to not overfit the data. Logistic regression is way more simple than random forest. Moreover, the sparsity of the data makes it very hard for the random forest to find good splits that will help in separating the classes

<h3>Reduce dimenstions with being rare</h3>

In [124]:
import copy
X_train_rare = copy.copy(X_train)
X_test_rare = copy.copy(X_test)
X_train_rare["test"]=0
X_test_rare["test"]=1
temp_df = pandas.concat([X_train_rare,X_test_rare],axis=0)
names = list(X_train_rare.columns.values)
temp_df = pandas.concat([X_train_rare,X_test_rare],axis=0)
for i in names:
    temp_df.loc[temp_df[i].value_counts()[temp_df[i]].values < 20, i] = "RARE_VALUE"
for i in range(temp_df.shape[1]):
    temp_df.iloc[:,i]=temp_df.iloc[:,i].astype('str')
    
X_train_rare = temp_df[temp_df["test"]=="0"].iloc[:,:-1].values
X_test_rare = temp_df[temp_df["test"]=="1"].iloc[:,:-1].values
for i in range(X_train_rare.shape[1]):
    le = preprocessing.LabelEncoder()
    le.fit(temp_df.iloc[:,:-1].iloc[:, i])
    les.append(le)
    X_train_rare[:, i] = le.transform(X_train_rare[:, i])
    X_test_rare[:, i] = le.transform(X_test_rare[:, i])
    
enc.fit(X_train_rare)
X_train_rare = enc.transform(X_train_rare)
X_test_rare = enc.transform(X_test_rare)
l.fit(X_train_rare,y_train)
y_pred = l.predict_proba(X_test_rare)
print("LR: ", log_loss(y_test,y_pred))
r.fit(X_train_rare,y_train)
y_pred = r.predict_proba(X_test_rare)
print("RF: ", log_loss(y_test,y_pred))
print(X_train_rare.shape)

LR:  0.4305982457848496
RF:  0.43399603103418616
(8056, 422)


We can overcome some of this disadvantages by encoding all rare categories to the same features (“rare value”). This method can reduce the dimensionality drastically in some datasets with a small decrease in performance (or even an increase)

<h2>Method 3: Feature hashing (a.k.a the hashing trick)</h2>

In [125]:
from sklearn.feature_extraction import FeatureHasher
X_train_hash = copy.copy(X_train)
X_test_hash = copy.copy(X_test)
print("X_train_hash: ", X_train_hash.shape)
print("X_test_hash: ", X_test_hash.shape)

X_train_hash:  (8056, 21)
X_test_hash:  (1943, 21)


In [126]:
# convert to an object type using "str"
for i in range(X_train_hash.shape[1]):
    X_train_hash.iloc[:,i] = X_train_hash.iloc[:,i].astype('str')
for i in range(X_test_hash.shape[1]):
    X_test_hash.iloc[:,i] = X_test_hash.iloc[:,i].astype('str')

In [127]:
print(X_train_hash.values[0])
print(X_test_hash.head(1))

['1005' '0' '1fbe01fe' 'f3845767' '28905ebd' 'ecad2386' '7801e8d9'
 '07d7df22' 'a99f214a' '44956a24' '1' '2' '15706' '320' '50' '1722' '0'
 '35' '-1' '79' '0']
     C1 banner_pos   site_id site_domain site_category    app_id app_domain  \
6  1005          0  8fda644b    25d4cfcd      f028772b  ecad2386   7801e8d9   

  app_category device_id device_model ...  device_conn_type    C14  C15 C16  \
6     07d7df22  a99f214a     be6db1d7 ...                 0  20362  320  50   

    C17 C18 C19 C20  C21 test  
6  2333   0  39  -1  157    1  

[1 rows x 21 columns]


In [128]:
h = FeatureHasher(n_features=100, input_type="string")

In [129]:
X_train_hash = h.transform(X_train_hash.values)
X_test_hash = h.transform(X_test_hash.values)


In [130]:
print(X_train_hash.shape)
print(type(X_train_hash))
print(X_train_hash[8000])
print(X_train_hash[:,0])
#print(X_train_hash.toarray)

(8056, 100)
<class 'scipy.sparse.csr.csr_matrix'>
  (0, 0)	0.0
  (0, 2)	-1.0
  (0, 7)	2.0
  (0, 28)	-1.0
  (0, 35)	1.0
  (0, 37)	3.0
  (0, 50)	-1.0
  (0, 54)	-1.0
  (0, 57)	-1.0
  (0, 60)	1.0
  (0, 75)	1.0
  (0, 79)	-1.0
  (0, 89)	-2.0
  (0, 96)	-1.0
  (6, 0)	1.0
  (8, 0)	1.0
  (16, 0)	1.0
  (21, 0)	1.0
  (23, 0)	1.0
  (26, 0)	1.0
  (27, 0)	1.0
  (31, 0)	1.0
  (36, 0)	1.0
  (37, 0)	1.0
  (44, 0)	1.0
  (45, 0)	1.0
  (47, 0)	-1.0
  (51, 0)	1.0
  (54, 0)	1.0
  (59, 0)	-1.0
  (69, 0)	1.0
  (74, 0)	1.0
  (80, 0)	1.0
  (82, 0)	1.0
  (84, 0)	1.0
  (88, 0)	1.0
  (91, 0)	1.0
  (96, 0)	1.0
  (97, 0)	1.0
  :	:
  (7947, 0)	-1.0
  (7949, 0)	1.0
  (7957, 0)	-1.0
  (7958, 0)	1.0
  (7966, 0)	1.0
  (7968, 0)	1.0
  (7972, 0)	2.0
  (7973, 0)	1.0
  (7975, 0)	1.0
  (7976, 0)	1.0
  (7978, 0)	1.0
  (7981, 0)	-1.0
  (7991, 0)	2.0
  (7993, 0)	1.0
  (7995, 0)	-1.0
  (8000, 0)	0.0
  (8004, 0)	1.0
  (8021, 0)	-1.0
  (8023, 0)	1.0
  (8025, 0)	0.0
  (8034, 0)	0.0
  (8035, 0)	1.0
  (8051, 0)	1.0
  (8052, 0)	-1.0
  (

In [131]:
l.fit(X_train_hash, y_train)
y_pred = l.predict_proba(X_test_hash)
print("LR: ", log_loss(y_test, y_pred))
r.fit(X_train_hash,y_train)
y_pred = r.predict_proba(X_test_hash)
print("RF: ", log_loss(y_test,y_pred))

LR:  0.4396245695633043
RF:  0.43471334823615365


<h2>Method 4: Encoding categories with dataset statistics</h2>

In [132]:
import copy
X_train_count = copy.copy(X_train)
X_test_count = copy.copy(X_test)
X_train_count["test"] = 0
X_test_count["test"] = 1
temp_df = pandas.concat([X_train_count, X_test_count], axis=0)
print("temp_df shape: ", temp_df.shape)
for i in range(temp_df.shape[1]):
    temp_df.iloc[:,i] = temp_df.iloc[:,i].astype('category')
X_train_count = temp_df[temp_df["test"]==0].iloc[:,:-1]
X_test_count = temp_df[temp_df["test"]==1].iloc[:,:-1]

temp_df shape:  (9999, 21)


In [133]:
counts = X_train_count.iloc[:,0].value_counts()
print(X_train_count.iloc[0:5,0])
print(X_train_count.shape)
print(counts.head(30))
print(counts.sort_index())
counts = counts.sort_index()

0    1005
1    1005
2    1005
3    1005
4    1005
Name: C1, dtype: category
Categories (6, object): [1001, 1002, 1005, 1007, 1008, 1010]
(8056, 20)
1005    7459
1002     301
1010     281
1007      10
1001       3
1008       2
Name: C1, dtype: int64
1001       3
1002     301
1005    7459
1007      10
1008       2
1010     281
Name: C1, dtype: int64


In [134]:
counts = counts.fillna(0)

In [135]:
np.random.rand(len(counts))

array([0.72946933, 0.48975975, 0.85769324, 0.45669378, 0.33565994,
       0.45931782])

In [136]:
np.random.rand(len(counts)) / 1000

array([9.66973216e-05, 7.96347800e-04, 4.05204558e-04, 3.94163354e-04,
       2.88750418e-04, 4.89610021e-04])

In [137]:
counts = counts + np.random.rand(len(counts)) / 1000
print(counts)

1001       3.000593
1002     301.000132
1005    7459.000842
1007      10.000260
1008       2.000495
1010     281.000607
Name: C1, dtype: float64


In [138]:
for i in range(X_train_count.shape[1]):
    counts = X_train_count.iloc[:,i].value_counts()
    counts = counts.sort_index()
    counts = counts.fillna(0)
    counts += np.random.rand(len(counts)) / 1000
    X_train_count.iloc[:,i].cat.categories = counts
    X_test_count.iloc[:,i].cat.categories = counts

In [139]:
print(X_train_count.iloc[:,0].cat.categories)

Float64Index([3.0009647617467285,  301.0005224054728, 7459.0001783582575,
               10.00098079563379,  2.000195049512278, 281.00011922963637],
             dtype='float64', name='C1')


In [140]:
l.fit(X_train_count, y_train)
y_pred = l.predict_proba(X_test_count)
print("LR: ", log_loss(y_test, y_pred))
r.fit(X_train_count,y_train)
y_pred = r.predict_proba(X_test_count)
print("RF: ", 
      log_loss(y_test,y_pred))

LR:  0.4477828125597004
RF:  0.43341654731879725


In [141]:
X_train_ctr = copy.copy(X_train)
X_test_ctr = copy.copy(X_test)
X_train_ctr["test"]=0
X_test_ctr["test"]=1
temp_df = pandas.concat([X_train_ctr,X_test_ctr],axis=0)
for i in range(temp_df.shape[1]):
    temp_df.iloc[:,i]=temp_df.iloc[:,i].astype('category')
X_train_ctr=temp_df[temp_df["test"]==0].iloc[:,:-1]
X_test_ctr=temp_df[temp_df["test"]==1].iloc[:,:-1]
temp_df = pandas.concat([X_train_ctr,y_train],axis=1)
names = list(X_train_ctr.columns.values)

In [142]:
print(names)
print(temp_df.shape)
print(temp_df.iloc[0:2,:])

['C1', 'banner_pos', 'site_id', 'site_domain', 'site_category', 'app_id', 'app_domain', 'app_category', 'device_id', 'device_model', 'device_type', 'device_conn_type', 'C14', 'C15', 'C16', 'C17', 'C18', 'C19', 'C20', 'C21']
(8056, 21)
     C1 banner_pos   site_id site_domain site_category    app_id app_domain  \
0  1005          0  1fbe01fe    f3845767      28905ebd  ecad2386   7801e8d9   
1  1005          0  1fbe01fe    f3845767      28905ebd  ecad2386   7801e8d9   

  app_category device_id device_model  ...  device_conn_type    C14  C15 C16  \
0     07d7df22  a99f214a     44956a24  ...                 2  15706  320  50   
1     07d7df22  a99f214a     711ee120  ...                 0  15704  320  50   

    C17 C18 C19     C20 C21 click  
0  1722   0  35      -1  79     0  
1  1722   0  35  100084  79     0  

[2 rows x 21 columns]


In [143]:
means = temp_df.groupby('C1')['click'].mean()
print(means)

C1
1001    0.000000
1002    0.182724
1005    0.171203
1007    0.000000
1008    1.000000
1010    0.067616
Name: click, dtype: float64


In [144]:
sum(temp_df['click']) / len(temp_df['click'])

0.1679493545183714

In [145]:
means = means.fillna(sum(temp_df['click'])/len(temp_df['click']))
print(means)

C1
1001    0.000000
1002    0.182724
1005    0.171203
1007    0.000000
1008    1.000000
1010    0.067616
Name: click, dtype: float64


In [146]:
for i in names:
    means = temp_df.groupby(i)['click'].mean()
    means = means.fillna(sum(temp_df['click'])/len(temp_df['click']))
    means += np.random.rand(len(means))/1000
    X_train_ctr[i].cat.categories = means
    X_test_ctr[i].cat.categories = means
                  
print(X_train_ctr.head(3))

         C1 banner_pos   site_id site_domain site_category    app_id  \
0  0.171707   0.164319  0.202797    0.202926      0.202333  0.187886   
1  0.171707   0.164319  0.202797    0.202926      0.202333  0.187886   
2  0.171707   0.164319  0.202797    0.202926      0.202333  0.187886   

  app_domain app_category device_id device_model device_type device_conn_type  \
0   0.182057     0.187599   0.17734     0.000260    0.171459         0.149092   
1   0.182057     0.187599   0.17734     0.181202    0.171459         0.171467   
2   0.182057     0.187599   0.17734     0.128727    0.171459         0.171467   

        C14       C15       C16     C17       C18       C19       C20  \
0  0.132530  0.157847  0.158131  0.1955  0.152665  0.159945  0.172641   
1  0.173228  0.157847  0.158131  0.1955  0.152665  0.159945  0.227386   
2  0.173228  0.157847  0.158131  0.1955  0.152665  0.159945  0.227386   

        C21  
0  0.195305  
1  0.195305  
2  0.195305  


In [147]:
l.fit(X_train_ctr,y_train)
y_pred = l.predict_proba(X_test_ctr)
print("LR: ", log_loss(y_test,y_pred))

LR:  0.4700630472558881


In [148]:
r.fit(X_train_ctr,y_train)
y_pred = r.predict_proba(X_test_ctr)
print("RF: ", log_loss(y_test,y_pred))

RF:  0.45554757757739295


<h2>Method 5: Cat2Vec</h2>

In [149]:
from gensim.models.word2vec import Word2Vec
from random import shuffle
size=6
window=8
x_w2v = copy.deepcopy(train.iloc[:,features])
names = list(x_w2v.columns.values)
for i in names:
    x_w2v[i]=x_w2v[i].astype('category')
    x_w2v[i].cat.categories = ["Feature %s %s" % (i,g) for g in x_w2v[i].cat.categories]
x_w2v = x_w2v.values.tolist()
print(len(x_w2v))
print(x_w2v[0:1])


9999
[['Feature C1 1005', 'Feature banner_pos 0', 'Feature site_id 1fbe01fe', 'Feature site_domain f3845767', 'Feature site_category 28905ebd', 'Feature app_id ecad2386', 'Feature app_domain 7801e8d9', 'Feature app_category 07d7df22', 'Feature device_id a99f214a', 'Feature device_model 44956a24', 'Feature device_type 1', 'Feature device_conn_type 2', 'Feature C14 15706', 'Feature C15 320', 'Feature C16 50', 'Feature C17 1722', 'Feature C18 0', 'Feature C19 35', 'Feature C20 -1', 'Feature C21 79']]


In [150]:
for i in x_w2v:
    shuffle(i) # shuffle columns per row
print(x_w2v[0:1])

[['Feature site_domain f3845767', 'Feature banner_pos 0', 'Feature device_model 44956a24', 'Feature device_type 1', 'Feature C20 -1', 'Feature device_id a99f214a', 'Feature device_conn_type 2', 'Feature C14 15706', 'Feature app_domain 7801e8d9', 'Feature app_category 07d7df22', 'Feature site_category 28905ebd', 'Feature site_id 1fbe01fe', 'Feature C15 320', 'Feature C19 35', 'Feature C1 1005', 'Feature C21 79', 'Feature app_id ecad2386', 'Feature C18 0', 'Feature C16 50', 'Feature C17 1722']]


In [151]:
w2v = Word2Vec(x_w2v,size=size,window=window) # create object, size means embedding dimensions

In [152]:
X_train_w2v = copy.copy(X_train)
X_test_w2v = copy.copy(X_test)
for i in names:
    X_train_w2v[i]=X_train_w2v[i].astype('category')
    X_train_w2v[i].cat.categories = ["Feature %s %s" % (i,g) for g in X_train_w2v[i].cat.categories]
for i in names:
    X_test_w2v[i]=X_test_w2v[i].astype('category')
    X_test_w2v[i].cat.categories = ["Feature %s %s" % (i,g) for g in X_test_w2v[i].cat.categories]

In [153]:
print(X_train_w2v.head(1))

                C1            banner_pos                   site_id  \
0  Feature C1 1005  Feature banner_pos 0  Feature site_id 1fbe01fe   

                    site_domain                   site_category  \
0  Feature site_domain f3845767  Feature site_category 28905ebd   

                    app_id                   app_domain  \
0  Feature app_id ecad2386  Feature app_domain 7801e8d9   

                    app_category                   device_id  \
0  Feature app_category 07d7df22  Feature device_id a99f214a   

                    device_model ...             device_conn_type  \
0  Feature device_model 44956a24 ...   Feature device_conn_type 2   

                 C14              C15             C16               C17  \
0  Feature C14 15706  Feature C15 320  Feature C16 50  Feature C17 1722   

             C18             C19             C20             C21 test  
0  Feature C18 0  Feature C19 35  Feature C20 -1  Feature C21 79    0  

[1 rows x 21 columns]


In [154]:
X_train_w2v = X_train_w2v.values # One list having all columns per row
X_test_w2v = X_test_w2v.values
print(X_train_w2v[0,:])

['Feature C1 1005' 'Feature banner_pos 0' 'Feature site_id 1fbe01fe'
 'Feature site_domain f3845767' 'Feature site_category 28905ebd'
 'Feature app_id ecad2386' 'Feature app_domain 7801e8d9'
 'Feature app_category 07d7df22' 'Feature device_id a99f214a'
 'Feature device_model 44956a24' 'Feature device_type 1'
 'Feature device_conn_type 2' 'Feature C14 15706' 'Feature C15 320'
 'Feature C16 50' 'Feature C17 1722' 'Feature C18 0' 'Feature C19 35'
 'Feature C20 -1' 'Feature C21 79' 0]


In [155]:
print(len(X_train_w2v))
print(X_train_w2v.shape[1])
print(size)

8056
21
6


In [156]:
# initialize a size of 8011 * 126 
# 6 embedding size * 21 columns = 126
x_w2v_train = np.random.random((len(X_train_w2v),size*X_train_w2v.shape[1]))
print(x_w2v_train.shape) 
print(w2v.vocabulary)
#print(help(w2v))
print(type(X_train_w2v[0,0]))
if (w2v.wv.__contains__("Feature C1 1005")):
    print("xxx")
    print(w2v.wv.__getitem__("Feature C1 1005"))


(8056, 126)
<gensim.models.word2vec.Word2VecVocab object at 0x10dce0588>
<class 'str'>
xxx
[-0.84423256  1.2427202  -0.26425478 -1.1727661   1.1989584  -0.8525728 ]


In [157]:
for j in range(X_train_w2v.shape[1]):
    for i in range(X_train_w2v.shape[0]):
        if X_train_w2v[i,j] in w2v:
            x_w2v_train[i,j*size:(j+1)*size] = w2v[X_train_w2v[i,j]]
            #print(w2v[X_train_w2v[i,j]])

  app.launch_new_instance()


In [158]:
x_w2v_test = np.random.random((len(X_test_w2v),size*X_test_w2v.shape[1]))
for j in range(X_test_w2v.shape[1]):
    for i in range(X_test_w2v.shape[0]):
        if X_test_w2v[i,j] in w2v:
            x_w2v_test[i,j*size:(j+1)*size] = w2v[X_test_w2v[i,j]]




In [159]:
l.fit(x_w2v_train,y_train)
y_pred = l.predict_proba(x_w2v_test)
print("LR: ", log_loss(y_test,y_pred))
r.fit(x_w2v_train,y_train)
y_pred = r.predict_proba(x_w2v_test)
print("RF: ", log_loss(y_test,y_pred))

LR:  0.4330431221683492
RF:  0.4320668190285643


<h2>Method 6: Category embedding with deep learning</h2>

In [160]:
from keras.models import Sequential
from keras.layers.core import Dense, Activation, Reshape
from keras.layers import Merge
from keras.layers.embeddings import Embedding
from keras.callbacks import ModelCheckpoint
import h5py
X_train_dnn = copy.copy(X_train_ori).values
X_test_dnn = copy.copy(X_test_ori).values
print(X_train_dnn.shape, X_test_dnn.shape)
print(X_train_dnn[0,:])
print(X_train_ori.head(1))



(8056, 20) (1943, 20)
[1005 0 '1fbe01fe' 'f3845767' '28905ebd' 'ecad2386' '7801e8d9' '07d7df22'
 'a99f214a' '44956a24' 1 2 15706 320 50 1722 0 35 -1 79]
     C1  banner_pos   site_id site_domain site_category    app_id app_domain  \
0  1005           0  1fbe01fe    f3845767      28905ebd  ecad2386   7801e8d9   

  app_category device_id device_model  device_type  device_conn_type    C14  \
0     07d7df22  a99f214a     44956a24            1                 2  15706   

   C15  C16   C17  C18  C19  C20  C21  
0  320   50  1722    0   35   -1   79  


In [161]:
les = []
for i in range(X_train_dnn.shape[1]):
    le = preprocessing.LabelEncoder()
    le.fit(train.iloc[:,features].iloc[:, i])
    les.append(le)
    X_train_dnn[:, i] = le.transform(X_train_dnn[:, i])
    X_test_dnn[:, i] = le.transform(X_test_dnn[:, i])
print(X_train_dnn.shape, X_test_dnn.shape)
print(X_train_dnn[0:2,:])



(8056, 20) (1943, 20)
[[2 0 43 301 2 293 13 0 722 315 1 1 67 2 1 27 0 0 0 15]
 [2 0 43 301 2 293 13 0 722 488 1 0 65 2 1 27 0 0 42 15]]


In [162]:
X = X_train_dnn
print(X[...,[0]])
print(X[0:3,[0]])


[[2]
 [2]
 [2]
 ...
 [2]
 [2]
 [2]]
[[2]
 [2]
 [2]]


In [174]:
def split_features(X):
    X_list = []
    
    C1 = X[..., [0]]
    X_list.append(C1)
    banner_pos = X[..., [1]]
    X_list.append(banner_pos)
    site_id = X[..., [2]]
    X_list.append(site_id)
    
    site_domain = X[..., [3]]
    X_list.append(site_domain)
    site_category = X[..., [4]]
    X_list.append(site_category)
    app_id = X[..., [5]]
    X_list.append(app_id)
    app_domain = X[..., [6]]
    X_list.append(app_domain)
    app_category = X[..., [7]]
    X_list.append(app_category)
    
    device_id = X[..., [8]]
    X_list.append(device_id)
    device_model = X[..., [9]]
    X_list.append(device_model)
    device_type = X[..., [10]]
    X_list.append(device_type)
    
    device_conn_type = X[..., [11]]
    X_list.append(device_conn_type)
    C14 = X[..., [12]]
    X_list.append(C14)
    
    C15 = X[..., [13]]
    X_list.append(C15)
    C16 = X[..., [14]]
    X_list.append(C16)
    C17 = X[..., [15]]
    X_list.append(C17)
    C18 = X[..., [16]]
    X_list.append(C18)
    C19 = X[..., [17]]
    X_list.append(C19)
    
    C20 = X[..., [18]]
    X_list.append(C20)
    C21 = X[..., [19]]
    X_list.append(C21)
    
    return X_list

In [183]:


class NN_with_EntityEmbedding(object):
    def __init__(self, X_train, y_train, X_val, y_val):
        self.nb_epoch = 1
        self.__build_keras_model()
        self.fit(X_train, y_train, X_val, y_val)
    def preprocessing(self, X):
        X_list = split_features(X)
        print("X shape: ", np.array(X).shape)
        print("X_list shape: ", np.array(X_list).shape)        
        return X_list
    def __build_keras_model(self):
        models = []
        
        model_C1 = Sequential()
        model_C1.add(Embedding(len(les[0].classes_), 3, input_length=1)) # Turns positive integers (indexes) into dense vectors of fixed size. eg. [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]
        model_C1.add(Reshape(target_shape=(3,)))
        models.append(model_C1)
        print("Label Encoder Size: ", len(les[0].classes_))
        
        model_banner_pos = Sequential()
        model_banner_pos.add(Embedding(len(les[1].classes_), 3, input_length=1))
        model_banner_pos.add(Reshape(target_shape=(3,)))
        models.append(model_banner_pos)
        print("Label Encoder Size: ", len(les[1].classes_))
            
        model_site_id = Sequential()
        model_site_id.add(Embedding(len(les[2].classes_), 8, input_length=1))
        model_site_id.add(Reshape(target_shape=(8,)))
        models.append(model_site_id)
        print("Label Encoder Size: ", len(les[2].classes_))        
        
        site_domain = Sequential()
        site_domain.add(Embedding(len(les[3].classes_), 8, input_length=1))
        site_domain.add(Reshape(target_shape=(8,)))
        models.append(site_domain)
        site_category = Sequential()
        site_category.add(Embedding(len(les[4].classes_), 3, input_length=1))
        site_category.add(Reshape(target_shape=(3,)))
        models.append(site_category)
        app_id = Sequential()
        app_id.add(Embedding(len(les[5].classes_), 8, input_length=1))
        app_id.add(Reshape(target_shape=(8,)))
        models.append(app_id)
        app_domain = Sequential()
        app_domain.add(Embedding(len(les[6].classes_), 4, input_length=1))
        app_domain.add(Reshape(target_shape=(4,)))
        models.append(app_domain)
        
        app_category = Sequential()
        app_category.add(Embedding(len(les[7].classes_), 3, input_length=1))
        app_category.add(Reshape(target_shape=(3,)))
        models.append(app_category)
        
        device_id = Sequential()
        device_id.add(Embedding(len(les[8].classes_), 10, input_length=1))
        device_id.add(Reshape(target_shape=(10,)))
        models.append(device_id)
        
        device_model = Sequential()
        device_model.add(Embedding(len(les[9].classes_), 8, input_length=1))
        device_model.add(Reshape(target_shape=(8,)))
        models.append(device_model)
        
        device_type = Sequential()
        device_type.add(Embedding(len(les[10].classes_), 2, input_length=1))
        device_type.add(Reshape(target_shape=(2,)))
        models.append(device_type)
        
        device_conn_type = Sequential()
        device_conn_type.add(Embedding(len(les[11].classes_), 2, input_length=1))
        device_conn_type.add(Reshape(target_shape=(2,)))
        models.append(device_conn_type)
        
        C14 = Sequential()
        C14.add(Embedding(len(les[12].classes_), 8, input_length=1))
        C14.add(Reshape(target_shape=(8,)))
        models.append(C14)
        
        C15 = Sequential()
        C15.add(Embedding(len(les[13].classes_), 3, input_length=1))
        C15.add(Reshape(target_shape=(3,)))
        models.append(C15)
        
        C16 = Sequential()
        C16.add(Embedding(len(les[14].classes_), 3, input_length=1))
        C16.add(Reshape(target_shape=(3,)))
        models.append(C16)
        
        C17 = Sequential()
        C17.add(Embedding(len(les[15].classes_), 4, input_length=1))
        C17.add(Reshape(target_shape=(4,)))
        models.append(C17)
        
        C18 = Sequential()
        C18.add(Embedding(len(les[16].classes_), 2, input_length=1))
        C18.add(Reshape(target_shape=(2,)))
        models.append(C18)
        
        C19 = Sequential()
        C19.add(Embedding(len(les[17].classes_), 4, input_length=1))
        C19.add(Reshape(target_shape=(4,)))
        models.append(C19)
        
        C20 = Sequential()
        C20.add(Embedding(len(les[18].classes_), 5, input_length=1))
        C20.add(Reshape(target_shape=(5,)))
        models.append(C20)
        
        C21 = Sequential()
        C21.add(Embedding(len(les[19].classes_), 4, input_length=1))
        C21.add(Reshape(target_shape=(4,)))
        models.append(C21)
        
        self.model = Sequential()
        self.model.add(Merge(models, mode = 'concat'))
        self.model.add(Dense(150, kernel_initializer='uniform'))
        self.model.add(Activation('relu'))
        self.model.add(Dense(250, kernel_initializer='uniform'))
        self.model.add(Activation('relu'))
        self.model.add(Dense(1))
        self.model.add(Activation('sigmoid'))
        self.model.compile(loss='binary_crossentropy',optimizer='adam', metrics=['acc'])
        
        
        
    def fit(self, X_train, y_train, X_val, y_val):
        X_pre_train = self.preprocessing(X_train)
        print("X_pre_train shape: ", np.array(X_pre_train).shape)
        X_pre_val = self.preprocessing(X_val)   
        print("X_pre_val shape: ", np.array(X_pre_val).shape)        
        print("y_train shape: ", np.array(y_train).shape)        
        print("y_val shape: ", np.array(y_val).shape)        
        
        
        self.model.fit(X_pre_train, y_train,
                      validation_data=(X_pre_val, y_val),
                      epochs=self.nb_epoch, batch_size=128,)
        


In [186]:
dnn = NN_with_EntityEmbedding(X_train_dnn, y_train, X_test_dnn, y_test)
weights = dnn.model.get_weights()
n=0

dnn.model.summary()



Label Encoder Size:  6
Label Encoder Size:  4
Label Encoder Size:  381




X shape:  (8056, 20)
X_list shape:  (20, 8056, 1)
X_pre_train shape:  (20, 8056, 1)
X shape:  (1943, 20)
X_list shape:  (20, 1943, 1)
X_pre_val shape:  (20, 1943, 1)
y_train shape:  (8056,)
y_val shape:  (1943,)
Train on 8056 samples, validate on 1943 samples
Epoch 1/1
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
merge_19 (Merge)             (None, 95)                0         
_________________________________________________________________
dense_55 (Dense)             (None, 150)               14400     
_________________________________________________________________
activation_55 (Activation)   (None, 150)               0         
_________________________________________________________________
dense_56 (Dense)             (None, 250)               37750     
_________________________________________________________________
activation_56 (Activation)   (None, 250)               0         
____

In [181]:
for i in range(0,40,2):
    n = n + (weights[i][0].shape)[1]
    
x_dnn_train = np.random.random((len(X_train_dnn), n))
start_ind = 0
for j in range(X_train_dnn.shape[1]):
    mat = weights[j*2][0]
    dim = mat.shape[1]
    for i in range(X_train_dnn.shape[0]):
        x_dnn_train[i, start_ind:start_ind + dim] = mat[X_train_dnn[i,j]]
    start_ind += dim
x_dnn_test = np.random.random((len(X_test_dnn),n))
start_ind = 0
for j in range(X_test_dnn.shape[1]):
    mat = weights[j*2][0]
    dim = mat.shape[1]
    for i in range(x_dnn_test.shape[0]):
        x_dnn_test[i,start_ind:start_ind+dim]=mat[X_test_dnn[i,j]]
    start_ind += dim

In [182]:
l.fit(x_dnn_train,y_train)
y_pred = l.predict_proba(x_dnn_test)
print(log_loss(y_test,y_pred))
r.fit(x_dnn_train,y_train)
y_pred = r.predict_proba(x_dnn_test)
print(log_loss(y_test,y_pred))        



0.4753951686013886
0.4717856780089146


In [None]:
    
def split_features(X):
    X_list = []
C1 = X[..., [0]]
    X_list.append(C1)
banner_pos = X[..., [1]]
    X_list.append(banner_pos)
site_id = X[..., [2]]
    X_list.append(site_id)
    
    site_domain = X[..., [3]]
    X_list.append(site_domain)
site_category = X[..., [4]]
    X_list.append(site_category)
app_id = X[..., [5]]
    X_list.append(app_id)
app_domain = X[..., [6]]
    X_list.append(app_domain)
app_category = X[..., [7]]
    X_list.append(app_category)
    
    device_id = X[..., [8]]
    X_list.append(device_id)
device_model = X[..., [9]]
    X_list.append(device_model)
device_type = X[..., [10]]
    X_list.append(device_type)
    
    device_conn_type = X[..., [11]]
    X_list.append(device_conn_type)
C14 = X[..., [12]]
    X_list.append(C14)
    
    C15 = X[..., [13]]
    X_list.append(C15)
C16 = X[..., [14]]
    X_list.append(C16)
C17 = X[..., [15]]
    X_list.append(C17)
C18 = X[..., [16]]
    X_list.append(C18)
C19 = X[..., [17]]
    X_list.append(C19)
    
    C20 = X[..., [18]]
    X_list.append(C20)
C21 = X[..., [19]]
    X_list.append(C21)
return X_list

class NN_with_EntityEmbedding(object):
def __init__(self, X_train, y_train, X_val, y_val):
        self.nb_epoch = 10
        self.__build_keras_model()
        self.fit(X_train, y_train, X_val, y_val)
def preprocessing(self, X):
        X_list = split_features(X)
        return X_list
def __build_keras_model(self):
        models = []
        model_C1= Sequential()
        model_C1.add(Embedding(len(les[0].classes_), 3, input_length=1))
        model_C1.add(Reshape(target_shape=(3,)))
        models.append(model_C1)
        
        model_banner_pos = Sequential()
        model_banner_pos.add(Embedding(len(les[1].classes_), 3, input_length=1))
        model_banner_pos.add(Reshape(target_shape=(3,)))
        models.append(model_banner_pos)
        
        model_site_id = Sequential()
        model_site_id.add(Embedding(len(les[2].classes_), 8, input_length=1))
        model_site_id.add(Reshape(target_shape=(8,)))
        models.append(model_site_id)
        
        site_domain = Sequential()
        site_domain.add(Embedding(len(les[3].classes_), 8, input_length=1))
        site_domain.add(Reshape(target_shape=(8,)))
        models.append(site_domain)
site_category = Sequential()
        site_category.add(Embedding(len(les[4].classes_), 3, input_length=1))
        site_category.add(Reshape(target_shape=(3,)))
        models.append(site_category)
app_id = Sequential()
        app_id.add(Embedding(len(les[5].classes_), 8, input_length=1))
        app_id.add(Reshape(target_shape=(8,)))
        models.append(app_id)
app_domain = Sequential()
        app_domain.add(Embedding(len(les[6].classes_), 4, input_length=1))
        app_domain.add(Reshape(target_shape=(4,)))
        models.append(app_domain)
        
        app_category = Sequential()
        app_category.add(Embedding(len(les[7].classes_), 3, input_length=1))
        app_category.add(Reshape(target_shape=(3,)))
        models.append(app_category)
        
        device_id = Sequential()
        device_id.add(Embedding(len(les[8].classes_), 10, input_length=1))
        device_id.add(Reshape(target_shape=(10,)))
        models.append(device_id)
        
        device_model = Sequential()
        device_model.add(Embedding(len(les[9].classes_), 8, input_length=1))
        device_model.add(Reshape(target_shape=(8,)))
        models.append(device_model)
        
        device_type = Sequential()
        device_type.add(Embedding(len(les[10].classes_), 2, input_length=1))
        device_type.add(Reshape(target_shape=(2,)))
        models.append(device_type)
        
        device_conn_type = Sequential()
        device_conn_type.add(Embedding(len(les[11].classes_), 2, input_length=1))
        device_conn_type.add(Reshape(target_shape=(2,)))
        models.append(device_conn_type)
C14 = Sequential()
        C14.add(Embedding(len(les[12].classes_), 8, input_length=1))
        C14.add(Reshape(target_shape=(8,)))
        models.append(C14)
        
        C15 = Sequential()
        C15.add(Embedding(len(les[13].classes_), 3, input_length=1))
        C15.add(Reshape(target_shape=(3,)))
        models.append(C15)
        
        C16 = Sequential()
        C16.add(Embedding(len(les[14].classes_), 3, input_length=1))
        C16.add(Reshape(target_shape=(3,)))
        models.append(C16)
        
        C17 = Sequential()
        C17.add(Embedding(len(les[15].classes_), 4, input_length=1))
        C17.add(Reshape(target_shape=(4,)))
        models.append(C17)
        
        C18 = Sequential()
        C18.add(Embedding(len(les[16].classes_), 2, input_length=1))
        C18.add(Reshape(target_shape=(2,)))
        models.append(C18)
        
        C19 = Sequential()
        C19.add(Embedding(len(les[17].classes_), 4, input_length=1))
        C19.add(Reshape(target_shape=(4,)))
        models.append(C19)
        
        C20 = Sequential()
        C20.add(Embedding(len(les[18].classes_), 5, input_length=1))
        C20.add(Reshape(target_shape=(5,)))
        models.append(C20)
        
        C21 = Sequential()
        C21.add(Embedding(len(les[19].classes_), 4, input_length=1))
        C21.add(Reshape(target_shape=(4,)))
        models.append(C21)
self.model = Sequential()
        self.model.add(Merge(models, mode='concat'))
        self.model.add(Dense(150, kernel_initializer='uniform'))
        self.model.add(Activation('relu'))
        self.model.add(Dense(250, kernel_initializer='uniform'))
        self.model.add(Activation('relu'))
        self.model.add(Dense(1))
        self.model.add(Activation('sigmoid'))
self.model.compile(loss='binary_crossentropy',
              optimizer='adam',
             metrics=['acc'])
        
    def fit(self, X_train, y_train, X_val, y_val):
        self.model.fit(self.preprocessing(X_train), y_train,
                       validation_data=(self.preprocessing(X_val), y_val),
                       epochs=self.nb_epoch, batch_size=128,
                       )
dnn = NN_with_EntityEmbedding(X_train_dnn, y_train, X_test_dnn, y_test)   
weights = dnn.model.get_weights()
n = 0
for i in range(0,40,2):
    n+=(weights[i][0].shape)[1]
    
x_dnn_train = np.random.random((len(X_train_dnn),n))
start_ind=0
for j in range(X_train_dnn.shape[1]):
    mat = weights[j*2][0]
    dim = mat.shape[1]
    for i in range(X_train_dnn.shape[0]):
        x_dnn_train[i,start_ind:start_ind+dim]=mat[X_train_dnn[i,j]]
    start_ind += dim
x_dnn_test = np.random.random((len(X_test_dnn),n))
start_ind=0
for j in range(X_test_dnn.shape[1]):
    mat = weights[j*2][0]
    dim = mat.shape[1]
    for i in range(x_dnn_test.shape[0]):
        x_dnn_test[i,start_ind:start_ind+dim]=mat[X_test_dnn[i,j]]
    start_ind += dim
l.fit(x_dnn_train,y_train)
y_pred = l.predict_proba(x_dnn_test)
print(log_loss(y_test,y_pred))
r.fit(x_dnn_train,y_train)
y_pred = r.predict_proba(x_dnn_test)
print(log_loss(y_test,y_pred))