# Deep User Modeling for Inferred Demographics

This notebook provides the details for the blog post: <b> Deep User Modeling for Inferred Demographics </b>

tl;dr a user representation based on embedding of multiple sequences of categorical information, for the purpose of multiple demographic attribute prediction 


In [1]:
import keras
from keras.layers import *
from keras.models import Model
from keras.optimizers import Adam

import tensorflow as tf

from sklearn.preprocessing import LabelEncoder

import numpy as np
import pandas as pd 

Using TensorFlow backend.


Real world user data in sensitive... so we will generate some fake data of our own! The type of data we are interested in is varying length sequences of categorical information. 

In [2]:
def make_sequence(num_users, num_categories, max_len):
    category_sequences = [
        np.random.randint(0, num_categories, np.random.randint(max_len)) 
        for _ in range(num_users)
    ]
    return category_sequences

Here is an exmaple of what this looks like:

In [3]:
num_users = 1000
max_len = 50
num_categories = 100
category_sequences = make_sequence(num_users, num_categories, max_len)
print(category_sequences[0])

[39 54 31 51 48 80  8 46 75 58 20 15 70 28 13 26 63 84 82 44  4 79 97 52 34
 25 79 68 51 21 40 62 63 76 59 51 62 17 64 31 57 26  4 39]


Keep in mind these numbers represent categories (of a purchase, for instance). So in plain text this would read:
[coffee, supermaket, coffee, coffee,... ] 

But, we want to transfor this to a probability vector, containing the fraction of each of the categories in the overall sequence.

In [None]:
def to_fractions(sequence):
    def row_to_fractions(row):
        return pd.Series(row).value_counts() / len(row)
    
    return pd.DataFrame([row_to_fractions(seq) for seq in sequence]).fillna(0).values

In [5]:
frac = to_fractions(category_sequences)
frac

array([[ 0.        ,  0.        ,  0.        , ...,  0.02272727,
         0.        ,  0.        ],
       [ 0.02564103,  0.        ,  0.        , ...,  0.        ,
         0.02564103,  0.02564103],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       ..., 
       [ 0.125     ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.03448276,  0.03448276,  0.        , ...,  0.        ,
         0.        ,  0.        ]])

Our model will use the matrix-multiplication approach to embedding (explained in detail in the blog post...), but the traditional lookup embedding is done like this (keras, then TensorFlow):

In [None]:
keras.layers.Embedding(num_categories, embedding_dim)

In [None]:
embedding_matrix = tf.get_variable("embeddings", [num_categories, embedding_dim])
embeddings = tf.nn.embedding_lookup(embedding_matrix, category_sequence_goes_here)

## The model 

In [7]:
def deep_user_multiple_sequences(input_sizes, output_sizes, embedding_sizes, depth=(100, 100)):

    # The inputs are not actually sequences! they are the distribution over sequence objects...
    inputs = [Input(shape=(s,)) for s in input_sizes]

    # Each input is then embedded into its own space 
    # (relu not really necessary...)
    embeddings = [Dense(emb_size, activation='relu')(input) 
                  for emb_size, input in zip(embedding_sizes, inputs)]

    # Concat everything
    everything = concatenate(embeddings)

    # Add in additional layers
    for layer_size in depth:
        everything = Dense(layer_size, activation='relu')(everything)

    # Go to output
    outputs = [Dense(out_size, activation='softmax')(everything) 
               for out_size in output_sizes]

    # Build, print, and return model
    model = Model(inputs=inputs, outputs=outputs)
    model.summary()
    model.compile(loss='categorical_crossentropy', optimizer=Adam(), metrics=['accuracy'])
    return model

In [8]:
my_model = deep_user_multiple_sequences(input_sizes=(100, 50), 
                                        output_sizes=(4, 3), 
                                        embedding_sizes=(100, 100))

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 100)          0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, 50)           0                                            
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 100)          10100       input_1[0][0]                    
__________________________________________________________________________________________________
dense_2 (Dense)                 (None, 100)          5100        input_2[0][0]                    
__________________________________________________________________________________________________
concatenat

### Training 

we generate random user demographics, and a couple of random sequences to train the model.

In [9]:
y_marital = np.random.choice(["single", "married", "divorced", "widowed"], num_users)
y_children = np.random.choice(["1", "2", "2+"], num_users)

In [10]:
y_marital = np.eye(4)[LabelEncoder().fit_transform(y_marital)]
y_children = np.eye(3)[LabelEncoder().fit_transform(y_children)]
print(y_marital)
print(y_children)

[[ 0.  0.  0.  1.]
 [ 0.  0.  1.  0.]
 [ 0.  0.  0.  1.]
 ..., 
 [ 1.  0.  0.  0.]
 [ 1.  0.  0.  0.]
 [ 1.  0.  0.  0.]]
[[ 0.  1.  0.]
 [ 0.  1.  0.]
 [ 0.  1.  0.]
 ..., 
 [ 1.  0.  0.]
 [ 1.  0.  0.]
 [ 0.  1.  0.]]


In [11]:
seq1 = to_fractions(make_sequence(num_users, num_categories=100, max_len=500))
seq2 = to_fractions(make_sequence(num_users, num_categories=50, max_len=500))


In [12]:
my_model.fit([seq1, seq2], [y_marital, y_children], epochs=200)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78

<keras.callbacks.History at 0x11c54a290>