# Wide and Deep and Lasso'd
## Feature Selection using LASSO regression for Wide and Deep model

The Wide & Deep model combines the best features of linear models (memorization) and deep models (better generalization). 

The wide part of the W&D model utilizes many crossed features, for example: "has iphone" AND "is female" would become "is female and has iphone". The current W&D model relies on human intuition to pick which features to cross (ie gender and phone device). This seems arbitrary, and we thought we could extend and improve the algorithm by using smarter feature selection before crossing the features. For example, if "is female" is more predictive than "has iphone", we only choose "is female" to cross with other predictive features. Our hypothesis is that by selecting the most powerful and predictive features, we can reduce the chances that the algorithm will pick up patterns in the data that are just noise.

We considered different machine learning techniques (PCA, SVD, Random Forest) for feature selection, but settled on LASSO regression (least absolute shrinkage and selection operator), because the LASSO runs on all features, but shrinks regression coefficients for some features down to 0. Features with coefficients of 0 are dropped and all others are kept. Depending on what alpha you select (threshold for how much to shrink features), the output is a list of most important features.

We compared the performance of Wide & Deep & LASSO with the basic Wide & Deep model on Movielens data and found that the basic Wide & Deep model outperformed our LASSO version by about 8 percentage points. Our modifications did not make any improvements to the model. See notes below for more.

In [1]:
import tensorflow as tf
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import requests, zipfile, io
from sklearn.decomposition import PCA
#%matplotlib inline

In [2]:
str_cols = ["genre" ,"gender","age_desc","occ_desc"]
int_cols = ["age", "occupation"]
target_col = ["rating"]


**Bug Fix:** Column names used in Tensorflow must be of a limited length and of a specific format. Because of this reason we were not able to use the long names generated by pandas one hot encoding function. The cell below generates a list of unique 4 letter id to replace the column names.


In [3]:
# create a list of random ids to be used a column names later
import itertools
alphabet1 = list(map(chr, range(97, 123)))
iter_alphabet = list(itertools.combinations(alphabet1, 4))
rand_ids = np.array([])
for x in iter_alphabet:
    temp=''
    for y in x:
        temp = temp+y
    rand_ids = np.append(rand_ids,temp)
print(rand_ids.shape)

(14950,)


## Data import and preparation
The cell below downloads all the needed data files to your current working directory. Then it proceeds to prepare the data by doing the following:
- Merge all data files to a master file
- Re-index movie ID and user ID
- Generate one hot encoding for selected columns
- Re-label one hot encoded columns
- Split data to test train set and validate set



In [4]:
#Download and Import data
r = requests.get('http://files.grouplens.org/datasets/movielens/ml-1m.zip')
z = zipfile.ZipFile(io.BytesIO(r.content))
for i in z.filelist:
    z.extract(i.filename)
    if 'movies' in i.filename:
        movie_data = pd.read_csv(
            i.filename,
            sep='::', 
            engine='python', 
            encoding='latin-1',
            names=['movieid', 'title', 'genre'])
        print('importing: ' ,i.filename)
    if 'rating' in i.filename:
        rating_data = pd.read_csv(
            i.filename,
            sep="::",
            engine="python",
            encoding="latin-1",
            names=['userid', 'movieid', 'rating', 'timestamp'])
        print('importing: ' ,i.filename)
    if 'users' in i.filename:
        user_data = pd.read_csv(
            i.filename,
            sep='::', 
            engine='python', 
            encoding='latin-1',
            names=['userid', 'gender', 'age', 'occupation', 'zipcode'])
        print('importing: ' ,i.filename)

#categories found on readme file
age_desc = {1: "Under 18", 18: "18-24", 25: "25-34", 35: "35-44", 45: "45-49", 50: "50-55", 56: "56+"}
occupation_desc = {
    0: "other or not specified", 1: "academic/educator", 2: "artist", 3: "clerical/admin",
    4: "college/grad student", 5: "customer service", 6: "doctor/health care",
    7: "executive/managerial", 8: "farmer", 9: "homemaker", 10: "K-12 student", 11: "lawyer",
    12: "programmer", 13: "retired", 14: "sales/marketing", 15: "scientist", 16: "self-employed",
    17: "technician/engineer", 18: "tradesman/craftsman", 19: "unemployed", 20: "writer"
}
user_data['age_desc'] = user_data['age'].apply(lambda x: age_desc[x])
user_data['occ_desc'] = user_data['occupation'].apply(lambda x: occupation_desc[x])


#combine all dataset
temp = pd.merge(rating_data, movie_data, how="left", on="movieid")
dataset = pd.merge(temp, user_data, how="left", on="userid")
'''dataset.head()'''

# reindex the movieid column and the userid column

adj_col = dataset['movieid']

adj_col_uni = adj_col.sort_values().unique()

adj_df = pd.DataFrame(adj_col_uni).reset_index().rename(columns = {0:'movieid','index':'adj_movieid'})

dataset = pd.merge(adj_df,dataset,how="right", on="movieid")
dataset['adj_userid'] = dataset['userid'] - 1


# split data into training and validset
ratio = .8
dataset = dataset.sample(frac=1, replace=False)
print('creating onehot encoding')
dataset_dummies = pd.get_dummies(dataset[str_cols])

# change names of columns to numbers
old_names = dataset_dummies.columns.get_values()
new_names = rand_ids[:len(old_names)]
dataset_dummies.columns = new_names

dataset = pd.concat([dataset,dataset_dummies],axis=1)
'''
# remove the "\" from the column name
new_names = dataset.columns.get_values()
subs = lambda t: t.replace('|','_')
vfunc = np.vectorize(subs)
new_names = vfunc(new_names)
dataset.columns = new_names
'''

n_split = int(len(dataset)*ratio)
trainset = dataset[:n_split]
validset = dataset[n_split:]

start_col_i = dataset.shape[1]-dataset_dummies.shape[1]
trainset_dummies = trainset.iloc[:,start_col_i:]

print('trianset: ', trainset.shape)
print('trainset_dummies: ', trainset_dummies.shape)
print('validset: ', validset.shape)


importing:  ml-1m/movies.dat
importing:  ml-1m/ratings.dat
importing:  ml-1m/users.dat
creating onehot encoding
trianset:  (800167, 345)
trainset_dummies:  (800167, 331)
validset:  (200042, 345)


In [5]:
print("Below is the mapping from NEW names to OLD names ")
pd.DataFrame(old_names, new_names).rename(columns={0:'OLD Names'})

Below is the mapping from NEW names to OLD names 


Unnamed: 0,OLD Names
abcd,genre_Action
abce,genre_Action|Adventure
abcf,genre_Action|Adventure|Animation
abcg,genre_Action|Adventure|Animation|Children's|Fa...
abch,genre_Action|Adventure|Animation|Horror|Sci-Fi
abci,genre_Action|Adventure|Children's
abcj,genre_Action|Adventure|Children's|Comedy
abck,genre_Action|Adventure|Children's|Fantasy
abcl,genre_Action|Adventure|Children's|Sci-Fi
abcm,genre_Action|Adventure|Comedy


In [6]:
trainset.head()

Unnamed: 0,adj_movieid,movieid,userid,rating,timestamp,title,genre,gender,age,occupation,...,acfi,acfj,acfk,acfl,acfm,acfn,acfo,acfp,acfq,acfr
674106,2278,2471,3401,2,967433549,Crocodile Dundee II (1988),Adventure|Comedy,M,35,7,...,0,0,0,0,0,0,0,0,0,0
909408,3238,3471,5065,5,962476123,Close Encounters of the Third Kind (1977),Drama|Sci-Fi,M,25,14,...,0,0,0,1,0,0,0,0,0,0
438849,1432,1556,3067,2,969997472,Speed 2: Cruise Control (1997),Action|Romance|Thriller,F,25,0,...,1,0,0,0,0,0,0,0,0,0
886787,3134,3363,558,2,976050489,American Graffiti (1973),Comedy|Drama,M,35,20,...,0,0,0,0,0,0,0,0,0,1
308038,1113,1203,3588,5,966662808,12 Angry Men (1957),Drama,M,25,2,...,0,0,0,0,0,0,0,0,0,0


## Wide and Deep baseline model

First, let's run the baseline Wide and Deep model. Wide and Deep is already a package within TensorFlow, so we used this as a starting point to understand how TensorFlow works before making our own extension to the model.

We changed some of the input functions to make it work for the Movielens dataset. 



In [35]:
def make_wnd_inputs(dataframe):
    feature_inputs = {
        col_name: tf.SparseTensor(
            indices = [[i, 0] for i in range(len(dataframe[col_name]))],
            values = dataframe[col_name].values,
            dense_shape = [len(dataframe[col_name]), 1]
        )
        for col_name in str_cols + int_cols
    }
    label_input = tf.constant(dataframe[target_col].values-1)
    return (feature_inputs, label_input)

In [36]:

crossed_columns = [
  tf.feature_column.crossed_column(
      ["genre", "occupation"], hash_bucket_size=1000),
  tf.feature_column.crossed_column(
      ["gender", "genre"], hash_bucket_size=1000),
]
wide_columns = crossed_columns


In [37]:
#Base Wide & Deep
base_model = tf.contrib.learn.DNNLinearCombinedClassifier(
    n_classes=5,
    linear_feature_columns = wide_columns,
    dnn_feature_columns = deep_columns,
    dnn_hidden_units = [32, 16],
    fix_global_step_increment_bug=True,
    config = tf.contrib.learn.RunConfig(
        keep_checkpoint_max = 1,
        save_summary_steps = 10
    )
)


INFO:tensorflow:Using config: {'_is_chief': True, '_environment': 'local', '_model_dir': '/var/folders/dg/nljwh9354cvftwk_8g82fj640000gn/T/tmpn58ujj8h', '_master': '', '_tf_random_seed': None, '_save_checkpoints_steps': None, '_evaluation_master': '', '_num_worker_replicas': 0, '_keep_checkpoint_every_n_hours': 10000, '_num_ps_replicas': 0, '_keep_checkpoint_max': 1, '_task_type': None, '_task_id': 0, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1
}
, '_log_step_count_steps': 100, '_save_summary_steps': 10, '_save_checkpoints_secs': 600, '_session_config': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x1ac5f84e48>}


In [38]:
base_model.fit(input_fn = lambda: make_wnd_inputs(trainset), steps=10)

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into /var/folders/dg/nljwh9354cvftwk_8g82fj640000gn/T/tmpn58ujj8h/model.ckpt.
INFO:tensorflow:step = 1, loss = 1.59251
INFO:tensorflow:Saving checkpoints for 10 into /var/folders/dg/nljwh9354cvftwk_8g82fj640000gn/T/tmpn58ujj8h/model.ckpt.
INFO:tensorflow:Loss for final step: 1.47004.


DNNLinearCombinedClassifier(params={'embedding_lr_multipliers': None, 'dnn_optimizer': None, 'linear_optimizer': None, 'input_layer_partitioner': None, 'dnn_activation_fn': <function relu at 0x10d7e5510>, 'head': <tensorflow.contrib.learn.python.learn.estimators.head._MultiClassHead object at 0x1ac5f84c50>, 'dnn_hidden_units': [32, 16], 'gradient_clip_norm': None, 'dnn_feature_columns': (_EmbeddingColumn(categorical_column=_IdentityCategoricalColumn(key='occupation', num_buckets=1000, default_value=0), dimension=8, combiner='mean', initializer=<tensorflow.python.ops.init_ops.TruncatedNormal object at 0x109f39080>, ckpt_to_load_from=None, tensor_name_in_ckpt=None, max_norm=None, trainable=True), _EmbeddingColumn(categorical_column=_IdentityCategoricalColumn(key='age', num_buckets=1000, default_value=0), dimension=8, combiner='mean', initializer=<tensorflow.python.ops.init_ops.TruncatedNormal object at 0x109f39908>, ckpt_to_load_from=None, tensor_name_in_ckpt=None, max_norm=None, trainab

In [40]:
results = base_model.evaluate(input_fn = lambda: make_wnd_inputs(validset), steps=1)

INFO:tensorflow:Starting evaluation at 2017-12-17-15:56:54
INFO:tensorflow:Restoring parameters from /var/folders/dg/nljwh9354cvftwk_8g82fj640000gn/T/tmpn58ujj8h/model.ckpt-10
INFO:tensorflow:Evaluation [1/1]
INFO:tensorflow:Finished evaluation at 2017-12-17-15:57:03
INFO:tensorflow:Saving dict for global step 10: accuracy = 0.346742, global_step = 10, loss = 1.46834


In [41]:
predictions = base_model.predict_classes(input_fn = lambda: make_wnd_inputs(validset))
probabilities = base_model.predict_proba(input_fn = lambda: make_wnd_inputs(validset))

INFO:tensorflow:Restoring parameters from /var/folders/dg/nljwh9354cvftwk_8g82fj640000gn/T/tmpn58ujj8h/model.ckpt-10
INFO:tensorflow:Restoring parameters from /var/folders/dg/nljwh9354cvftwk_8g82fj640000gn/T/tmpn58ujj8h/model.ckpt-10


In [42]:
for n, r in results.items():
    print("%s: %a"%(n, r))

accuracy: 0.34674218
global_step: 10
loss: 1.468339


In [43]:
predict = list(predictions)


In [44]:
prob = list(probabilities)


## Wide and Deep base model accuracy:

In [45]:
dnw_accuracy = np.sum(np.asarray(predict)+1 == validset.rating.values) / len(validset)
print("DNW Accuracy: %f%%"%(dnw_accuracy*100,))

DNW Accuracy: 34.674218%


In [46]:
results = validset[["gender","age_desc","occ_desc", "title", "genre", "rating"]].copy()
results["prediction"] = np.asarray(predict)+1
results["rating1"] = np.vstack(prob)[:,0]
results["rating2"] = np.vstack(prob)[:,1]
results["rating3"] = np.vstack(prob)[:,2]
results["rating4"] = np.vstack(prob)[:,3]
results["rating5"] = np.vstack(prob)[:,4]
results.tail(20)

Unnamed: 0,gender,age_desc,occ_desc,title,genre,rating,prediction,rating1,rating2,rating3,rating4,rating5
783566,M,50-55,doctor/health care,Body Heat (1981),Crime|Thriller,4,4,0.06564,0.103277,0.286258,0.310477,0.234347
393741,F,45-49,doctor/health care,Star Trek: First Contact (1996),Action|Adventure|Sci-Fi,3,4,0.082526,0.101586,0.264915,0.358767,0.192206
197195,M,35-44,other or not specified,"Cable Guy, The (1996)",Comedy,1,4,0.092056,0.139292,0.248253,0.319788,0.200612
531223,M,35-44,technician/engineer,Labyrinth (1986),Adventure|Children's|Fantasy,3,4,0.078198,0.137061,0.274792,0.292537,0.217411
222352,M,50-55,scientist,My Fair Lady (1964),Musical|Romance,5,3,0.12349,0.121893,0.270602,0.253798,0.230217
15275,M,25-34,lawyer,Twelve Monkeys (1995),Drama|Sci-Fi,4,4,0.106285,0.114929,0.193595,0.322558,0.262633
643421,M,25-34,executive/managerial,Lifeforce (1985),Horror|Sci-Fi,1,4,0.096543,0.133438,0.241871,0.296966,0.231182
918290,M,35-44,technician/engineer,"Odd Couple, The (1968)",Comedy,4,4,0.066885,0.123932,0.301229,0.306464,0.20149
905462,M,25-34,academic/educator,Guess Who's Coming to Dinner (1967),Comedy|Drama,4,4,0.088697,0.127995,0.240211,0.337708,0.20539
782387,F,18-24,writer,Total Recall (1990),Action|Adventure|Sci-Fi|Thriller,3,4,0.081379,0.097211,0.259884,0.338928,0.222598


As an aside, we noticed that many of the predictions seem to converge around 4. This model may simply be predicting 4 because it is the most commonly given rating. This would be something to investigate for future improvements to the model.

## LASSO cross feature selection
Next, we will use LASSO regression to select which cross features could potentially result in an improvement to the baseline TensorFlow “Wide and Deep” model. The one hot encoded columns are passed through LASSO and several Alpha parameters are tested until we are able to reduce the number of features to an acceptable amount.

**Bug Fix**: We noticed that inputting more than 50 one hot encoded features to the TensorFlow model results in the kernel crashing. For this reason, we decided to increase the hyper parameter alpha in the LASSO model, resulting in only 21 cross column features.


In [7]:
# LASSO FOR LOOP

def lasso_loop(data_onehot, data_y, alpha_list):
    from sklearn import linear_model
    coefs= pd.DataFrame()
    
    for alpha in alpha_list:
        print('Building model for alpha: ',alpha)
        clf = linear_model.Lasso(alpha=alpha)
        clf.fit(data_onehot, data_y)
        
        coefs_raw = np.append(alpha,clf.coef_)
        coefs = coefs.append(pd.Series(coefs_raw), ignore_index=True)
    coefs.columns = np.append('alpha',trainset_dummies.columns.get_values())
    return coefs

############
data_onehot = trainset_dummies
alpha_list= [.001, .002, .003]
#np.arange(.001,.01,.001)
data_y = trainset[target_col]

coefs_df = lasso_loop(data_onehot, data_y, alpha_list)
coefs_df


Building model for alpha:  0.001
Building model for alpha:  0.002
Building model for alpha:  0.003


Unnamed: 0,alpha,abcd,abce,abcf,abcg,abch,abci,abcj,abck,abcl,...,acfi,acfj,acfk,acfl,acfm,acfn,acfo,acfp,acfq,acfr
0,0.001,-0.122323,0.036501,0.0,-0.0,0.0,-0.0,-0.0,-0.0,-0.0,...,-0.032938,0.061018,0.0,0.012675,0.053974,-0.0,0.024127,-0.0,-0.064391,-0.081665
1,0.002,-0.054742,0.0,0.0,-0.0,0.0,-0.0,-0.0,-0.0,-0.0,...,-0.028766,0.041778,0.0,0.0,0.011113,-0.0,0.00644,-0.0,-0.003449,-0.063905
2,0.003,-0.0,0.0,0.0,-0.0,0.0,-0.0,-0.0,-0.0,-0.0,...,-0.020175,0.024184,0.0,0.0,0.0,-0.0,0.0,-0.0,-0.0,-0.044049


In [8]:
# List of columns to keep
mask = coefs_df.loc[2,:] == 0.0
#trainset.iloc[]
col_drop = coefs_df.columns.get_values()[mask]
col_keep = coefs_df.columns.get_values()[~mask]
col_keep = col_keep[1:]
col_keep = list(col_keep)
print('columns that will be DROP: ', col_drop.shape[0])
print('columns that will be KEEP: ', len(col_keep))

columns that will be DROP:  308
columns that will be KEEP:  23


In [9]:
#dropping columns from training and validation
trainset = trainset.drop(col_drop, axis=1)
validset = validset.drop(col_drop, axis=1)
print(trainset.shape, ' New shape of trainset')
print(validset.shape, ' New shape of validset')

(800167, 37)  New shape of trainset
(200042, 37)  New shape of validset


## Wide and Deep model with LASSO cross feature selection

In [10]:
def make_inputs(dataframe):
    feature_inputs = {
        col_name: tf.SparseTensor(
            indices = [[i, 0] for i in range(len(dataframe[col_name]))],
            values = dataframe[col_name].values,
            dense_shape = [len(dataframe[col_name]), 1]
        )
        for col_name in str_cols + int_cols + col_keep
    }
    label_input = tf.constant(dataframe[target_col].values-1)
    return (feature_inputs, label_input)

In [11]:

genre = tf.feature_column.categorical_column_with_hash_bucket('genre', hash_bucket_size=1000) 
gender = tf.feature_column.categorical_column_with_hash_bucket('gender', hash_bucket_size=1000) 

In [12]:
#deal with int columns

age = tf.feature_column.categorical_column_with_identity('age', num_buckets=1000, default_value=0)
occupation = tf.feature_column.categorical_column_with_identity('occupation', num_buckets=1000, default_value=0)

In [13]:
def make_int_columns(col_keep):   

    int_columns = [
        tf.feature_column.categorical_column_with_identity(col_name, num_buckets=1000, default_value=0)
        for col_name in col_keep
    ]
    return int_columns

In [14]:
make_int_columns(col_keep)

[_IdentityCategoricalColumn(key='abdo', num_buckets=1000, default_value=0),
 _IdentityCategoricalColumn(key='aber', num_buckets=1000, default_value=0),
 _IdentityCategoricalColumn(key='abfn', num_buckets=1000, default_value=0),
 _IdentityCategoricalColumn(key='abgh', num_buckets=1000, default_value=0),
 _IdentityCategoricalColumn(key='abkp', num_buckets=1000, default_value=0),
 _IdentityCategoricalColumn(key='ablr', num_buckets=1000, default_value=0),
 _IdentityCategoricalColumn(key='abmn', num_buckets=1000, default_value=0),
 _IdentityCategoricalColumn(key='abow', num_buckets=1000, default_value=0),
 _IdentityCategoricalColumn(key='abqz', num_buckets=1000, default_value=0),
 _IdentityCategoricalColumn(key='abtz', num_buckets=1000, default_value=0),
 _IdentityCategoricalColumn(key='abuw', num_buckets=1000, default_value=0),
 _IdentityCategoricalColumn(key='abxz', num_buckets=1000, default_value=0),
 _IdentityCategoricalColumn(key='acdg', num_buckets=1000, default_value=0),
 _IdentityCa

In [15]:
#make embedding columns

deep_columns = [
    
    #int columns:
  tf.feature_column.embedding_column(occupation, dimension = 8),
  tf.feature_column.embedding_column(age, dimension = 8),

  # hashed columns:
#   tf.feature_column.embedding_column(zipcode, dimension=8),
  tf.feature_column.embedding_column(genre, dimension=8),
  tf.feature_column.embedding_column(gender, dimension=8)


]


In [17]:
def make_crossed_columns(col_keep):
    crossed_columns = []
    for col1 in col_keep:
        for col2 in col_keep:
            if col1[:5] != col2[:5]:

                    col1_col2 = tf.feature_column.crossed_column(
                  [col1, col2], hash_bucket_size=1000)
                    crossed_columns.append(col1_col2)
    return crossed_columns

In [18]:

crossed_list = []

for col1 in col_keep:
        for col2 in col_keep:
            if col1[:5] != col2[:5]:
                crossed_list.append([str(col1),str(col2)])

crossed_dedupe = set(tuple(sorted(p)) for p in crossed_list)

In [19]:
def make_wide_input_layers(crossed_dedupe):
    crossed_wide_input_layers = [
        tf.feature_column.crossed_column([c for c in cs], hash_bucket_size=1000)
        for cs in crossed_dedupe
    ]
    return crossed_wide_input_layers

In [20]:
#LASSO version
model = tf.contrib.learn.DNNLinearCombinedClassifier(
    n_classes=5,
    linear_feature_columns = make_wide_input_layers(crossed_dedupe),
    dnn_feature_columns = deep_columns,
    dnn_hidden_units = [32, 16],
    fix_global_step_increment_bug=True,
    config = tf.contrib.learn.RunConfig(
        keep_checkpoint_max = 1,
        save_summary_steps = 10
    )
)


INFO:tensorflow:Using config: {'_is_chief': True, '_environment': 'local', '_model_dir': '/var/folders/dg/nljwh9354cvftwk_8g82fj640000gn/T/tmpxr_8y64y', '_master': '', '_tf_random_seed': None, '_save_checkpoints_steps': None, '_evaluation_master': '', '_num_worker_replicas': 0, '_keep_checkpoint_every_n_hours': 10000, '_num_ps_replicas': 0, '_keep_checkpoint_max': 1, '_task_type': None, '_task_id': 0, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1
}
, '_log_step_count_steps': 100, '_save_summary_steps': 10, '_save_checkpoints_secs': 600, '_session_config': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x1a2bb198d0>}


In [21]:
model.fit(input_fn = lambda: make_inputs(trainset), steps=10)

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into /var/folders/dg/nljwh9354cvftwk_8g82fj640000gn/T/tmpxr_8y64y/model.ckpt.
INFO:tensorflow:step = 1, loss = 1.61574
INFO:tensorflow:Saving checkpoints for 10 into /var/folders/dg/nljwh9354cvftwk_8g82fj640000gn/T/tmpxr_8y64y/model.ckpt.
INFO:tensorflow:Loss for final step: 3.87136.


DNNLinearCombinedClassifier(params={'embedding_lr_multipliers': None, 'dnn_optimizer': None, 'linear_optimizer': None, 'input_layer_partitioner': None, 'dnn_activation_fn': <function relu at 0x10d7e5510>, 'head': <tensorflow.contrib.learn.python.learn.estimators.head._MultiClassHead object at 0x109f3f898>, 'dnn_hidden_units': [32, 16], 'gradient_clip_norm': None, 'dnn_feature_columns': (_EmbeddingColumn(categorical_column=_IdentityCategoricalColumn(key='occupation', num_buckets=1000, default_value=0), dimension=8, combiner='mean', initializer=<tensorflow.python.ops.init_ops.TruncatedNormal object at 0x109f39080>, ckpt_to_load_from=None, tensor_name_in_ckpt=None, max_norm=None, trainable=True), _EmbeddingColumn(categorical_column=_IdentityCategoricalColumn(key='age', num_buckets=1000, default_value=0), dimension=8, combiner='mean', initializer=<tensorflow.python.ops.init_ops.TruncatedNormal object at 0x109f39908>, ckpt_to_load_from=None, tensor_name_in_ckpt=None, max_norm=None, trainabl

In [22]:
result_lasso = model.evaluate(input_fn = lambda: make_inputs(validset), steps=1)

INFO:tensorflow:Starting evaluation at 2017-12-15-14:23:31
INFO:tensorflow:Restoring parameters from /var/folders/dg/nljwh9354cvftwk_8g82fj640000gn/T/tmpxr_8y64y/model.ckpt-10
INFO:tensorflow:Evaluation [1/1]
INFO:tensorflow:Finished evaluation at 2017-12-15-14:24:33
INFO:tensorflow:Saving dict for global step 10: accuracy = 0.261985, global_step = 10, loss = 3.99424


In [23]:
predictions = model.predict_classes(input_fn = lambda: make_inputs(validset))
probabilities = model.predict_proba(input_fn = lambda: make_inputs(validset))


INFO:tensorflow:Restoring parameters from /var/folders/dg/nljwh9354cvftwk_8g82fj640000gn/T/tmpxr_8y64y/model.ckpt-10
INFO:tensorflow:Restoring parameters from /var/folders/dg/nljwh9354cvftwk_8g82fj640000gn/T/tmpxr_8y64y/model.ckpt-10


In [24]:

for n, r in results_lasso.items():
    print("%s: %a"%(n, r))

accuracy: 0.26198497
global_step: 10
loss: 3.9942381


In [25]:
predict = list(predictions)


In [26]:
prob = list(probabilities)


In [27]:
lasso_accuracy = np.sum(np.asarray(predict)+1 == validset.rating.values) / len(validset)
print("Lasso Accuracy: %f%%"%(lasso_accuracy*100,))

DNW Accuracy: 26.198498%


In [28]:
result_lasso = validset[["gender","age_desc","occ_desc", "title", "genre", "rating"]].copy()
result_lasso["prediction"] = np.asarray(predict)+1
result_lasso["rating1"] = np.vstack(prob)[:,0]
result_lasso["rating2"] = np.vstack(prob)[:,1]
result_lasso["rating3"] = np.vstack(prob)[:,2]
result_lasso["rating4"] = np.vstack(prob)[:,3]
result_lasso["rating5"] = np.vstack(prob)[:,4]
result_lasso.tail(20)

Unnamed: 0,gender,age_desc,occ_desc,title,genre,rating,prediction,rating1,rating2,rating3,rating4,rating5
783566,M,50-55,doctor/health care,Body Heat (1981),Crime|Thriller,4,3,0.027752,0.095728,0.86215,0.014286,8.3e-05
393741,F,45-49,doctor/health care,Star Trek: First Contact (1996),Action|Adventure|Sci-Fi,3,3,0.041268,0.121213,0.80751,0.029811,0.000197
197195,M,35-44,other or not specified,"Cable Guy, The (1996)",Comedy,1,3,0.035603,0.124409,0.82195,0.017962,7.5e-05
531223,M,35-44,technician/engineer,Labyrinth (1986),Adventure|Children's|Fantasy,3,3,0.02099,0.103194,0.864674,0.011097,4.5e-05
222352,M,50-55,scientist,My Fair Lady (1964),Musical|Romance,5,3,0.027441,0.103342,0.851856,0.017287,7.4e-05
15275,M,25-34,lawyer,Twelve Monkeys (1995),Drama|Sci-Fi,4,3,0.029698,0.119144,0.83789,0.013235,3.3e-05
643421,M,25-34,executive/managerial,Lifeforce (1985),Horror|Sci-Fi,1,3,0.033416,0.156774,0.790872,0.018861,7.8e-05
918290,M,35-44,technician/engineer,"Odd Couple, The (1968)",Comedy,4,3,0.029864,0.119008,0.8372,0.013877,5e-05
905462,M,25-34,academic/educator,Guess Who's Coming to Dinner (1967),Comedy|Drama,4,3,0.032863,0.132371,0.816168,0.018501,9.7e-05
782387,F,18-24,writer,Total Recall (1990),Action|Adventure|Sci-Fi|Thriller,3,3,0.056286,0.114833,0.7997,0.02899,0.00019


# Now, let's compare to the basic "Wide and Deep" model

- Accuracy of the LASSO Wide and Deep is 26.19% versus the 34% for the baseline Wide and Deep.
- LASSO Wide and Deep tended to underpredict ratings (LASSO ratings converged around 3, and the actual mean of ratings was 3.58).
- The cross product implementation available in tensor flow is much more time efficient than pre-selecting cross columns and adding additional columns in the model.
- Proper utilization of GPU was dificult to ensure. We assume that our code has a bottleneck in the flow.

**Conclusion:**
- **Baseline Wide and Deep is superior to our modification both in terms of accuracy and efficient runtime.**