# Wide-and-Deep ML: Model Preparation

In this notebook, we train and evaluate the wide-and-deep collaborative filtering recommender using features engineered in the prior notebook.

## 1. Prepare the data

In [17]:
# !pip3 install tensorflow

In [14]:
# import required libraries

import pandas as pd
import tensorflow as tf
import math

In [9]:
# save models
train_df = pd.read_csv('../data/user_movie_interaction_train.csv')
val_df = pd.read_csv('../data/user_movie_interaction_val.csv')
test_df = pd.read_csv('../data/user_movie_interaction_train.csv')

In [10]:
train_df.head()

Unnamed: 0.1,Unnamed: 0,userId,movieId,rating,title,genres,avg_movie_rating,user_all_genres
0,23695,442,51662,0.4,300 (2007),action fantasy war imax,0.721622,fantasy sci-fi mystery animation documentary w...
1,37754,417,1027,0.4,Robin Hood: Prince of Thieves (1991),adventure drama,0.610526,fantasy sci-fi musical horror mystery western ...
2,18178,394,45499,0.5,X-Men: The Last Stand (2006),action sci-fi thriller,0.638095,sci-fi drama children thriller western film-no...
3,33268,271,60609,0.9,Death Note (2006),adventure crime drama horror mystery,0.9,sci-fi drama children thriller western film-no...
4,47465,489,3301,0.6,"Whole Nine Yards, The (2000)",comedy crime,0.641667,sci-fi drama children thriller western film-no...


In [11]:
# drop unnecessary columns
train_df.drop(['Unnamed: 0'], axis=1, inplace=True)
val_df.drop(['Unnamed: 0'], axis=1, inplace=True)
test_df.drop(['Unnamed: 0'], axis=1, inplace=True)
train_df.head()

Unnamed: 0,userId,movieId,rating,title,genres,avg_movie_rating,user_all_genres
0,442,51662,0.4,300 (2007),action fantasy war imax,0.721622,fantasy sci-fi mystery animation documentary w...
1,417,1027,0.4,Robin Hood: Prince of Thieves (1991),adventure drama,0.610526,fantasy sci-fi musical horror mystery western ...
2,394,45499,0.5,X-Men: The Last Stand (2006),action sci-fi thriller,0.638095,sci-fi drama children thriller western film-no...
3,271,60609,0.9,Death Note (2006),adventure crime drama horror mystery,0.9,sci-fi drama children thriller western film-no...
4,489,3301,0.6,"Whole Nine Yards, The (2000)",comedy crime,0.641667,sci-fi drama children thriller western film-no...


In [12]:
# convert `genres` feature data to categorical
train_df['genres'] = train_df['genres'].astype('category')
train_df['user_all_genres'] = train_df['user_all_genres'].astype('category')
train_df.dtypes

userId                 int64
movieId                int64
rating               float64
title                 object
genres              category
avg_movie_rating     float64
user_all_genres     category
dtype: object

### 1.1. Capture label and feature info

In [25]:
# calculate number of examples
user_num = len(train_df['userId'].unique())
movie_num = len(train_df['movieId'].unique())
genre_num = len(train_df['genres'].unique())
user_all_genres_num = len(train_df['genres'].unique())

# calculate manually added embedding dimensions
user_dim = int(round(math.pow(user_num, 1/3)))
movie_dim = int(round(math.pow(movie_num, 1/3)))
genre_dim = int(round(math.pow(genre_num, 1/3)))
user_all_genres_dim = int(round(math.pow(user_all_genres_num, 1/3)))

# variables to define wide and deep columns from the dataset
LABEL_COL = 'title'

NUMERIC_COLS = [
    'rating',
    'avg_movie_rating'
]

CATEGORICAL_COLS = [
    'userId',
    'movieId',
    'genres',
    'user_all_genres'
]

HASH_BUCKET_SIZES = {
    'userId': user_num,
    'movieId': movie_num,
    'genres': genre_num,
    'user_all_genres': user_all_genres_num
}

EMBEDDING_DIMENSIONS = {
    'userId': user_dim,
    'movieId': movie_dim,
    'genres': genre_dim,
    'user_all_genres': user_all_genres_dim,
}

# define wide and deep columns
def get_wide_and_deep_cols():
    wide_cols, deep_cols = [], []
    numeric_cols, numeric_buckets = [], []
    categorical_buckets = []
    cat_hash_bucket_size = genre_num * genre_dim
    left_bound = 3/5
    rght_bound = 4.5/5

    # embedding columns
    ID_COLS = ['userId', 'movieId']
    for col_name in CATEGORICAL_COLS:
        categorical_col = tf.feature_column.categorical_column_with_identity(
            col_name,
            num_buckets = HASH_BUCKET_SIZES[col_name])
        wrapped_col = tf.feature_column.embedding_column(
            categorical_col,
            dimension = EMBEDDING_DIMENSIONS[col_name],
            combiner = 'sqrtn')

        if col_name not in ID_COLS:
            categorical_buckets.append(col_name)

        wide_cols.append(categorical_col)
        deep_cols.append(wrapped_col)

    # numeric columns and respective cross columns
    for col_name in NUMERIC_COLS:
        col_name = tf.feature_column.numeric_column(
            col_name,
            shape = (1,),
            dtype = tf.float32)
        col_buckets = tf.feature_column.bucketized_column(
            col_name,
            boundaries=[left_bound, rght_bound])
        numeric_cols.append(col_name)
        numeric_buckets.append(col_buckets)

    numeric_cols_crossed = tf.feature_column.crossed_column(numeric_buckets, 12)
    categorical_cols_crossed = tf.feature_column.crossed_column(categorical_buckets, cat_hash_bucket_size)
    wide_cols.extend(numeric_buckets + numeric_cols_crossed + categorical_cols_crossed)
    deep_cols.extend(numeric_cols)

    return wide_cols, deep_cols

In [26]:
wide_columns, deep_columns = get_wide_and_deep_cols()

Instructions for updating:
Use Keras preprocessing layers instead, either directly or via the `tf.keras.utils.FeatureSpace` utility. Each of `tf.feature_column.*` has a functional equivalent in `tf.keras.layers` for feature preprocessing when training a Keras model.
Instructions for updating:
Use Keras preprocessing layers instead, either directly or via the `tf.keras.utils.FeatureSpace` utility. Each of `tf.feature_column.*` has a functional equivalent in `tf.keras.layers` for feature preprocessing when training a Keras model.
Instructions for updating:
Use Keras preprocessing layers instead, either directly or via the `tf.keras.utils.FeatureSpace` utility. Each of `tf.feature_column.*` has a functional equivalent in `tf.keras.layers` for feature preprocessing when training a Keras model.
Instructions for updating:
Use `tf.keras.layers.experimental.preprocessing.HashedCrossing` instead for feature crossing when preprocessing data to train a Keras model.


TypeError: can only concatenate list (not "CrossedColumn") to list

In [None]:
wide_columns

In [None]:
deep_columns

## Future development
- use libraries like `PorterStemmer` that allows counter vectorization to find word associations in big paragraphs.
- `sklearn`'s `CountVectorizer` does the vectorization, and `cosine_similarity` computes how closely the vectorized words relate.