# Wide-and-Deep ML: Model Preparation

In this notebook, we train and evaluate the wide-and-deep collaborative filtering recommender using features engineered in the prior notebook.

## 1. Prepare the data

In [17]:
# !pip3 install tensorflow

In [4]:
# import required libraries

import pandas as pd
import tensorflow as tf

2023-11-29 18:04:03.369111: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-11-29 18:04:03.381450: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2023-11-29 18:04:03.548726: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-29 18:04:03.548772: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-29 18:04:03.560360: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to

In [5]:
# save models
train_df = pd.read_csv('../data/user_movie_interaction_train.csv')
val_df = pd.read_csv('../data/user_movie_interaction_val.csv')
test_df = pd.read_csv('../data/user_movie_interaction_train.csv')

In [6]:
train_df.head()

Unnamed: 0.1,Unnamed: 0,userId,movieId,rating,title,genres,genre_freq,user_genre_rating
0,11807,475,442,1.0,Demolition Man (1993),action adventure sci-fi,0.021983,0.021983
1,63752,489,2082,0.6,"Mighty Ducks, The (1992)",children comedy,0.006359,0.003815
2,41258,330,262,0.8,"Little Princess, A (1995)",children drama,0.002963,0.00237
3,34631,394,44191,0.8,V for Vendetta (2006),action sci-fi thriller imax,0.001048,0.000839
4,43674,238,4025,0.7,Miss Congeniality (2000),comedy crime,0.01111,0.007777


In [7]:
# drop unnecessary columns
train_df.drop(['Unnamed: 0'], axis=1, inplace=True)
train_df.head()

Unnamed: 0,userId,movieId,rating,title,genres,genre_freq,user_genre_rating
0,475,442,1.0,Demolition Man (1993),action adventure sci-fi,0.021983,0.021983
1,489,2082,0.6,"Mighty Ducks, The (1992)",children comedy,0.006359,0.003815
2,330,262,0.8,"Little Princess, A (1995)",children drama,0.002963,0.00237
3,394,44191,0.8,V for Vendetta (2006),action sci-fi thriller imax,0.001048,0.000839
4,238,4025,0.7,Miss Congeniality (2000),comedy crime,0.01111,0.007777


In [8]:
# convert `genres` feature data to categorical
train_df['genres'] = train_df['genres'].astype('category')
train_df.dtypes

userId                  int64
movieId                 int64
rating                float64
title                  object
genres               category
genre_freq            float64
user_genre_rating     float64
dtype: object

### 1.1. Capture label and feature info

In [None]:
# get number of examples
user_num = len(train_df['userId'].unique())
movie_num = len(train_df['movieId'].unique())
genre_num = len(train_df['genres'].unique())

# variables to define wide and deep columns from the dataset
LABEL_COL = 'title'

NUMERIC_COLS = [
    'rating',
    'genre_freq',
    'user_genre_rating'
]

CATEGORICAL_COL = [
    'userId',
    'movieId',
    'genres'
]

HASH_BUCKET_SIZES = {
    'userId': user_num,
    'movieId': movie_num,
    'genres': genre_num
}

EMBEDDING_DIMENSIONS = {
    'userId': int(round(math.pow(user_num, 1/3))),
    'movieId': int(round(math.pow(movie_num, 1/3))),
    'genres': int(round(math.pow(genre_num, 1/3))),
}

# define wide and deep columns
def get_wide_and_deep_cols():
    wide_cols, deep_cols = [], []

    # embedding columns
    for col_name in CATEGORICAL_COL:
        categorical_col = tf.feature_column.categorical_column_with_identity(
            col_name, bucket_num = HASH_BUCKET_SIZES[col_name]
        )
        wrapped_col = tf.feature_column.embedding_col(
            categorical_col,
            dimension = EMBEDDING_DIMENSIONS[col_name],
            combiner = 'sqrtn'
        )

        wide_cols.append(categorical_col)
        deep_cols.append(wrapped_col)

    # cross cols for ratings and frequency?

    return wide_cols, deep_cols

## Future development
- use libraries like `PorterStemmer` that allows counter vectorization to find word associations in big paragraphs.
- `sklearn`'s `CountVectorizer` does the vectorization, and `cosine_similarity` computes how closely the vectorized words relate.