# Final Capstone: Revisiting the Netflix Prize

## Notebook 2: Feature Engineering

With few given features to work with, and a $1,000,000 reward up for grabs, contest participants looked for ways to extract information in the data that could be used explicitly to better represent user preferences and bias, and somehow relate them to movie attributes. Furthermore, inclusion of the rating date adds a time dimension to these somewhat implicit relationships. The goal then, is to find relationships that may improve prediction accuracy, determine the appropriate numerical calculation, code the operation, then use the output as a new feature in the dataset.

Tasks such as these are creative in nature; but working with big data requires the data science practitioner to always be aware of the state of computational resources. Perhaps the most challenging aspect, however, is choosing the most efficient path in processing the data.

In [1]:
import time
start_time = time.perf_counter()
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 50)
pd.set_option('display.float_format', '{:.2f}'.format)

In [2]:
%%time
# retrieve data exported from first notebook
base_path = 'C:/Users/jnpol/Documents/DS/Data Science/UL/'
all_ratings = pd.read_parquet(base_path + 'all_ratings.parquet')
quindex = pd.read_parquet(base_path + 'quindex.parquet')
net = pd.read_parquet(base_path + 'net1.parquet')

all_ratings.info()
print()
quindex.info()
print()
net.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100480507 entries, 0 to 100480506
Data columns (total 1 columns):
 #   Column  Dtype
---  ------  -----
 0   rating  int8 
dtypes: int8(1)
memory usage: 862.4 MB

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1408395 entries, 0 to 1408394
Data columns (total 1 columns):
 #   Column   Non-Null Count    Dtype
---  ------   --------------    -----
 0   quindex  1408395 non-null  int64
dtypes: int64(1)
memory usage: 10.7 MB

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100480507 entries, 0 to 100480506
Data columns (total 5 columns):
 #   Column     Dtype  
---  ------     -----  
 0   mov_id     int16  
 1   cust_id    int32  
 2   rating     float64
 3   day_rated  int16  
 4   mov_year   int16  
dtypes: float64(1), int16(3), int32(1)
memory usage: 2.4 GB
Wall time: 3.37 s


## New Features: Counts
The features below are intended to reflect the patterns, and tendencies, behaviors (and temporal shifts) of the movie connoisseur, as well as express movie attributes. Some features are designed to provide more clarity on general trends, while others seek to illuminate specific traits of each customer.

In [3]:
%%time
# add column indicating number of times movie was rated
net['mov_count'] = net.groupby(['mov_id'])['mov_id'].transform('count')

# add column indicating average number of ratings per movie per day
net['avg_rate_pm_pd'] = net.mov_count / net.day_rated.max()
net.avg_rate_pm_pd = net.avg_rate_pm_pd.astype(np.float32)
net.drop(['mov_count'], 1, inplace=True)

# add column indicating the number of movies rated per cust
net['rated_bycust'] = net.groupby(['cust_id'])['cust_id'].transform('count')

# add column indicating average number of ratings per customer per day
net['avg_rate_pc_pd'] = net.rated_bycust / net.day_rated.max()
net.avg_rate_pc_pd = net.avg_rate_pc_pd.astype(np.float32)
net.drop(['rated_bycust'], 1, inplace=True)

# add column indicating number of times cust rated on that day
net['cust_day_count'] = net.groupby(
    ['cust_id', 'day_rated'])['mov_id'].transform('count')
net.cust_day_count = net.cust_day_count.astype(np.int16)

# add column indicating number of days since customer's first rating
net['day_min'] = net.groupby(['cust_id'])['day_rated'].transform('min')
net['cust_days_since'] = net.day_rated - net.day_min

# add column indicating number of days since movie's first rating
net.day_min = net.groupby(['mov_id'])['day_rated'].transform('min')
net['mov_days_since'] = net.day_rated - net.day_min
net.drop(['day_min'], 1, inplace=True)

Wall time: 46.3 s


## Additional Features: Means
The additional features below may be calculated on the training set only. They cannot be applied directly on the quiz set since the ratings are assumed to be unknown; but they can (and will) be estimated. In this way, even though the quiz set's true ratings cannot be use to train the model, such features that calculate averages on ratings from the training data can be added to the quiz set.

Note that there are many opportunities to reduce memory consumption. This continues throughout the project.

In [4]:
%%time
# used to select rows matching original quiz df index
quilist = list(quindex.quindex)

# add column indicating average rating per movie
net['mov_avg_rating'] = net.drop(quilist).groupby(
    ['mov_id'])['rating'].transform('mean')
net.mov_avg_rating = net.mov_avg_rating.astype(np.float32)

# add column indicating average rating per cust
net['cust_avg_rating'] = net.drop(quilist).groupby(
    ['cust_id'])['rating'].transform('mean')
net.cust_avg_rating = net.cust_avg_rating.astype(np.float32)

# add column indicating average rating per movie per day
net['mov_day_avg'] = net.drop(quilist).groupby(
    ['mov_id', 'day_rated'])['rating'].transform('mean')
net.mov_day_avg = net.mov_day_avg.astype(np.float32)

# add column indicating daily average rating by the cust
net['cust_day_avg'] = net.drop(quilist).groupby(
    ['cust_id', 'day_rated'])['rating'].transform('mean')
net.cust_day_avg = net.cust_day_avg.astype(np.float32)

# add column indicating average rating per release year
net['avg_rate_yr'] = net.drop(quilist).groupby(
    ['mov_year'])['rating'].transform('mean')
net.avg_rate_yr = net.avg_rate_yr.astype(np.float32)

# add column indicating average rating per customer per release year
net['avg_rate_cst_yr'] = net.drop(quilist).groupby(
    ['cust_id', 'mov_year'])['rating'].transform('mean')
net.avg_rate_cst_yr = net.avg_rate_cst_yr.astype(np.float32)

Wall time: 3min 31s


The code below shows the quantity of null values in each column. Most of these values will be filled shortly; but not all of them. We will see that this allows for a convenient method of row selection.

In [5]:
%%time
net.info()
display(net.head())
net.isna().sum()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100480507 entries, 0 to 100480506
Data columns (total 16 columns):
 #   Column           Dtype  
---  ------           -----  
 0   mov_id           int16  
 1   cust_id          int32  
 2   rating           float64
 3   day_rated        int16  
 4   mov_year         int16  
 5   avg_rate_pm_pd   float32
 6   avg_rate_pc_pd   float32
 7   cust_day_count   int16  
 8   cust_days_since  int16  
 9   mov_days_since   int16  
 10  mov_avg_rating   float32
 11  cust_avg_rating  float32
 12  mov_day_avg      float32
 13  cust_day_avg     float32
 14  avg_rate_yr      float32
 15  avg_rate_cst_yr  float32
dtypes: float32(8), float64(1), int16(6), int32(1)
memory usage: 8.5 GB


Unnamed: 0,mov_id,cust_id,rating,day_rated,mov_year,avg_rate_pm_pd,avg_rate_pc_pd,cust_day_count,cust_days_since,mov_days_since,mov_avg_rating,cust_avg_rating,mov_day_avg,cust_day_avg,avg_rate_yr,avg_rate_cst_yr
0,1,1488844,3.0,2125,2003,0.24,0.98,4,200,590,3.73,3.26,4.0,3.25,3.51,3.22
1,1,822109,5.0,2009,2003,0.24,0.07,11,36,474,3.73,3.99,5.0,4.36,3.51,4.0
2,1,885013,4.0,2168,2003,0.24,0.16,3,157,633,3.73,3.84,4.0,4.0,3.51,3.47
3,1,30878,,2236,2003,0.24,0.58,7,1462,701,,,,,,
4,1,823519,3.0,1636,2003,0.24,0.29,34,41,101,3.73,3.9,3.0,3.91,3.51,3.96


Wall time: 2.52 s


mov_id                   0
cust_id                  0
rating             1408395
day_rated                0
mov_year                 0
avg_rate_pm_pd           0
avg_rate_pc_pd           0
cust_day_count           0
cust_days_since          0
mov_days_since           0
mov_avg_rating     1408395
cust_avg_rating    1408395
mov_day_avg        1408395
cust_day_avg       1408395
avg_rate_yr        1408395
avg_rate_cst_yr    1408395
dtype: int64

Before filling in missing values, the dataframe is sorted so that the values designated to fill in the nulls are positioned in the previous row. All columns except for 'rating' are forward filled. For each feature, the dataframe is re-sorted. Notice that the first 4 features all begin with 'cust_id' as the primary value to sort by. This is intentional, and helps expedite the task.

In [6]:
%%time
net.sort_values(by=['cust_id', 'mov_year', 'avg_rate_cst_yr'], inplace=True)
net.avg_rate_cst_yr.fillna(method='ffill', inplace=True)

net.sort_values(by=['cust_id', 'cust_avg_rating'], inplace=True)
net.cust_avg_rating.fillna(method='ffill', inplace=True)

net.sort_values(by=['cust_id', 'day_rated', 'cust_day_avg'], inplace=True)
net.cust_day_avg.fillna(method='ffill', inplace=True)

net.sort_values(by=['mov_id', 'day_rated', 'mov_day_avg'], inplace=True)
net.mov_day_avg.fillna(method='ffill', inplace=True)

net.sort_values(by=['mov_id', 'mov_avg_rating'], inplace=True)
net.mov_avg_rating.fillna(method='ffill', inplace=True)

net.sort_values(by=['mov_year', 'avg_rate_yr'], inplace=True)
net.avg_rate_yr.fillna(method='ffill', inplace=True)

# add column for difference between cust mean and global mean
net['bline_approx'] = (2*net.rating.mean() + net.cust_day_avg -
                       net.cust_avg_rating + net.mov_day_avg -
                       net.mov_avg_rating)/2
net.bline_approx = net.bline_approx.astype(np.float32)

Wall time: 3min 7s


In [7]:
%%time
net.sort_index(inplace=True)
net.info()
display(net.head())
net.isna().sum()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100480507 entries, 0 to 100480506
Data columns (total 17 columns):
 #   Column           Dtype  
---  ------           -----  
 0   mov_id           int16  
 1   cust_id          int32  
 2   rating           float64
 3   day_rated        int16  
 4   mov_year         int16  
 5   avg_rate_pm_pd   float32
 6   avg_rate_pc_pd   float32
 7   cust_day_count   int16  
 8   cust_days_since  int16  
 9   mov_days_since   int16  
 10  mov_avg_rating   float32
 11  cust_avg_rating  float32
 12  mov_day_avg      float32
 13  cust_day_avg     float32
 14  avg_rate_yr      float32
 15  avg_rate_cst_yr  float32
 16  bline_approx     float32
dtypes: float32(9), float64(1), int16(6), int32(1)
memory usage: 6.4 GB


Unnamed: 0,mov_id,cust_id,rating,day_rated,mov_year,avg_rate_pm_pd,avg_rate_pc_pd,cust_day_count,cust_days_since,mov_days_since,mov_avg_rating,cust_avg_rating,mov_day_avg,cust_day_avg,avg_rate_yr,avg_rate_cst_yr,bline_approx
0,1,1488844,3.0,2125,2003,0.24,0.98,4,200,590,3.73,3.26,4.0,3.25,3.51,3.22,3.74
1,1,822109,5.0,2009,2003,0.24,0.07,11,36,474,3.73,3.99,5.0,4.36,3.51,4.0,4.43
2,1,885013,4.0,2168,2003,0.24,0.16,3,157,633,3.73,3.84,4.0,4.0,3.51,3.47,3.82
3,1,30878,,2236,2003,0.24,0.58,7,1462,701,3.73,3.63,3.0,3.0,3.51,3.43,2.92
4,1,823519,3.0,1636,2003,0.24,0.29,34,41,101,3.73,3.9,3.0,3.91,3.51,3.96,3.24


Wall time: 18.2 s


mov_id                   0
cust_id                  0
rating             1408395
day_rated                0
mov_year                 0
avg_rate_pm_pd           0
avg_rate_pc_pd           0
cust_day_count           0
cust_days_since          0
mov_days_since           0
mov_avg_rating           0
cust_avg_rating          0
mov_day_avg              0
cust_day_avg             0
avg_rate_yr              0
avg_rate_cst_yr          0
bline_approx             0
dtype: int64

In [8]:
%%time
net['all_ratings'] = all_ratings.rating
net = net.sample(frac=1, random_state=171)
net.info()
net.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100480507 entries, 93047606 to 6426045
Data columns (total 18 columns):
 #   Column           Dtype  
---  ------           -----  
 0   mov_id           int16  
 1   cust_id          int32  
 2   rating           float64
 3   day_rated        int16  
 4   mov_year         int16  
 5   avg_rate_pm_pd   float32
 6   avg_rate_pc_pd   float32
 7   cust_day_count   int16  
 8   cust_days_since  int16  
 9   mov_days_since   int16  
 10  mov_avg_rating   float32
 11  cust_avg_rating  float32
 12  mov_day_avg      float32
 13  cust_day_avg     float32
 14  avg_rate_yr      float32
 15  avg_rate_cst_yr  float32
 16  bline_approx     float32
 17  all_ratings      int8   
dtypes: float32(9), float64(1), int16(6), int32(1), int8(1)
memory usage: 6.5 GB
Wall time: 46.4 s


Unnamed: 0,mov_id,cust_id,rating,day_rated,mov_year,avg_rate_pm_pd,avg_rate_pc_pd,cust_day_count,cust_days_since,mov_days_since,mov_avg_rating,cust_avg_rating,mov_day_avg,cust_day_avg,avg_rate_yr,avg_rate_cst_yr,bline_approx,all_ratings
93047606,16469,562200,4.0,1913,2004,19.74,0.02,3,79,286,3.41,3.82,3.51,3.33,3.51,3.52,3.41,4
95827348,16984,639827,1.0,1854,2004,16.95,0.28,4,512,292,3.22,3.38,3.12,2.0,3.51,2.62,2.86,1
56309414,10277,1828455,2.0,1854,1948,5.03,0.08,10,357,1797,3.93,2.88,3.67,2.4,3.83,2.0,3.23,2
69063104,12501,1583579,,2137,1984,19.07,0.09,6,501,1543,3.87,4.07,3.95,4.0,3.71,3.25,3.61,5
60852899,11149,458209,3.0,1506,2002,56.17,0.24,9,316,405,3.14,3.26,3.05,2.78,3.5,3.01,3.32,3


In [9]:
%%time
net.to_parquet('net2.parquet')

Wall time: 28 s
