# 2: FEATURE ENGINEERING

Derive a composite score that takes into account popularity, ratings, and profitability into account. Also, target encoding was used to convert categories into continuous values.


## STEP 2a: COMPOSITE SCORE

This outlines the method in defining a metric that captures the notion of a “hit movie”. In the merged dataset. The following features was used in producing a single metric:

- **NUMBER OF VOTES.** Servers as an indicator of the title's popularity and viewer engagement.

- **GROSS INCOME**. A key indicator of the title's financial success and market appeal.
- **AVERAGE RATINGS.**
Reflects the overall viewer reception and satisfaction. Higher ratings often indicate a more favorable response from the audience.

In [6]:
import pandas as pd
import numpy as np

# Load data
data = pd.read_csv('processed_data/data.csv', index_col=0)
data.head()

Unnamed: 0_level_0,primaryTitle,isAdult,startYear,runtimeMinutes,genres,averageRating,numVotes,actor,actress,casting_director,cinematographer,composer,director,editor,producer,production_designer,self,writer,gross_income
tconst,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
tt0000009,Miss Jerry,0,1894.0,45.0,Romance,5.4,211,"nm0183823,nm1309758",nm0063086,,nm0085156,,nm0085156,,nm0085156,,,nm0085156,0.0
tt0000147,The Corbett-Fitzsimmons Fight,0,1897.0,100.0,"Documentary,News,Sport",5.2,512,,,,nm0714557,,nm0714557,,nm0103755,,"nm0179163,nm0280615,nm4082222,nm4081458,nm2256592",,0.0
tt0000574,The Story of the Kelly Gang,0,1906.0,70.0,"Action,Adventure,Biography",6.0,900,"nm0846894,nm1431224,nm3002376,nm0143899,nm3001...","nm0846887,nm0170118",,"nm0425854,nm0675239,nm0675260",nm2421834,nm0846879,,"nm0317210,nm0425854,nm0846894,nm0846911",,,nm0846879,0.0
tt0000591,The Prodigal Son,0,1907.0,90.0,Drama,5.4,24,"nm0906197,nm0332182","nm1323543,nm1759558",,,,nm0141150,,,,,nm0141150,0.0
tt0000615,Robbery Under Arms,0,1907.0,,Drama,4.3,25,"nm3071427,nm0581353,nm0888988,nm0240418,nm0346387",nm0218953,,"nm0167619,nm0240418",,nm0533958,,,,,"nm0092809,nm0533958",0.0


In [3]:
data['gross_income_log'] = np.log10(data['gross_income']+1)
data['numVotes_log'] = np.log10(data['numVotes']+1)

### THE CRITIC METHOD

The CRITIC (Criteria Importance Through Intercriteria Correlation) method, developed by Diakoulaki, Mavrotas, and Papayannakis in 1995, is a technique in Multi-Criteria Decision-Making (MCDM).
It assesses the importance of criteria based on their variance and the correlation between them. Criteria that provide unique information by being less correlated with others and showing higher variance are given higher weights, thus improving decision-making objectivity.

#### STEPS

1. Normalize (min-max normalization)  the decision matrix.
2. Calculate the standard deviation for each criteria on the normalized matrix ($\sigma_i$)
3. Determine the correlation matrix $r_{ij}$
4. Calculate the measure of conflict
$\displaystyle\sum^{n}_{j=1}r_{ij}$
5. Determine the quantity of information associated with each criterion:
$c_i=\sigma_i\displaystyle\sum^{n}_{j=1}r_{ij}$
6. Determine the weights $w_i=\frac{c_i}{\sum^{n}_{i=1}c_i}$

In [4]:
# Shuffle the indices of the DataFrame
shuffled_indices = np.random.permutation(len(data))

# Calculate the sizes for each split
train_size = int(0.6 * len(data))
val_size = int(0.2 * len(data))
test_size = len(data) - train_size - val_size  # This ensures that rounding issues don't leave out any data

# Split the data into three parts
data_train2 = data.iloc[shuffled_indices[:train_size]].copy()
data_train2 = data_train2[['numVotes_log','gross_income_log','averageRating']]

data_train2 = data_train2[['numVotes_log','gross_income_log','averageRating']]

In [5]:
from sklearn.preprocessing import MinMaxScaler

mm = MinMaxScaler().fit(data_train2)

normalized_data_train2 = pd.DataFrame(mm.transform(data_train2),
                                         columns = data_train2.columns,
                                         index = data_train2.index)

normalized_data_train2_nozero = normalized_data_train2.loc[normalized_data_train2['gross_income_log']>0]
conflict = 1 - normalized_data_train2_nozero.corr()
conflict['sum'] = conflict.sum(axis=1)
conflict['stdev'] = data_train2.std()
conflict['info_qty'] = conflict['sum']*conflict['stdev']
weights = conflict['info_qty']/conflict['info_qty'].sum()
weights

numVotes_log        0.187335
gross_income_log    0.412424
averageRating       0.400241
Name: info_qty, dtype: float64

In [6]:
data2 = data[['numVotes_log','gross_income_log','averageRating']].copy()

normalized_data = pd.DataFrame(mm.transform(data2),
                               columns = data2.columns,
                               index = data2.index)

data['composite_score'] = (normalized_data*weights).sum(axis=1)
data['composite_score'].describe()

count    299371.000000
mean          0.296507
std           0.114048
min           0.000000
25%           0.233092
50%           0.279616
75%           0.325864
max           0.944252
Name: composite_score, dtype: float64

## STEP 2b: TARGET ENCODING

A technique for encoding categorical variables using the average value of the target variable. This method transforms categorical features into continuous values.

### ADVANTAGES
* Reduces overfitting compared to traditional one-hot encoding.
* Handles high cardinality categorical data effectively.

### PROCESS
1. Group data by categorical feature.
2. Calculate the mean of the target variable for each category.
3. Replace the categorical variable with the calculated mean.

In [7]:
data_TE = data[['primaryTitle','isAdult','startYear',
                          'averageRating','runtimeMinutes',
                          'numVotes_log','gross_income_log','composite_score']].copy()
data_TE.head()

Unnamed: 0_level_0,primaryTitle,isAdult,startYear,averageRating,runtimeMinutes,numVotes_log,gross_income_log,composite_score
tconst,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
tt0000009,Miss Jerry,0,1894.0,5.4,45.0,2.326336,0.0,0.246708
tt0000147,The Corbett-Fitzsimmons Fight,0,1897.0,5.2,100.0,2.710117,0.0,0.250465
tt0000574,The Story of the Kelly Gang,0,1906.0,6.0,70.0,2.954725,0.0,0.294105
tt0000591,The Prodigal Son,0,1907.0,5.4,90.0,1.39794,0.0,0.216104
tt0000615,Robbery Under Arms,0,1907.0,4.3,,1.414973,0.0,0.167747


In [8]:
fill_cols = ['genres','actor','actress','casting_director','cinematographer',
             'composer','director','editor','producer','production_designer',
             'self','writer']

cols_to_catmeans = {}

for col in fill_cols:

    data[col] = data[col].fillna('')
    data[f'{col}_temp'] = data[col].str.split(',')
    category_means = data[[f'{col}_temp','composite_score']].explode(f'{col}_temp').groupby(f'{col}_temp')['composite_score'].mean()
    
    cols_to_catmeans[col] = category_means
    # Step 3: Replace categories with corresponding mean target values
    data_TE[f'{col}_TE'] = data[f'{col}_temp'].apply(lambda x: np.mean([category_means[c] for c in x]))

    data_TE[f'{col}_COUNT'] = data[f'{col}_temp'].apply(len)

### SPLIT AND SAVE THE TRAIN, VALIDATION, AND TEST DATASETS

In [9]:
# Split the data into three parts
data_train = data_TE.iloc[shuffled_indices[:train_size]].copy().dropna()
data_val = data_TE.iloc[shuffled_indices[train_size:train_size + val_size]].copy().dropna()
data_test = data_TE.iloc[shuffled_indices[train_size + val_size:]].copy().copy().dropna()

In [11]:
data_test.head()

Unnamed: 0_level_0,primaryTitle,isAdult,startYear,averageRating,runtimeMinutes,numVotes_log,gross_income_log,composite_score,genres_TE,genres_COUNT,...,editor_TE,editor_COUNT,producer_TE,producer_COUNT,production_designer_TE,production_designer_COUNT,self_TE,self_COUNT,writer_TE,writer_COUNT
tconst,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
tt0892426,River Bottom,0,1993.0,8.6,95.0,1.041393,0.0,0.346659,0.305057,1,...,0.346659,3,0.318383,3,0.282329,1,0.294165,1,0.260397,1
tt0048623,Sincerely Yours,0,1955.0,5.4,115.0,2.663701,0.0,0.257829,0.318802,3,...,0.368614,1,0.431397,1,0.282329,1,0.294165,1,0.294902,2
tt0053727,College Confidential,0,1960.0,4.8,91.0,2.348305,0.0,0.22075,0.305057,1,...,0.334658,1,0.387285,1,0.282329,1,0.22075,1,0.291923,2
tt5018116,Onekotan: The Lost Island,0,2015.0,6.2,52.0,1.462398,0.0,0.253806,0.336889,3,...,0.263569,1,0.253806,3,0.282329,1,0.294165,1,0.253806,1
tt5825358,The Night-Time Winds,0,2017.0,6.1,45.0,1.176091,0.0,0.239921,0.298918,3,...,0.242426,1,0.251678,2,0.282329,1,0.294165,1,0.24638,1


In [12]:
data_train.to_csv('processed_data/train.csv')
data_val.to_csv('processed_data/val.csv')
data_test.to_csv('processed_data/test.csv')

## SAVE THE TARGET ENCODINGS
The category means are saved to decode values into categories.

In [15]:
import pickle
from datetime import datetime

curr_time = datetime.today().strftime('%Y-%m-%d %H-%M-%S')

FILENAME = f'./models/{curr_time}_col_to_catmeans.pkl'

pickle.dump(cols_to_catmeans, open(FILENAME, 'wb'))
print(FILENAME)

./models/2024-05-11 18-59-46_col_to_catmeans.pkl
