# 5: TOPIC ANALYSIS ON TITLES SIMILAR TO GLOBAL MAXIMA

Analyze keywords of titles whose features are similar to the global maximum of the composite score model.

## Step 5a: Recap of Global Maxima


In [17]:
# Load the pickled file
import pickle
import pandas as pd
pd.set_option('display.max_rows', 500)


# MODEL_FILENAME = './models/2024-05-08 19-28-45_Booster.pkl'
OPTIMALFEATURES_FILENAME = './models/2024-05-13 12-43-07_OPTIMALFEATURES.pkl'

with open(OPTIMALFEATURES_FILENAME, 'rb') as file:
    best_params = pickle.load(file)
    
best_params

{'isAdult': 0,
 'runtimeMinutes': 168.99999999999997,
 'genres_TE': 0.3844082717006556,
 'genres_COUNT': 3,
 'actor_TE': 0.6533680613825336,
 'actor_COUNT': 38,
 'actress_TE': 0.8189210693083288,
 'actress_COUNT': 1,
 'casting_director_TE': 0.8560667653884508,
 'casting_director_COUNT': 1,
 'cinematographer_TE': 0.0102518649082633,
 'cinematographer_COUNT': 1,
 'composer_TE': 0.8182351827308728,
 'composer_COUNT': 1,
 'director_TE': 0.8837858813422167,
 'director_COUNT': 1,
 'editor_TE': 0.8623019167399543,
 'editor_COUNT': 1,
 'producer_TE': 0.7347334584340207,
 'producer_COUNT': 1,
 'production_designer_TE': 0.9041301479025808,
 'production_designer_COUNT': 4,
 'self_TE': 0.7724457277542547,
 'self_COUNT': 30,
 'writer_TE': 0.736346111203962,
 'writer_COUNT': 3}

## Step 5b: Decode Target Encoded Values

We aim to find data points that are similar in feature values to the global maximum, based on Euclidean distance.

In [2]:
CATMEANS_FILENAME = './models/2024-05-11 18-59-46_col_to_catmeans.pkl'

with open(CATMEANS_FILENAME, 'rb') as file:
    col_to_catmeans = pickle.load(file)

In [3]:
# Function to find closest n values to x
def find_closest_values(series, x, n):
    # Calculate the absolute differences from x
    differences = series.sub(x).abs()
    # Sort these differences and get the top n closest
    closest_indices = differences.nsmallest(n).index
    return series.loc[closest_indices]

In [4]:
features = ['genres','actor','actress','casting_director','cinematographer',
             'composer','director','editor','producer','production_designer',
             'self','writer']

records = []
for feat in features:
    series = col_to_catmeans[feat.replace('_temp','')]
    x = best_params[f'{feat}_TE']
    n = best_params[f'{feat}_COUNT']
    
    for close_cat in find_closest_values(series, x, n).index:
        records.append([feat, close_cat])

Using `col_to_catmeans`, which stores the mean scores for each category, continuous values are decoded back to their respective categories.

Given that the values of the global minimum might not match exactly with the category means in `col_to_catmeans`, the closest target encoded value to the actual value will be selected.

**Example**

    'genres_TE': 0.3844082717006556,
    'genres_COUNT': 3
    
This conversion calls for decoding the value of `genres_TE` into the 3 closest target-encoded categories to the `genres_TE` value. Which means that the 3 closest categories are the following:
* Biography (`0.381871`)
* Sport (`0.363881`)
* History (`0.359789`)

In [5]:
col_to_catmeans['genres']

genres_temp
Action         0.322377
Adult          0.227223
Adventure      0.334267
Animation      0.351749
Biography      0.381871
Comedy         0.289577
Crime          0.343605
Documentary    0.312518
Drama          0.305057
Family         0.306678
Fantasy        0.327761
Film-Noir      0.425211
Game-Show      0.281996
History        0.359789
Horror         0.277046
Music          0.354070
Musical        0.311042
Mystery        0.340307
News           0.315217
Reality-TV     0.275147
Romance        0.297307
Sci-Fi         0.315240
Sport          0.363881
Talk-Show      0.274714
Thriller       0.314649
War            0.340275
Western        0.302530
Name: composite_score, dtype: float64

We do the same thing to decode the other feature values into categories.

In [6]:
import pandas as pd
crew = pd.DataFrame(records, columns=['feature', 'nconst'])

names = pd.read_csv('./raw_data/name.basics.tsv.gz', compression='gzip',
                    delimiter='\t', usecols=['nconst','primaryName'])

crew_with_names = pd.merge(crew, names, how='left', on='nconst')
crew_with_names['primaryName'] = crew_with_names['primaryName'].fillna(crew_with_names['nconst'])
crew_with_names = crew_with_names.drop('nconst',axis=1)

# Grouping by 'feature' and joining 'primaryName' values with commas
crew_with_names = crew_with_names.groupby('feature')['primaryName'].agg(', '.join).reset_index()
display(crew_with_names)

Unnamed: 0,feature,primaryName
0,actor,"Lawrence Ryle, Bobby Worrest, Aramis Hudson, M..."
1,actress,Jodie Farber
2,casting_director,Albert Tavares
3,cinematographer,Darko Stanojev
4,composer,Arcade Fire
5,director,Lee Unkrich
6,editor,Kevin Nolting
7,genres,"Biography, Sport, History"
8,producer,Bill Damaschke
9,production_designer,"Chris Sanders, Dan Cooper, Bob Pauley, Jim Pea..."


### Save dataset

In [7]:
crew_with_names.to_csv('./processed_data/ultimate_crossover.csv')

## Step 5c: Topic Analysis on Titles Similar to Global Maxima

## LOAD ENTIRE DATASET

In [8]:
# Load data
train = pd.read_csv('processed_data/train.csv', index_col=0).dropna()
val = pd.read_csv('processed_data/val.csv', index_col=0).dropna()
test = pd.read_csv('processed_data/test.csv', index_col=0).dropna()

data = pd.concat([train, val, test])
data_feat = data[list(best_params.keys())].copy()

Before getting the closest values, it is necessary to scale both the data and the global maximum first.

In [18]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
data_feat_mm = pd.DataFrame(scaler.fit_transform(data_feat),
                            columns=data_feat.columns,
                            index=data_feat.index)
data_feat_mm.head()

Unnamed: 0_level_0,isAdult,runtimeMinutes,genres_TE,genres_COUNT,actor_TE,actor_COUNT,actress_TE,actress_COUNT,casting_director_TE,casting_director_COUNT,cinematographer_TE,cinematographer_COUNT,composer_TE,composer_COUNT,director_TE,director_COUNT,editor_TE,editor_COUNT,producer_TE,producer_COUNT,production_designer_TE,production_designer_COUNT,self_TE,self_COUNT,writer_TE,writer_COUNT
tconst,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
tt0097192,-0.120086,0.131018,-0.247498,-0.951871,0.07679,0.637351,-0.670992,0.000878,-1.247786,-0.226637,0.239683,-0.306884,0.002869,-0.32798,0.009835,-0.261377,0.31881,-0.377908,-0.72176,-0.634986,-0.368277,-0.179487,-0.106823,-0.25703,0.318651,0.275897
tt1588425,-0.120086,0.027601,0.263394,0.26638,0.110807,-1.604808,0.169335,-1.100954,-0.272904,-0.226637,-0.168444,1.707484,-0.351114,-0.32798,-0.31397,-0.261377,-0.170145,1.630531,-0.386767,-0.634986,-0.245896,-0.179487,-1.037798,4.773075,-0.203298,-0.786112
tt0065656,-0.120086,-0.009334,-0.781507,0.26638,0.912757,-0.483729,0.089135,2.204541,-0.272904,-0.226637,0.595486,-0.306884,-0.504568,1.730479,0.45335,-0.261377,0.100393,1.630531,0.619942,-0.634986,-0.245896,-0.179487,-0.106823,-0.25703,-0.016811,2.399916
tt0193617,-0.120086,-0.127526,-0.247498,-0.951871,-0.847507,0.263658,-0.727065,-1.100954,-0.272904,-0.226637,-0.592346,-0.306884,-0.837996,-0.32798,-0.497872,-0.261377,-0.528838,-0.377908,-0.386767,-0.634986,-0.864687,-0.179487,-0.106823,-0.25703,0.007107,-0.786112
tt8188734,-0.120086,0.263984,-0.247498,-0.951871,0.024815,0.263658,-0.039677,0.551794,-0.272904,-0.226637,-0.420344,-0.306884,-0.351114,-0.32798,-0.229908,-0.261377,-0.279689,-0.377908,0.025813,-0.634986,-0.245896,-0.179487,-0.106823,-0.25703,-0.209696,1.337907


Once scaled, the closest 100 titles with feature values similar to the global maximum as are listed below.

In [10]:
best_params_df = pd.DataFrame(pd.Series(best_params)).T
best_params_df = pd.DataFrame(scaler.transform(best_params_df[data_feat_mm.columns]),
                              columns=best_params_df.columns)
best_params_scaled = best_params_df.loc[0]

In [19]:
import pandas as pd
from scipy.spatial import distance

# Assuming data is your DataFrame with the dataset

# Function to calculate Euclidean distance
def calculate_distance(row):
    return distance.euclidean(row, best_params_scaled)

# Calculate distance from each row in the DataFrame to the best_params
data_feat_dist = data_feat.apply(calculate_distance, axis=1)

# Sort by distance and select the top 100 closest entries
closest_100 = data_feat_dist.sort_values().head(100)
movie_titles = data.loc[closest_100.index,'primaryTitle']
movie_titles.head(100)

tconst
tt0810779                                        Bound by Blood
tt1283967                                               Transit
tt0469119                                             Love Trap
tt1879071                                      The Fourth Reich
tt0025166                               George White's Scandals
tt8420478                                       The Eradication
tt13483480                          Tetonica Castro: Polar Star
tt19365132                                       Goodbye Tornio
tt32019988                                                Music
tt32019932                                                 Time
tt28028901                                             The Drop
tt31029252                                             The Vent
tt15438120                                                 Pupa
tt3052434                          Standard Operating Procedure
tt0480948          Vehicles and Weapons: Making 'Tomb Raider 2'
tt0273087                        

## Retrieving the most popular terms

TF-IDF, or Term Frequency-Inverse Document Frequency, is a statistical measure used to evaluate the importance of a word in a collection of documents. It helps identify the most relevant terms in each document.
* The **term frequency (TF)** part assesses how frequently a term appears in a specific document,
* while the **inverse document frequency (IDF)** part evaluates how rare or common the term is across the whole document set.
The combination of these two measures helps to prioritize terms that are frequent in a particular document but not too common across all documents, making them distinctive and potentially more relevant to the topic of the document. This makes TF-IDF particularly useful for extracting top terms that uniquely characterize each document.

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer(ngram_range=(1,3))

# Fit and transform the series
tfidf_matrix = vectorizer.fit_transform(movie_titles)

# Get feature names to use as dataframe column headers
feature_names = vectorizer.get_feature_names_out()

# Create a DataFrame for the TF-IDF values
df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)

Using this approach, the top 3 terms are the following:

In [21]:
# Function to get top n terms with highest tf-idf scores for each document
def get_top_n_terms(dataframe, top_n=5):
    top_terms = {}
    for index, row in dataframe.iterrows():
        # Sort the TF-IDF values in descending order and select the top n
        sorted_row = row.sort_values(ascending=False)
        top_terms[index] = sorted_row.head(top_n).index.tolist()
    return top_terms

top_terms = 20
# Get top 5 terms for each document
top_terms_per_document = get_top_n_terms(df_tfidf, top_terms)

# If you want to find the overall top terms in the corpus:
top_terms_overall = df_tfidf.max(axis=0).sort_values(ascending=False).head(top_terms).index.tolist()

print("Top overall terms:", top_terms_overall)

Top overall terms: ['transit', 'time', 'maternity', 'epcot', 'extra', 'president', 'prosopagnosia', 'spooked', 'pupa', 'questionnaire', 'moishe', 'unhinged', 'music', 'dukkhito', 'drop', 'vent', 'eradication', 'the drop', 'the eradication', 'the vent']


## Suggested Title and Synopses from ChatGPT

The following top terms, I asked ChatGPT to generate titles and synopses with the following terms:

|     Title                                   |     Synopsis                                                                                                                                               |
|---------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------|
|     Transit   Through Time                  |     A   historical sports biography about an athlete overcoming adversities in   different eras.                                                           |
|     Extra   Presidential                    |     A   biographical film about a lesser-known but impactful president whose policies   left a lasting mark.                                               |
|     Prosopagnosia:   The Unseen Faces       |     A   deep, personal story about someone struggling with face blindness, possibly   an athlete or historical figure who must overcome this challenge.    |
|     Pupa:   Transformation of a Champion    |     A   biography focusing on an athlete's metamorphosis and trials.                                                                                       |
|     Unhinged   Melodies                     |     With   music by Arcade Fire, a film about an unstable but genius musician who   influenced a sport or historical event.                                |

## REFERENCES
* The CRITIC Method. *New Methods and Applications in Multiple Attribute Decision Making (MADM)*
* The Beauty of Bayesian Optimization, Explained in Simple Terms. https://towardsdatascience.com/the-beauty-of-bayesian-optimization-explained-in-simple-terms-81f3ee13b10f
* ChatGPT prompts to generate movie titles of top terms.