# Producing Movies

Imagine you're a movie producer. Can you use AI to determine what will be a hit? (no, you can’t, really) but this is a great demonstration of how generative AI and predictive AI can work together. 
Throughout this hands-on lab, we will use movie performance data to explore how predictive AI (think XGBoost) can be combined with Generative AI (ChatGPT). Where ChatGPT allows us to imagine all sorts of new combinations of movies and their plots and themes, predictive AI can provide a more objective test of the ideas you generate. 

## The Data

[OpenML]() actually hosts a few datasets from movies. There are other datasets with movie performance on Kaggle and other places, these are listed as public domain so we will stick with this data for now. 

In [1]:
from sklearn.datasets import fetch_openml
from difflib import get_close_matches
import numpy as np
import pandas as pd 
base_movies_and_ratings = fetch_openml(data_id=43603).frame
base_movie_revenue = fetch_openml(data_id=43113).frame

  warn(
  warn(


**Movies and Rating #43603**
This data has star ratings of movies from an online forum. It also has "Description" which provides a very nice plot summary of the movie which is what we need. 

In [2]:
base_movies_and_ratings.head()

Unnamed: 0,Unnamed:_0,Title,Year,Rating,Metascore,Votes,Description,Genre,Runtime_(Minutes),Revenue_(Millions),Actors,Director
0,0,Avengers: Endgame,2019,8.5,78.0,648248,After the devastating events of Avengers: ...,"Action, Adventure, Drama",181,858.37,"Robert Downey Jr., Chris Evans, Mark Ruffalo, ...","Anthony Russo, Joe Russo"
1,1,Spider-Man: Far from Home,2019,7.6,69.0,255849,Following the events of Avengers: Endgame ...,"Action, Adventure, Sci-Fi",129,388.53,"Tom Holland, Samuel L. Jackson, Jake Gyllenhaa...",Jon Watts
2,2,Toy Story 4,2019,7.9,84.0,146740,"When a new toy called ""Forky"" joins Woody ...","Animation, Adventure, Comedy",100,433.03,"Tom Hanks, Tim Allen, Annie Potts, Tony Hale",Josh Cooley
3,3,Jumanji: The Next Level,2019,7.0,58.0,63856,"In Jumanji: The Next Level, the gang is ba...","Action, Adventure, Comedy",123,0.0,"Dwayne Johnson, Jack Black, Kevin Hart, Karen ...",Jake Kasdan
4,4,The Lighthouse,2019,7.8,83.0,50595,Two lighthouse keepers try to maintain the...,"Drama, Fantasy, Horror",109,0.43,"Robert Pattinson, Willem Dafoe, Valeriia Karaman",Robert Eggers


**Movie Revenue #43113**
This data has information about the revenue of a movie. It helps find our hits. 

In [3]:
base_movie_revenue.head()

Unnamed: 0,index,genres,id,keywords,original_title,release_date,revenue,status,title,cast,director
0,0,Action Adventure Fantasy Science Fiction,19995,culture clash future space war space colony so...,Avatar,2009-12-10,2787965087,Released,Avatar,Sam Worthington Zoe Saldana Sigourney Weaver S...,James Cameron
1,1,Adventure Fantasy Action,285,ocean drug abuse exotic island east india trad...,Pirates of the Caribbean: At World's End,2007-05-19,961000000,Released,Pirates of the Caribbean: At World's End,Johnny Depp Orlando Bloom Keira Knightley Stel...,Gore Verbinski
2,2,Action Adventure Crime,206647,spy based on novel secret agent sequel mi6,Spectre,2015-10-26,880674609,Released,Spectre,Daniel Craig Christoph Waltz L\u00e9a Seydoux ...,Sam Mendes
3,3,Action Crime Drama Thriller,49026,dc comics crime fighter terrorist secret ident...,The Dark Knight Rises,2012-07-16,1084939099,Released,The Dark Knight Rises,Christian Bale Michael Caine Gary Oldman Anne ...,Christopher Nolan
4,4,Action Adventure Science Fiction,49529,based on novel mars medallion space travel pri...,John Carter,2012-03-07,284139100,Released,John Carter,Taylor Kitsch Lynn Collins Samantha Morton Wil...,Andrew Stanton


I noticed the titles differ slightly between the file so we have to be a little fancy in joining these together. These next two slides will take a minute to run. You can skip over these slides and just load the data from a URL if needed. 

In [4]:
## If you want to actually run the data set the following to True
RUN_DATA = False

title_possibilites = base_movie_revenue.title.values.tolist()
def get_match(e: str):
    matches = get_close_matches(e, title_possibilites)
    if len(matches) == 0:
        return None 
    else:
        i = title_possibilites.index(matches[0])
        return i


In [9]:

title_possibilites = base_movie_revenue.title.values.tolist()
def get_match(e: str):
    matches = get_close_matches(e, title_possibilites)
    if len(matches) == 0:
        return None 
    else:
        i = title_possibilites.index(matches[0])
        return i
if RUN_DATA: 
    base_movies_and_ratings['lookup_index'] = base_movies_and_ratings.Title.apply(get_match)
    joined_data = base_movies_and_ratings.join(base_movie_revenue[['revenue', 'status']], on='lookup_index', how='inner').drop(columns=['Revenue_(Millions)'])	
else:
    joined_data = pd.read_csv("https://storage.googleapis.com/public-artifacts-datarobot/Producers%20Chair%20Dataset.csv")
    
joined_data.head()

Unnamed: 0,Unnamed:_0,Title,Year,Rating,Metascore,Votes,Description,Genre,Runtime_(Minutes),Actors,Director,lookup_index,revenue,status
0,0,Avengers: Endgame,2019,8.5,78.0,648248,After the devastating events of Avengers: ...,"Action, Adventure, Drama",181,"Robert Downey Jr., Chris Evans, Mark Ruffalo, ...","Anthony Russo, Joe Russo",2592.0,15843608,Released
3654,3105,Highlander: Endgame,2000,4.6,21.0,18737,Immortals Connor and Duncan MacLeod must j...,"Action, Adventure, Fantasy",87,"Christopher Lambert, Adrian Paul, Bruce Payne,...",Douglas Aarniokoski,2592.0,15843608,Released
2,2,Toy Story 4,2019,7.9,84.0,146740,"When a new toy called ""Forky"" joins Woody ...","Animation, Adventure, Comedy",100,"Tom Hanks, Tim Allen, Annie Potts, Tony Hale",Josh Cooley,42.0,1066969703,Released
1462,913,Toy Story 3,2010,8.3,92.0,714751,The toys are mistakenly delivered to a day...,"Animation, Adventure, Comedy",103,"Tom Hanks, Tim Allen, Joan Cusack, Ned Beatty",Lee Unkrich,42.0,1066969703,Released
4,4,The Lighthouse,2019,7.8,83.0,50595,Two lighthouse keepers try to maintain the...,"Drama, Fantasy, Horror",109,"Robert Pattinson, Willem Dafoe, Valeriia Karaman",Robert Eggers,2917.0,93617009,Released


### Framing Our Problem 

Now we have a lot of data here. But if you're a producer what do you really know when you evaluate
whether to make a movie. 

1. The title
2. Maybe one or two actors who are interested
3. Maybe the director
4. The Genre
5. The plot summary

Everything else, that's target leakage because you don't know it at the time of prediction. You don't know the studio or even the runtime. 

Furthermore, what is our goal? Do we want to identify "A movie that will make $150 Million" or wdo we just want to make sure our movie will return "above average". our dataset because it only includes the top movies from the last 40 years, the average earnings are about $120 Million so above average is really good. 

Let's set up our target columns, We will also shape some targets: 

1. **revenue_pct_mean:** What percentage of an average film would this return? REGRESSION
2. **high_gross:** Will this return above average? CLASSIFICATION
3. **revenue log transform:** How much revenue will this movie return (log transformed due to outliers Avatar)?

In [10]:
import re
def white_space_cleaner(in_str)-> str: 
    pattern = r'\s+'
    result = re.sub(pattern, ' ', in_str)
    return result.strip()
    
TARGET_COLUMNS = {
    'Straight Revenue': 'revenue',
    'Revenue as PCT Above a Clipped Mean': 'revenue_pct_mean',
    'Log of Revenue': 'revenue_log1p',
    'High Gross Classifier': 'high_gross'
}


FEATURE_COLUMNS = ['Title', 'Year',
       'Description',  'Actors', 'Director','Genre']
if RUN_DATA:
    joined_data['Director'] = joined_data.Director.copy().apply(white_space_cleaner)
    joined_data['Actors'] = joined_data.Actors.copy().apply(white_space_cleaner)
    joined_data['revenue_pct_mean'] = (joined_data.revenue.clip(lower=10_000_000, upper=300_000_000) / joined_data.revenue.mean()).round(2)* 100
    joined_data['high_gross']  = joined_data['revenue_pct_mean'] > 100
    joined_data['revenue_log1p'] = np.log1p(joined_data.revenue)
    joined_data.to_csv("Producers Chair Dataset.csv", index=False)

In [12]:
import altair as alt

def genre_splitter(in_col):
    genres = ["Action","Comedy","Drama","Sci-Fi", "Horror"]
    matched_genre = None
    while matched_genre is None: 
        if "," in in_col:
            splits = in_col.split(",")
            for item in splits:
                if item in genres:
                    matched_genre = item
            if not matched_genre:
                matched_genre = "Other"
        else:
            matched_genre = "Other"
    return matched_genre 


alt.Chart(joined_data.sample(5000).assign(primary_genre=lambda df: df.Genre.apply(genre_splitter)), title="Exploring Revenue", width=800, height=600).mark_bar().encode(
    alt.X('revenue:Q', bin=True),
    alt.Y('count():Q', title="Number of films", scale=alt.Scale(type='sqrt')),
    alt.Color("primary_genre:N")
)

  for col_name, dtype in df.dtypes.iteritems():


## Vector Database 

In addition to our training data we just created, we also want to create a knowlege base of the movies and films. To do that, we are going to create a series of text documents according to a template. We will then zip up all this data to create our vector database. 

Note, if your not using DataRobot the code that can create the vector index is coming up in the next notebook #2. 

In [13]:
from base64 import encode, decode, encodebytes
import csv
import zipfile
from tempfile import TemporaryFile
from io import StringIO
from jinja2 import Template
import re

movie_file_template = """
Title: {{Title}}

Description: {{Description}}



Earnings:

This movie earned {{revenue_pct_mean}}% the average movie. 
"""
template = Template(movie_file_template)


def replace_non_alphanumeric(text):
  """Replaces non-alphanumeric characters in a string with underscores.

  Args:
      text: The input string.

  Returns:
      A new string with non-alphanumeric characters replaced by underscores.
  """
  return re.sub(r"\W", "_", text)



def write_file(zfile: zipfile.ZipFile, title: str, text: str, textfile_numbers: int): 
    filename=f'''{replace_non_alphanumeric(title) + str(text_file_numbers)}.txt'''
    with zfile.open(filename, 'w') as ifile:
        ifile.write(text.encode())
    return filename

with zipfile.ZipFile("VectorDatabaseData.zip", 'w') as zfile:
    text_file_numbers = 0
    for row in joined_data.itertuples():
        text = template.render(**row._asdict())
        filename = write_file(zfile,row.Title,  text, text_file_numbers)
        text_file_numbers += 1

In [14]:

import datarobot as dr
# if running outside datarobot add your authentication here 
# dr.Client(token=??, endpoint=?? )

UPLOAD_NEW_DATA = False

if UPLOAD_NEW_DATA:
    ds = dr.Dataset.create_from_file("VectorDatabaseData.zip")
    ds.modify(name="Move Information Vector DB")
    ds.update()


In [15]:
from datarobot.models.genai.vector_database import VectorDatabase

if UPLOAD_NEW_DATA:
    chunking_params = {
    "embeddingModel": "jinaai/jina-embedding-t-en-v1",
    "chunkingMethod": "recursive",
    "chunkSize": 256,
    "chunkOverlapPercentage": 20,
    "separators": [
        "\n\n",
        "\n",
        ""
    ],
    "isSeparatorRegex": False
}
    vector_db = VectorDatabase.create(dataset_id=ds.id,  name="Movies The Vector DB", chunking_parameters=chunking_params)
    print("Vector DB Created")

## Generating Predictive Models

The code below uses DataRobot to create 3 new projects: straight revenue, high grossing classifier, and the revenue percentage prediction. We call this pattern a model factory. These experiments will all run simultaneously building close to 75 models in total. Use the links to access them. 

In [16]:
import datarobot as dr
from IPython.display import display, HTML
from datetime import date 
TODAY = date.today().isoformat()

RUN_PROJECTS_IN_DR = True


if RUN_PROJECTS_IN_DR:
    for title, target in TARGET_COLUMNS.items():
        title =  title + " | "  + TODAY
        ds_name = f"Data: {title}"
        if ds_name not in [ds.name for ds in dr.Dataset.list()]:
            print(f"Creating Data For: {title}")
            df = joined_data[FEATURE_COLUMNS + [target]]
            ds = dr.Dataset.create_from_in_memory_data(df)
            ds.modify(name=f"Data: {title}")
            ds.update()
        print(f"Starting Experiment: {title}")
        project = dr.Project.create_from_dataset(ds.id, project_name=title)
        project.analyze_and_model(
            target=target, 
            worker_count=-1
        )
        display(
            HTML(f"""<div style="text-aligh:center;padding:.75rem;"> 
        <a href="{project.get_uri()}" target="_blank" style="background-color:#5371BF;color:white;padding:.66rem .75rem;border-radius:5px;cursor: pointer;">Open Experiment in DataRobot</a>
    </div>"""
    )
        )


Creating Data For: Straight Revenue | 2024-04-24
Starting Experiment: Straight Revenue | 2024-04-24


Creating Data For: Revenue as PCT Above a Clipped Mean | 2024-04-24
Starting Experiment: Revenue as PCT Above a Clipped Mean | 2024-04-24


Creating Data For: Log of Revenue | 2024-04-24
Starting Experiment: Log of Revenue | 2024-04-24


Creating Data For: High Gross Classifier | 2024-04-24
Starting Experiment: High Gross Classifier | 2024-04-24


### Model in SKLearn 

If you are not using DataRobot, the pipeline below recreates the percentage of revenue regressor. 

In [14]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator, TransformerMixin

# Assuming your DataFrame is called 'df'
# And the target variable is 'revenue_pct_mean'

# Separate the target variable from the features
X = joined_data[FEATURE_COLUMNS]
y = joined_data['revenue_pct_mean']

ALL_GENRES = joined_data['Genre'].str.replace(" ", "").str.get_dummies(",").columns.values.tolist()
genres_shape = joined_data['Genre'].str.replace(" ", "").str.get_dummies(",")

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

from sklearn.preprocessing import OneHotEncoder

class GenreEncoder(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.one_hot_encoder = OneHotEncoder(handle_unknown='infrequent_if_exist', min_frequency=10)
        
    def fit(self, X, y=None):
        # self.genres = X['Genre'].str.replace(" ", "").str.get_dummies(",")
        self.one_hot_encoder.fit(genres_shape)
        self.genres = genres_shape 
        # self.one_hot_encoder.fit(genres)
        return self

    def transform(self, X):
        genres = X['Genre'].str.replace(" ", "").str.get_dummies(",")
        for col in self.genres.columns:
            if col not in genres.columns:
                genres[col] = 0
        return self.one_hot_encoder.transform(genres[self.genres.columns])

# Step 1: Define the column transformers
numeric_transformer = SimpleImputer(strategy='mean')
text_transformer = TfidfVectorizer()
director_transformer = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=10_000)
genre_transformer = GenreEncoder()

# Step 2: Create the ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, ['Year']),
        ('genre', genre_transformer, ['Genre']),
        ('title', text_transformer, 'Title'),
        ('desc', text_transformer, 'Description'),
        ('actors', text_transformer, 'Actors'),
        ('dir', director_transformer, ['Director'])
    ])

# Step 3: Create the pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor(n_estimators=10))
])
# Fit the pipeline to the training data
pipeline.fit(X_train, y_train)



In [15]:
pipeline.score(X_test,y_test)

0.08913741093728145