
# W266 Final Project: How does a computer describe a movie?

Data-to-text generation is an important field in NLP. The ability to translate raw structured data into more human-readable sentences is crucial as the sheer volume of data available to data scientists continues to grow. The cutting edge advances in data-to-text generation recently have utilized sequence-to-sequence encoder-decoder architecture ([Dusek and Jurcicek](https://aclanthology.org/P16-2008.pdf), 2016) to generate strings of readable text from “meaning representations” that tell the model what to include in the string, as well as where to include it. Improvements have since been made to this architecture, such as inclusion of higher-level sentence planning and content selection ([Puduppully et al.](https://www.aaai.org/ojs/index.php/AAAI/article/view/4668), 2019), imposition of pragmatic information preservation techniques and metrics ([Shen et al.](https://aclanthology.org/N19-1410.pdf), 2019), and employment of hierarchical encodings for the structured input data ([Rebuffel et al.](https://link.springer.com/chapter/10.1007/978-3-030-45439-5_5), 2020).

Most of the developments in data-to-text generation have been focused on generating uniform, factual summary sentences from highly structured inputs (e.g., “The Atlanta Hawks (46-12) defeated the Orlando Magic (19-41) 95-88 on Monday” or “The three star coffee shop, The Eagle, gives families a mid-priced dining experience”). But what happens when the structured data input describes something that can be more ambiguous, like a movie? Can a data-to-text model give short descriptions of a movie plot based on its genre and more specific keywords related to its content? After all, much of the data that a model may need to summarize for different use cases may not be as fact-based as a sports game recap. For this project, I plan to use the popular [Movies Dataset](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset?select=movies_metadata.csv) from Kaggle to train a sequence-to-sequence model and evaluate its performance in summarizing movie plots based on the information fed into it from the dataset.

In [1]:
!pip install -q sentencepiece==0.1.94
!pip install -q transformers==3.4.0
!pip install -q simpletransformers==0.49.7
!pip install rouge
!pip install wikipedia

import random
import itertools
import csv
import wikipedia
import ast
import pandas as pd
import numpy as np
import altair as alt
import tensorflow as tf
from tensorflow import keras
from transformers import T5Tokenizer, T5Model, T5ForConditionalGeneration, AdamW
from rouge import Rouge
from pprint import pprint

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/




## Function Definitions

In [2]:
def clean_list(str_):
    # function to clean lists of dicts within the movie metadata
    # returns a string of the object names or an empty string
    
    # skipping empty strings
    if not str_:
        return np.nan
    # parsing string as list
    list_ = ast.literal_eval(str_)
    # skipping empty lists
    if list_:
        new_list = []
        # for every entry in the list
        for obj in list_:
            # get just the name of the object (e.g., genre, keyword, etc.), leave out ID
            # also replaces whitespace with underscores
            new_list.append(obj.get("name"))#.replace(" ", "_"))
        # return just names in a list of strings, not a list of dicts
        return ", ".join(new_list)
    # empty lists return nan
    else:
        return np.nan

def clean_crew(str_):
    # function to clean lists of dicts within the movie metadata
    # returns a dict of the job-name pairs or an empty dict
    
    # skipping empty strings
    if not str_:
        return np.nan
    # parsing string as list
    list_ = ast.literal_eval(str_)
    crew = {}
    # skipping empty lists
    if list_:
        # for every entry in the list
        for obj in list_:
            # get job-name pairs, add to crew dict
            # also replaces whitespace with underscores
            if obj.get("job"):
                crew[obj.get("job")] = obj.get("name")
            # if an entry has no job listed, skip it
            else:
                continue
        # return dict of job-name pairs
        return crew
    # empty lists return nan
    else:
        return np.nan

def wikipedia_overview(title_year):
  # grabbing just the title
  title = title_year[:-7]
  # grabbing just the year
  year = title_year[-5:-1]
  # searching wikipedia for the overview of the movie
  # a tree of try/except due to the sheer amount of disambiguation needed
  try:
    wikipedia_result = wikipedia.summary(f"{title} ({year} film)", auto_suggest=False, sentences=1)
  except wikipedia.exceptions.PageError:
    try:
      wikipedia_result = wikipedia.summary(f"{title} ({int(year)+1} film)", auto_suggest=False, sentences=1)
    except wikipedia.exceptions.PageError:
      try:
        wikipedia_result = wikipedia.summary(f"{title} ({int(year)-1} film)", auto_suggest=False, sentences=1)
      except wikipedia.exceptions.PageError:
        try:
          wikipedia_result = wikipedia.summary(f"{title} (film)", auto_suggest=False, sentences=1)
        except wikipedia.exceptions.PageError:
          try:
            wikipedia_result = wikipedia.summary(title, auto_suggest=False, sentences=1)
          except wikipedia.exceptions.DisambiguationError:
            return np.nan
          except wikipedia.exceptions.PageError:
            return np.nan
          except KeyError:
            return np.nan
        except wikipedia.exceptions.DisambiguationError:
          return np.nan
      except KeyError:
        return np.nan
    except wikipedia.exceptions.DisambiguationError:
      return np.nan
    except KeyError:
      return np.nan
  except wikipedia.exceptions.DisambiguationError:
    return np.nan
  except KeyError:
    return np.nan
  return wikipedia_result

## Reading/Cleaning Datasets

In [None]:
# reading in data: overall metadata, keywords, and cast/crew
movies_meta = pd.read_csv("movies_metadata.csv").set_index("id")
keywords = pd.read_csv("keywords.csv").set_index("id")
credits = pd.read_csv("credits.csv").set_index("id")

# dropping three films with data read issues
movies_meta.drop(["1997-08-20", "2012-09-29", "2014-01-01"], inplace=True)
# converting movies_meta index to int for join
movies_meta.index = movies_meta.index.astype(int)

  exec(code_obj, self.user_global_ns, self.user_ns)


In [None]:
# cleaning several columns in data for easier NLP parsing
meta_cols = ["genres", "spoken_languages", "production_companies"]
for col in meta_cols:
    movies_meta[col] = movies_meta[col].fillna("").apply(clean_list)
keywords["keywords"] = keywords["keywords"].fillna("").apply(clean_list)
credits["cast"] = credits["cast"].fillna("").apply(clean_list)
credits["crew"] = credits["crew"].fillna("").apply(clean_crew)

In [None]:
# printing index values that are not ints
# issues with data read for three films?
for i in movies_meta.index:
    try:
        i = int(i)
    except:
        print(i)

In [None]:
# joining movies_meta, keywords, and credits DataFrames on movie ID
movies_joined = movies_meta.join(keywords).join(credits)
def get_director(crew):
  try:
    return crew.get("Director")
  except AttributeError:
    return np.nan
movies_joined["director"] = movies_joined["crew"].apply(get_director)
movies_joined.dropna(subset=["title", "genres", "keywords", "cast", "crew", "overview", "director"], inplace=True)
movies_joined.reset_index(inplace=True)
movies_joined["title_year"] = movies_joined["title"] + " (" + movies_joined["release_date"].str.slice(start=0, stop=4) + ")"
# movies_joined

In [None]:
movies_joined.to_csv("movies_joined.csv", index=False, quoting=csv.QUOTE_NONNUMERIC)

In [None]:
movies_subset = movies_joined.sample(n=500)
movies_subset["wikipedia"] = movies_subset["title_year"].apply(wikipedia_overview)
movies_subset.dropna(subset=["wikipedia"], inplace=True)
# movies_subset

In [None]:
movies_subset.to_csv("movies_subset.csv", index=False, quoting=csv.QUOTE_NONNUMERIC)

In [None]:
movies_test = movies_joined.sample(n=500)
movies_test["wikipedia"] = movies_test["title_year"].apply(wikipedia_overview)
movies_test.dropna(subset=["wikipedia"], inplace=True)
# movies_test

In [None]:
for movie_id in movies_test["id"]:
  if movie_id in movies_subset["id"]:
    drop_ind = movies_test.loc[movies_test["id"]==movie_id].index
    movies_test.drop(drop_ind, inplace=True)

In [None]:
movies_test.to_csv("movies_test.csv", index=False, quoting=csv.QUOTE_NONNUMERIC)

## Reading Checkpointed Datasets

In [2]:
movies_joined = pd.read_csv("movies_joined.csv")
movies_joined.head()

Unnamed: 0,id,adult,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,original_title,overview,...,tagline,title,video,vote_average,vote_count,keywords,cast,crew,director,title_year
0,2,False,,0,"Drama, Crime",,tt0094675,fi,Ariel,Taisto Kasurinen is a Finnish coal miner whose...,...,,Ariel,False,7.1,44.0,"underdog, prison, factory worker, prisoner, he...","Turo Pajala, Susanna Haavisto, Matti Pellonpää...","{'Screenplay': 'Aki Kaurismäki', 'Director': '...",Aki Kaurismäki,Ariel (1988)
1,3,False,,0,"Drama, Comedy",,tt0092149,fi,Varjoja paratiisissa,"An episode in the life of Nikander, a garbage ...",...,,Shadows in Paradise,False,7.1,35.0,"salesclerk, helsinki, garbage, independent film","Matti Pellonpää, Kati Outinen, Sakari Kuosmane...","{'Screenplay': 'Aki Kaurismäki', 'Director': '...",Aki Kaurismäki,Shadows in Paradise (1986)
2,5,False,,4000000,"Crime, Comedy",,tt0113101,en,Four Rooms,It's Ted the Bellhop's first night on the job....,...,Twelve outrageous guests. Four scandalous requ...,Four Rooms,False,6.5,539.0,"hotel, new year's eve, witch, bet, hotel room,...","Tim Roth, Antonio Banderas, Jennifer Beals, Ma...",{'Original Music Composer': 'Combustible Ediso...,Quentin Tarantino,Four Rooms (1995)
3,6,False,,0,"Action, Thriller, Crime",,tt0107286,en,Judgment Night,"While racing to a boxing match, Frank, Mike, J...",...,Don't move. Don't whisper. Don't even breathe.,Judgment Night,False,6.4,79.0,"chicago, drug dealer, boxing match, escape, on...","Emilio Estevez, Cuba Gooding Jr., Denis Leary,...","{'Director': 'Stephen Hopkins', 'Screenplay': ...",Stephen Hopkins,Judgment Night (1993)
4,11,False,"{'id': 10, 'name': 'Star Wars Collection', 'po...",11000000,"Adventure, Action, Science Fiction",http://www.starwars.com/films/star-wars-episod...,tt0076759,en,Star Wars,Princess Leia is captured and held hostage by ...,...,"A long time ago in a galaxy far, far away...",Star Wars,False,8.1,6778.0,"android, galaxy, hermit, death star, lightsabe...","Mark Hamill, Harrison Ford, Carrie Fisher, Pet...","{'Director': 'George Lucas', 'Executive Produc...",George Lucas,Star Wars (1977)


In [3]:
# printing the number of films with columns of interest filled
len(movies_joined.dropna(subset=["keywords", "genres", "cast", "crew", "overview"]))

30189

In [4]:
movies_subset = pd.read_csv("movies_subset.csv")
movies_subset.head()

Unnamed: 0,id,adult,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,original_title,overview,...,title,video,vote_average,vote_count,keywords,cast,crew,director,title_year,wikipedia
0,106194,False,,0,"Documentary, Foreign",,tt0469772,sv,"Bergman och filmen, Bergman och teatern, Bergm...",This is the first time ever that a film maker ...,...,Bergman Island,False,7.0,4.0,woman director,"Ingmar Bergman, Marie Nyreröd, Erland Josephson",{'Director': 'Marie Nyreröd'},Marie Nyreröd,Bergman Island (2004),Bergman Island is a 2021 romantic drama film w...
1,1599,False,,22000000,"Comedy, Romance",,tt0356721,en,I Heart Huckabees,"A husband-and-wife team play detective, but no...",...,I Heart Huckabees,False,6.2,245.0,"sex, detective, jealousy, humor, protest, wife...","Jason Schwartzman, Dustin Hoffman, Lily Tomlin...","{'Music': 'Jon Brion', 'Director of Photograph...",David O. Russell,I Heart Huckabees (2004),I Heart Huckabees (stylized as I ♥ Huckabees; ...
2,35405,False,,0,"Action, Comedy",,tt0072107,en,S*P*Y*S,"Two CIA bunglers (Donald Sutherland, Elliott G...",...,S*P*Y*S,False,5.9,5.0,"murder, espionage, hijinks","Elliott Gould, Donald Sutherland, Zouzou, Joss...","{'Director': 'Irvin Kershner', 'Screenplay': '...",Irvin Kershner,S*P*Y*S (1974),S*P*Y*S is a 1974 American spy comedy film dir...
3,70971,False,,0,"Drama, Romance",,tt0082404,en,Four Friends,Also known as Moritorium and Georgia's Friends...,...,Four Friends,False,7.0,4.0,"adolescence, 1960s","Craig Wasson, Jodi Thelen, Michael Huddleston,...","{'Production Design': 'David Chapman', 'Costum...",Arthur Penn,Four Friends (1981),Four Friends is a 1981 American comedy-drama f...
4,40718,False,,0,"Action, Drama",,tt0040946,en,Wake of the Red Witch,Captain Ralls fights Dutch shipping magnate Ma...,...,Wake of the Red Witch,False,5.2,12.0,"based on novel, captain, fight, love, shipping...","John Wayne, Gail Russell, Gig Young, Adele Mar...","{'Screenplay': 'Kenneth Gamet', 'Director of P...",Edward Ludwig,Wake of the Red Witch (1948),Wake of the Red Witch is a 1948 American adven...


In [5]:
movies_test = pd.read_csv("movies_test.csv")
movies_test.head()

Unnamed: 0,id,adult,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,original_title,overview,...,title,video,vote_average,vote_count,keywords,cast,crew,director,title_year,wikipedia
0,7510,False,,12000000,"Drama, Comedy",,tt0439289,en,Running with Scissors,Young Augusten Burroughs absorbs experiences t...,...,Running with Scissors,False,5.8,80.0,"gay, sister sister relationship, wife husband ...","Annette Bening, Brian Cox, Joseph Cross, Josep...","{'Producer': 'Bonnie Weis', 'Casting': 'Mali F...",Ryan Murphy,Running with Scissors (2006),Running with Scissors is a 2006 American comed...
1,115162,False,,0,"Crime, Drama",,tt0056921,fr,Chair de poule,"In this French crime drama, two safe-crackers ...",...,Chair de poule,False,6.1,4.0,french noir,"Robert Hossein, Jean Sorel, Catherine Rouvel, ...","{'Adaptation': 'Julien Duvivier', 'Dialogue': ...",Julien Duvivier,Chair de poule (1963),"Chair de poule (French for ""goosebumps"") is a ..."
2,368006,False,,10000000,"Action, Science Fiction, Thriller",,tt4981966,ta,24,A scientist invents a time machine but his evi...,...,24,False,7.1,22.0,time travel,"Suriya , Samantha Akkineni, Nithya Menen, Sara...","{'Director': 'Vikram Kumar', 'Editor': 'Prawin...",Vikram Kumar,24 (2016),24 is a 2016 Indian Tamil-language science fic...
3,26235,False,"{'id': 251383, 'name': 'The Gamers Collection'...",0,"Adventure, Comedy, Fantasy",http://deadgentlemen.com/projects/the-gamers/t...,tt0447166,en,The Gamers: Dorkness Rising,All Lodge wants is for his gaming group to fin...,...,The Gamers: Dorkness Rising,False,7.6,14.0,independent film,"Nathan Rice, Carol Roscoe, Brian Lewis, Scott ...","{'Stunt Coordinator': 'Geoff Gibbs', 'Director...",Matt Vancil,The Gamers: Dorkness Rising (2008),The Gamers: Dorkness Rising is a feature-lengt...
4,30068,False,,0,Documentary,,tt0293088,en,Devil's Playground,The Devil's Playground is a fascinating and mo...,...,Devil's Playground,False,6.7,14.0,"amish, woman director, rumspringa","Velda Bontrager, Mark Bontrager, Dewayne Chupp...",{'Director': 'Lucy Walker'},Lucy Walker,Devil's Playground (2002),Devil's Playground is a 2002 American document...


## Creating Model Inputs

In [7]:
# creating validation set
wiki_pairs = []
tmdb_pairs = []
for i in movies_subset.index:
#     input_string = f'''\
# Title[{movies_subset.loc[i, 'title']}], \
# Genres[{movies_subset.loc[i, 'genres']}], \
# Keywords[{movies_subset.loc[i, 'keywords']}], \
# Cast[{movies_subset.loc[i, 'cast']}], \
# {", ".join([f"{job}[{name}]" for job, name in movies_subset.loc[i, "crew"].items()])}\
# '''
    input_string = f'''\
{movies_subset.loc[i, 'title']} is a \
{movies_subset.loc[i, 'genres'].lower().replace(',', ' ')} film \
directed by {movies_subset.loc[i, 'director']}. \
{movies_subset.loc[i, 'title']} was released in {movies_subset.loc[i, 'title_year'][-5:-1]}. \
{movies_subset.loc[i, 'title']} stars {movies_subset.loc[i, 'cast'].split(', ')[0]}. \
{movies_subset.loc[i, 'title']} contains {movies_subset.loc[i, 'keywords']}. \
'''
    wiki_overview = movies_subset.loc[i, "wikipedia"]
    wiki_pair = (input_string, wiki_overview)
    wiki_pairs.append(wiki_pair)
    tmdb_overview = movies_subset.loc[i, "overview"]
    tmdb_pair = (input_string, tmdb_overview)
    tmdb_pairs.append(tmdb_pair)

In [8]:
# creating test set
wiki_test = []
tmdb_test = []
for i in movies_test.index:
#     input_string = f'''\
# Title[{movies_subset.loc[i, 'title']}], \
# Genres[{movies_subset.loc[i, 'genres']}], \
# Keywords[{movies_subset.loc[i, 'keywords']}], \
# Cast[{movies_subset.loc[i, 'cast']}], \
# {", ".join([f"{job}[{name}]" for job, name in movies_subset.loc[i, "crew"].items()])}\
# '''
    input_string = f'''\
{movies_test.loc[i, 'title']} is a \
{movies_test.loc[i, 'genres'].lower().replace(',', ' ')} film \
directed by {movies_test.loc[i, 'director']}. \
{movies_test.loc[i, 'title']} was released in {movies_test.loc[i, 'title_year'][-5:-1]}. \
{movies_test.loc[i, 'title']} stars {movies_test.loc[i, 'cast'].split(', ')[0]}. \
{movies_test.loc[i, 'title']} contains {movies_test.loc[i, 'keywords']}. \
'''
    wiki_overview = movies_test.loc[i, "wikipedia"]
    wiki_pair = (input_string, wiki_overview)
    wiki_test.append(wiki_pair)
    tmdb_overview = movies_test.loc[i, "overview"]
    tmdb_pair = (input_string, tmdb_overview)
    tmdb_test.append(tmdb_pair)

In [9]:
wiki_pairs[:5]

[('Bergman Island is a documentary  foreign film directed by Marie Nyreröd. Bergman Island was released in 2004. Bergman Island stars Ingmar Bergman. Bergman Island contains woman director. ',
  'Bergman Island is a 2021 romantic drama film written and directed by Mia Hansen-Løve.'),
 ('I Heart Huckabees is a comedy  romance film directed by David O. Russell. I Heart Huckabees was released in 2004. I Heart Huckabees stars Jason Schwartzman. I Heart Huckabees contains sex, detective, jealousy, humor, protest, wife, celebrity, rivalry, independent film, religion, universe, anger, nature, husband, existentialism, enviroment, issues. ',
  'I Heart Huckabees (stylized as I ♥ Huckabees; also I Love Huckabees) is a 2004 independent black comedy film directed and produced by David O. Russell, who co-wrote the screenplay with Jeff Baena.'),
 ('S*P*Y*S is a action  comedy film directed by Irvin Kershner. S*P*Y*S was released in 1974. S*P*Y*S stars Elliott Gould. S*P*Y*S contains murder, espionag

In [10]:
tmdb_pairs[:5]

[('Bergman Island is a documentary  foreign film directed by Marie Nyreröd. Bergman Island was released in 2004. Bergman Island stars Ingmar Bergman. Bergman Island contains woman director. ',
  "This is the first time ever that a film maker has access to Ingmar Bergman in his home at the small island Fårö in the Baltic Sea. Bergman and the Cinema starts with Frenzy from 1944 and ends with Saraband from 2003. It contains unique behind-the-scenes material from Bergman's private archive. Bergman and the Theatre is about some of Bergman's 125 theatrical stagings and about his delight with the TV medium with successes as Scenes from a marriage. In Bergman and Fårö Island he talks about the childhood that shaped him. He shows where he shot his film Persona and fell in love - and he lists his worst demons!"),
 ('I Heart Huckabees is a comedy  romance film directed by David O. Russell. I Heart Huckabees was released in 2004. I Heart Huckabees stars Jason Schwartzman. I Heart Huckabees contain

## Creating & Fine-Tuning T5 Model

In [76]:
# creating clean pre-trained T5 model
t5_model = T5ForConditionalGeneration.from_pretrained('t5-base')
t5_tokenizer = T5Tokenizer.from_pretrained('t5-base')

In [77]:
# optimizer
no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [
    {
        "params": [p for n, p in t5_model.named_parameters() if not any(nd in n for nd in no_decay)],
        "weight_decay": 0.0,
    },
    {
        "params": [p for n, p in t5_model.named_parameters() if any(nd in n for nd in no_decay)],
        "weight_decay": 0.0,
    },
]
optimizer = AdamW(optimizer_grouped_parameters, lr=3e-4, eps=1e-8)

In [78]:
# fine-tune loop
t5_model.train()

epochs = 5

few_shot_train = random.sample(wiki_pairs, 15)
pprint(few_shot_train)

for epoch in range(epochs):
  print ("epoch ",epoch)
  for input,output in few_shot_train:
    input_sent = f"describe: {input} </s>"
    ouput_sent = output+" </s>"

    tokenized_inp = t5_tokenizer.encode_plus(input_sent,  max_length=192, pad_to_max_length=True, return_tensors="pt")
    tokenized_output = t5_tokenizer.encode_plus(ouput_sent, max_length=192, pad_to_max_length=True, return_tensors="pt")

    input_ids  = tokenized_inp["input_ids"]
    attention_mask = tokenized_inp["attention_mask"]

    lm_labels= tokenized_output["input_ids"]
    decoder_attention_mask=  tokenized_output["attention_mask"]

    # the forward function automatically creates the correct decoder_input_ids
    output = t5_model(input_ids=input_ids, labels=lm_labels,decoder_attention_mask=decoder_attention_mask,attention_mask=attention_mask)
    loss = output[0]

    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


[('Mystery Train is a crime  drama  comedy film directed by Jim Jarmusch. '
  'Mystery Train was released in 1989. Mystery Train stars Masatoshi Nagase. '
  'Mystery Train contains hotel room, memphis, widow, elvis, firearm, '
  'independent film. ',
  'Mystery Train is a 1989 comedy-drama anthology film written and directed by '
  'Jim Jarmusch and set in Memphis, Tennessee.'),
 ('Crouching Tiger, Hidden Dragon: Sword of Destiny is a action  adventure  '
  'drama film directed by Yuen Woo-ping. Crouching Tiger, Hidden Dragon: Sword '
  'of Destiny was released in 2016. Crouching Tiger, Hidden Dragon: Sword of '
  'Destiny stars Michelle Yeoh. Crouching Tiger, Hidden Dragon: Sword of '
  'Destiny contains martial arts, wuxia. ',
  'Crouching Tiger, Hidden Dragon: Sword of Destiny (Chinese: 卧虎藏龙：青冥宝剑) is a '
  '2016 American-Chinese wuxia film directed by Yuen Woo-ping and written by '
  'John Fusco, based on the novel Iron Knight, Silver Vase by Wang Dulu.'),
 ('A Successful Calamity i

  f"This sequence already has {self.eos_token}. In future versions this behavior may lead to duplicated eos tokens being added."


epoch  1
epoch  2
epoch  3
epoch  4


## Evaluating Fine-Tuned Model

In [80]:
# unit testing to see what the model creates for an individual movie
test_ind = 159
# input_text = "John Wick is a action  thriller film directed by David Leitch. John Wick was released in 2014. John Wick stars Keanu Reeves. John Wick contains hitman, russian mafia, revenge, murder, gangster, dog, retired, widower."
# test_sent = f'describe: {input_text} </s>'
test_sent = f'describe: {wiki_pairs[test_ind][0]} </s>'
print(test_sent)
test_tokenized = t5_tokenizer.encode_plus(test_sent, return_tensors="pt")

test_input_ids  = test_tokenized["input_ids"]
test_attention_mask = test_tokenized["attention_mask"]

t5_model.eval()
beam_outputs = t5_model.generate(
    input_ids=test_input_ids,attention_mask=test_attention_mask,
    min_length=40,
    max_length=120,
    early_stopping=True,
    num_beams=9,
    num_return_sequences=3,
    no_repeat_ngram_size=1
)

for beam_output in beam_outputs:
    sent = t5_tokenizer.decode(beam_output, skip_special_tokens=True,clean_up_tokenization_spaces=True)
    pprint(sent)

describe: Ai Weiwei: Never Sorry is a documentary film directed by Alison Klayman. Ai Weiwei: Never Sorry was released in 2012. Ai Weiwei: Never Sorry stars Ai Weiwei. Ai Weiwei: Never Sorry contains freedom of speech, social activism, dissident, chinese communists, sichuan earthquake, contemporary art, twitter, social media, duringcreditsstinger, woman director.  </s>
('Never Sorry is a 2012 American-chinese documentary film directed by Alison '
 'Klayman, which has been released in China since the mid-1990s. It also '
 'features interviews with famous Chinese artists and activists who have taken '
 'to social media for fear of being ridiculed as "the most powerful person on '
 'earth."')
('Never Sorry is a 2012 American-chinese documentary film directed by Alison '
 'Klayman, which has been released in China since the mid-1990s. It also '
 'features interviews with famous Chinese artists and activists who have taken '
 'to social media for fear of being ridiculed as "the most powerfu

In [73]:
# ROUGE scores with baseline plot overview
pprint(tmdb_pairs[test_ind][1])
pprint(Rouge().get_scores(t5_tokenizer.decode(beam_outputs[0], skip_special_tokens=True,clean_up_tokenization_spaces=True), tmdb_pairs[test_ind][1]))

('Ai Weiwei is known for many things – great architecture, subversive '
 'in-your-face art, and political activism. He has also called for greater '
 'transparency on the part of the Chinese state. Director Alison Klayman '
 'chronicles the complexities of Ai’s life for three years, beginning with his '
 'rise to public prominence via blog and Twitter after he questioned the '
 'deaths of more than 5,000 students in the 2008 Sichuan earthquake. The '
 'record continues through his widely publicized arrest in Beijing in April of '
 '2011. As Ai prepares various works of art for major international '
 'exhibitions, his activism heats up, and his run-ins with China’s authorities '
 'become more and more frequent.')
[{'rouge-1': {'f': 0.15652173519243864,
              'p': 0.2903225806451613,
              'r': 0.10714285714285714},
  'rouge-2': {'f': 0.014598536725452341,
              'p': 0.03333333333333333,
              'r': 0.009345794392523364},
  'rouge-l': {'f': 0.13913043084461

In [74]:
# ROUGE scores with Wikipedia description
pprint(wiki_pairs[test_ind][1])
pprint(Rouge().get_scores(t5_tokenizer.decode(beam_outputs[0], skip_special_tokens=True,clean_up_tokenization_spaces=True), wiki_pairs[test_ind][1]))

('Ai Weiwei: Never Sorry (in Chinese 艾未未：道歉你妹; official title in Taiwan '
 '艾未未：草泥馬) is a 2012 documentary film about Chinese artist and activist Ai '
 'Weiwei, directed by American filmmaker Alison Klayman.')
[{'rouge-1': {'f': 0.5084745712841139,
              'p': 0.4838709677419355,
              'r': 0.5357142857142857},
  'rouge-2': {'f': 0.20338982550991105, 'p': 0.2, 'r': 0.20689655172413793},
  'rouge-l': {'f': 0.4406779611146223,
              'p': 0.41935483870967744,
              'r': 0.4642857142857143}}]


In [None]:
# evaluating on 500-movie validation set
rouge1_wiki_highest = []
rouge1_tmdb_highest = []
for test_ind in range(len(wiki_pairs)):
  test_sent = f'describe: {wiki_pairs[test_ind][0]} </s>'
  test_tokenized = t5_tokenizer.encode_plus(test_sent, return_tensors="pt")

  test_input_ids  = test_tokenized["input_ids"]
  test_attention_mask = test_tokenized["attention_mask"]

  t5_model.eval()
  beam_outputs = t5_model.generate(
      input_ids=test_input_ids,attention_mask=test_attention_mask,
      min_length=40,
      max_length=120,
      early_stopping=True,
      num_beams=9,
      num_return_sequences=3,
      no_repeat_ngram_size=1
  )

  rouge1_wiki = []
  rouge1_tmdb = []
  for beam_output_n in beam_outputs:
    score_n = Rouge().get_scores(
        t5_tokenizer.decode(beam_output_n, skip_special_tokens=True,clean_up_tokenization_spaces=True),
        wiki_pairs[test_ind][1]
    )[0]["rouge-1"]["f"]
    rouge1_wiki.append(score_n)
  
    score_n = Rouge().get_scores(
        t5_tokenizer.decode(beam_output_n, skip_special_tokens=True,clean_up_tokenization_spaces=True),
        tmdb_pairs[test_ind][1]
    )[0]["rouge-1"]["f"]
    rouge1_tmdb.append(score_n)
  
  rouge1_wiki_highest.append(max(rouge1_wiki))
  rouge1_tmdb_highest.append(max(rouge1_tmdb))

print(np.mean(rouge1_wiki_highest))
print(np.mean(rouge1_tmdb_highest))

  f"This sequence already has {self.eos_token}. In future versions this behavior may lead to duplicated eos tokens being added."
  beam_id = beam_token_id // vocab_size


In [12]:
# few-shot training size tuning
for training_data in [tmdb_pairs, wiki_pairs]:
  if training_data == tmdb_pairs:
    print("Trained on: BASELINE PLOT OVERVIEW")
  else:
    print("Trained on: WIKIPEDIA OVERVIEW")
  for shot in [10, 5]:
    print(f"{shot}-shot")
    # hyper-parameter adjustments
    epochs = 5
    few_shot_train = random.sample(training_data, shot)

    # creating clean model
    t5_model = T5ForConditionalGeneration.from_pretrained('t5-base')
    t5_tokenizer = T5Tokenizer.from_pretrained('t5-base')

    # optimizer
    no_decay = ["bias", "LayerNorm.weight"]
    optimizer_grouped_parameters = [
        {
            "params": [p for n, p in t5_model.named_parameters() if not any(nd in n for nd in no_decay)],
            "weight_decay": 0.0,
        },
        {
            "params": [p for n, p in t5_model.named_parameters() if any(nd in n for nd in no_decay)],
            "weight_decay": 0.0,
        },
    ]
    optimizer = AdamW(optimizer_grouped_parameters, lr=3e-4, eps=1e-8)

    # fine-tune loop
    t5_model.train()

    for epoch in range(epochs):
      print ("epoch ",epoch)
      for input,output in few_shot_train:
        input_sent = f"describe: {input} </s>"
        ouput_sent = output+" </s>"

        tokenized_inp = t5_tokenizer.encode_plus(input_sent,  max_length=192, pad_to_max_length=True, return_tensors="pt")
        tokenized_output = t5_tokenizer.encode_plus(ouput_sent, max_length=192, pad_to_max_length=True, return_tensors="pt")

        input_ids  = tokenized_inp["input_ids"]
        attention_mask = tokenized_inp["attention_mask"]

        lm_labels= tokenized_output["input_ids"]
        decoder_attention_mask=  tokenized_output["attention_mask"]

        # the forward function automatically creates the correct decoder_input_ids
        output = t5_model(input_ids=input_ids, labels=lm_labels,decoder_attention_mask=decoder_attention_mask,attention_mask=attention_mask)
        loss = output[0]

        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
    
    # evaluation
    rouge1_wiki_highest = []
    rouge1_tmdb_highest = []
    
    for test_ind in range(len(training_data[:200])):
      test_sent = f'describe: {training_data[test_ind][0]} </s>'
      test_tokenized = t5_tokenizer.encode_plus(test_sent, return_tensors="pt")

      test_input_ids  = test_tokenized["input_ids"]
      test_attention_mask = test_tokenized["attention_mask"]

      t5_model.eval()
      beam_outputs = t5_model.generate(
          input_ids=test_input_ids,attention_mask=test_attention_mask,
          min_length=40,
          max_length=120,
          early_stopping=True,
          num_beams=9,
          num_return_sequences=3,
          no_repeat_ngram_size=1
      )

      rouge1_wiki = []
      rouge1_tmdb = []
      for beam_output_n in beam_outputs:
        score_n = Rouge().get_scores(
            t5_tokenizer.decode(beam_output_n, skip_special_tokens=True,clean_up_tokenization_spaces=True),
            wiki_pairs[test_ind][1]
        )[0]["rouge-1"]["f"]
        rouge1_wiki.append(score_n)
      
        score_n = Rouge().get_scores(
            t5_tokenizer.decode(beam_output_n, skip_special_tokens=True,clean_up_tokenization_spaces=True),
            tmdb_pairs[test_ind][1]
        )[0]["rouge-1"]["f"]
        rouge1_tmdb.append(score_n)
      
      rouge1_wiki_highest.append(max(rouge1_wiki))
      rouge1_tmdb_highest.append(max(rouge1_tmdb))

    print(f"Baseline score: {np.mean(rouge1_tmdb_highest)}")
    print(f"Wikipedia score: {np.mean(rouge1_wiki_highest)}")
    print("-----------------------")

  print("-----------------------")

10-shot


Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


epoch  0


  f"This sequence already has {self.eos_token}. In future versions this behavior may lead to duplicated eos tokens being added."


epoch  1
epoch  2
epoch  3
epoch  4


  beam_id = beam_token_id // vocab_size


Baseline score: 0.13168859127968108
Wikipedia score: 0.4428841432545135
-----------------------
5-shot


Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


epoch  0
epoch  1
epoch  2
epoch  3
epoch  4
Baseline score: 0.16593677211229646
Wikipedia score: 0.3720387182263813
-----------------------


In [None]:
# evaluating on 500-movie test set
rouge1_wiki_highest = []
rouge1_tmdb_highest = []
for test_ind in range(len(wiki_test)):
  test_sent = f'describe: {wiki_test[test_ind][0]} </s>'
  test_tokenized = t5_tokenizer.encode_plus(test_sent, return_tensors="pt")

  test_input_ids  = test_tokenized["input_ids"]
  test_attention_mask = test_tokenized["attention_mask"]

  t5_model.eval()
  beam_outputs = t5_model.generate(
      input_ids=test_input_ids,attention_mask=test_attention_mask,
      min_length=40,
      max_length=120,
      early_stopping=True,
      num_beams=9,
      num_return_sequences=3,
      no_repeat_ngram_size=1
  )

  rouge1_wiki = []
  rouge1_tmdb = []
  for beam_output_n in beam_outputs:
    score_n = Rouge().get_scores(
        t5_tokenizer.decode(beam_output_n, skip_special_tokens=True,clean_up_tokenization_spaces=True),
        wiki_test[test_ind][1]
    )[0]["rouge-1"]["f"]
    rouge1_wiki.append(score_n)
  
    score_n = Rouge().get_scores(
        t5_tokenizer.decode(beam_output_n, skip_special_tokens=True,clean_up_tokenization_spaces=True),
        tmdb_test[test_ind][1]
    )[0]["rouge-1"]["f"]
    rouge1_tmdb.append(score_n)
  
  rouge1_wiki_highest.append(max(rouge1_wiki))
  rouge1_tmdb_highest.append(max(rouge1_tmdb))

print(np.mean(rouge1_wiki_highest))
print(np.mean(rouge1_tmdb_highest))

## Visualization for Final Report

In [13]:
# storing values to copy into final report
# 5 epochs, 5-shot baseline
tmdb_55 = 0.16487877249859986
wiki_55 = 0.35897335625458804

# 5 epochs, 10-shot baseline
tmdb_510 = 0.17880860306318402
wiki_510 = 0.2769135366969937

# 5 epochs, 15-shot baseline
tmdb_515 = 0.18237719012673384
wiki_515 = 0.3127183926456598

# 5 epochs, 5-shot improved
tmdb_55 = 0.16593677211229646
wiki_55 = 0.3720387182263813

# 5 epochs, 10-shot improved
tmdb_510 = 0.13168859127968108
wiki_510 = 0.4428841432545135

# 5 epochs, 15-shot improved
tmdb_515 = 0.14667901548095327
wiki_515 = 0.45286786315175664

# 10 epochs, 10-shot improved
tmdb_1010 = 0.13916713592122615
wiki_1010 = 0.4376975009417915

In [56]:
# creating a long-form pandas dataframes for altair charts
epochs = [5, 10, 15]
baseline_tmdb = [0.16487877249859986, 0.17880860306318402, 0.18237719012673384]
baseline_wiki = [0.35897335625458804, 0.2769135366969937, 0.3127183926456598]
improved_tmdb = [0.16593677211229646, 0.13168859127968108, 0.14667901548095327]
improved_wiki = [0.3720387182263813, 0.4428841432545135, 0.45286786315175664]

baseline_metrics = pd.DataFrame(columns=["eval_output", "n_train_examples", "rouge1"])
baseline_metrics["eval_output"] = ["Plot Overview"] * 3 + ["Wikipedia Description"] * 3
baseline_metrics["n_train_examples"] = [5, 10, 15] * 2
baseline_metrics["rouge1"] = baseline_tmdb + baseline_wiki

improved_metrics = pd.DataFrame(columns=["eval_output", "n_train_examples", "rouge1"])
improved_metrics["eval_output"] = ["Plot Overview"] * 3 + ["Wikipedia Description"] * 3
improved_metrics["n_train_examples"] = [5, 10, 15] * 2
improved_metrics["rouge1"] = improved_tmdb + improved_wiki

In [71]:
# altair chart for report
baseline_chart = alt.Chart(baseline_metrics).mark_bar().encode(
    x = alt.X("eval_output:N", title="ROUGE-1 overlap"),
    y = alt.Y("rouge1:Q", title="ROUGE-1 Score"),
    column = alt.Column("n_train_examples:O", title="Number of Training Examples"),
    color = alt.Color("eval_output:N", title="Evaluated Against:")
).properties(
    width = 150,
    height = 350,
    title = "Baseline Model"
)

improved_chart = alt.Chart(improved_metrics).mark_bar().encode(
    x = alt.X("eval_output:N", title="ROUGE-1 overlap"),
    y = alt.Y("rouge1:Q", title="ROUGE-1 Score"),
    column = alt.Column("n_train_examples:O", title="Number of Training Examples"),
    color = alt.Color("eval_output:N", title="Evaluated Against:")
).properties(
    width = 150,
    height = 350,
    title = "Improved Model"
)

(baseline_chart | improved_chart).resolve_scale(y="shared").properties(
    title="ROUGE-1 scores for baseline and improved models, compared against plot overviews and Wikipedia descriptions:"
).configure_title(fontSize=16)

In [75]:
wiki_pairs[:25]

[('Bergman Island is a documentary  foreign film directed by Marie Nyreröd. Bergman Island was released in 2004. Bergman Island stars Ingmar Bergman. Bergman Island contains woman director. ',
  'Bergman Island is a 2021 romantic drama film written and directed by Mia Hansen-Løve.'),
 ('I Heart Huckabees is a comedy  romance film directed by David O. Russell. I Heart Huckabees was released in 2004. I Heart Huckabees stars Jason Schwartzman. I Heart Huckabees contains sex, detective, jealousy, humor, protest, wife, celebrity, rivalry, independent film, religion, universe, anger, nature, husband, existentialism, enviroment, issues. ',
  'I Heart Huckabees (stylized as I ♥ Huckabees; also I Love Huckabees) is a 2004 independent black comedy film directed and produced by David O. Russell, who co-wrote the screenplay with Jeff Baena.'),
 ('S*P*Y*S is a action  comedy film directed by Irvin Kershner. S*P*Y*S was released in 1974. S*P*Y*S stars Elliott Gould. S*P*Y*S contains murder, espionag