In [92]:
# Suggested imports. Do not use import any modules that are not in the requirements.txt file on the VLE.

%matplotlib inline

import numpy as np
import pandas as pd
import torch
import collections
import random
import matplotlib.pyplot as plt
import sklearn.model_selection
import sklearn.metrics

device = 'cuda:0' if torch.cuda.is_available() else 'cpu'

# Movie titles assignment

Table of contents:

* [Data filtering and splitting (10%)](#Data-filtering-and-splitting-(10%))
* [Title classification (25%)](#Title-classification-(25%))
* [Title generation (25%)](#Title-generation-(25%))
* [Language models as classifiers (30%)](#Language-models-as-classifiers-(30%))
* [Conclusion (10%)](#Conclusion-(10%))

Information:

This assignment is 100% of your assessment.
You are to follow the instructions below and fill each cell as instructed.
Once ready, submit this notebook on VLE with all the outputs included (run all your code and don't clear any output cells).
Do not submit anything else apart from the notebook and do not use any extra data apart from what is requested.

## Introduction

A big shot Hollywood producer is looking for a way to automatically generate new movie titles for future movies and you have been employed to do this (in exchange for millions of dollars!).
A data set of movie details has already been collected from IMDb for you and your task is to create the model and the algorithms necessary to use it.

## Data filtering and splitting (10%)

Start by downloading the CSV file `filmtv_movies - ENG.csv` from [this kaggle data set](https://www.kaggle.com/datasets/stefanoleone992/filmtv-movies-dataset).

The CSV file needs to be filtered as the producer is only interested in certain types of movie titles.
Load the file and filter it so that only movies with the following criteria are kept:

* The country needs to be `United States` (and no other country should be mentioned).
* The genre should be `Action`, `Horror`, `Fantasy`, `Western`, and `Adventure`.
* The title should not have more than 20 characters.

In [93]:
df = pd.read_csv('filmtv_movies - ENG.csv')

# Filtering
# Country
df = df[df['country'] == 'United States']  

# Genre
genre_list = ['Action', 'Horror', 'Fantasy', 'Western', 'Adventure']
df = df[df['genre'].isin(genre_list)]

# Title Length
df = df[df['title'].str.len() <= 20]

df.to_csv('testFilter.csv') # testing if filter worked

Split the filtered data into 80% train, 10% validation, and 10% test.
You will only need the title and genre columns.

In [94]:
from sklearn.model_selection import train_test_split
# splitting data : training data (80%) / validation + testing (20%)
train, rem = train_test_split(df, train_size = 0.8)

# split the remaining 20% in half: validation (10%) / testing (10%)
valid, test = train_test_split(rem, train_size = 0.5)

# Test to see if percentages were split correcrtly
# train.to_csv('trainT.csv') 
# valid.to_csv('validT.csv') 
# test.to_csv('testT.csv') 

From your processed data set, display:

* the amount of movies in each genre and split
* 5 examples of movie titles from each genre and split

In [99]:
def countMoviesByGenre(df):
    action = len(df[df.genre.str.contains('Action')])
    horror = len(df[df.genre.str.contains('Horror')])
    fantasy = len(df[df.genre.str.contains('Fantasy')])
    western = len(df[df.genre.str.contains('Western')])
    adventure = len(df[df.genre.str.contains('Adventure')])

    return [action, horror, fantasy, western, adventure]


def examplesByGenre(df):
    actionM= df[df['genre'] == 'Action']
    rand_actionM = actionM['title'].sample(n=5)

    horrorM = df[df['genre'] == 'Horror'] 
    rand_horrorM = horrorM['title'].sample(n=5)

    fantasyM = df[df['genre'] == 'Fantasy']
    rand_fantasyM = fantasyM['title'].sample(n=5)

    westernM = df[df['genre'] == 'Western']
    rand_westernM = westernM['title'].sample(n=5)

    adventureM = df[df['genre'] == 'Adventure']
    rand_adventureM = adventureM['title'].sample(n=5)

    return [rand_actionM, rand_horrorM, rand_fantasyM, rand_westernM, rand_adventureM]


def outputStatistics(train, valid, test):
    # Part 1 - Amount of Movies in each genre and split
    print("Amount of Movies in each genre: ")
    
    # call function that calculates amount of movies per genre (called for each split). List is returned for every split
    movieCountTrain= countMoviesByGenre(train)
    movieCountValid = countMoviesByGenre(valid)
    movieCountTest = countMoviesByGenre(test)
    
    # Lists containing the amount of movies per Genre are appended into a list of lists (used to make the dataframe)
    stats = []
    stats.append(movieCountTrain)
    stats.append(movieCountValid)
    stats.append(movieCountTest)

    # Dataframe containing amount of movies in each genre and split is created and outputted
    stats_df = pd.DataFrame(stats, columns=['Action', 'Horror', 'Fantasy', 'Western', 'Adventure'])
    stats_df.index = ['Training Split', 'Validation Split', 'Testing Split']
    print(stats_df)
    


    # Part 2 - 5 Examples of movie titles from each genre and split

    print("\n5 Examples of Movie titles from each genre:")

    movieExTrain = examplesByGenre(train)
    movieExValid = examplesByGenre(valid)
    movieExTest = examplesByGenre(test)


    # FIND A WAY TO OUTPUT TABLE

    # examples = []
    # examples.append(movieExTrain)
    # examples.append(movieExValid)
    # examples.append(movieExTest)
    
    # examples_df = pd.DataFrame(examples, columns=['Action', 'Horror', 'Fantasy', 'Western', 'Adventure'])
    # examples_df.index = ['Training Split', 'Validation Split', 'Testing Split']
    # print(examples_df)


    

outputStatistics(train, valid, test)
# outputStatistics(valid)
# outputStatistics(test)






Amount of Movies in each genre: 
                  Action  Horror  Fantasy  Western  Adventure
Training Split       702     663      423      430        381
Validation Split      94      75       61       55         40
Testing Split         92      80       58       52         43

5 Examples of Movie titles from each genre:
                                                             Action  \
Training Split    9908         Prime Cut
6557         The Swarm
...   
Validation Split  21324                Sniper 3
22968     Comman...   
Testing Split     4332          Black Belt
26943      Machete Ki...   

                                                             Horror  \
Training Split    29543         Preservation
20871              ...   
Validation Split  14962              Possessed
12646            ...   
Testing Split     7925            The Manitou
29529       Alien ...   

                                                            Fantasy  \
Training Split    582      Beauty

## Title classification (25%)

Your first task is to prove that a neural network can identify the genre of a movie based on its title.

You will note that many titles are just a single word or two words long so you need to work at the character level instead of the word level, that is, a token would be a single character, including punctuation marks and spaces.
You must also lowercase the titles.
Preprocess the data sets, create a neural network, and train it to classify the movie titles into their genre.
Plot a graph of the **accuracy** of the model on the train and validation sets after each epoch.

Measure the F1 score performance of the model when applied on the test set.
Also plot a confusion matrix showing how often each genre is mistaken as another genre.

## Title generation (25%)

Now that you've proven that titles and genre are related, make a model that can generate a title given a genre.

Again, you need to generate tokens at the character level instead of the word level and the titles must be lowercased.
Preprocess the data sets, create a neural network, and train it to generate the movie titles given their genre.
Plot a graph of the **perplexity** of the model on the train and validation sets after each epoch.

Generate 3 titles for every genre.
Make sure that the titles are not all the same.

## Language models as classifiers (30%)

It occurs to you that the movie title generator can also be used as a classifier by doing the following:

* Let title $t$ be the title that you want to classify.
* For every genre $g$,
    * Use the generator as a language model to get the probability of $t$ (the whole title) using genre $g$.
* Pick the genre that makes the language model give the largest probability.

The producer is thrilled to not need two separate models and now you have to implement this.
**Use the preprocessed test set from the previous task** in order to find the genre that makes the language model give the largest probability.
There is no need to plot anything here.

Just like in the classification task, measure the F1 score and plot the confusion matrix of this new classifier.

Write a paragraph or psuedo code to describe what your code above does.

In [96]:
'''

'''

'\n\n'

## Conclusion (10%)

The producer's funders are asking for a report about this new technology they invested in.
In 300 words, write your interpretation of the results together with what you think could make the model perform better.

In [97]:
'''

'''

'\n\n'