# Predicting the Primary Genre Of Movies/TV-Shows using Plot Text

Everyone loves binge-watching their favorite Movies and TV Shows. Nowadays, movies can pull elements from multiple genres (e.g., action, adventure, comedy, etc.) through complex themes intertwined within a single plot. For example, a movie can be primarily an action movie while also containing undercurrents of romance and comedy (for example, Thor: Ragnarok, Thor: Love and Thunder). The majority of online platforms (e.g., IMDB, rotten tomatoes, etc.) that maintain movie/tv-shows details include all genres BUT do not specifically mention a "primary genre." We seek to tackle this problem today by using machine learning to classify any movie/tv-show with a single, primary genre that best represents the title's plot.

Highlighting a movie's primary genre can have many commercial applications, including improved content recommendation and increased precision in understanding various actors' performance and affinities per particular genres (and consequently, across niche fan bases).

As stated, we will use **machine learning** to **predict the primary genre** of movies/tv-shows. The majority of movies have a plot mentioned in a few lines of text that can be utilized for predicting target genres. This will be a **multi-label classification task** as we'll be predicting multiple genres per movie based on plot text. A simple **Logistic Regression** model was trained for this task. The dataset used for training the model is publicly available **CMU Movie Summary Corpus** dataset. The text of the plot is encoded using the **TF-IDF** text encoding method. The dataset was divided into train (90%) and test (10%) subsets. The evaluation metric used for performance evaluation was **F1-score**, which was **0.436** for test datasets which is quite a good score as we are able to recover the majority of genres. The model outputs probabilities for each genre, and we choose the genre with the highest probability as the "primary genre" of the particular movie/tv-show. The process's total code is present as a single class which can be run by simply initiating and calling the run() method.

In [1]:
import pandas as pd
import numpy as np
import json
import nltk
import re
import csv
import gzip
import matplotlib.pyplot as plt 
import seaborn as sns
from tqdm import tqdm
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

## Prepare Data

### Movie ID to Genre Mapping

In [2]:
meta = pd.read_csv("MovieSummaries/movie.metadata.tsv", sep = '\t', header = None)
meta.columns = ["movie_id",1,"movie_name",3,4,5,6,7,"genre"]
meta['movie_id'] = meta['movie_id'].astype(str)
meta.head()

Unnamed: 0,movie_id,1,movie_name,3,4,5,6,7,genre
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/01jfsb"": ""Thriller"", ""/m/06n90"": ""Science..."
1,3196793,/m/08yl5d,Getting Away with Murder: The JonBenét Ramsey ...,2000-02-16,,95.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/02n4kr"": ""Mystery"", ""/m/03bxz7"": ""Biograp..."
2,28463795,/m/0crgdbh,Brun bitter,1988,,83.0,"{""/m/05f_3"": ""Norwegian Language""}","{""/m/05b4w"": ""Norway""}","{""/m/0lsxr"": ""Crime Fiction"", ""/m/07s9rl0"": ""D..."
3,9363483,/m/0285_cd,White Of The Eye,1987,,110.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/01jfsb"": ""Thriller"", ""/m/0glj9q"": ""Erotic..."
4,261236,/m/01mrr1,A Woman in Flames,1983,,106.0,"{""/m/04306rv"": ""German Language""}","{""/m/0345h"": ""Germany""}","{""/m/07s9rl0"": ""Drama""}"


### Loading Movie Plots

In [3]:
plots = []

with gzip.open("MovieSummaries/plot_summaries.txt", 'rt', encoding = 'UTF-8') as f:
    reader = csv.reader(f, dialect='excel-tab') 
    for row in tqdm(reader):
        plots.append(row)

42303it [00:00, 46527.89it/s]


In [4]:
movie_id = []
plot = []

# extract movie Ids and plot summaries
for i in tqdm(plots):
    movie_id.append(i[0])
    plot.append(i[1])

# create dataframe
movies = pd.DataFrame({'movie_id': movie_id, 'plot': plot})

movies.head()

100%|███████████████████████████████████████████████████████████████████████| 42303/42303 [00:00<00:00, 1626321.19it/s]


Unnamed: 0,movie_id,plot
0,23890098,"Shlykov, a hard-working taxi driver and Lyosha..."
1,31186339,The nation of Panem consists of a wealthy Capi...
2,20663735,Poovalli Induchoodan is sentenced for six yea...
3,2231378,"The Lemon Drop Kid , a New York City swindler,..."
4,595909,Seventh-day Adventist Church pastor Michael Ch...


### Merging Movie Plots and Genres

In [5]:
movies = pd.merge(movies, meta[['movie_id', 'movie_name', 'genre']], on = 'movie_id')

movies.head()

Unnamed: 0,movie_id,plot,movie_name,genre
0,23890098,"Shlykov, a hard-working taxi driver and Lyosha...",Taxi Blues,"{""/m/07s9rl0"": ""Drama"", ""/m/03q4nz"": ""World ci..."
1,31186339,The nation of Panem consists of a wealthy Capi...,The Hunger Games,"{""/m/03btsm8"": ""Action/Adventure"", ""/m/06n90"":..."
2,20663735,Poovalli Induchoodan is sentenced for six yea...,Narasimham,"{""/m/04t36"": ""Musical"", ""/m/02kdv5l"": ""Action""..."
3,2231378,"The Lemon Drop Kid , a New York City swindler,...",The Lemon Drop Kid,"{""/m/06qm3"": ""Screwball comedy"", ""/m/01z4y"": ""..."
4,595909,Seventh-day Adventist Church pastor Michael Ch...,A Cry in the Dark,"{""/m/0lsxr"": ""Crime Fiction"", ""/m/07s9rl0"": ""D..."


### Existing Genres Of Title

In [6]:
existing_peerlogix_genres = pd.read_csv("peerlogix_genres.csv")

existing_peerlogix_genres = existing_peerlogix_genres.groupby("imdb_id").aggregate(lambda x: list(x)).reset_index()

print("Existing Genres : {}".format(existing_peerlogix_genres.shape))

existing_peerlogix_genres.head()

Existing Genres : (49947, 2)


Unnamed: 0,imdb_id,genre
0,tt0000004,[Animation]
1,tt0000007,[Documentary]
2,tt0000008,[Documentary]
3,tt0000246,[Comedy]
4,tt0000417,"[Adventure, Fantasy, Science Fiction]"


In [7]:
existing_peerlogix_genres = dict(zip(existing_peerlogix_genres["imdb_id"].values.tolist(), existing_peerlogix_genres["genre"].values.tolist()))

In [8]:
list(existing_peerlogix_genres.items())[:5]

[('tt0000004', ['Animation']),
 ('tt0000007', ['Documentary']),
 ('tt0000008', ['Documentary']),
 ('tt0000246', ['Comedy']),
 ('tt0000417', ['Adventure', 'Fantasy', 'Science Fiction'])]

### Cleaning Genre Data

In [9]:
valid_genres = [
            'Action', 'Action & Adventure', 'Adventure', 'Animation', 'Comedy', 'Crime', 'Documentary',
            'Drama', 'Family', 'Fantasy', 'Horror', 'Musical', 'Mystery', 'Romance', 'Science Fiction',
            'Thriller', 'War', ]

valid_genres = pd.read_csv("peerlogix_genres.csv").genre.unique().tolist() ## This one has around 8-10 more unique labels

genre_corrections = {
                    'Action/Adventure' : 'Action & Adventure',
                    'Crime Fiction' : 'Crime',
                    'Family Film' : 'Family',
                    'Romance Film' : 'Romance',
                    'War Film' : 'War',
                    'Comedy Film' : 'Comedy'
                    }

only_peerlogix_valid_genres, genres = True, [] 

for i in movies['genre']: 
    movie_genres = list(json.loads(i).values())
    
    movie_genres = [genre_corrections.get(genre, genre) for genre in movie_genres] ## Genre Correction in Data
    
    if only_peerlogix_valid_genres: ## Keep only peerlgix valid genres
        movie_genres = [genre for genre in movie_genres if genre in valid_genres]
        
    genres.append(movie_genres) 

movies['genre_new'] = genres

movies.head()

Unnamed: 0,movie_id,plot,movie_name,genre,genre_new
0,23890098,"Shlykov, a hard-working taxi driver and Lyosha...",Taxi Blues,"{""/m/07s9rl0"": ""Drama"", ""/m/03q4nz"": ""World ci...",[Drama]
1,31186339,The nation of Panem consists of a wealthy Capi...,The Hunger Games,"{""/m/03btsm8"": ""Action/Adventure"", ""/m/06n90"":...","[Action & Adventure, Science Fiction, Action, ..."
2,20663735,Poovalli Induchoodan is sentenced for six yea...,Narasimham,"{""/m/04t36"": ""Musical"", ""/m/02kdv5l"": ""Action""...","[Musical, Action, Drama]"
3,2231378,"The Lemon Drop Kid , a New York City swindler,...",The Lemon Drop Kid,"{""/m/06qm3"": ""Screwball comedy"", ""/m/01z4y"": ""...",[Comedy]
4,595909,Seventh-day Adventist Church pastor Michael Ch...,A Cry in the Dark,"{""/m/0lsxr"": ""Crime Fiction"", ""/m/07s9rl0"": ""D...","[Crime, Drama]"


### Cleaning Plot Data

In [10]:
movies['clean_plot'] = movies['plot'].apply(lambda x: " ".join(re.findall("[a-zA-Z]+", x.lower())))

movies.head()

Unnamed: 0,movie_id,plot,movie_name,genre,genre_new,clean_plot
0,23890098,"Shlykov, a hard-working taxi driver and Lyosha...",Taxi Blues,"{""/m/07s9rl0"": ""Drama"", ""/m/03q4nz"": ""World ci...",[Drama],shlykov a hard working taxi driver and lyosha ...
1,31186339,The nation of Panem consists of a wealthy Capi...,The Hunger Games,"{""/m/03btsm8"": ""Action/Adventure"", ""/m/06n90"":...","[Action & Adventure, Science Fiction, Action, ...",the nation of panem consists of a wealthy capi...
2,20663735,Poovalli Induchoodan is sentenced for six yea...,Narasimham,"{""/m/04t36"": ""Musical"", ""/m/02kdv5l"": ""Action""...","[Musical, Action, Drama]",poovalli induchoodan is sentenced for six year...
3,2231378,"The Lemon Drop Kid , a New York City swindler,...",The Lemon Drop Kid,"{""/m/06qm3"": ""Screwball comedy"", ""/m/01z4y"": ""...",[Comedy],the lemon drop kid a new york city swindler is...
4,595909,Seventh-day Adventist Church pastor Michael Ch...,A Cry in the Dark,"{""/m/0lsxr"": ""Crime Fiction"", ""/m/07s9rl0"": ""D...","[Crime, Drama]",seventh day adventist church pastor michael ch...


## Create and Train Model

### Create Multi-Labels

In [11]:
from sklearn.preprocessing import MultiLabelBinarizer

multilabel_binarizer = MultiLabelBinarizer()
multilabel_binarizer.fit(movies['genre_new'])

# transform target variable
Y = multilabel_binarizer.transform(movies['genre_new'])

Y.shape

(42204, 20)

### Define Vectorizer

In [12]:
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, stop_words="english", max_features=50000)

### Split Data into Train/Test

In [13]:
X_train, X_test, Y_train, Y_test = train_test_split(movies['clean_plot'], Y, test_size=0.1, random_state=123)

### Vectorize Data

In [14]:
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

X_train_tfidf.shape, X_test_tfidf.shape

((37983, 50000), (4221, 50000))

### Train Classifier

In [15]:
%%time

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import f1_score

lr = LogisticRegression()
clf1 = OneVsRestClassifier(lr)

clf1.fit(X_train_tfidf, Y_train)

Wall time: 20.6 s


OneVsRestClassifier(estimator=LogisticRegression())

In [16]:
Y_test_pred1 = clf1.predict(X_test_tfidf)

In [17]:
f1_score(Y_test, Y_test_pred1, average="micro")

0.4363823051253529

### Create Prediction Function

In [18]:
def make_prediction(text, threshold = 0.0):
    if isinstance(text, str):
        cleaned_text = " ".join(re.findall("[a-zA-Z]+", text.lower()))
        X = tfidf_vectorizer.transform([cleaned_text])
        probs = clf1.predict_proba(X) ## Probabilities/ liklihood
        
        # Isolate highest probable genre
        primary_genre_idx = probs.argsort()[0][-1] ## Taking highest probability/liklihood
        idx2, idx3, idx4, idx5 = probs.argsort()[0][-2], probs.argsort()[0][-3], probs.argsort()[0][-4], probs.argsort()[0][-5]
        
        # Discard results if none were above threshold 
        if probs[0][primary_genre_idx] < threshold:
            return ["NA", ] * 5
        
        # Else, return top genre
        primary_genre = multilabel_binarizer.classes_[primary_genre_idx] 
        genre2, genre3, genre4, genre5 = multilabel_binarizer.classes_[idx2], multilabel_binarizer.classes_[idx3], multilabel_binarizer.classes_[idx4], multilabel_binarizer.classes_[idx5]
        
        primary_genre = genre_corrections.get(primary_genre, primary_genre)
        genre2, genre3, genre4, genre5 = genre_corrections.get(genre2, genre2), genre_corrections.get(genre3, genre3), genre_corrections.get(genre4, genre4), genre_corrections.get(genre5, genre5)
        return primary_genre, genre2, genre3, genre4, genre5
    else:
        return ["NA", ] * 5

In [19]:
make_prediction("the nation of panem consists of a wealthy capitalists")

('Drama', 'Comedy', 'Romance', 'Action', 'Action & Adventure')

In [20]:
make_prediction("the nation of panem consists of a wealthy capitalists", threshold=0.3)

['NA', 'NA', 'NA', 'NA', 'NA']

In [21]:
make_prediction("the nation of panem consists of a wealthy capitalists", threshold=0.29)

('Drama', 'Comedy', 'Romance', 'Action', 'Action & Adventure')

### Predicting Primary Genre for PeerLogix Dataset (original function)

In [22]:
peerlogix = pd.read_csv("PeerLogix Titles (IMDb Metadata).csv")

peerlogix.head()

Unnamed: 0,imdb_id,poster_url,trailer_url,release_year,description,imdb_score,rating,genres
0,tt0068817,https://m.media-amazon.com/images/M/MV5BYzk2Mj...,-VBLDluFlnQ,1972,Ogami Itto volunteers to be tortured by the ya...,7.5,R,"['Drama', 'Action']"
1,tt0069047,https://m.media-amazon.com/images/M/MV5BODM3MD...,iquHLv-Usrg,1973,A busload of tourists stops in to visit a smal...,4.8,R,['Horror']
2,tt0385990,https://m.media-amazon.com/images/M/MV5BZTZkYT...,1JusFfwHSgM,2004,Two time travelers with opposing views are sen...,3.1,R,"['Action', 'Science Fiction']"
3,tt1686768,https://m.media-amazon.com/images/M/MV5BODM1MD...,tSIbHII-N44,2011,An English teacher brings soccer from England ...,6.8,,['Drama']
4,tt2655470,https://m.media-amazon.com/images/M/MV5BMjE5Nz...,l7PKbUab0Xs,2013–2014,"Centers on Terry Gannon, a recently divorced s...",6.4,,"['Comedy', 'Family']"


In [23]:
primary_genre = []

for i, (imdb_id, plot) in enumerate(peerlogix[["imdb_id","description"]].values):
    existing_genres = existing_peerlogix_genres.get(imdb_id, []) ## Retrieve Existing Genres for id
    predicted_genres = make_prediction(plot) ## Make Prediction on Plot
    
    if existing_genres: ## If Genres present for IMDB ID then choose from it else append predicted one.
        if len(existing_genres) == 1: ## If single Genre then it'll be primary Genre
            primary_genre.append(existing_genres[0])
        else:
            selected_genre = None
            for genre in predicted_genres: ### Check for predicted Genre in existing Genres
                if genre in existing_genres:
                    selected_genre = genre
                    break    
            primary_genre.append(selected_genre if selected_genre else "NA") 
            
    else:
        primary_genre.append(predicted_genres[0]) ## Append first one which is primary
    
    if (i+1)%1000 == 0:
        print("{} iteration completed".format(i+1))

1000 iteration completed
2000 iteration completed
3000 iteration completed
4000 iteration completed
5000 iteration completed
6000 iteration completed
7000 iteration completed
8000 iteration completed
9000 iteration completed
10000 iteration completed
11000 iteration completed
12000 iteration completed
13000 iteration completed
14000 iteration completed
15000 iteration completed
16000 iteration completed
17000 iteration completed
18000 iteration completed
19000 iteration completed
20000 iteration completed
21000 iteration completed
22000 iteration completed
23000 iteration completed
24000 iteration completed
25000 iteration completed
26000 iteration completed
27000 iteration completed
28000 iteration completed
29000 iteration completed
30000 iteration completed
31000 iteration completed
32000 iteration completed
33000 iteration completed
34000 iteration completed
35000 iteration completed
36000 iteration completed
37000 iteration completed
38000 iteration completed
39000 iteration compl

In [24]:
#peerlogix["Primary_Genre1"] = peerlogix["description"].apply(make_prediction)
peerlogix["Primary_Genre1"] = primary_genre

In [25]:
peerlogix.head()

Unnamed: 0,imdb_id,poster_url,trailer_url,release_year,description,imdb_score,rating,genres,Primary_Genre1
0,tt0068817,https://m.media-amazon.com/images/M/MV5BYzk2Mj...,-VBLDluFlnQ,1972,Ogami Itto volunteers to be tortured by the ya...,7.5,R,"['Drama', 'Action']",Action
1,tt0069047,https://m.media-amazon.com/images/M/MV5BODM3MD...,iquHLv-Usrg,1973,A busload of tourists stops in to visit a smal...,4.8,R,['Horror'],Horror
2,tt0385990,https://m.media-amazon.com/images/M/MV5BZTZkYT...,1JusFfwHSgM,2004,Two time travelers with opposing views are sen...,3.1,R,"['Action', 'Science Fiction']",Science Fiction
3,tt1686768,https://m.media-amazon.com/images/M/MV5BODM1MD...,tSIbHII-N44,2011,An English teacher brings soccer from England ...,6.8,,['Drama'],Drama
4,tt2655470,https://m.media-amazon.com/images/M/MV5BMjE5Nz...,l7PKbUab0Xs,2013–2014,"Centers on Terry Gannon, a recently divorced s...",6.4,,"['Comedy', 'Family']",Comedy


In [26]:
peerlogix.to_csv("Peerlogix_with_Genre.csv")

In [27]:
from collections import Counter

Counter(peerlogix.Primary_Genre1)

Counter({'Action': 2924,
         'Horror': 3341,
         'Science Fiction': 1096,
         'Drama': 18857,
         'Comedy': 8654,
         'Thriller': 3076,
         'NA': 1868,
         'Documentary': 3417,
         'Romance': 1539,
         'Animation': 771,
         'Crime': 1348,
         'Family': 746,
         'Mystery': 285,
         'Western': 402,
         'Reality': 743,
         'Adventure': 819,
         'Action & Adventure': 206,
         'Talk': 47,
         'Fantasy': 366,
         'War': 41,
         'News': 18,
         'Sci-Fi & Fantasy': 65,
         'Kids': 12,
         'Music': 85,
         'War & Politics': 6,
         'History': 17,
         'Soap': 13,
         'TV Movie': 4,
         'Musical': 4})