# Background

I enjoy watching movies of all sorts, but the past couple of years I have found myself saying **"I enjoyed the movie, but I would have enjoyed it more if they cut a bit of stuff out"** quite a lot. I felt like the average runtime of movies had increased over the years. So I did what any reasonable person would do and looked at the data.

## Data Sources
All data was retrieved from IMDB website on 08-02-2024.

https://datasets.imdbws.com/

Data Documentation: http://www.imdb.com/interfaces/

**title.basics.tsv.gz - Contains the following information for titles:**\
    tconst (string) - alphanumeric unique identifier of the title\
    titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)\
    primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release\
    originalTitle (string) - original title, in the original language\
    isAdult (boolean) - 0: non-adult title; 1: adult title\
    startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year\
    endYear (YYYY) – TV Series end year. ‘\N’ for all other title types\
    runtimeMinutes – primary runtime of the title, in minutes\
    genres (string array) – includes up to three genres associated with the title

# Environment Setup

In [None]:
# Data handling and cleaning
import pandas as pd
import numpy as np
import janitor
import json
import requests
import re

# secrets file
import constants

# statistics library
import statistics as stats

# Graphics libraries
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from plotnine import *

# Data Processing

In [None]:
movies = (
    pd.read_csv(
        "data/title.basics.tsv.gz",
        sep="\t",
        low_memory=False,
        na_values=[r"\N", ",,"],
        compression="gzip",
    )
    .clean_names(case_type="snake")
    .remove_empty()
)

# make sure that it imported correctly
movies.shape

# further inspection of this data showed that it also includes miniseries and tv shows we need to remove them
movies = movies.loc[
    (movies["title_type"] == "movie")
    & ~(pd.isna(movies["runtime_minutes"]))
    & ~(pd.isna(movies["start_year"]))
]

movies.shape

# change runtime to numeric
movies["runtime_minutes"] = pd.to_numeric(movies["runtime_minutes"], errors="coerce")

movies["start_year"].describe()

movies["runtime_minutes"].describe()

plt.hist(movies["runtime_minutes"])
plt.show()

# some strange runtime and year values we will have to investigate and remove

# lets see these movies with strange runtimes
movies[movies.runtime_minutes >= 1000]

# I'm not interested in any movies that have not been released so any movie that has a startYear after 2021
# for now lets keep these runtimeMinutes outliers in, they wont change much and can always remove them later
movies = movies[(movies.start_year <= 2025) & (movies.start_year >= 1900)]

# Overall Movies

Below is a plot of the average run time for all movies in the database from the year 1900 to the year 2020.

In [None]:
mean_runtime = movies.groupby("start_year", as_index=False).agg(
    mean_runtime=("runtime_minutes", "mean"),
    n=("start_year", "count"),
)

stats.mean(movies.runtime_minutes)

mean_runtime.dtypes

plot = (
    ggplot(mean_runtime, aes(x="start_year", y="mean_runtime", colour="n"))
    + geom_line(size=2, lineend="round", linejoin="mitre")
    + geom_hline(
        alpha=0.8, yintercept=89.60, linetype="solid", size=1.15, color="darkgrey"
    )
    + scale_y_continuous(limits=(0, 120), breaks=(0, 30, 60, 90, 120), expand=(0, 5))
    + scale_x_continuous(expand=(0, 1))
    + scale_colour_gradient(
        low="#E7B800",
        high="#FC4E07",
        breaks=(0, 4000, 8000, 12000),
    )
    + labs(
        x="Year",
        y="Mean Runtime (in Minutes)",
        colour="Number of Movies",
        title="Mean runtime for all movies from 1900 to 2024",
        subtitle="Dotted line is overall mean runtime (89.60 min). Data from IMDB",
    )
    + theme_minimal()
    + theme(
        legend_position=(0.75, 0.25),
        legend_direction="horizontal",
        legend_text=element_text(size=10),
        axis_text_x=element_text(face="bold"),
        panel_grid_minor=element_blank(),
        text=element_text(size=10),
        axis_title_x=element_text(size=10, face="bold"),
        axis_title_y=element_text(size=10, face="bold"),
        axis_text_y=element_text(face="bold"),
    )
)

plot.show()

Lets zoom in real quick and only look at movies that have been released since I was born. We can see that from 2020 the average runtime of movies appears to be increasing.

In [None]:
mean_runtime = (
    movies[(movies.start_year >= 1994)]
    .groupby("start_year", as_index=False)
    .agg(
        mean_runtime=("runtime_minutes", "mean"),
        n=("start_year", "count"),
    )
)

stats.mean(movies.runtime_minutes[(movies.start_year >= 1994)])


plot = (
    ggplot(mean_runtime, aes(x="start_year", y="mean_runtime", colour="n"))
    + geom_line(size=2, lineend="round", linejoin="mitre")
    + geom_hline(
        alpha=0.8, yintercept=90.16, linetype="solid", size=1.15, color="darkgrey"
    )
    + geom_line(stat="smooth", method="lm", alpha=0.1)
    + scale_y_continuous(limits=(0, 120), breaks=(0, 30, 60, 90, 120), expand=(0, 5))
    + scale_x_continuous(expand=(0, 1))
    + scale_colour_gradient(
        low="#E7B800",
        high="#FC4E07",
        breaks=(0, 4000, 8000, 12000),
    )
    + labs(
        x="Year",
        y="Mean Runtime (in Minutes)",
        colour="Number of Movies",
        title="Mean runtime for all movies from 1994 to 2024",
        subtitle="Dotted line is overall mean runtime (90.16 min). Data from IMDB",
    )
    + theme_minimal()
    + theme(
        legend_position=(0.75, 0.25),
        legend_direction="horizontal",
        panel_grid_minor=element_blank(),
        text=element_text(size=10),
        axis_title_x=element_text(size=10, face="bold"),
        axis_title_y=element_text(size=10, face="bold"),
        axis_text_y=element_text(face="bold"),
        axis_text_x=element_text(face="bold"),
    )
)

plot.show()

# Per Genre
Now we produce the same plot as the overall plot above but for each genre in the dataset.

In [None]:
mean_runtime = (
    movies[(movies["genres"].str.contains("Action", na=False))]
    .groupby("start_year", as_index=False)
    .agg(
        mean_runtime=("runtime_minutes", "mean"),
        n=("start_year", "count"),
    )
)

mean_runtime.head()

# The same plot but now for action movies
action = (
    ggplot(mean_runtime, aes(x="start_year", y="mean_runtime", colour="n"))
    + geom_line(size=1, lineend="round", linejoin="mitre")
    + geom_hline(
        alpha=0.8, yintercept=89.60, linetype="solid", size=1.15, color="darkgrey"
    )
    + scale_y_continuous(limits=(0, 280), breaks=(0, 30, 60, 90, 120), expand=(0, 5))
    + scale_x_continuous(expand=(0, 1))
    + scale_colour_gradient(
        low="#E7B800",
        high="#FC4E07",
    )
    + labs(
        x="Year",
        y="Mean Runtime (in Minutes)",
        colour="Number of Movies",
        title="Action Movies",
    )
    + theme_minimal()
    + theme(
        legend_position=(0.75, 0.2),
        legend_direction="horizontal",
        panel_grid_minor=element_blank(),
        text=element_text(size=10),
        axis_title_x=element_text(size=10, face="bold"),
        axis_title_y=element_text(size=10, face="bold"),
        axis_text_y=element_text(face="bold"),
        axis_text_x=element_text(face="bold"),
    )
)

# The same plot but now for romance movies
romance = (
    ggplot(mean_runtime, aes(x="start_year", y="mean_runtime", colour="n"))
    + geom_line(size=1, lineend="round", linejoin="mitre")
    + geom_hline(
        alpha=0.8, yintercept=89.60, linetype="solid", size=1.15, color="darkgrey"
    )
    + scale_y_continuous(limits=(0, 120), breaks=(0, 30, 60, 90, 120), expand=(0, 5))
    + scale_x_continuous(expand=(0, 1))
    + scale_colour_gradient(
        low="#E7B800",
        high="#FC4E07",
    )
    + labs(
        x="Year",
        y="Mean Runtime (in Minutes)",
        colour="Number of Movies",
        title="Romance movies",
    )
    + theme_minimal()
    + theme(
        legend_position=(0.75, 0.2),
        legend_direction="horizontal",
        panel_grid_minor=element_blank(),
        text=element_text(size=10),
        axis_title_x=element_text(size=10, face="bold"),
        axis_title_y=element_text(size=10, face="bold"),
        axis_text_y=element_text(face="bold"),
        axis_text_x=element_text(face="bold"),
    )
)

# The same plot but now for drama movies
drama = (
    ggplot(mean_runtime, aes(x="start_year", y="mean_runtime", colour="n"))
    + geom_line(size=1, lineend="round", linejoin="mitre")
    + geom_hline(
        alpha=0.8, yintercept=89.60, linetype="solid", size=1.15, color="darkgrey"
    )
    + scale_y_continuous(limits=(0, 120), breaks=[0, 30, 60, 90, 120], expand=(0, 5))
    + scale_x_continuous(expand=(0, 1))
    + scale_colour_gradient(
        low="#E7B800",
        high="#FC4E07",
    )
    + labs(
        x="Year",
        y="Mean Runtime (in Minutes)",
        colour="Number of Movies",
        title="Drama movies",
    )
    + theme_minimal()
    + theme(
        legend_position=(0.75, 0.2),
        legend_direction="horizontal",
        panel_grid_minor=element_blank(),
        text=element_text(size=10),
        axis_title_x=element_text(size=10, face="bold"),
        axis_title_y=element_text(size=10, face="bold"),
        axis_text_y=element_text(face="bold"),
        axis_text_x=element_text(face="bold"),
    )
)

action.show()
romance.show()
drama.show()


# Movies I Watched

Now that we see that although movie runtime has trended upward since 1900, the runtime of movies that were released since I have been born has remained relatively flat with an increasing trend since 2020. Thankfully I keep extensive records of movies I have watched (although record keeping only started around 2017), so we can check if movies **I have specifically watched** have increased in runtime over time.


In [None]:
response = requests.get(
    "https://movies.lukehannan.com", headers={"X-Api-Key": constants.API_KEY}
)

# Check for successful response
if response.status_code == 200:
    # Parse the JSON data
    data = response.json()

    # Convert the JSON data to a pandas DataFrame
    data = pd.DataFrame(data)
else:
    print(f"Error fetching data from URL: https://movies.lukehannan.com")
    print(f"Status code: {response.status_code}")

data.shape
data.dtypes

# change runtime to numeric
data["runtime"] = pd.to_numeric(data["runtime"].str.extract(r"(\d+)", expand=False))

data["year"] = pd.to_numeric(data["year"])


mean_runtime = data.groupby("year", as_index=False).agg(
    mean_runtime=("runtime", "mean"),
    n=("year", "count"),
)


mean_runtime.head()

stats.mean(mean_runtime.mean_runtime)

plot = (
    ggplot(mean_runtime, aes(x="year", y="mean_runtime", colour="n"))
    + geom_line(size=2, lineend="round", linejoin="mitre")
    + geom_hline(
        alpha=0.8, yintercept=117.67, linetype="solid", size=1.15, color="darkgrey"
    )
    + scale_y_continuous(
        limits=[0, 180], breaks=(0, 30, 60, 90, 120, 150, 180), expand=(0, 5)
    )
    + scale_x_continuous(expand=(0, 1))
    + scale_colour_gradient(
        low="#E7B800",
        high="#FC4E07",
        breaks=(0, 4000, 8000, 12000),
    )
    + labs(
        x="Year",
        y="Mean Runtime (in Minutes)",
        colour="Number of Movies",
        title="Mean runtime for all movies I have watched",
        subtitle="Dotted line is overall mean runtime (117.67 min). Data from Me",
    )
    + theme_minimal()
    + theme(
        legend_position=(0.75, 0.25),
        legend_direction="horizontal",
        panel_grid_minor=element_blank(),
        text=element_text(size=10),
        axis_title_x=element_text(size=10, face="bold"),
        axis_title_y=element_text(size=10, face="bold"),
        axis_text_y=element_text(face="bold"),
        axis_text_x=element_text(face="bold"),
    )
)

plot.show()


# Conclusion

From all the figures it seems that movie runtime has slightly increased since 1900, and the runtime of movies I have
watched has significantly increased over time.