# 1 Motivation

### 1.1 What is your dataset?

Our dataset is the TMBD (The Movie Database) dataset, extracted from Kaggle (Full TMDB Movies Dataset 2024 (1M Movies) and approximately 472 MB of size, last extracted version for this report was the 1st of May, 2024), where a user is providing a daily updated extraction of the database. TMDB itself (https://www.themoviedb.org/) is an organization based in Canada that provides a publicly accessible database containing information about movies in general.

### 1.2 Why did you choose this/these particular dataset(s)?

We were considering a range of different datasets, but decided to choose this particular one as it has a good size to dive into several dimensions (over one million entries and 24 columns), it deals with movies which we both have a passion for. As also shown in the second part of this notebook, the parameters dive into specific fields, such as movie quality, finances, production places, etc. which gives us the basis for a visualization website, that covers several different plot types. Additionaly the specific datasset includes all the columns that we were interested in the first place in terms of analyzing, like budget, rating, genre, revenue etc.

### 1.3 What was your goal for the end user's experience?

On the one hand we wanted to give the user a basic overview of what is going on in the music industry, how this developed over time, and where we are currently, some specific connections, etc.

On the other hand we want to convey the message, that this dataset mirrors specific characteristics of our society. This should all be conveyed in an interactive way, such that the user can decide on his or her own on what to focus on, while still giving some basic directions for the user on what he or her can focus on.

# 2 Basic stats

### 2.1 Write about your choices in data cleaning and preprocessing

For the cleaning of the data, we later on (see in 3.1.2) created some limits for the visualization of our world map, where we excluded some outlier data. You can see the code section below:

In [None]:
# Limits
'''
min_moviesPerCountry = 50
budget_range = [1, 5000000]
revenue_range = [1, 100000000]
min_rating = 5.5
'''

Countries like Aruba, which only have little occurance, but are included in some high prices movies as a production_country, this country e.g. has a budget range clearly exceeding all other countries that come after it and an inclusion of Aruba would have messed up the color scale.

For the preprocessing we created some dummie variables for the genres and countries later on (see in whole part 3). We later on in the map visualization also decided to merge the soviet union and russia data together (as we do not know any more detailled production location to split it up into the other successor states) as well as merged East and West Germany data together to one.

### 2.2 Write a short section that discusses the dataset stats, containing key points/plots from your exploratory data analysis.

In [None]:
df = pd.read_csv("TMDB_movie_dataset_v11.csv")

As we can see below we have over a million entries, and a decent amount of columns where only non-null values occur.

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1030001 entries, 0 to 1030000
Data columns (total 24 columns):
 #   Column                Non-Null Count    Dtype  
---  ------                --------------    -----  
 0   id                    1030001 non-null  int64  
 1   title                 1029989 non-null  object 
 2   vote_average          1030001 non-null  float64
 3   vote_count            1030001 non-null  int64  
 4   status                1030001 non-null  object 
 5   release_date          900626 non-null   object 
 6   revenue               1030001 non-null  int64  
 7   runtime               1030001 non-null  int64  
 8   adult                 1030001 non-null  bool   
 9   backdrop_path         286900 non-null   object 
 10  budget                1030001 non-null  int64  
 11  homepage              111396 non-null   object 
 12  imdb_id               578618 non-null   object 
 13  original_language     1030001 non-null  object 
 14  original_title        1029989 non-

In [None]:
df.head()

Unnamed: 0,id,title,vote_average,vote_count,status,release_date,revenue,runtime,adult,backdrop_path,...,original_title,overview,popularity,poster_path,tagline,genres,production_companies,production_countries,spoken_languages,keywords
0,27205,Inception,8.364,34495,Released,2010-07-15,825532764,148,False,/8ZTVqvKDQ8emSGUEMjsS4yHAwrp.jpg,...,Inception,"Cobb, a skilled thief who commits corporate es...",83.952,/oYuLEt3zVCKq57qu2F8dT7NIa6f.jpg,Your mind is the scene of the crime.,"Action, Science Fiction, Adventure","Legendary Pictures, Syncopy, Warner Bros. Pict...","United Kingdom, United States of America","English, French, Japanese, Swahili","rescue, mission, dream, airplane, paris, franc..."
1,157336,Interstellar,8.417,32571,Released,2014-11-05,701729206,169,False,/pbrkL804c8yAv3zBZR4QPEafpAR.jpg,...,Interstellar,The adventures of a group of explorers who mak...,140.241,/gEU2QniE6E77NI6lCU6MxlNBvIx.jpg,Mankind was born on Earth. It was never meant ...,"Adventure, Drama, Science Fiction","Legendary Pictures, Syncopy, Lynda Obst Produc...","United Kingdom, United States of America",English,"rescue, future, spacecraft, race against time,..."
2,155,The Dark Knight,8.512,30619,Released,2008-07-16,1004558444,152,False,/nMKdUUepR0i5zn0y1T4CsSB5chy.jpg,...,The Dark Knight,Batman raises the stakes in his war on crime. ...,130.643,/qJ2tW6WMUDux911r6m7haRef0WH.jpg,Welcome to a world without rules.,"Drama, Action, Crime, Thriller","DC Comics, Legendary Pictures, Syncopy, Isobel...","United Kingdom, United States of America","English, Mandarin","joker, sadism, chaos, secret identity, crime f..."
3,19995,Avatar,7.573,29815,Released,2009-12-15,2923706026,162,False,/vL5LR6WdxWPjLPFRLe133jXWsh5.jpg,...,Avatar,"In the 22nd century, a paraplegic Marine is di...",79.932,/kyeqWdyUXW608qlYkRqosgbbJyK.jpg,Enter the world of Pandora.,"Action, Adventure, Fantasy, Science Fiction","Dune Entertainment, Lightstorm Entertainment, ...","United States of America, United Kingdom","English, Spanish","future, society, culture clash, space travel, ..."
4,24428,The Avengers,7.71,29166,Released,2012-04-25,1518815515,143,False,/9BBTo63ANSmhC4e6r62OJFuK2GL.jpg,...,The Avengers,When an unexpected enemy emerges and threatens...,98.082,/RYMX2wcKCBAr24UyPD7xwmjaTn.jpg,Some assembly required.,"Science Fiction, Action, Adventure",Marvel Studios,United States of America,"English, Hindi, Russian","new york city, superhero, shield, based on com..."


We observed some entries that are probably wrong, like below, the "earliest" movie is a farsa de ines pereira, for which a movie has been published later in the 20th century, but the year 1800 is indicated. Due to the fact that there is no algorithm that can filter those wrong entries out, we decided to leave them in, as we assume thos "wrong entries" to only have a minor influence in our qunatitative visualization output due to the large amount of total movies in our dataset. By not excluding single wrong entries we found, we assure a kind of consistency of wrong entries that are included in our dataset.

It is still nice to mention, that our first movie that we could confirm in our dataset is a motion picture of Felix Nadar spinning in his chair from the year 1865, you can check that out here if you want to: https://www.youtube.com/watch?v=uq9Yv9RwMZ4 

In [None]:
df.sort_values(by="release_date")

Unnamed: 0,id,title,vote_average,vote_count,status,release_date,revenue,runtime,adult,backdrop_path,...,original_title,overview,popularity,poster_path,tagline,genres,production_companies,production_countries,spoken_languages,keywords
590407,1233885,A Farsa de Inês Pereira,0.0,0,Released,1800-09-11,0,56,False,,...,A Farsa de Inês Pereira,The farsades of ines pereira by the goat of my...,0.600,,,,,,Portuguese,
387888,1244969,Felix Nadar Spinning in his Chair,0.0,0,Released,1865-01-01,0,1,False,,...,Felix Nadar Spinning in his Chair,Felix Nadar Spinning in his Chair,0.600,,,Documentary,Paris Nadar Studio,,No Language,
331097,1181748,Felix Nadar Spinning in his Chair,10.0,1,Released,1865-01-01,0,1,False,,...,Felix Nadar Spinning in his Chair,Revolving portrait of French photographer Feli...,1.162,/zjqn8AjirFf3434cKVAm6SV9fLH.jpg,,Documentary,,France,,
369073,1256924,Felix Nadar Spinning in his Chair,0.0,0,Released,1865-01-01,0,1,False,,...,Felix Nadar Spinning in his Chair,Felix Nadar Spinning in his Chair,0.000,,The frames that spun!,,,,,
522801,1208472,The Frontier,0.0,0,Released,1867-01-01,0,5,False,,...,La Frontera,"De Armas' last production, La frontera (1967),...",0.600,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1029994,680500,Gemmy Brown and the Multiverse,0.0,0,Planned,,0,85,False,,...,Gemmy Brown à la conquête du Multivers,Eleven years old Gemmy Brown has a single obse...,0.600,/aniIYFM4kuAnq4z92GwNW7C9oPe.jpg,,"Animation, Animation, Family, Science Fiction",Ozone Studio,France,French,
1029996,680502,探案錄財寶大劫案,0.0,0,Released,,0,0,False,,...,探案錄財寶大劫案,,0.600,,,,,,,
1029997,680504,寄生遊戲,0.0,0,Released,,0,0,False,,...,寄生遊戲,,0.600,,,,,,,
1029999,680507,Paris by Night 60,0.0,0,Released,,0,0,False,,...,Paris by Night 60,Paris by Night 60,0.600,/fgh8LGGkDl9TVIAJ3dUnBx6EIPz.jpg,,,,,,


The longest movie in this database is a 240 hour documentary about the Helsinki Stora Enso headquarters.

In [None]:
df.sort_values(by="runtime", ascending=False)

Unnamed: 0,id,title,vote_average,vote_count,status,release_date,revenue,runtime,adult,backdrop_path,...,original_title,overview,popularity,poster_path,tagline,genres,production_companies,production_countries,spoken_languages,keywords
203924,251800,Modern Times Forever,7.0,2,Released,2011-03-23,0,14400,False,,...,"Stora Enso Building, Helsinki","The film shows centuries of decay, compressed ...",0.896,/wLO187Bh5udP3nYINc7D314sS5X.jpg,,Documentary,,"Denmark, Finland, Vietnam",Danish,
988608,710874,Svalbard minutt for minutt,0.0,0,Released,2020-01-31,0,13319,False,/ejSYxDnjPUIElj2V0SCsvrE7B2J.jpg,...,Svalbard minutt for minutt,A documentary trying to relive the 10 days of ...,0.600,/vi6ADwjwvvrVjAa08b9PxWvgbyV.jpg,,Documentary,,,Norwegian,real time
105677,272074,Cinématon,4.3,6,Released,1978-12-20,0,12480,False,/6fJgLOFJO5AAlieLZxfmTyZKOEy.jpg,...,Cinématon,Cinématon is a 156-hour long experimental film...,1.968,,,Documentary,"K.O.C.K. Production, Les Amis de Cinématon",France,No Language,
341905,197299,Beijing 2003,1.0,1,Released,2004-01-01,0,9000,False,,...,Beijing 2003,Beijing 2003 is a video about the city that th...,0.707,,,Documentary,Ai Weiwei Studio,China,Mandarin,
1007214,717019,Untitled #125 (Hickory),0.0,0,Released,2011-01-01,0,7200,False,,...,Untitled #125 (Hickory),In 2011 Azzarella released Untitled #125 (Hick...,0.600,/krgZwsqAJbXcKmjdio1oBeaYQb7.jpg,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
843382,343499,The Grey Robber,0.0,0,Released,1957-01-25,0,0,False,,...,Серый разбойник,A Soviet Union film from 1957.,0.655,,,Adventure,,Soviet Union,Russian,wolf
432881,1273687,Геля,0.0,0,Released,,0,0,False,,...,Геля,A crime action comedy on a Landwagen with a fa...,0.000,,,,,,,
843384,343502,The Adventures of Frontier Fremont,0.0,0,Released,1976-01-14,0,0,False,,...,The Adventures of Frontier Fremont,The true story of one man's struggle to make t...,0.748,/kj8C2oPu6SxF1kus82UUwxJM1OM.jpg,,,,,,
1030000,1282048,流転の海,0.0,0,Released,,0,0,False,,...,流転の海,"Set in Osaka just after the end of the war, th...",0.000,,,,,,,


# 3 Data Analysis

We worked on the analysis in three different documents, which are provided below

### 3.1 Describe your data analysis and explain what you've learned about the dataset.

In our data analysis, we tried to first get an overview of the data and get an idea of some rough overall relations and basic statistics. We strived to find something interesting for our story, and some characteristics, that reflect our society in the movie data. Thus we first started with plotting a Network graph, that gives us a rough idea of how movies are distributed into different gernres and how genres are related to each other (how relatively often are they assigned for the same movie)

#### 3.1.3 Network Graph and Decade World Map

In this file we generated a Network Graph. We decided to leave the whole progress in to be transparent about how we proceeded, thus there are different versions of Network Graphs, ending with our Network Graph used for the web visualization

In [None]:
import pandas as pd
import networkx as nx
import plotly.graph_objects as go
from itertools import combinations
import community as community_louvain

In [None]:
df = pd.read_csv("TMDB_movie_dataset_v11.csv")

In [None]:
# Here we created dummy variables for each genre, which given in the genre column and are seperated with a comma
genres_expanded = df['genres'].str.replace(" ", "").str.get_dummies(sep=',')

Here we are generating the first version of our Network Graph

In [None]:
df = genres_expanded

G = nx.Graph()

# Add nodes with the size attribute
for genre in df.columns:
    G.add_node(genre, size=df[genre].sum())

# Adds edges for each genre that appears together with other genres
# In a later version of our Network Graphs we indicated the occurance of genre combination in the edge width
# Here it is only binary, so there is always the same edge width if at least one combination occurs, making a combination almost everywhere (which is not optimal yet)
for (genre1, genre2) in combinations(df.columns, 2):
    weight = (df[genre1] & df[genre2]).sum()
    if weight > 0:
        G.add_edge(genre1, genre2, weight=weight)

# Here we specify the positioning of the nodes
pos = nx.spring_layout(G, k=0.3, seed = 111)

# Here the edges are created
edge_x = []
edge_y = []
weights = []
for edge in G.edges():
    x0, y0 = pos[edge[0]]
    x1, y1 = pos[edge[1]]
    edge_x.append(x0)
    edge_x.append(x1)
    edge_x.append(None)
    edge_y.append(y0)
    edge_y.append(y1)
    edge_y.append(None)
    weights.append(G.edges[edge]['weight'])

# Here the nodes are created
node_x = []
node_y = []
sizes = []
for node in G.nodes():
    x, y = pos[node]
    node_x.append(x)
    node_y.append(y)
    sizes.append(G.nodes[node]['size'])

# Creating edge traces
edge_trace = go.Scatter(
    x=edge_x, y=edge_y,
    line=dict(width=0.5, color='grey'),
    hoverinfo='none',
    mode='lines')

# Creating node traces
node_trace = go.Scatter(
    x=node_x, y=node_y,
    mode='markers',
    hoverinfo='text',
    marker=dict(
        showscale=True,
        colorscale='YlGnBu',
        size=sizes,
        sizemode='area',
        sizeref=2.*max(sizes)/(40.**2),
        sizemin=4
    ),
    text=list(G.nodes()))

# Final Figure
fig = go.Figure(data=[edge_trace, node_trace],
                layout=go.Layout(
                    title='Network Graph of Movie Genres',
                    titlefont_size=16,
                    showlegend=False,
                    hovermode='closest',
                    margin=dict(b=20, l=5, r=5, t=40),
                    annotations=[dict(
                        text="",
                        showarrow=False,
                        xref="paper", yref="paper",
                        x=0.005, y=-0.002)],
                    xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
                    yaxis=dict(showgrid=False, zeroline=False, showticklabels=False))
                )
fig.show()
fig.write_html('network_graph_of_movie_genres.html')

Here we were implementing the Louvain method to detect communities, in order to get some insights into some genre groups that are closely related to each other

In [None]:
df = genres_expanded

G = nx.Graph()

for genre in df.columns:
    total_movies = df[genre].sum()
    G.add_node(genre, size=total_movies, total_movies=total_movies)

for (genre1, genre2) in combinations(df.columns, 2):
    weight = (df[genre1] & df[genre2]).sum()
    if weight > 0:
        G.add_edge(genre1, genre2, weight=weight)

# Community detection
partition = community_louvain.best_partition(G)

pos = nx.spring_layout(G, k=0.3, seed=101)

edge_x, edge_y = [], []
for edge in G.edges():
    x0, y0 = pos[edge[0]]
    x1, y1 = pos[edge[1]]
    edge_x.extend([x0, x1, None])
    edge_y.extend([y0, y1, None])

node_x, node_y, sizes, hover_texts, node_colors = [], [], [], [], []
for node in G.nodes():
    x, y = pos[node]
    node_x.append(x)
    node_y.append(y)
    sizes.append(G.nodes[node]['size'])
    community_id = partition[node]
    node_colors.append(community_id)

edge_trace = go.Scatter(
    x=edge_x, y=edge_y,
    line=dict(width=0.5, color='grey'),
    hoverinfo='none',
    mode='lines')

node_trace = go.Scatter(
    x=node_x, y=node_y,
    mode='markers',
    hoverinfo='text',
    text=hover_texts,
    marker=dict(
        showscale=False,  # Enable color scale
        colorscale='Viridis',  # Color scale
        size=sizes,
        color=node_colors,  # Assign community colors to nodes
        sizemode='area',
        sizeref=2.*max(sizes)/(40.**2),
        sizemin=4
    ))

fig = go.Figure(data=[edge_trace, node_trace],
                layout=go.Layout(
                    title='Network Graph of Movie Genres',
                    titlefont_size=16,
                    showlegend=False,
                    hovermode='closest',
                    margin=dict(b=20, l=5, r=5, t=40),
                    annotations=[{
                        "text": "",
                        "showarrow": False,
                        "xref": "paper",
                        "yref": "paper",
                        "x": 0.005,
                        "y": -0.002
                    }],
                    xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
                    yaxis=dict(showgrid=False, zeroline=False, showticklabels=False)))

fig.show()

In the next graph we were playing around wtih the edge width to indicate stronger and weaker common occurances between the genres

In [None]:
df = genres_expanded

G = nx.Graph()

for genre in df.columns:
    total_movies = df[genre].sum()
    G.add_node(genre, size=total_movies)

for (genre1, genre2) in combinations(df.columns, 2):
    weight = (df[genre1] & df[genre2]).sum()
    if weight > 0:
        G.add_edge(genre1, genre2, weight=weight)

partition = community_louvain.best_partition(G)

pos = nx.spring_layout(G, k=0.3, seed=101)

fig = go.Figure()

# Different weight classes
weight_classes = {0.5: [], 1: [], 2: [], 3: []}
for edge in G.edges(data=True):
    normalized_weight = edge[2]['weight'] // 100
    if normalized_weight > 0:
        x0, y0 = pos[edge[0]]
        x1, y1 = pos[edge[1]]
        weight_class = min(normalized_weight, max(weight_classes.keys()))
        weight_classes[weight_class].extend([(x0, y0, x1, y1)])

for weight, edges in weight_classes.items():
    x = []
    y = []
    for (x0, y0, x1, y1) in edges:
        x.extend([x0, x1, None])
        y.extend([y0, y1, None])
    fig.add_trace(go.Scatter(
        x=x, y=y,
        mode='lines',
        line=dict(width=weight, color='grey'),
        hoverinfo='none'
    ))

max_node_size = max(sizes)

fig.add_trace(go.Scatter(
    x=node_x, y=node_y,
    mode='markers',
    marker=dict(
        size=sizes,
        color=list(partition.values()),
        colorscale='Viridis',
        sizemode='area',
        sizeref=2. * max_node_size / (10. ** 4),
        sizemin=4
    ),
    text=[f"{node}<br>{G.nodes[node]['size']} movies" for node in G.nodes()],
))

fig.update_layout(
    title='Network Graph of Movie Genres',
    titlefont_size=16,
    hovermode='closest',
    showlegend=False,
    xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
    yaxis=dict(showgrid=False, zeroline=False, showticklabels=False)
)

fig.show()

In the next graph we were playing around with the design and scaling of the edge width, to make it a bit nicer for visualizing it on the webpage. 

In [None]:
df = genres_expanded

G = nx.Graph()

for genre in df.columns:
    total_movies = df[genre].sum()
    G.add_node(genre, size=total_movies, total_movies=total_movies)

for (genre1, genre2) in combinations(df.columns, 2):
    weight = (df[genre1] & df[genre2]).sum()
    if weight > 0:
        G.add_edge(genre1, genre2, weight=weight)

partition = community_louvain.best_partition(G)

pos = nx.spring_layout(G, k=0.3, seed=101)

fig = go.Figure()

# We chose those min and max widths which results in a nice visualization of relations
min_width = 0.05
max_width = 10

# Maximum weight for normalization
max_weight = max(edge[2]['weight'] for edge in G.edges(data=True))

for edge in G.edges(data=True):
    x0, y0 = pos[edge[0]]
    x1, y1 = pos[edge[1]]
    weight = edge[2]['weight']
    # Normalize the weight and scale it within the min_width to max_width range
    edge_width = (weight / max_weight) * (max_width - min_width) + min_width

    fig.add_trace(go.Scatter(
        x=[x0, x1, None],
        y=[y0, y1, None],
        mode='lines',
        line=dict(width=edge_width, color='pink'),
        hoverinfo='none'
    ))

# Node sizes for visualization
sizes = [G.nodes[node]['size']*10 for node in G.nodes()] 

node_trace = go.Scatter(
    x=[pos[node][0] for node in G.nodes()],
    y=[pos[node][1] for node in G.nodes()],
    mode='markers',
    marker=dict(
        showscale=False,
        colorscale='Viridis',
        size=sizes,
        color=list(partition.values()),
        sizemode='area',
        sizeref=2.*max(sizes)/(40.**2),
        sizemin=4
    ),
    text=[f"{node}<br>Total Movies: {G.nodes[node]['total_movies']}" for node in G.nodes()],
    hoverinfo='text'
)

fig.add_trace(node_trace)

fig.update_layout(
    title='Network Graph of Movie Genres with Weighted Edges',
    titlefont_size=16,
    hovermode='closest',
    showlegend=False,
    margin=dict(b=20, l=5, r=5, t=40),
    #Transparent background:
    plot_bgcolor='rgba(0,0,0,0)',
    paper_bgcolor='rgba(0,0,0,0)',
    xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
    yaxis=dict(showgrid=False, zeroline=False, showticklabels=False)
)

fig.show()
#fig.write_html('network_graph_of_movie_genres.html')


This is the final version with the final font.

In [None]:
import pandas as pd
import networkx as nx
import plotly.graph_objects as go
from itertools import combinations
import community as community_louvain

df = genres_expanded

G = nx.Graph()

for genre in df.columns:
    total_movies = df[genre].sum()
    G.add_node(genre, size=total_movies, total_movies=total_movies)

for (genre1, genre2) in combinations(df.columns, 2):
    weight = (df[genre1] & df[genre2]).sum()
    if weight > 0:
        G.add_edge(genre1, genre2, weight=weight)

partition = community_louvain.best_partition(G)

pos = nx.spring_layout(G, k=0.3, seed=101)

fig = go.Figure()

min_width = 0.05
max_width = 10

max_weight = max(edge[2]['weight'] for edge in G.edges(data=True))

for edge in G.edges(data=True):
    x0, y0 = pos[edge[0]]
    x1, y1 = pos[edge[1]]
    weight = edge[2]['weight']
    edge_width = (weight / max_weight) * (max_width - min_width) + min_width

    fig.add_trace(go.Scatter(
        x=[x0, x1, None],
        y=[y0, y1, None],
        mode='lines',
        line=dict(width=edge_width, color='pink'),
        hoverinfo='none'
    ))

sizes = [G.nodes[node]['size']*10 for node in G.nodes()]
node_trace = go.Scatter(
    x=[pos[node][0] for node in G.nodes()],
    y=[pos[node][1] for node in G.nodes()],
    mode='markers+text',
    marker=dict(
        showscale=False,
        colorscale='Viridis',
        size=sizes,
        color=list(partition.values()),
        sizemode='area',
        sizeref=2.*max(sizes)/(40.**2),
        sizemin=4
    ),
    text=[f"{node}" for node in G.nodes()],
    textposition="top center",
    hoverinfo='text',
    hovertext=[f"Total Movies: {G.nodes[node]['total_movies']}" for node in G.nodes()]
)

fig.add_trace(node_trace)

fig.update_layout(
    title='Genre Network Graph',
    # New Font
    font=dict(
        family="Courier New, monospace",
    ),
    title_font = dict(
        size = 16
    ),
    hovermode='closest',
    showlegend=False,
    margin=dict(b=20, l=5, r=5, t=40),
    plot_bgcolor='rgba(0,0,0,0)',
    paper_bgcolor='rgba(0,0,0,0)',
    xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
    yaxis=dict(showgrid=False, zeroline=False, showticklabels=False)
)

fig.show()
fig.write_html('network_graph_of_movie_genres.html')


#### 3.1.2 Rating Bar Plot and World Maps

Here we first wanted to create a bar plot showing the distribution of movies per genere overall, and for each of them to have the average rating to eventually observe something interesting

In [None]:
import pandas as pd
import plotly.graph_objs as go
from plotly.subplots import make_subplots

In [None]:
df = pd.read_csv("TMDB_movie_dataset_v11.csv")

Again Dummy Variables are created

In [None]:
genres_expanded = df['genres'].str.replace(" ", "").str.get_dummies(sep=',')
df = pd.concat([df, genres_expanded], axis=1)

We distinguish here between the average rating (average of all movies in genre) and the weighted average rating (average of each vote per genre)

In [None]:
genre_columns = df.columns[24:]
vote_average_col = 'vote_average'
vote_count_col = 'vote_count'

weighted_avgs = pd.Series(index=genre_columns, dtype=float)

# We calculate genre counts and initial average votes
genre_counts = df[genre_columns].sum().sort_values(ascending=False)
genre_avg_vote = (df[genre_columns].multiply(df[vote_average_col], axis=0)).sum() / genre_counts

# We calculate the weighed averages per genre
for genre in genre_columns:
    genre_data = df[df[genre] == 1]
    weighted_score = (genre_data[vote_average_col] * genre_data[vote_count_col]).sum()
    total_votes = genre_data[vote_count_col].sum()
    weighted_avgs[genre] = weighted_score / total_votes if total_votes != 0 else 0

# Sorting
sorted_genres = genre_counts.sort_values(ascending=False).index
genre_counts = genre_counts[sorted_genres]
genre_avg_vote = genre_avg_vote[sorted_genres]
weighted_avgs = weighted_avgs[sorted_genres]

# Two y-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])

fig.add_trace(
    go.Bar(x=genre_counts.index, y=genre_counts, name='# of Movies',
           hoverinfo='name+y', marker=dict(color='skyblue')),
    secondary_y=False,
)

# Line plot for the average votes
fig.add_trace(
    go.Scatter(x=genre_avg_vote.index, y=genre_avg_vote, name='Average Vote', mode='lines+markers',
               hovertemplate='Genre: %{x}<br>Average Vote: %{y:.2f}<extra></extra>', marker=dict(color='red')),
    secondary_y=True,
)

# Line plot for the weighted average votes
fig.add_trace(
    go.Scatter(x=weighted_avgs.index, y=weighted_avgs, name='Weighted Average Vote', mode='lines+markers',
               hovertemplate='Genre: %{x}<br>Weighted Average Vote: %{y:.2f}<extra></extra>', marker=dict(color='green')),
    secondary_y=True,
)

# Titles and Labels
fig.update_layout(
    title_text='Number of Movies and Voting Metrics by Genre',
    font=dict(
        family="Courier New, monospace",
    ),
    title_font = dict(
        size = 16
    ),
    xaxis_title="Genres",
    xaxis_tickangle=-45,
    height=575,
    paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
)

fig.update_yaxes(title_text="Number of Movies", secondary_y=False)
fig.update_yaxes(title_text="Vote Average", secondary_y=True)

fig.show()
fig.write_html('genreNumberAverageBarplot.html')

NameError: name 'make_subplots' is not defined

In the following we are creating world maps to visualize overall data for countries

In [None]:
df.columns

Index(['id', 'title', 'vote_average', 'vote_count', 'status', 'release_date',
       'revenue', 'runtime', 'adult', 'backdrop_path', 'budget', 'homepage',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'tagline', 'genres',
       'production_companies', 'production_countries', 'spoken_languages',
       'keywords', 'Action', 'Adventure', 'Animation', 'Comedy', 'Crime',
       'Documentary', 'Drama', 'Family', 'Fantasy', 'History', 'Horror',
       'Music', 'Mystery', 'Romance', 'ScienceFiction', 'TVMovie', 'Thriller',
       'War', 'Western'],
      dtype='object')

We used ChatGPT to create us the country_codes dictionary below, by giving it a list of all single dummy countries, and for each of them we are given a Country Code that we can later on use to display our data.

In [None]:
countries_expanded = df['production_countries'].str.replace(" ", "").str.get_dummies(sep=',')
df = pd.concat([df, countries_expanded], axis=1)

country_codes = {
    'Afghanistan': 'AFG',
    'Albania': 'ALB',
    'Algeria': 'DZA',
    'AmericanSamoa': 'ASM',
    'Andorra': 'AND',
    'Angola': 'AGO',
    'Anguilla': 'AIA',
    'Antarctica': 'ATA',
    'AntiguaandBarbuda': 'ATG',
    'Argentina': 'ARG',
    'Armenia': 'ARM',
    'Aruba': 'ABW',
    'Australia': 'AUS',
    'Austria': 'AUT',
    'Azerbaijan': 'AZE',
    'Bahamas': 'BHS',
    'Bahrain': 'BHR',
    'Bangladesh': 'BGD',
    'Barbados': 'BRB',
    'Belarus': 'BLR',
    'Belgium': 'BEL',
    'Belize': 'BLZ',
    'Benin': 'BEN',
    'Bermuda': 'BMU',
    'Bhutan': 'BTN',
    'Bolivia': 'BOL',
    'BosniaandHerzegovina': 'BIH',
    'Botswana': 'BWA',
    'BouvetIsland': 'BVT',
    'Brazil': 'BRA',
    'BritishIndianOceanTerritory': 'IOT',
    'BritishVirginIslands': 'VGB',
    'BruneiDarussalam': 'BRN',
    'Bulgaria': 'BGR',
    'BurkinaFaso': 'BFA',
    'Burundi': 'BDI',
    'Cambodia': 'KHM',
    'Cameroon': 'CMR',
    'Canada': 'CAN',
    'CapeVerde': 'CPV',
    'CaymanIslands': 'CYM',
    'CentralAfricanRepublic': 'CAF',
    'Chad': 'TCD',
    'Chile': 'CHL',
    'China': 'CHN',
    'ChristmasIsland': 'CXR',
    'CocosIslands': 'CCK',
    'Colombia': 'COL',
    'Comoros': 'COM',
    'Congo': 'COG',
    'CookIslands': 'COK',
    'CostaRica': 'CRI',
    "CoteD'Ivoire": 'CIV',
    'Croatia': 'HRV',
    'Cuba': 'CUB',
    'Cyprus': 'CYP',
    'CzechRepublic': 'CZE',
    'Czechoslovakia': 'CSK',
    'Denmark': 'DNK',
    'Djibouti': 'DJI',
    'Dominica': 'DMA',
    'DominicanRepublic': 'DOM',
    'EastGermany': 'DDR',  # Later on merged to a single Germany
    'EastTimor': 'TLS',
    'Ecuador': 'ECU',
    'Egypt': 'EGY',
    'ElSalvador': 'SLV',
    'EquatorialGuinea': 'GNQ',
    'Eritrea': 'ERI',
    'Estonia': 'EST',
    'Ethiopia': 'ETH',
    'FaeroeIslands': 'FRO',
    'FalklandIslands': 'FLK',
    'Fiji': 'FJI',
    'Finland': 'FIN',
    'France': 'FRA',
    'FrenchGuiana': 'GUF',
    'FrenchPolynesia': 'PYF',
    'FrenchSouthernTerritories': 'ATF',
    'Gabon': 'GAB',
    'Gambia': 'GMB',
    'Georgia': 'GEO',
    'Germany': 'DEU',
    'Ghana': 'GHA',
    'Gibraltar': 'GIB',
    'Greece': 'GRC',
    'Greenland': 'GRL',
    'Grenada': 'GRD',
    'Guadaloupe': 'GLP',
    'Guam': 'GUM',
    'Guatemala': 'GTM',
    'Guinea': 'GIN',
    'Guinea-Bissau': 'GNB',
    'Guyana': 'GUY',
    'Haiti': 'HTI',
    'HeardandMcDonaldIslands': 'HMD',
    'HolySee': 'VAT',
    'Honduras': 'HND',
    'HongKong': 'HKG',
    'Hungary': 'HUN',
    'Iceland': 'ISL',
    'India': 'IND',
    'Indonesia': 'IDN',
    'Iran': 'IRN',
    'Iraq': 'IRQ',
    'Ireland': 'IRL',
    'Israel': 'ISR',
    'Italy': 'ITA',
    'Jamaica': 'JAM',
    'Japan': 'JPN',
    'Jordan': 'JOR',
    'Kazakhstan': 'KAZ',
    'Kenya': 'KEN',
    'Kiribati': 'KIR',
    'Kosovo': 'XKX',
    'Kuwait': 'KWT',
    'KyrgyzRepublic': 'KGZ',
    "LaoPeople'sDemocraticRepublic": 'LAO',
    'Latvia': 'LVA',
    'Lebanon': 'LBN',
    'Lesotho': 'LSO',
    'Liberia': 'LBR',
    'LibyanArabJamahiriya': 'LBY',
    'Liechtenstein': 'LIE',
    'Lithuania': 'LTU',
    'Luxembourg': 'LUX',
    'Macao': 'MAC',
    'Macedonia': 'MKD',
    'Madagascar': 'MDG',
    'Malawi': 'MWI',
    'Malaysia': 'MYS',
    'Maldives': 'MDV',
    'Mali': 'MLI',
    'Malta': 'MLT',
    'MarshallIslands': 'MHL',
    'Martinique': 'MTQ',
    'Mauritania': 'MRT',
    'Mauritius': 'MUS',
    'Mayotte': 'MYT',
    'Mexico': 'MEX',
    'Micronesia': 'FSM',
    'Moldova': 'MDA',
    'Monaco': 'MCO',
    'Mongolia': 'MNG',
    'Montenegro': 'MNE',
    'Montserrat': 'MSR',
    'Morocco': 'MAR',
    'Mozambique': 'MOZ',
    'Myanmar': 'MMR',
    'Namibia': 'NAM',
    'Nauru': 'NRU',
    'Nepal': 'NPL',
    'Netherlands': 'NLD',
    'NetherlandsAntilles': 'ANT',
    'NewCaledonia': 'NCL',
    'NewZealand': 'NZL',
    'Nicaragua': 'NIC',
    'Niger': 'NER',
    'Nigeria': 'NGA',
    'Niue': 'NIU',
    'NorfolkIsland': 'NFK',
    'NorthKorea': 'PRK',
    'NorthernIreland': 'NIR',
    'NorthernMarianaIslands': 'MNP',
    'Norway': 'NOR',
    'Oman': 'OMN',
    'Pakistan': 'PAK',
    'Palau': 'PLW',
    'PalestinianTerritory': 'PSE',
    'Panama': 'PAN',
    'PapuaNewGuinea': 'PNG',
    'Paraguay': 'PRY',
    'Peru': 'PER',
    'Philippines': 'PHL',
    'PitcairnIsland': 'PCN',
    'Poland': 'POL',
    'Portugal': 'PRT',
    'PuertoRico': 'PRI',
    'Qatar': 'QAT',
    'Reunion': 'REU',
    'Romania': 'ROU',
    'Russia': 'RUS',
    'Rwanda': 'RWA',
    'Samoa': 'WSM',
    'SanMarino': 'SMR',
    'SaoTomeandPrincipe': 'STP',
    'SaudiArabia': 'SAU',
    'Senegal': 'SEN',
    'Serbia': 'SRB',
    'SerbiaandMontenegro': 'SCG',
    'Seychelles': 'SYC',
    'SierraLeone': 'SLE',
    'Singapore': 'SGP',
    'Slovakia': 'SVK',
    'Slovenia': 'SVN',
    'SolomonIslands': 'SLB',
    'Somalia': 'SOM',
    'SouthAfrica': 'ZAF',
    'SouthGeorgiaandtheSouthSandwichIslands': 'SGS',
    'SouthKorea': 'KOR',
    'SouthSudan': 'SSD',
    'SovietUnion': 'SUN',  # Values from SU and Russia are assigned to the same region in the world map, as no more detailled assignment one of the successor states of the SU could be made
    'Spain': 'ESP',
    'SriLanka': 'LKA',
    'St.Helena': 'SHN',
    'St.KittsandNevis': 'KNA',
    'St.Lucia': 'LCA',
    'St.PierreandMiquelon': 'SPM',
    'St.VincentandtheGrenadines': 'VCT',
    'Sudan': 'SDN',
    'Suriname': 'SUR',
    'Svalbard&JanMayenIslands': 'SJM',
    'Swaziland': 'SWZ',
    'Sweden': 'SWE',
    'Switzerland': 'CHE',
    'SyrianArabRepublic': 'SYR',
    'Taiwan': 'TWN',
    'Tajikistan': 'TJK',
    'Tanzania': 'TZA',
    'Thailand': 'THA',
    'Timor-Leste': 'TLS',
    'Togo': 'TGO',
    'Tokelau': 'TKL',
    'Tonga': 'TON',
    'TrinidadandTobago': 'TTO',
    'Tunisia': 'TUN',
    'Turkey': 'TUR',
    'Turkmenistan': 'TKM',
    'TurksandCaicosIslands': 'TCA',
    'Tuvalu': 'TUV',
    'USVirginIslands': 'VIR',
    'Uganda': 'UGA',
    'Ukraine': 'UKR',
    'UnitedArabEmirates': 'ARE',
    'UnitedKingdom': 'GBR',
    'UnitedStatesMinorOutlyingIslands': 'UMI',
    'UnitedStatesofAmerica': 'USA',
    'Uruguay': 'URY',
    'Uzbekistan': 'UZB',
    'Vanuatu': 'VUT',
    'Venezuela': 'VEN',
    'Vietnam': 'VNM',
    'WallisandFutunaIslands': 'WLF',
    'WesternSahara': 'ESH',
    'Yemen': 'YEM',
    'Yugoslavia': 'YUG',
    'Zaire': 'ZAR',
    'Zambia': 'ZMB',
    'Zimbabwe': 'ZWE'
}

print({k: country_codes[k] for k in list(country_codes)[:10]})  # print first 10 entries

{'Afghanistan': 'AFG', 'Albania': 'ALB', 'Algeria': 'DZA', 'AmericanSamoa': 'ASM', 'Andorra': 'AND', 'Angola': 'AGO', 'Anguilla': 'AIA', 'Antarctica': 'ATA', 'AntiguaandBarbuda': 'ATG', 'Argentina': 'ARG'}


Here we are creating a new df, called country_metrics_df, which contains all relevant country related data

In [None]:
import re
import pandas as pd

# Here we take the two largest non-actual country values and include them into the new countries (Due to not enough more specific data, we included the Soviet Union data only into Russia and did not split them up on its other successor states)
country_codes['Russia (SU until 1991)'] = 'RUS'
if 'SovietUnion' in country_codes:
    del country_codes['SovietUnion']
if 'Russia' in country_codes:
    del country_codes['Russia']

df['Russia (SU until 1991)'] = df['SovietUnion'].fillna(0) + df['Russia'].fillna(0)
df['Russia (SU until 1991)'] = df['Russia (SU until 1991)'].apply(lambda x: 1 if x > 0 else 0)

country_codes['Germany (including East Germany)'] = 'DEU'
if 'EastGermany' in country_codes:
    del country_codes['EastGermany']
if 'Germany' in country_codes:
    del country_codes['Germany']

df['Germany (including East Germany)'] = df['EastGermany'].fillna(0) + df['Germany'].fillna(0)
df['Germany (including East Germany)'] = df['Germany (including East Germany)'].apply(lambda x: 1 if x > 0 else 0)

df.drop(['SovietUnion', 'Russia'], axis=1, inplace=True, errors='ignore')
df.drop(['EastGermany', 'Germany'], axis=1, inplace=True, errors='ignore')

# Calculate metrics for each country
metrics = []
for country, code in country_codes.items():
    country_df = df[df[country] == 1]
    
    total_movies = len(country_df)
    
    avg_budget = country_df['budget'].mean()
    avg_revenue = country_df['revenue'].mean()
    avg_runtime = country_df['runtime'].mean()
    
    weighted_scores = (country_df['vote_average'] * country_df['vote_count']).sum()
    total_votes = country_df['vote_count'].sum()
    if total_votes > 0:
        weighted_avg_rating = weighted_scores / total_votes
    else:
        weighted_avg_rating = 0
    
    metrics.append({
        'Country': country,
        'ISO_Code': code,
        'Total_Movies': total_movies,
        'Average_Budget': avg_budget,
        "Average_Revenue": avg_revenue,
        "Average_Runtime": avg_runtime,
        'Weighted_Average_Rating': weighted_avg_rating
    })

country_metrics_df = pd.DataFrame(metrics)

def add_space(value):
    if isinstance(value, str):
        return re.sub(r'(?<=[a-z])(?=[A-Z])', ' ', value)
    else:
        return value

# Nicer display of countries again
country_metrics_df['Country'] = country_metrics_df['Country'].apply(add_space)

# Calculating another revenue to budget ratio
country_metrics_df["Average_Revenue/Budget"] = country_metrics_df["Average_Revenue"] / country_metrics_df["Average_Budget"]

print(country_metrics_df)


                              Country ISO_Code  Total_Movies  Average_Budget  \
0                         Afghanistan      AFG           203    13639.793103   
1                             Albania      ALB           366    52889.245902   
2                             Algeria      DZA           458   127721.694323   
3                      American Samoa      ASM             6     1905.000000   
4                             Andorra      AND            23    52173.913043   
..                                ...      ...           ...             ...   
242                             Zaire      ZAR             1        0.000000   
243                            Zambia      ZMB            40    17750.000000   
244                          Zimbabwe      ZWE            66   712243.939394   
245            Russia (SU until 1991)      RUS         18855   138397.341713   
246  Germany (including East Germany)      DEU         41464   462752.363761   

     Average_Revenue  Average_Runtime  

Here we are creating our first world map. We decided to use plotly over folium as we found with it a nice way to display the country data related to their country code, and did not use any more local geographic data values.

In the beginning of the following cell, we set some limits regarding movie amount, budget and rating, or countries where no rating occurs and thus the range of 0 to 6 almost unused is, and to make the remaining data more distinguishable. An example herefor would be Aruba which has only a few movies but is included in highly priced movies, leading to an explosion of the color scale, and giving no real interesting insights for us in return

In [None]:
import plotly.express as px

# Limits
min_moviesPerCountry = 50
budget_range = [1, 5000000]
revenue_range = [1, 100000000]
min_rating = 5.5

country_metrics_df = country_metrics_df[country_metrics_df["Total_Movies"] > min_moviesPerCountry]
country_metrics_df = country_metrics_df[(country_metrics_df["Average_Budget"] >= budget_range[0]) & (country_metrics_df["Average_Budget"] <= budget_range[1])]
country_metrics_df = country_metrics_df[(country_metrics_df["Average_Revenue"] >= revenue_range[0]) & (country_metrics_df["Average_Revenue"] <= revenue_range[1])]
country_metrics_df = country_metrics_df[country_metrics_df["Weighted_Average_Rating"] >= min_rating]


# The initial setup
fig = px.choropleth(
    country_metrics_df,
    locations="ISO_Code",
    color="Total_Movies",
    hover_name="Country",
    color_continuous_scale=px.colors.diverging.RdYlGn,
    title="Interactive World Map",
    projection="orthographic",
    width=600,
    height=500,
    hover_data={
        "Total_Movies": ":,.0f"
    }
)

# Initially shown layout data and style
fig.update_layout(
    geo=dict(
        showframe=False,
        showcoastlines=False,
        projection_type='orthographic',
        bgcolor='rgba(0,0,0,0)'
    ),
    paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
    margin=dict(l=0, r=0, t=50, b=0),
    font=dict(family="Courier New, monospace"),
    title=dict(font=dict(size=16, family="Courier New, monospace"))
)

fig.update_traces(
    hovertemplate="<b>%{hovertext}</b><br>Total Movies: %{z:,.0f}"
)

# Remove color bar title
fig.update_layout(coloraxis_colorbar=dict(title=""))

# Update layout for a cleaner look and configure dropdown menu
fig.update_layout(
    geo=dict(
        showframe=False,
        showcoastlines=False,
        projection_type='orthographic',
        bgcolor='rgba(0,0,0,0)'
    ),
    paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
    margin=dict(l=0, r=0, t=50, b=0),
    font=dict(family="Courier New, monospace"),
    title=dict(font=dict(size=16, family="Courier New, monospace")),
    # Defining the dropdown menue
    updatemenus=[{
        'buttons': [
            {
                'args': [{"z": [country_metrics_df["Total_Movies"]],
                         "coloraxis_cmin": country_metrics_df["Total_Movies"].min(),
                         "coloraxis_cmax": country_metrics_df["Total_Movies"].max(),
                         "hovertemplate": "<b>%{hovertext}</b><br>Total Movies: %{z:,.0f}"}],
                'label': 'Total Movies',
                'method': 'restyle'
            },
            {
                'args': [{"z": [country_metrics_df["Average_Budget"]],
                         "coloraxis_cmin": country_metrics_df["Average_Budget"].min(),
                         "coloraxis_cmax": country_metrics_df["Average_Budget"].max(),
                         "hovertemplate": "<b>%{hovertext}</b><br>Average Budget: $%{z:,.0f}"}],
                'label': 'Average Budget',
                'method': 'restyle'
            },
            {
                'args': [{"z": [country_metrics_df["Average_Revenue"]],
                         "coloraxis_cmin": country_metrics_df["Average_Revenue"].min(),
                         "coloraxis_cmax": country_metrics_df["Average_Revenue"].max(),
                         "hovertemplate": "<b>%{hovertext}</b><br>Average Revenue: $%{z:,.0f}"}],
                'label': 'Average Revenue',
                'method': 'restyle'
            },
            {
                'args': [{"z": [country_metrics_df["Average_Revenue/Budget"]],
                         "coloraxis_cmin": country_metrics_df["Average_Revenue/Budget"].min(),
                         "coloraxis_cmax": country_metrics_df["Average_Revenue/Budget"].max(),
                         "hovertemplate": "<b>%{hovertext}</b><br>Average Revenue/Budget: %{z:,.3f}"}],
                'label': 'Average Revenue/Budget',
                'method': 'restyle'
            },
            {
                'args': [{"z": [country_metrics_df["Average_Runtime"]],
                         "coloraxis_cmin": country_metrics_df["Average_Runtime"].min(),
                         "coloraxis_cmax": country_metrics_df["Average_Runtime"].max(),
                         "hovertemplate": "<b>%{hovertext}</b><br>Average Runtime: %{z:,.0f}min"}],
                'label': 'Average Runtime',
                'method': 'restyle'
            },
            {
                'args': [{"z": [country_metrics_df["Weighted_Average_Rating"]],
                         "coloraxis_cmin": country_metrics_df["Weighted_Average_Rating"].min(),
                         "coloraxis_cmax": country_metrics_df["Weighted_Average_Rating"].max(),
                         "hovertemplate": "<b>%{hovertext}</b><br>Weighted Average Rating: %{z:.2f}"}],
                'label': 'Weighted Average Rating',
                'method': 'restyle'
            }
        ],
        'direction': 'down',
        'pad': {'r': 10, 't': 10},
        'showactive': True,
        'x': 0.92,
        'xanchor': 'left',
        'y': 1.1,
        'yanchor': 'top',
        'bgcolor': 'white' 
    }]
)

# Show the plot
fig.show()
fig.write_html('worldMap.html')


#### 3.1.1 Time Series

Here we are creating some time series to investigate specific fields a bit closer, and to gain some insights about the development and actual state.

In [None]:
import pandas as pd
import plotly.express as px

In [None]:
df = pd.read_csv("TMDB_movie_dataset_v11.csv")

In [None]:
genres_expanded = df['genres'].str.replace(" ", "").str.get_dummies(sep=',')
df_dummy = pd.concat([df, genres_expanded], axis=1)

This it the Genre dependent time seris of the Weighted Average Rating.

In [None]:
df_dummy['release_date'] = pd.to_datetime(df_dummy['release_date'])
df_dummy['year'] = df_dummy['release_date'].dt.year
genre_columns = list(df_dummy.columns)[24:]
yearly_genre_avg = pd.DataFrame()

# Calculating the weighted scores per genre
for genre in genre_columns:
    genre_data = df_dummy[df_dummy[genre] == 1]

    genre_data['weighted_score'] = genre_data['vote_average'] * genre_data['vote_count']
    summary = genre_data.groupby('year').agg({
        'weighted_score': 'sum',
        'vote_count': 'sum'
    }).reset_index()

    summary['weighted_average'] = summary['weighted_score'] / summary['vote_count']
    summary['Genre'] = genre  # Add a column for the genre name

    yearly_genre_avg = pd.concat([yearly_genre_avg, summary[['year', 'weighted_average', 'Genre']]], ignore_index=True)

# Creating an interactive diagram
fig = px.line(yearly_genre_avg, x='year', y='weighted_average', color='Genre',
              title='Weighted Average Rating per Genre and Year',
              labels={'weighted_average': 'Weighted Average Rating', 'year': 'Year'},
              hover_data={'year': '|%Y'})

# Here we are defining the initially shown genres
hidden_genres = set(genre_columns)
hidden_genres -= {"Animation", "Horror"}

for trace in fig.data:
    if trace.name in hidden_genres:
        trace.visible = 'legendonly'

# Updating the Layout
fig.update_layout(
    font=dict(
        family="Courier New, monospace",
    ),
    title_font = dict(
        size = 16
    ),
    xaxis_title='Year',
    yaxis_title='Weighed Average Rating',
    legend_title='Genre',
    xaxis=dict(range=[1864, 2023]),
    height=575,
    plot_bgcolor='rgba(0,0,0,0)',
    paper_bgcolor='rgba(0,0,0,0)',
    legend=dict(
        title_font=dict(family="Courier New, monospace",size=10),
        font=dict(family="Courier New, monospace",size=9)
    )
)

fig.show()
fig.write_html('time_series.html')



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/

This it the Genre dependent time seris of the Budget.

In [None]:
genre_columns = list(df_dummy.columns)[24:]

yearly_genre_budget = pd.DataFrame()

for genre in genre_columns:
    genre_data = df_dummy[df_dummy[genre] == 1]
    summary = genre_data.groupby('year').agg({
        'budget': 'sum'
    }).reset_index()
    summary['Genre'] = genre 
    yearly_genre_budget = pd.concat([yearly_genre_budget, summary[['year', 'budget', 'Genre']]], ignore_index=True)

print(yearly_genre_budget)

fig = px.line(yearly_genre_budget, x='year', y='budget', color='Genre',
              title='Total Budget per Genre and Year',
              labels={'budget': 'Total Budget ($)', 'year': 'Year'},
              hover_data={'year': '|%Y'})

hidden_genres = set(genre_columns)
hidden_genres -= {"Action", "Documentary", "Drama", "Thriller"}

for trace in fig.data:
    if trace.name in hidden_genres:
        trace.visible = 'legendonly'

fig.update_layout(
    font=dict(
        family="Courier New, monospace",
    ),
    title_font=dict(
        size=16
    ),
    xaxis_title='Year',
    yaxis_title='Total Budget ($)',
    legend_title='Genre',
    xaxis=dict(range=[1900, 2023]),
    height=575,
    plot_bgcolor='rgba(0,0,0,0)',
    paper_bgcolor='rgba(0,0,0,0)',
    legend=dict(
        title_font=dict(family="Courier New, monospace", size=10),
        font=dict(family="Courier New, monospace", size=9)
    )
)

fig.show()
fig.write_html('time_series_budget.html')


        year    budget    Genre
0     1894.0         0   Action
1     1895.0         0   Action
2     1896.0         0   Action
3     1897.0         0   Action
4     1898.0         0   Action
...      ...       ...      ...
2458  2022.0  57816693  Western
2459  2023.0    146540  Western
2460  2024.0    226023  Western
2461  2025.0         0  Western
2462  2034.0         0  Western

[2463 rows x 3 columns]






And the same thing for the Revenue.

In [None]:
import pandas as pd

genre_columns = list(df_dummy.columns)[24:]
yearly_genre_revenue = pd.DataFrame()

for genre in genre_columns:
    genre_data = df_dummy[df_dummy[genre] == 1]
    summary = genre_data.groupby('year').agg({
        'revenue': 'sum'
    }).reset_index()

    summary['Genre'] = genre
    yearly_genre_revenue = pd.concat([yearly_genre_revenue, summary[['year', 'revenue', 'Genre']]], ignore_index=True)

print(yearly_genre_revenue)

fig = px.line(yearly_genre_revenue, x='year', y='revenue', color='Genre',
              title='Total Revenue per Genre and Year',
              labels={'revenue': 'Total Revenue ($)', 'year': 'Year'},
              hover_data={'year': '|%Y'})

hidden_genres = set(genre_columns)
hidden_genres -= {"Action", "Documentary", "Drama", "Thriller"}
for trace in fig.data:
    if trace.name in hidden_genres:
        trace.visible = 'legendonly'

fig.update_layout(
    font=dict(
        family="Courier New, monospace",
    ),
    title_font=dict(
        size=16
    ),
    xaxis_title='Year',
    yaxis_title='Total Revenue ($)',
    legend_title='Genre',
    xaxis=dict(range=[1900, 2023]),
    height=575,
    plot_bgcolor='rgba(0,0,0,0)',
    paper_bgcolor='rgba(0,0,0,0)',
    legend=dict(
        title_font=dict(family="Courier New, monospace", size=10),
        font=dict(family="Courier New, monospace", size=9)
    )
)

fig.show()
fig.write_html('time_series_revenue.html')

        year   revenue    Genre
0     1894.0         0   Action
1     1895.0         0   Action
2     1896.0         0   Action
3     1897.0         0   Action
4     1898.0         0   Action
...      ...       ...      ...
2458  2022.0  51681478  Western
2459  2023.0        77  Western
2460  2024.0     10000  Western
2461  2025.0         0  Western
2462  2034.0         0  Western

[2463 rows x 3 columns]






We later on decided to include a diagram with the total occurance of movies per genre in the database to give a better overview of the movie distribution in the dataset and used the same logic for that.

In [None]:
import pandas as pd

genre_columns = list(df_dummy.columns)[24:]
yearly_genre_count = pd.DataFrame()

for genre in genre_columns:
    genre_data = df_dummy[df_dummy[genre] == 1]
    summary = genre_data.groupby('year').agg({
        'title': 'count'
    }).reset_index()

    summary['Genre'] = genre
    yearly_genre_count = pd.concat([yearly_genre_count, summary[['year', 'title', 'Genre']]], ignore_index=True)

print(yearly_genre_count)

fig = px.line(yearly_genre_count, x='year', y='title', color='Genre',
              title='Number of Movies per Genre and Year',
              labels={'title': 'Number of Movies', 'year': 'Year'},
              hover_data={'year': '|%Y'})

hidden_genres = set()

for trace in fig.data:
    if trace.name in hidden_genres:
        trace.visible = 'legendonly'

fig.update_layout(
    font=dict(
        family="Courier New, monospace",
    ),
    title_font=dict(
        size=16
    ),
    xaxis_title='Year',
    yaxis_title='Number of Movies',
    legend_title='Genre',
    xaxis=dict(range=[1900, 2023]),
    height=575,
    plot_bgcolor='rgba(0,0,0,0)',
    paper_bgcolor='rgba(0,0,0,0)',
    legend=dict(
        title_font=dict(family="Courier New, monospace", size=10),
        font=dict(family="Courier New, monospace", size=9)
    )
)

fig.show()
fig.write_html('time_series_movie_count.html')

        year  title    Genre
0     1894.0      1   Action
1     1895.0      1   Action
2     1896.0      1   Action
3     1897.0      4   Action
4     1898.0      3   Action
...      ...    ...      ...
2458  2022.0    112  Western
2459  2023.0    113  Western
2460  2024.0     25  Western
2461  2025.0      1  Western
2462  2034.0      1  Western

[2463 rows x 3 columns]






We came up with the initial ideas and beginning setup of the code, and used chat gpt to implement subsequent changes to accomodate our needs. An example for this would be e.g. cases like adding two different y-axis scales in one plot or setting up the background, or creating the initial design of the dropdown menu for the world map.

# 4 Genre

### 4.1 Which tools did you use from each of the 3 categories of Visual Narrative (Figure 7 in Segal and Heer). Why?

We chose these tools of Visual Narrative:

* Consistent Visual Platfrom
* Feature Distinction
* We didnt use anything for transition rather did a plain 1 page article.

We wanted to make our website like an online Article easily readable and make a story through our writings and figures. Our article is consistent and filled with graphs that help the reader understand the text better and also all of them are interractive which makes the figures more interesting that having plain graphs. We also didn't use transition to our texts since it is a one page Article and it is quite straightforward without transition through the text/figures or pagination.

### 4.2  Which tools did you use from each of the 3 categories of Narrative Structure (Figure 7 in Segal and Heer). Why?

We chose these tools of Narrative Structure:

* Linear
* Very Limited Interactivity

As mentioned above, we chose to make a simple yet interesting Article that has limited Interactivity in some plots (e.g. interactive map and globe). The reading of the article is Linear and straightforward while the interractive plots help the reader make his own assumptions while reading the article and see our dataset points by himself/herself.

# 5 Visualizations

### 5.1 Explain the visualizations you've chosen.

* Visualization no.1 : We chose this network graph since it is a good introductory about the genres in the movie industry while also showing the relationship between each genre. Network graph is quite the choice for us since it displays a really interesting web-formed image for the reader.
* Visualization no.2 : This is a good indication to show the average and weighted average rating of each genre to see the acceptance and preference of the audience along with the count of movies per genre. Nothing special here just and informative graph that helps the user get into the point of the Article.
* Visualization no.3, no.7, no.8 : These plots are basic graphs that show us the points we want to mention about our story. The plots are again basic with limited interractivity from user, enabling him/her to chose which genre to see info about Budgeting and Revenue and make asssumptions.
* Visualization no.4, no.5 : These are some quite beautiful graphs that we thought would be interesting to visualize our dataset on a global scale with interactive plotting. Both the globe and interactive map offer a different, more spherical perspective of our dataset and add a nice touch to our narrative.
* Visualization no.6 : A more clear view of the Weighted Average Rating per Genre and Year can been seen in this graph enable the reader to compare the genres.

### 5.2 Why are they right for the story you want to tell?

We chose basic yet interesting graphs that are informative and visually appealing. Having interactive graphs that allow users to find information about specific genres such as Budget, Rating, and Revenue could be quite helpful while reading the article. Therefore, by selecting 'limited interactivity' graphs, we aim to make the article more engaging than simply embedding static figures onto the website. We aimed to select graphs that are neither boring nor too sophisticated, in order to maintain simplicity and visual appeal.

# 6 Discussion

### 6.1 What went well?

We both liked the idea of working with a dataset that is all about Movies since we are both cinephil. Also we had great communication throught the course and the making of the project. In terms of our website we believe we found some interesting points to talk about in our dataset making a nice storytelling about the movies and genres in general. We also agreed in adding small personal details and references inside the Article in order to make it feel more unique and ours (e.g. Lord of the Rings mention).

### 6.2 What is still missing? What could be improved?, Why?

We didn't use the full capacity of the dataset and only little talked about individual movies and rather talked about the patterns we recognised in general about the genres, the budget of the movies, the average runtime of movies etc. Also we found a really big dataset that includes many columns and a variety of information for each movie which could enable us on making more stories. So to sum it up, we did not use every column that could have been used, we also thought about doing some cool connections with the keywords, but left that out to focus on the other overviewable patterns. A combination with more qualitative insights could give an improved overview due to smaller examples.

# 7 Contributions

Both group members worked on all tasks together and helped each other on all tasks to an overall equal amount of effort.

* Nico was mostly responsible for the most part of the code and specific visualization choices along with Introductory parts of the Article.
* Manos was mostly responsible for the display and data selection for some figures in order to build a narrative and persistence to the data being shown.

Those contributions rather not indicate a higher workload of one or another person in a specific area, but states the person responsible for delegating tasks and getting things done.

# 8 References

We had two links inside our article that helped us with the narrative.

* https://www.kaggle.com/datasets/asaniczka/tmdb-movies-dataset-2023-930k-movies
* https://en.wikipedia.org/wiki/History_of_animation#1980s

Also we used chatGPT for some grammar correction and code assistance for configuring plots and code syntax.