# Network Analysis

This notebook supports the analysis part of our project. We assume that you already have a DuckDB instance filled with necessary tables. If not, please visit `setup.ipynb`.

Let's start with basic imports and connecting to our database.

In [2]:
import duckdb
import pandas as pd

# Connect to a persistent DuckDB database file
conn = duckdb.connect("imdb.duckdb")

As mentioned in the setup, we are dealing with 7 tables:
1. `name_basics`
2. `title_akas`
3. `title_basics`
4. `title_crew`
5. `title_episode`
6. `title_principals`
7. `title_ratings`

The following query shows a detailed overview about our schema.

In [5]:
df = conn.execute("""
SELECT table_name, column_name, data_type
FROM information_schema.columns
WHERE table_schema = 'main'
ORDER BY table_name, ordinal_position;
""").df()

display(df)

Unnamed: 0,table_name,column_name,data_type
0,name_basics,nconst,VARCHAR
1,name_basics,primary_name,VARCHAR
2,name_basics,birth_year,INTEGER
3,name_basics,death_year,INTEGER
4,name_basics,primary_profession,VARCHAR[]
5,name_basics,known_for_titles,VARCHAR[]
6,title_akas,title_id,VARCHAR
7,title_akas,CAST(ordering AS INTEGER),INTEGER
8,title_akas,title,VARCHAR
9,title_akas,region,VARCHAR


The following cell lists all titles and actors along with other interesting information (average rating, runtime, country, etc.). This should be useful for analysis.

In [8]:
df = conn.execute("""
    SELECT 
        tb.tconst,
        tb.primary_title AS movie_title,
        tb.start_year,
        tb.runtime_minutes,
        tb.genres,
        tr.average_rating,
        tr.num_votes,
        nb.primary_name AS actor_name,
        nb.birth_year,
        nb.primary_profession,
        tp.category,
        tp.characters,
        ta.region
    FROM title_basics tb
    JOIN title_ratings tr 
        ON tb.tconst = tr.tconst
    JOIN title_principals tp
        ON tb.tconst = tp.tconst
    JOIN name_basics nb
        ON tp.nconst = nb.nconst
    JOIN title_akas ta
        ON tb.tconst = ta.title_id
    WHERE tb.title_type = 'movie'
        AND tb.start_year >= 2010
        AND ta.region = 'US'
        AND tp.category IN ('actor', 'actress')
    ORDER BY tr.average_rating DESC, tr.num_votes DESC
    LIMIT 100
""").df()

display(df)

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Unnamed: 0,tconst,movie_title,start_year,runtime_minutes,genres,average_rating,num_votes,actor_name,birth_year,primary_profession,category,characters,region
0,tt20625192,We Rise,2022,,,10.0,36,Sheera Paloma Eskenazi,,[actress],actress,"[""Singer""]",US
1,tt7989204,P Rell: Crucifix,2013,5,[Music],10.0,15,Alessio Giorgetti,,"[editor, director, actor]",actor,"[""Guitar Player""]",US
2,tt7989204,P Rell: Crucifix,2013,5,[Music],10.0,15,Marcus Shamgar Haywood,,"[actor, composer]",actor,"[""P Rell""]",US
3,tt15461714,King B.'s Hate... Love,2021,85,[Drama],10.0,14,Olivia Gant,,[actress],actress,"[""Eve""]",US
4,tt15461714,King B.'s Hate... Love,2021,85,[Drama],10.0,14,Tiphanie Nichole Rae,,[actress],actress,"[""Rachael""]",US
...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,tt36205855,Pain Through Lyric,2025,,[Drama],9.9,27,da Creepa,,"[editor, visual_effects, composer]",actor,"[""Da Ceepa""]",US
96,tt36205855,Pain Through Lyric,2025,,[Drama],9.9,27,Adrianna Kymbrashia,,[actress],actress,"[""Lyric""]",US
97,tt36205855,Pain Through Lyric,2025,,[Drama],9.9,27,Gabriella Arlene,,[actress],actress,"[""LaLa""]",US
98,tt36205855,Pain Through Lyric,2025,,[Drama],9.9,27,P Tank Jones,,"[actor, producer, location_management]",actor,"[""Restaurant extra""]",US
