## BIG DATA Project

### Project Participants

- **BEMMOUSSAT Marwan**
- **BOUKAOUI Mohamed**
- **EL AICHOUNI Yahya**

# IMDB Data Analysis & Wikimedia Stream Processing

This notebook is part of a Big Data project dedicated to the analysis of public IMDB datasets and the exploration of real-time Wikipedia events.

The first part of the project focuses on loading, cleaning, and analyzing large-scale movie and people data from IMDB. The objective is to answer a set of structured analytical questions related to movie characteristics, genres, ratings, demographic aspects, and metadata consistency. All datasets are loaded programmatically to ensure reproducibility and ease of execution.

The second part of the project introduces a lightweight stream processing pipeline based on the Wikimedia EventStreams platform. IMDB-related entities (such as movies, people, or genres) are monitored in real time to compute simple metrics and trigger alert-type events, simulating a real-world monitoring use case.

The project emphasizes clean code organization, reproducibility, and clear documentation, with all results generated directly from code and supported by explanatory markdown cells.


In [55]:
import requests
from pathlib import Path

# Base URL for IMDB datasets
BASE_URL = "https://datasets.imdbws.com/"

# Required datasets
FILES = [
    "name.basics.tsv.gz",
    "title.basics.tsv.gz",
    "title.ratings.tsv.gz",
    "title.crew.tsv.gz",
    "title.akas.tsv.gz"
]

# Create data directory
data_dir = Path("data")
data_dir.mkdir(exist_ok=True)

def download_file(filename):
    file_path = data_dir / filename
    if file_path.exists():
        print(f"✅ {filename} already exists")
        return

    print(f"⬇️ Downloading {filename}...")
    response = requests.get(BASE_URL + filename, stream=True)
    response.raise_for_status()

    with open(file_path, "wb") as f:
        for chunk in response.iter_content(chunk_size=1024 * 1024):
            # The file is downloaded in chunks (1 MB at a time) to avoid loading large files entirely into memory.
            # This approach ensures better memory management and allows the code to scale efficiently
            # when handling large datasets such as the IMDB files.
            if chunk:
                f.write(chunk)

    print(f"✅ Downloaded {filename}")

# Download all datasets
for file in FILES:
    download_file(file)

✅ name.basics.tsv.gz already exists
✅ title.basics.tsv.gz already exists
✅ title.ratings.tsv.gz already exists
✅ title.crew.tsv.gz already exists
✅ title.akas.tsv.gz already exists


### Question 1 – Data Loading

IMDB datasets were downloaded programmatically directly from the official IMDB website using a notebook cell.
No manual file import was performed. The files were stored locally in a dedicated `data/` directory to ensure
reproducibility.

The following datasets were loaded automatically:
- name.basics.tsv.gz
- title.basics.tsv.gz
- title.ratings.tsv.gz
- title.crew.tsv.gz
- title.akas.tsv.gz

All other datasets were deemed unnecessary for the scope of this analysis.

In [56]:
import pandas as pd

people_df = pd.read_csv(
    "data/name.basics.tsv.gz",
    sep="\t",
    na_values="\\N",
    low_memory=False
)

people_df.shape

(14948799, 6)

In [57]:
total_people = people_df.shape[0]
total_people

14948799

### Question 2 – Total Number of People

The IMDB dataset `name.basics.tsv.gz` contains **14948799** individuals.
Each row corresponds to a unique person identified by an IMDB identifier (`nconst`).


In [58]:
# Convert birthYear to numeric (invalid values become NaN)
people_df["birthYear"] = pd.to_numeric(people_df["birthYear"], errors="coerce")

# Earliest year of birth
earliest_birth_year = int(people_df["birthYear"].min())

earliest_birth_year

4

### Question 3 – Earliest Year of Birth

The earliest recorded year of birth in the IMDB dataset is **4**.
Only valid numeric birth years were considered; missing or invalid values were excluded from the analysis.

In [59]:
from datetime import datetime

current_year = datetime.now().year
years_ago = current_year - earliest_birth_year

years_ago

2021

### Question 4 – How Many Years Ago Was This Person Born?

Based on the earliest recorded birth year (**4**) and the current year (**2025**),
this person was born approximately **2021 years ago**.

In [60]:
# Identify the person(s) with the earliest birth year
earliest_people = people_df[people_df["birthYear"] == earliest_birth_year]

earliest_people[["nconst", "primaryName", "birthYear", "deathYear"]].head()

Unnamed: 0,nconst,primaryName,birthYear,deathYear
737944,nm0784172,Lucio Anneo Seneca,4.0,65.0


### Question 5 – Validity of the Earliest Birth Date

Based solely on the information available in the IMDB dataset, the earliest recorded year of birth
is **unlikely to be accurate**.

The dataset shows historical inconsistencies in early records, particularly for individuals with
very old birth dates. Such entries often relate to figures with limited or uncertain biographical
information, and IMDB does not guarantee the reliability of early historical data.

As a result, when relying exclusively on this dataset and without external validation, this birth
date should be interpreted with caution.


### Question 6 – Reasoning Behind the Birth Date Validity Assessment

The earliest year of birth identified in the dataset is **4**, which corresponds to the year 4 J.-C.
While this value is present in the `birthYear` column, its validity is questionable when examined
solely through the lens of the dataset itself.

The IMDB `name.basics` dataset aggregates biographical information from a wide range of sources,
including historical, legendary, or poorly documented figures. For very early dates, the dataset
does not provide any mechanism to verify historical accuracy or consistency with professional
activity timelines.

Additionally, IMDB-related records primarily concern individuals associated with film and television,
industries that emerged many centuries after the year 4. The presence of such an early birth year
therefore suggests a limitation in data quality rather than a reliable biographical fact.

As a result, based only on the internal consistency and context of the dataset, this birth date
should be considered unreliable and treated with caution in further analyses.

In [61]:
# Ensure birthYear is numeric
people_df["birthYear"] = pd.to_numeric(people_df["birthYear"], errors="coerce")

# Most recent year of birth
most_recent_birth_year = int(people_df["birthYear"].max())

most_recent_birth_year

2025

### Question 7 – Most Recent Year of Birth

The most recent recorded year of birth in the IMDB dataset is **2025**.
This value reflects the maximum birth year present in the dataset and does not necessarily
indicate a biologically plausible date, highlighting potential data quality limitations.

In [62]:
# Number of people without a listed birth year
missing_birth_years = people_df["birthYear"].isna().sum()

# Total number of people
total_people = people_df.shape[0]

# Percentage of missing birth years
percentage_missing = (missing_birth_years / total_people) * 100

missing_birth_years, percentage_missing

(14287661, 95.57731694700023)

### Question 8 – Percentage of Missing Birth Years

Out of all individuals listed in the IMDB dataset, **approximately 95.58%** do not have a
recorded year of birth. This corresponds to **14,287,661** people with a missing `birthYear`
value, highlighting the

In [63]:
import pandas as pd

titles_df = pd.read_csv(
    "data/title.basics.tsv.gz",
    sep="\t",
    na_values="\\N",
    low_memory=False
)

In [64]:
# Convert columns to numeric where needed
titles_df["startYear"] = pd.to_numeric(titles_df["startYear"], errors="coerce")
titles_df["runtimeMinutes"] = pd.to_numeric(titles_df["runtimeMinutes"], errors="coerce")

# Filter shorts released after 1900 with a valid runtime
shorts_after_1900 = titles_df[
    (titles_df["titleType"] == "short") &
    (titles_df["startYear"] > 1900) &
    (titles_df["runtimeMinutes"].notna())
]

# Longest short
longest_short_runtime = int(shorts_after_1900["runtimeMinutes"].max())

longest_short_runtime

1311

### Question 9 – Longest Short Film After 1900

The longest recorded short film released after 1900 in the IMDB dataset has a runtime of
**1311 minutes**.

Only titles classified as `short`, released after 1900, and with a valid runtime were
considered in this analysis.

In [65]:
# # Convert columns to numeric where needed
# titles_df["startYear"] = pd.to_numeric(titles_df["startYear"], errors="coerce")
# titles_df["runtimeMinutes"] = pd.to_numeric(titles_df["runtimeMinutes"], errors="coerce")

# Filter shorts released after 1900 with a valid runtime
Movies_after_1900 = titles_df[
    (titles_df["titleType"] == "movie") &
    (titles_df["startYear"] > 1900) &
    (titles_df["runtimeMinutes"].notna())
]

# Shortest short
Shortest_short_runtime = int(Movies_after_1900["runtimeMinutes"].min())

Shortest_short_runtime

1

### Question 10 – Shortest Movie After 1900

The shortest recorded movie released after 1900 in the IMDB dataset has a runtime of
**1 minute**.

This result is based on titles classified as `movie`, released after 1900, and having a
valid runtime value. The presence of such a short duration highlights potential classification
or data quality limitations within the dataset.

In [66]:
# Extract genres column, drop missing values
genres_series = titles_df["genres"].dropna()

# Split genres (comma-separated) and get unique values
all_genres = set(
    genre
    for genres in genres_series
    for genre in genres.split(",")
)

# Sort genres alphabetically
all_genres = sorted(all_genres)

all_genres

['Action',
 'Adult',
 'Adventure',
 'Animation',
 'Biography',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Family',
 'Fantasy',
 'Film-Noir',
 'Game-Show',
 'History',
 'Horror',
 'Music',
 'Musical',
 'Mystery',
 'News',
 'Reality-TV',
 'Romance',
 'Sci-Fi',
 'Short',
 'Sport',
 'Talk-Show',
 'Thriller',
 'War',
 'Western']

### Question 11 – List of Genres Represented

The IMDB dataset contains the following genres:

- Action
- Adult
- Adventure
- Animation
- Biography
- Comedy
- Crime
- Documentary
- Drama
- Family
- Fantasy
- Film-Noir
- Game-Show
- History
- Horror
- Music
- Musical
- Mystery
- News
- Reality-TV
- Romance
- Sci-Fi
- Short
- Sport
- Talk-Show
- Thriller
- War
- Western

This list was obtained by extracting and splitting the `genres` field for all titles and
keeping only unique genre values.

ratings_df = pd.read_csv(
    "data/title.ratings.tsv.gz",
    sep="\t",
    na_values="\\N"
)

In [67]:
ratings_df = pd.read_csv(
    "data/title.ratings.tsv.gz",
    sep="\t",
    na_values="\\N"
)

In [68]:
# Keep only movies
movies_df = titles_df[titles_df["titleType"] == "movie"]

# Keep only comedy movies
comedy_movies = movies_df[
    movies_df["genres"].notna() &
    movies_df["genres"].str.contains("Comedy")
]

comedy_movies.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
1016,tt0001028,movie,Salome Mad,Salome Mad,0,1909.0,,,Comedy
1329,tt0001341,movie,Jarní sen starého mládence,Jarní sen starého mládence,0,1913.0,,,Comedy
2649,tt0002676,movie,El bello Arturo,El bello Arturo,0,1913.0,,,Comedy
2719,tt0002746,movie,Checkers,Checkers,0,1913.0,,,"Comedy,Drama"
2771,tt0002798,movie,Le dernier pardon,Le dernier pardon,0,1913.0,,,Comedy


In [69]:
# Merge comedy movies with ratings
comedy_with_ratings = comedy_movies.merge(
    ratings_df,
    on="tconst",
    how="inner"
)

comedy_with_ratings.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes
0,tt0001028,movie,Salome Mad,Salome Mad,0,1909.0,,,Comedy,3.8,26
1,tt0002996,movie,My Husband's Getting Married,Házasodik az uram,0,1913.0,,,Comedy,3.6,38
2,tt0003015,movie,Die Insel der Seligen,Die Insel der Seligen,0,1913.0,,49.0,"Comedy,Fantasy",4.6,81
3,tt0003324,movie,A Regiment of Two,A Regiment of Two,0,1913.0,,57.0,"Comedy,Drama",6.3,32
4,tt0003565,movie,Where Is Coletti?,Wo ist Coletti?,0,1913.0,,86.0,"Comedy,Crime",6.4,56


In [70]:
# Sort by rating, then by number of votes
top_comedy = comedy_with_ratings.sort_values(
    by=["averageRating", "numVotes"],
    ascending=[False, False]
).iloc[0]

top_comedy[
    ["primaryTitle", "averageRating", "numVotes"]
]

primaryTitle     Space Melody
averageRating            10.0
numVotes                    6
Name: 66269, dtype: object

### Question 12 – Highest Rated Comedy Movie

The highest rated comedy movie in the IMDB dataset is **Space Melody**,
with an average rating of **10** based on **6** votes.

In cases where multiple comedy movies share the same rating, the tie was broken by selecting
the movie with the highest number of votes, as specified in the assignment.

In [71]:
crew_df = pd.read_csv(
    "data/title.crew.tsv.gz",
    sep="\t",
    na_values="\\N"
)

In [72]:
# Get tconst of Space Melody
space_melody_tconst = top_comedy["tconst"]

# Find director(s)
director_ids = crew_df[
    crew_df["tconst"] == space_melody_tconst
]["directors"].iloc[0]

director_ids

'nm4492923'

In [73]:
# Split director IDs
director_ids_list = director_ids.split(",")

# Retrieve director names
directors = people_df[
    people_df["nconst"].isin(director_ids_list)
]["primaryName"].tolist()

directors

['Leonardo Thimo']

### Question 13 – Director of the Movie

The director(s) of **Space Melody** is/are:

- **Leonardo Thimo**

This information was obtained by linking the movie to the `title.crew` dataset and matching
the director identifiers with the corresponding names in the `name.basics` dataset.

In [74]:
akas_df = pd.read_csv(
    "data/title.akas.tsv.gz",
    sep="\t",
    na_values="\\N",
    low_memory=False
)

# Get alternate titles for Space Melody
alternate_titles = akas_df[
    akas_df["titleId"] == space_melody_tconst
]["title"].unique()

alternate_titles

array(['Space Melody', 'H Melwdia Tou Diastimatos',
       "Leonardo Thimo's Space Melody"], dtype=object)

### Question 14 – Alternate Titles

The following alternate titles were found for **Space Melody**:

- *Space Melody*
- *H Melwdia Tou Diastimatos*
- *Leonardo Thimo's Space Melody*

These titles were retrieved from the `title.akas` dataset using the movie’s IMDB identifier.