# Démo Colab

- Étudiant : Aldiouma MBAYE
- Professeur : M. DERRAZ Foued
- Module : Data Warehousing et ETL

- Exécutez les cellules dans l’ordre (1 → 6).
- La cellule 1 clone/pull le dépôt GitHub.
- La cellule 2 installe `pyspark` et `requests`.
- La cellule 3 télécharge les fichiers IMDb dans `raw/`.
- La cellule 4 lance le pipeline avec `--show-counts`.
- La cellule 5/6 vérifie les sorties Parquet et affiche un aperçu.


In [1]:
# 1) Définir l'URL du dépôt GitHub
REPO_URL = "https://github.com/maldiouma/Pipeline-PySpark-ETL-IMDb.git"
REPO_DIR = "/content/Pipeline-PySpark-ETL-IMDb"

import os, subprocess, sys
print('REPO_URL:', REPO_URL)
if not os.path.exists(REPO_DIR):
    !git clone $REPO_URL $REPO_DIR
else:
    print('Dépôt déjà présent, pull...')
    %cd $REPO_DIR
    !git pull
%cd $REPO_DIR
print('Répertoire courant:', os.getcwd())

REPO_URL: https://github.com/maldiouma/Pipeline-PySpark-ETL-IMDb.git
Cloning into '/content/Pipeline-PySpark-ETL-IMDb'...
remote: Enumerating objects: 14, done.[K
remote: Counting objects: 100% (14/14), done.[K
remote: Compressing objects: 100% (8/8), done.[K
remote: Total 14 (delta 3), reused 14 (delta 3), pack-reused 0 (from 0)[K
Receiving objects: 100% (14/14), 88.00 KiB | 8.80 MiB/s, done.
Resolving deltas: 100% (3/3), done.
/content/Pipeline-PySpark-ETL-IMDb
Répertoire courant: /content/Pipeline-PySpark-ETL-IMDb


In [10]:
# Installer un JDK pour Spark
!apt-get update -y >/dev/null 2>&1 && apt-get install -y openjdk-17-jdk-headless >/dev/null 2>&1
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-17-openjdk-amd64"
os.environ["PATH"] = f"{os.environ['JAVA_HOME']}/bin:" + os.environ['PATH']
!java -version
# 2) Installer les dépendances nécessaires (pyspark, requests)
!pip install -q -r requirements.txt || pip install -q pyspark requests

openjdk version "17.0.17" 2025-10-21
OpenJDK Runtime Environment (build 17.0.17+10-Ubuntu-122.04)
OpenJDK 64-Bit Server VM (build 17.0.17+10-Ubuntu-122.04, mixed mode, sharing)


In [11]:
# 3) Préparer les dossiers et télécharger les fichiers IMDb
import os
os.makedirs('raw', exist_ok=True)

!wget -O raw/title.basics.tsv.gz https://datasets.imdbws.com/title.basics.tsv.gz
!wget -O raw/title.ratings.tsv.gz https://datasets.imdbws.com/title.ratings.tsv.gz

--2026-01-08 10:22:09--  https://datasets.imdbws.com/title.basics.tsv.gz
Resolving datasets.imdbws.com (datasets.imdbws.com)... 18.239.69.2, 18.239.69.29, 18.239.69.88, ...
Connecting to datasets.imdbws.com (datasets.imdbws.com)|18.239.69.2|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 216047421 (206M) [binary/octet-stream]
Saving to: ‘raw/title.basics.tsv.gz’


2026-01-08 10:22:09 (345 MB/s) - ‘raw/title.basics.tsv.gz’ saved [216047421/216047421]

--2026-01-08 10:22:10--  https://datasets.imdbws.com/title.ratings.tsv.gz
Resolving datasets.imdbws.com (datasets.imdbws.com)... 18.239.69.2, 18.239.69.29, 18.239.69.88, ...
Connecting to datasets.imdbws.com (datasets.imdbws.com)|18.239.69.2|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8192377 (7.8M) [binary/octet-stream]
Saving to: ‘raw/title.ratings.tsv.gz’


2026-01-08 10:22:10 (183 MB/s) - ‘raw/title.ratings.tsv.gz’ saved [8192377/8192377]



In [12]:
# 4) Exécuter le pipeline avec affichage des volumes :------------------------------------ Temps d'execution sur colab(12 minutes )
# Installer un JDK pour Spark colab
!apt-get update -y >/dev/null 2>&1 && apt-get install -y openjdk-17-jdk-headless >/dev/null 2>&1
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-17-openjdk-amd64"
os.environ["PATH"] = f"{os.environ['JAVA_HOME']}/bin:" + os.environ["PATH"]
!java -version
!python src/etl_imdb.py --raw-dir raw --dw-dir dw --marts-dir marts --show-counts

openjdk version "17.0.17" 2025-10-21
OpenJDK Runtime Environment (build 17.0.17+10-Ubuntu-122.04)
OpenJDK 64-Bit Server VM (build 17.0.17+10-Ubuntu-122.04, mixed mode, sharing)
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
26/01/08 10:22:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[stats] titles_stg: 736156
[stats] ratings_stg: 1621609
[stats] dim_year: 136
[stats] dim_title: 736156
[stats] dim_genre: 28
[stats] bridge_title_genre: 1008943
[stats] fact_ratings: 337781
[stats] mart_year_kpi: 133
[stats] mart_top_genre_year: 17193
[stats] mart_top_year_by_rating: 1112
[stats] mart_rating_distribution: 1996
ERROR:root:KeyboardInterrupt while sending command.
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/py4j

In [14]:
# 6)  Aperçu Top 10 par genre et année
import pandas as pd
df_top = pd.read_parquet('marts/mart_top_genre_year')
df_top.sort_values(['yearkey','genrekey','rk']).head(20)

Unnamed: 0,yearkey,genrekey,titlekey,avg_rating,num_votes,rk
0,1906,action,tt0000574,6.0,1048,1
1,1906,adventure,tt0000574,6.0,1048,1
2,1906,biography,tt0000574,6.0,1048,1
3,1911,adventure,tt0002130,7.1,4016,1
4,1911,drama,tt0002130,7.1,4016,1
5,1911,fantasy,tt0002130,7.1,4016,1
6,1913,crime,tt0002844,6.9,2695,1
7,1913,crime,tt0003037,6.9,1831,2
8,1913,crime,tt0003165,6.9,1459,3
9,1913,drama,tt0002844,6.9,2695,1


In [17]:
# Aperçu du mart year KPI
df_kpi = pd.read_parquet('marts/mart_year_kpi')
df_kpi.sort_values('yearkey').head(10)

Unnamed: 0,yearkey,n_movies,mean_rating,total_votes
1,1894.0,1,5.2,232
2,1896.0,1,3.6,29
3,1897.0,2,4.65,656
4,1898.0,6,2.266667,116
5,1899.0,8,2.375,245
6,1900.0,8,3.2375,260
7,1901.0,7,2.8,186
8,1902.0,3,1.766667,43
9,1903.0,4,3.7,876
10,1904.0,2,3.6,401
