# Exploratory Data Analysis (EDA) — Movie Recommendation System

**Goal:** Understand the data structure and key patterns (ratings distribution, user activity, movie popularity, sparsity) using **locally saved Parquet files** to avoid repeated BigQuery queries.

**Inputs (from `data/`):**
- `interactions.parquet` — filtered interactions (`userId`, `movieId`, `rating`, `timestamp`)
- `movie_stats.parquet` — per-movie stats (`movieId`, `n_ratings`, `avg_rating`, `std_rating`)
- `user_stats.parquet` — per-user stats (`userId`, `n_ratings`, `avg_rating`, `first_ts`, `last_ts`)

In [19]:
from pathlib import Path
import pandas as pd

PROJECT_ROOT = Path.cwd().parent   
DATA_DIR = PROJECT_ROOT / "data"

print("PROJECT_ROOT:", PROJECT_ROOT)
print("DATA_DIR:", DATA_DIR)

PROJECT_ROOT: /home/jupyter/IslemFatma_Yassine/Personalized_Movie_Recommendation_System
DATA_DIR: /home/jupyter/IslemFatma_Yassine/Personalized_Movie_Recommendation_System/data


In [20]:
interactions_path = DATA_DIR / "interactions.parquet"
movie_stats_path = DATA_DIR / "movie_stats.parquet"
user_stats_path = DATA_DIR / "user_stats.parquet"

assert interactions_path.exists(), interactions_path
assert movie_stats_path.exists(), movie_stats_path
assert user_stats_path.exists(), user_stats_path

interactions = pd.read_parquet(interactions_path)
movie_stats = pd.read_parquet(movie_stats_path)
user_stats = pd.read_parquet(user_stats_path)

print("Loaded:")
print(" - interactions:", interactions.shape)
print(" - movie_stats:", movie_stats.shape)
print(" - user_stats:", user_stats.shape)


Loaded:
 - interactions: (70513, 4)
 - movie_stats: (1322, 4)
 - user_stats: (668, 5)


In [15]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style="whitegrid", palette="mako")
plt.rcParams["figure.figsize"] = (10, 5)

## Dataset overview
### Basic Statistics

In [12]:
num_users = user_stats.shape[0]
num_movies = movie_stats.shape[0]
num_ratings = interactions.shape[0]

total_possible = num_users * num_movies
sparsity = 1 - (num_ratings / total_possible)

num_users, num_movies, num_ratings, sparsity

(668, 1322, 70513, 0.9201525089005046)

In [None]:
## User Activity Analysis

In [None]:
## Movie Popularity Analysis

In [None]:
## Genre Analysis

In [None]:
## Temporal Analysis

In [None]:
## Data Quality Check