# Day 1: ML Exploration for Movie Recommender System

This notebook explores the MovieLens dataset and sets up the foundation for building a movie recommendation system.

## Contents
1. Dataset Loading and Inspection
2. Exploratory Data Analysis (EDA)
3. Data Preprocessing
4. Initial Modeling Setup

## 1. Setup and Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

# Import our custom data inspection module
import sys
sys.path.append('..')
from src.data_inspection import (
    load_ratings, load_movies, load_tags,
    print_dataset_summary, get_ratings_statistics,
    get_movies_statistics, get_tags_statistics
)

print('Libraries imported successfully!')

## 2. Dataset Loading and Inspection

We'll load the three main datasets from MovieLens:
- **ratings.csv**: User ratings for movies
- **movies.csv**: Movie information (title, genres)
- **tags.csv**: User-generated tags for movies

In [None]:
# Define data paths (adjust as needed)
DATA_PATH = '../data/'

# Load datasets
# Uncomment the lines below once data files are available
# ratings = load_ratings(DATA_PATH + 'ratings.csv')
# movies = load_movies(DATA_PATH + 'movies.csv')
# tags = load_tags(DATA_PATH + 'tags.csv')

print('Note: Uncomment the data loading lines once the MovieLens dataset is available.')
print('Expected files: ratings.csv, movies.csv, tags.csv in the data/ directory')

### 2.1 Ratings Dataset Inspection

In [None]:
# Inspect ratings dataset
# Uncomment once data is loaded
# print_dataset_summary(ratings, 'Ratings')
# ratings_stats = get_ratings_statistics(ratings)
# print(f"\nRatings Statistics:")
# for key, value in ratings_stats.items():
#     if key != 'rating_distribution':
#         print(f"  {key}: {value}")

### 2.2 Movies Dataset Inspection

In [None]:
# Inspect movies dataset
# Uncomment once data is loaded
# print_dataset_summary(movies, 'Movies')
# movies_stats = get_movies_statistics(movies)
# print(f"\nMovies Statistics:")
# for key, value in movies_stats.items():
#     if key != 'genre_distribution':
#         print(f"  {key}: {value}")

### 2.3 Tags Dataset Inspection

In [None]:
# Inspect tags dataset
# Uncomment once data is loaded
# print_dataset_summary(tags, 'Tags')
# tags_stats = get_tags_statistics(tags)
# print(f"\nTags Statistics:")
# for key, value in tags_stats.items():
#     if key != 'top_10_tags':
#         print(f"  {key}: {value}")

## 3. Exploratory Data Analysis (EDA)

### 3.1 Rating Distribution

In [None]:
# Visualize rating distribution
# Uncomment once data is loaded
# plt.figure(figsize=(10, 6))
# ratings['rating'].value_counts().sort_index().plot(kind='bar', color='steelblue')
# plt.title('Distribution of Movie Ratings')
# plt.xlabel('Rating')
# plt.ylabel('Count')
# plt.xticks(rotation=0)
# plt.tight_layout()
# plt.show()

### 3.2 Genre Distribution

In [None]:
# Visualize genre distribution
# Uncomment once data is loaded
# genres = movies['genres'].str.split('|').explode()
# plt.figure(figsize=(12, 6))
# genres.value_counts().plot(kind='bar', color='steelblue')
# plt.title('Distribution of Movie Genres')
# plt.xlabel('Genre')
# plt.ylabel('Count')
# plt.xticks(rotation=45, ha='right')
# plt.tight_layout()
# plt.show()

### 3.3 User Activity Analysis

In [None]:
# Analyze user activity
# Uncomment once data is loaded
# user_activity = ratings.groupby('userId').size()
# print(f"User Activity Statistics:")
# print(f"  Min ratings per user: {user_activity.min()}")
# print(f"  Max ratings per user: {user_activity.max()}")
# print(f"  Mean ratings per user: {user_activity.mean():.2f}")
# print(f"  Median ratings per user: {user_activity.median():.2f}")

# plt.figure(figsize=(10, 6))
# plt.hist(user_activity, bins=50, color='steelblue', edgecolor='black')
# plt.title('Distribution of Ratings per User')
# plt.xlabel('Number of Ratings')
# plt.ylabel('Number of Users')
# plt.tight_layout()
# plt.show()

### 3.4 Movie Popularity Analysis

In [None]:
# Analyze movie popularity
# Uncomment once data is loaded
# movie_popularity = ratings.groupby('movieId').agg({
#     'rating': ['count', 'mean']
# }).reset_index()
# movie_popularity.columns = ['movieId', 'rating_count', 'rating_mean']
# movie_popularity = movie_popularity.merge(movies[['movieId', 'title']], on='movieId')

# print("Top 10 Most Rated Movies:")
# print(movie_popularity.nlargest(10, 'rating_count')[['title', 'rating_count', 'rating_mean']])

## 4. Data Preprocessing

### 4.1 Create User-Item Matrix

In [None]:
# Create user-item rating matrix
# Uncomment once data is loaded
# user_item_matrix = ratings.pivot_table(
#     index='userId',
#     columns='movieId',
#     values='rating',
#     fill_value=0
# )
# print(f"User-Item Matrix Shape: {user_item_matrix.shape}")
# print(f"Sparsity: {(1 - (ratings.shape[0] / (user_item_matrix.shape[0] * user_item_matrix.shape[1]))) * 100:.2f}%")

### 4.2 Train-Test Split

In [None]:
# Split data for modeling
# Uncomment once data is loaded
# from sklearn.model_selection import train_test_split

# train_data, test_data = train_test_split(
#     ratings,
#     test_size=0.2,
#     random_state=42
# )
# print(f"Training set size: {len(train_data)}")
# print(f"Test set size: {len(test_data)}")

## 5. Initial Modeling Setup

### 5.1 Baseline Model: Mean Rating Predictor

In [None]:
# Baseline model using mean ratings
# Uncomment once data is loaded
# from sklearn.metrics import mean_squared_error, mean_absolute_error

# global_mean = train_data['rating'].mean()
# baseline_predictions = np.full(len(test_data), global_mean)

# rmse = np.sqrt(mean_squared_error(test_data['rating'], baseline_predictions))
# mae = mean_absolute_error(test_data['rating'], baseline_predictions)

# print(f"Baseline Model (Global Mean) Performance:")
# print(f"  RMSE: {rmse:.4f}")
# print(f"  MAE: {mae:.4f}")

### 5.2 User-Based Mean Predictor

In [None]:
# User-based mean predictor
# Uncomment once data is loaded
# user_means = train_data.groupby('userId')['rating'].mean()
# user_predictions = test_data['userId'].map(user_means).fillna(global_mean)

# rmse_user = np.sqrt(mean_squared_error(test_data['rating'], user_predictions))
# mae_user = mean_absolute_error(test_data['rating'], user_predictions)

# print(f"User Mean Predictor Performance:")
# print(f"  RMSE: {rmse_user:.4f}")
# print(f"  MAE: {mae_user:.4f}")

### 5.3 Item-Based Mean Predictor

In [None]:
# Item-based mean predictor
# Uncomment once data is loaded
# item_means = train_data.groupby('movieId')['rating'].mean()
# item_predictions = test_data['movieId'].map(item_means).fillna(global_mean)

# rmse_item = np.sqrt(mean_squared_error(test_data['rating'], item_predictions))
# mae_item = mean_absolute_error(test_data['rating'], item_predictions)

# print(f"Item Mean Predictor Performance:")
# print(f"  RMSE: {rmse_item:.4f}")
# print(f"  MAE: {mae_item:.4f}")

## 6. Next Steps

After this initial exploration, the following steps are planned:

1. **Collaborative Filtering**: Implement user-based and item-based collaborative filtering
2. **Matrix Factorization**: Apply SVD and ALS for latent factor models
3. **Deep Learning**: Explore neural collaborative filtering approaches
4. **Hybrid Models**: Combine content-based and collaborative filtering
5. **Evaluation**: Implement comprehensive evaluation metrics (NDCG, precision@k, recall@k)

In [None]:
print("Day 1 ML Exploration Complete!")
print("Next: Add MovieLens dataset and run the analysis.")