# Movie Recommendation System - Data Exploration

This notebook explores the movie dataset and user ratings to understand the data characteristics before building the neuro-fuzzy recommendation model.

## Setup

Import necessary libraries and set up the environment.

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Set plot style
sns.set(style='whitegrid')
plt.rcParams['figure.figsize'] = (12, 8)

# Project paths
PROJECT_DIR = Path().resolve().parents[0]
RAW_DATA_DIR = PROJECT_DIR / 'data' / 'raw'

print(f'Project directory: {PROJECT_DIR}')
print(f'Raw data directory: {RAW_DATA_DIR}')

## Load Data

Load the movie and ratings data from the raw data directory.

In [None]:
# Function to load data
def load_data():
    try:
        movies_path = RAW_DATA_DIR / 'movies.csv'
        ratings_path = RAW_DATA_DIR / 'ratings.csv'
        
        if not movies_path.exists() or not ratings_path.exists():
            print("Raw data files not found. Please download the dataset first.")
            return None, None
            
        movies_df = pd.read_csv(movies_path)
        ratings_df = pd.read_csv(ratings_path)
        
        print(f"Loaded {len(movies_df)} movies and {len(ratings_df)} ratings")
        return movies_df, ratings_df
        
    except Exception as e:
        print(f"Error loading data: {e}")
        return None, None

# Load the data
movies_df, ratings_df = load_data()

# Display the first few rows of each dataframe
if movies_df is not None and ratings_df is not None:
    print("\nMovies dataframe:")
    display(movies_df.head())
    
    print("\nRatings dataframe:")
    display(ratings_df.head())

## Data Overview

Examine the basic characteristics of the datasets.

In [None]:
if movies_df is not None and ratings_df is not None:
    # Movies dataframe info
    print("\nMovies dataframe info:")
    print(movies_df.info())
    print("\nMovies dataframe description:")
    print(movies_df.describe(include='all'))
    
    # Ratings dataframe info
    print("\nRatings dataframe info:")
    print(ratings_df.info())
    print("\nRatings dataframe description:")
    print(ratings_df.describe())
    
    # Check for missing values
    print("\nMissing values in movies dataframe:")
    print(movies_df.isnull().sum())
    
    print("\nMissing values in ratings dataframe:")
    print(ratings_df.isnull().sum())

## Movie Analysis

Analyze the movie dataset to understand the distribution of genres and release years.

In [None]:
if movies_df is not None:
    # Extract year from title
    movies_df['year'] = movies_df['title'].str.extract(r'\((\d{4})\)$')
    movies_df['clean_title'] = movies_df['title'].str.replace(r'\s*\(\d{4}\)$', '', regex=True)
    
    # Convert year to numeric
    movies_df['year'] = pd.to_numeric(movies_df['year'], errors='coerce')
    
    # Plot distribution of movie release years
    plt.figure(figsize=(14, 6))
    sns.histplot(movies_df['year'].dropna(), bins=50, kde=True)
    plt.title('Distribution of Movie Release Years')
    plt.xlabel('Year')
    plt.ylabel('Count')
    plt.show()
    
    # Analyze genres
    # Split the genres string into a list
    movies_df['genres_list'] = movies_df['genres'].str.split('|')
    
    # Count genre occurrences
    all_genres = []
    for genres in movies_df['genres_list']:
        if isinstance(genres, list):
            all_genres.extend(genres)
    
    genre_counts = pd.Series(all_genres).value_counts()
    
    # Plot top genres
    plt.figure(figsize=(14, 8))
    sns.barplot(x=genre_counts.index[:15], y=genre_counts.values[:15])
    plt.title('Top 15 Movie Genres')
    plt.xlabel('Genre')
    plt.ylabel('Count')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()

## Ratings Analysis

Analyze the ratings dataset to understand user behavior and rating patterns.

In [None]:
if ratings_df is not None:
    # Distribution of ratings
    plt.figure(figsize=(10, 6))
    sns.countplot(x='rating', data=ratings_df)
    plt.title('Distribution of Ratings')
    plt.xlabel('Rating')
    plt.ylabel('Count')
    plt.show()
    
    # Calculate rating statistics per user
    user_ratings = ratings_df.groupby('userId')['rating'].agg(['count', 'mean', 'std'])
    
    # Plot distribution of number of ratings per user
    plt.figure(figsize=(12, 6))
    sns.histplot(user_ratings['count'], bins=50, kde=True)
    plt.title('Distribution of Number of Ratings per User')
    plt.xlabel('Number of Ratings')
    plt.ylabel('Count')
    plt.xscale('log')
    plt.show()
    
    # Plot distribution of average rating per user
    plt.figure(figsize=(12, 6))
    sns.histplot(user_ratings['mean'], bins=50, kde=True)
    plt.title('Distribution of Average Rating per User')
    plt.xlabel('Average Rating')
    plt.ylabel('Count')
    plt.show()
    
    # Calculate rating statistics per movie
    movie_ratings = ratings_df.groupby('movieId')['rating'].agg(['count', 'mean', 'std'])
    
    # Plot distribution of number of ratings per movie
    plt.figure(figsize=(12, 6))
    sns.histplot(movie_ratings['count'], bins=50, kde=True)
    plt.title('Distribution of Number of Ratings per Movie')
    plt.xlabel('Number of Ratings')
    plt.ylabel('Count')
    plt.xscale('log')
    plt.show()
    
    # Plot distribution of average rating per movie
    plt.figure(figsize=(12, 6))
    sns.histplot(movie_ratings['mean'], bins=50, kde=True)
    plt.title('Distribution of Average Rating per Movie')
    plt.xlabel('Average Rating')
    plt.ylabel('Count')
    plt.show()

## User-Item Interaction Analysis

Analyze the interaction between users and movies.

In [None]:
if ratings_df is not None and movies_df is not None:
    # Create a pivot table of user-item ratings
    user_item_matrix = ratings_df.pivot(index='userId', columns='movieId', values='rating')
    
    # Calculate sparsity
    sparsity = 1.0 - len(ratings_df) / (user_item_matrix.shape[0] * user_item_matrix.shape[1])
    print(f"Sparsity of the user-item matrix: {sparsity:.4f} ({sparsity*100:.2f}%)")
    
    # Display a small sample of the user-item matrix
    print("\nSample of the user-item matrix (first 5 users, first 5 movies):")
    display(user_item_matrix.iloc[:5, :5])
    
    # Visualize the sparsity pattern
    plt.figure(figsize=(10, 8))
    plt.spy(user_item_matrix.iloc[:100, :100], precision=0.1, markersize=2)
    plt.title('Sparsity Pattern (First 100 Users and Movies)')
    plt.xlabel('Movie ID')
    plt.ylabel('User ID')
    plt.show()

## Temporal Analysis

Analyze the temporal patterns in the ratings data.

In [None]:
if ratings_df is not None:
    # Convert timestamp to datetime
    ratings_df['timestamp'] = pd.to_datetime(ratings_df['timestamp'], unit='s')
    
    # Extract time components
    ratings_df['year'] = ratings_df['timestamp'].dt.year
    ratings_df['month'] = ratings_df['timestamp'].dt.month
    ratings_df['day'] = ratings_df['timestamp'].dt.day
    ratings_df['hour'] = ratings_df['timestamp'].dt.hour
    
    # Plot ratings over time (by year)
    yearly_ratings = ratings_df.groupby('year').size()
    plt.figure(figsize=(12, 6))
    yearly_ratings.plot(kind='bar')
    plt.title('Number of Ratings by Year')
    plt.xlabel('Year')
    plt.ylabel('Number of Ratings')
    plt.tight_layout()
    plt.show()
    
    # Plot average rating by year
    yearly_avg_ratings = ratings_df.groupby('year')['rating'].mean()
    plt.figure(figsize=(12, 6))
    yearly_avg_ratings.plot(kind='line', marker='o')
    plt.title('Average Rating by Year')
    plt.xlabel('Year')
    plt.ylabel('Average Rating')
    plt.grid(True)
    plt.tight_layout()
    plt.show()
    
    # Plot ratings by hour of day
    hourly_ratings = ratings_df.groupby('hour').size()
    plt.figure(figsize=(12, 6))
    hourly_ratings.plot(kind='bar')
    plt.title('Number of Ratings by Hour of Day')
    plt.xlabel('Hour')
    plt.ylabel('Number of Ratings')
    plt.tight_layout()
    plt.show()

## Genre Preference Analysis

Analyze user preferences for different movie genres.

In [None]:
if ratings_df is not None and movies_df is not None:
    # Merge ratings with movies to get genre information
    ratings_with_genres = pd.merge(ratings_df, movies_df, on='movieId')
    
    # Create a dataframe with one row per movie-genre combination
    genre_ratings = []
    
    for _, row in ratings_with_genres.iterrows():
        genres = row['genres'].split('|')
        for genre in genres:
            genre_ratings.append({
                'userId': row['userId'],
                'movieId': row['movieId'],
                'rating': row['rating'],
                'genre': genre
            })
    
    genre_ratings_df = pd.DataFrame(genre_ratings)
    
    # Calculate average rating by genre
    genre_avg_ratings = genre_ratings_df.groupby('genre')['rating'].agg(['mean', 'count'])
    genre_avg_ratings = genre_avg_ratings.sort_values('count', ascending=False)
    
    # Plot average rating by genre (for top genres)
    top_genres = genre_avg_ratings.head(15).index
    
    plt.figure(figsize=(14, 8))
    sns.barplot(x=top_genres, y=genre_avg_ratings.loc[top_genres, 'mean'])
    plt.title('Average Rating by Genre (Top 15 Genres)')
    plt.xlabel('Genre')
    plt.ylabel('Average Rating')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()
    
    # Plot number of ratings by genre
    plt.figure(figsize=(14, 8))
    sns.barplot(x=top_genres, y=genre_avg_ratings.loc[top_genres, 'count'])
    plt.title('Number of Ratings by Genre (Top 15 Genres)')
    plt.xlabel('Genre')
    plt.ylabel('Number of Ratings')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()

## Correlation Analysis

Analyze correlations between movie features and ratings.

In [None]:
if ratings_df is not None and movies_df is not None:
    # Merge ratings with movies
    merged_df = pd.merge(ratings_df, movies_df, on='movieId')
    
    # Calculate correlation between movie year and rating
    if 'year' in merged_df.columns:
        year_rating_corr = merged_df['year'].corr(merged_df['rating'])
        print(f"Correlation between movie year and rating: {year_rating_corr:.4f}")
        
        # Plot year vs. rating
        plt.figure(figsize=(12, 6))
        sns.boxplot(x='year', y='rating', data=merged_df.sample(10000))
        plt.title('Rating Distribution by Movie Year')
        plt.xlabel('Year')
        plt.ylabel('Rating')
        plt.xticks(rotation=90)
        plt.tight_layout()
        plt.show()
    
    # Calculate correlation between number of ratings and average rating
    movie_stats = merged_df.groupby('movieId').agg({
        'rating': ['count', 'mean']
    })
    movie_stats.columns = ['rating_count', 'rating_mean']
    
    count_rating_corr = movie_stats['rating_count'].corr(movie_stats['rating_mean'])
    print(f"Correlation between number of ratings and average rating: {count_rating_corr:.4f}")
    
    # Plot number of ratings vs. average rating
    plt.figure(figsize=(10, 6))
    plt.scatter(movie_stats['rating_count'], movie_stats['rating_mean'], alpha=0.5)
    plt.title('Average Rating vs. Number of Ratings')
    plt.xlabel('Number of Ratings')
    plt.ylabel('Average Rating')
    plt.xscale('log')
    plt.grid(True)
    plt.tight_layout()
    plt.show()

## Conclusion

Summarize the key findings from the data exploration and discuss implications for the recommendation model.

Based on the exploratory data analysis, we can draw the following conclusions:

1. **Data Characteristics**:
   - The dataset contains a large number of movies and ratings
   - The user-item matrix is highly sparse, which is typical for recommendation systems
   - There is a wide range of movie release years and genres

2. **Rating Patterns**:
   - Ratings are not uniformly distributed; some ratings are more common than others
   - Users have varying levels of activity, with some users rating many movies and others rating only a few
   - Movies also have varying levels of popularity, with some movies receiving many ratings and others receiving few

3. **Temporal Patterns**:
   - Rating activity varies over time, with certain years showing more activity than others
   - There may be hourly patterns in rating behavior

4. **Genre Preferences**:
   - Different genres have different average ratings
   - Some genres are more popular (receive more ratings) than others

5. **Correlations**:
   - There may be correlations between movie features (e.g., year, genre) and ratings
   - There may be correlations between popularity (number of ratings) and average rating

**Implications for the Neuro-Fuzzy Recommendation Model**:

1. **Feature Engineering**:
   - We should include movie features such as genre and release year
   - We may want to include user features such as activity level and rating patterns
   - Temporal features may also be useful

2. **Model Design**:
   - The model should handle the high sparsity of the user-item matrix
   - We may want to use a hybrid approach that combines collaborative filtering with content-based filtering
   - The fuzzy component can help capture the uncertainty in user preferences

3. **Evaluation**:
   - We should evaluate the model on different user segments (e.g., active vs. inactive users)
   - We should evaluate the model on different movie segments (e.g., popular vs. unpopular movies)
   - We should consider both accuracy metrics (e.g., RMSE) and ranking metrics (e.g., precision, recall)

Next steps include preprocessing the data, engineering features, and building the neuro-fuzzy recommendation model.