# Data Preprocessing and Exploratory Data Analysis

This notebook loads the movie and rating datasets using our custom data loader functions, performs data cleaning, and carries out exploratory analysis to understand the data structure, missing values, and basic statistics.

We will inspect the movie metadata and user ratings, and perform some basic visualizations to get insights into the distribution of data.

In [1]:
import sys
import os
import pandas as pd
import matplotlib.pyplot as plt

# Add the src directory to the system path to import custom modules
sys.path.insert(0, os.path.abspath('../src'))

from data_loader import load_movies, load_ratings

# Load the movies and ratings datasets
movies = load_movies('../data/movies.csv')
ratings = load_ratings('../data/ratings.csv')

# Display the first few rows of the movies dataset
print('Movies Data:')
display(movies.head())

# Display the first few rows of the ratings dataset
print('Ratings Data:')
display(ratings.head())

FileNotFoundError: [Errno 2] No such file or directory: '../data/movies.csv'

In [None]:
# Display basic information and summary statistics
print('Movies Info:')
movies.info()

print('\nMovies Summary Statistics:')
display(movies.describe(include='all'))

print('\nRatings Info:')
ratings.info()

print('\nRatings Summary Statistics:')
display(ratings.describe())

In [None]:
# Visualize the distribution of ratings
plt.figure(figsize=(8, 5))
plt.hist(ratings['rating'], bins=20, color='skyblue', edgecolor='black')
plt.title('Distribution of User Ratings')
plt.xlabel('Rating')
plt.ylabel('Frequency')
plt.show()

## Conclusion

In this notebook, we loaded the movie and ratings datasets, inspected their structure, and performed initial exploratory analysis. This helps us understand the data before moving on to model development in the next notebook.