# Data preprocessing
In this notebook, we prepare the data that will be used to train the recommendation system.

In [1]:
import pandas as pd

## Step 1: Load data
We load three CSV files from The Movies Dataset: ratings, links, and movie metadata.

In [2]:
movies_df = pd.read_csv('The_Movies_Dataset/movies_metadata.csv', low_memory=False, dtype={'id' : 'str'})
ratings_df = pd.read_csv('The_Movies_Dataset/ratings_small.csv')
links_df = pd.read_csv('The_Movies_Dataset/links_small.csv')

## Step 2: Convert IDs
Movies are identified with `id` (`tmdbId`) in the metadata. We convert these to numeric format to make it possible to merge the tables.

In [3]:
movies_df['id'] = pd.to_numeric(movies_df['id'], errors='coerce')

## Step 3: Merge tables
Here we merge the ratings data with the movie titles from the metadata. This gives us a dataset where each row include userId, movieId, rating, and title.

In [4]:
ratings_meta = ratings_df.merge(links_df[['movieId', 'tmdbId']], on='movieId', how='left')
ratings_meta = ratings_meta.merge(movies_df[['id', 'title']], left_on='tmdbId', right_on='id', how='left')

## Step 4: Clean NaN and duplicates
We remove rows where the movie title is missing (since they cannot be used in recommendations) and drop any duplicates. We also remove unnecessary columns such as timestamp, tmdbId, and id.

In [5]:
ratings_meta = ratings_meta.drop_duplicates()
ratings_meta = ratings_meta.drop(columns=["timestamp", "tmdbId", "id"])
ratings_meta = ratings_meta.dropna(subset=["title"])

## Step 5: Save processed dataset
We save the result as `ratings_meta_small.csv`, which will be used in the next step when training the model.

In [6]:
ratings_meta.to_csv('ratings_meta_small.csv', index=False)