# Movie Data Processing and Preprocessing

## Goal

The goal is to prepare and preprocess movie-related data to enable downstream tasks such as genre prediction or movie recommendation. To accomplish this, we aim to construct a large, clean, and well-structured dataset by combining and refining two publicly available data sources. This dataset will serve as the foundation for later model training and evaluation using natural language processing techniques.


## Datasets Used

We use two publicly available datasets from Kaggle:

1. **Millions of Movies** (`akshaypawar7/millions-of-movies`):
   This dataset contains various metadata about movies, including essential fields like `id`, `title`, `release_date`, `overview`, `genres`, and `recommendations`.
   Additionally, it includes other attributes such as `budget`, `revenue`, `runtime`, `popularity`, and `vote_average`, which provide more context but are not critical for our current project goals.

2. **Wikipedia Movie Plots** (`jrobischon/wikipedia-movie-plots`): This dataset includes `title`, `release_year`, and detailed plot summaries (`Plot`) for over 34,000 movies.

## Data Cleaning and Merging

- Titles and release years were cleaned and normalized.
- The two datasets were merged using `title` and `release_year` as keys to enrich the primary dataset with additional plot data.
- If a plot was missing, we used the overview as a fallback to ensure textual input for each movie.

## Additional Cleaning Steps

- Removed duplicates using `id` and `title`
- Replaced invalid or empty values such as `"nan"` or `"None"` in the `plot` field

## Test Dataset Construction for Recommendations

Based on the `recommendations` field, we created two test datasets:

- One containing 5,000 movies
- One containing 10,000 movies

For each movie, we ensured:

- A valid plot or overview
- That the referenced recommendations exist within the dataset

## Exported Files

All cleaned and structured datasets were exported as JSON files for further use:

- `../data/all_data.json`
  Contains the full merged dataset with all available movies, including metadata (such as IDs, titles, release dates, and other attributes), plots, overviews, genres, and recommendations.

- `../data/recommendation_with_plot.json`
  Contains movies that have both plot information and recommendation data, prepared for recommendation tasks.

- `../data/recommendation_with_plot_test_5000.json`
  A test dataset subset with 5,000 movies, filtered to ensure valid plots and recommendations for evaluation purposes.

- `../data/recommendation_with_plot_test_10000.json`
  Similar to the 5,000-movie subset but larger, containing 10,000 movies for testing and validation.

- `../data/merged_wiki_plot.json`
  Contains only those movie entries for which detailed Wikipedia plot summaries were available and successfully merged, ensuring high-quality, extended plot descriptions.


In [38]:
import kagglehub
import shutil
import os
import pandas as pd
from tqdm import tqdm
tqdm.pandas()


## Download of the Data

For this project, we utilize two large publicly available datasets from Kaggle. The first dataset, **Millions of Movies** (`akshaypawar7/millions-of-movies`), contains extensive metadata on over 500,000 movies, including details such as movie IDs, titles, release dates, overviews, genres, and recommendations.

The second dataset, **Wikipedia Movie Plots** (`jrobischon/wikipedia-movie-plots`), provides detailed plot summaries for about 34,000 movies. This dataset will be used to enrich the primary dataset with more comprehensive plot descriptions.


In [39]:
def download_kaggle_dataset(dataset_name):
    dataset_path = kagglehub.dataset_download(dataset_name)
    target_path = "data"
    os.makedirs(target_path, exist_ok=True)

    for filename in os.listdir(dataset_path):
        full_file_name = os.path.join(dataset_path, filename)
        if os.path.isfile(full_file_name):
            shutil.copy(full_file_name, target_path)


In [40]:
download_kaggle_dataset("akshaypawar7/millions-of-movies")



In [41]:
df = pd.read_csv('data/movies.csv')

In [42]:
download_kaggle_dataset("jrobischon/wikipedia-movie-plots")
df_wiki = pd.read_csv('data/wiki_movie_plots_deduped.csv')



## Merging of the Datasets

To enrich the movie data with detailed plot descriptions, we merged the **Millions of Movies** dataset with the **Wikipedia Movie Plots** dataset. Since the Wikipedia dataset does not include genre or recommendation information both of which are crucial for our NLP tasks we use it only to supplement the plot data.

### Preprocessing Steps Before Merging

- **Normalization of Title and Year**:
  To ensure accurate matching, we cleaned and normalized the `title` and `release_year` fields in both datasets. This involved:
  - Converting titles to lowercase
  - Stripping whitespace
  - Extracting the release year from the full date string

- **Column Renaming for Consistency**:
  We renamed relevant columns in the Wikipedia dataset (`Title` → `title`, `Release Year` → `release_year`, `Plot` → `plot_wiki`) to match the naming conventions in the main dataset.

### Merge Strategy

Since movie titles are not always unique—especially in the case of remakes—we included `release_year` as an additional key when merging the datasets. This increases the likelihood of accurate matches between the two sources.

The merge was performed as a **left join** on `title` and `release_year`. This ensures that we retain all records from the primary dataset (`Millions of Movies`), and only add additional plot information if a match is found in the Wikipedia dataset.


### Final Dataset

We removed duplicates using both the `id` and `title` columns to improve data quality and consistency. The resulting dataset, stored as `merged_wiki_plot`, contains only entries that include:
- `id`
- `title`
- `genres`
- `overview`
- `plot`

This final merged dataset includes **21,287** entries.


In [43]:
df_wiki.head(3)

Unnamed: 0,Release Year,Title,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot
0,1901,Kansas Saloon Smashers,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Kansas_Saloon_Sm...,"A bartender is working at a saloon, serving dr..."
1,1901,Love by the Light of the Moon,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Love_by_the_Ligh...,"The moon, painted with a smiling face hangs ov..."
2,1901,The Martyred Presidents,American,Unknown,,unknown,https://en.wikipedia.org/wiki/The_Martyred_Pre...,"The film, just over a minute long, is composed..."


In [44]:
df['release_year'] = df['release_date'].apply(lambda x: str(x).split('-')[0] if pd.notnull(x) else None)

In [45]:
df['title'] = df['title'].astype(str).str.strip().str.lower()
df['release_year'] = df['release_year'].astype(str).str.strip()

df_wiki.rename(columns={'Title': 'title'}, inplace=True)
df_wiki.rename(columns={'Release Year': 'release_year'}, inplace=True)
df_wiki.rename(columns={'Plot': 'plot_wiki'}, inplace=True)

df_wiki['title'] = df_wiki['title'].astype(str).str.strip().str.lower()
df_wiki['release_year'] = df_wiki['release_year'].astype(str).str.strip()

plot_info = df_wiki[['title', 'release_year', 'plot_wiki']]

df = df.merge(plot_info, on=['title', 'release_year'], how='left', suffixes=('', '_wiki'))

if 'plot' in df.columns:
    df['plot'] = df['plot_wiki'].combine_first(df['plot'])
else:
    df['plot'] = df['plot_wiki']

df.drop(columns=['plot_wiki'], inplace=True)
df = df.drop_duplicates(subset='id', keep='first')
df = df.drop_duplicates(subset='title', keep='first')


In [46]:
merged_wiki_plot = df[['id', 'title', 'genres', 'overview', 'plot']].dropna(subset=['id', 'title','genres', 'overview', 'plot'])
merged_wiki_plot.shape

(21287, 5)

## Creation of Test Datasets

Since the full dataset is quite large and may result in long processing times for NLP tasks, we created two smaller test datasets containing approximately 5,000 and 10,000 movies.

To ensure that the recommendation functionality works properly, we applied the following criteria:

- Only movies with a valid `plot` are included.
- Each movie must contain `recommendations`, and all referenced movies must also be present within the dataset.
- No duplicate entries are allowed.
- A maximum of 20 recommendations per movie is included.

After creating the subsets, we ensured that all recommendation IDs within each dataset point to valid movie entries. We did this by filtering out any invalid or non-existent recommendation IDs. Additionally, we replaced missing or invalid `plot` values (e.g., `"nan"`, `"None"`, or empty strings) with the corresponding `overview` to ensure every entry has usable text.

These final cleaning steps guarantee that:

- The recommendation references are internally consistent.
- All entries contain valid textual descriptions.
- The datasets are ready for downstream NLP applications such as content-based recommendation or genre prediction.



In [47]:
def build_expanded_dataset(test_df, target_size=5000):
    test_df = test_df.dropna(subset=['id', 'recommendations']).copy()

    def parse_recommendations(val):
        if isinstance(val, str):
            return [int(x) for x in val.split('-') if x.isdigit()]
        elif isinstance(val, int):
            return [val]
        elif isinstance(val, list):
            return val
        else:
            return []

    test_df['parsed_recommendations'] = test_df['recommendations'].apply(parse_recommendations)
    test_df = test_df.drop_duplicates(subset='id')

    id_map = test_df.set_index('id').to_dict(orient='index')

    result_ids = set()
    visited_ids = set()

    start_ids = test_df[test_df['plot'].notna()]['id'].tolist()
    idx = 0

    while len(result_ids) < target_size and idx < len(start_ids):
        current_id = start_ids[idx]
        idx += 1

        if current_id in visited_ids:
            continue

        visited_ids.add(current_id)
        result_ids.add(current_id)

        recs = id_map.get(current_id, {}).get('parsed_recommendations', [])

        added = 0
        for rec_id in recs:
            if rec_id in id_map and rec_id not in result_ids:
                result_ids.add(rec_id)
                added += 1
                if added >= 20:
                    break

    result_test_df = test_df[test_df['id'].isin(result_ids)].drop_duplicates(subset='id').copy()

    result_test_df['recommendations'] = result_test_df['parsed_recommendations'].apply(
        lambda recs: '-'.join(str(r) for r in recs)
    )

    result_test_df.drop(columns=['parsed_recommendations'], inplace=True)

    return result_test_df


In [48]:
df_recommendation_with_plot_test_5000 = build_expanded_dataset(df, target_size=5000)
print(df_recommendation_with_plot_test_5000.shape)

df_recommendation_with_plot_test_10000 = build_expanded_dataset(df, target_size=10000)
print(df_recommendation_with_plot_test_10000.shape)


(5003, 22)
(10000, 22)


In [49]:
valid_ids_5000 = set(df_recommendation_with_plot_test_5000['id'].astype(str))

valid_ids_10000 = set(df_recommendation_with_plot_test_10000['id'].astype(str))

def filter_recommendations(rec_str, valid_ids):
    rec_list = rec_str.split('-')
    filtered = [r for r in rec_list if r in valid_ids]
    return '-'.join(filtered)

df_recommendation_with_plot_test_5000['recommendations'] = (
    df_recommendation_with_plot_test_5000['recommendations'].astype(str)
    .apply(filter_recommendations, args=(valid_ids_5000,))
)

df_recommendation_with_plot_test_10000['recommendations'] = (
    df_recommendation_with_plot_test_10000['recommendations'].astype(str)
    .apply(filter_recommendations, args=(valid_ids_10000,))
)


In [50]:
(df_recommendation_with_plot_test_5000['recommendations'].isna() | (df_recommendation_with_plot_test_5000['recommendations'] == '')).sum()
(df_recommendation_with_plot_test_10000['recommendations'].isna() | (df_recommendation_with_plot_test_10000['recommendations'] == '')).sum()

206

In [51]:
df['plot'] = df['plot'].astype(str)
df['plot'] = df['plot'].replace(["nan", "None", ""], pd.NA).fillna(df['overview'])
df['recommendations'] = df['recommendations'].astype(str)

df_recommendation_with_plot_test_5000['plot'] = df_recommendation_with_plot_test_5000['plot'].astype(str)

df_recommendation_with_plot_test_5000['plot'] = df_recommendation_with_plot_test_5000['plot'].astype(str)
df_recommendation_with_plot_test_5000['recommendations'] = df_recommendation_with_plot_test_5000['recommendations'].astype(str)

df_recommendation_with_plot_test_5000['plot'] = df_recommendation_with_plot_test_5000['plot'].replace(["nan", "None", ""], pd.NA).fillna(df_recommendation_with_plot_test_5000['overview'])

print(df_recommendation_with_plot_test_5000.isna().sum())
print(df_recommendation_with_plot_test_5000.shape)

df_recommendation_with_plot_test_10000['plot'] = df_recommendation_with_plot_test_10000['plot'].astype(str)

df_recommendation_with_plot_test_10000['plot'] = df_recommendation_with_plot_test_10000['plot'].astype(str)
df_recommendation_with_plot_test_10000['recommendations'] = df_recommendation_with_plot_test_10000['recommendations'].astype(str)

df_recommendation_with_plot_test_10000['plot'] = df_recommendation_with_plot_test_10000['plot'].replace(["nan", "None", ""], pd.NA).fillna(df_recommendation_with_plot_test_10000['overview'])

id                        0
title                     0
genres                    9
original_language         0
overview                 16
popularity                0
production_companies    146
release_date              0
budget                    0
revenue                   0
runtime                   1
status                    0
tagline                 896
vote_average              0
vote_count                0
credits                  17
keywords                360
poster_path               5
backdrop_path            59
recommendations           0
release_year              0
plot                     15
dtype: int64
(5003, 22)


In [52]:
df_recommendation_with_plot = df.dropna(subset=['id', 'title', 'overview'])[['id', 'title', 'overview', 'recommendations', 'plot', 'genres']]

df_recommendation_with_plot_test_5000 = df_recommendation_with_plot_test_5000.dropna(subset=['id', 'title', 'overview'])[['id', 'title', 'overview', 'recommendations', 'plot', 'genres']]

df_recommendation_with_plot_test_10000 = df_recommendation_with_plot_test_10000.dropna(subset=['id', 'title', 'overview'])[['id', 'title', 'overview', 'recommendations', 'plot', 'genres']]


## Exporting Final Datasets

After cleaning, merging, and filtering the data, all final datasets were exported as JSON files for further use in downstream NLP tasks and recommendation systems.

The following files were generated:

- `../data/all_data.json`: Contains the full cleaned dataset with metadata, plots, and recommendations. This includes some entries without full plot information but is useful for broader analyses.
- `../data/recommendation_with_plot.json`: A subset of the full dataset including only movies that contain all relevant fields (`id`, `title`, `overview`, `plot`, `genres`, and `recommendations`).
- `../data/recommendation_with_plot_test_5000.json`: A compact dataset with approximately 5,000 entries, cleaned and filtered to maintain valid recommendation links and non-empty plots. Ideal for quick experimentation.
- `../data/recommendation_with_plot_test_10000.json`: A larger test dataset with about 10,000 entries, following the same quality criteria as the 5k set.
- `../data/merged_wiki_plot.json`: Includes only the movies that were successfully merged with the Wikipedia dataset and contain enriched plot information.

These exports ensure the datasets are readily available in a structured format and can be easily loaded for modeling, evaluation, and other analysis steps.


In [53]:
df.to_json('../data/all_data.json', orient='records', lines=True)
df_recommendation_with_plot.to_json('../data/recommendation_with_plot.json', orient='records', lines=True)
df_recommendation_with_plot_test_5000.to_json('../data/recommendation_with_plot_test_5000.json', orient='records', lines=True)
df_recommendation_with_plot_test_10000.to_json('../data/recommendation_with_plot_test_10000.json', orient='records', lines=True)

merged_wiki_plot.to_json('../data/merged_wiki_plot.json', orient='records', lines=True)