## Description

#### Purpose: To calculate production company score using the dataset following similar methodology to *The Numbers*.

#### Input: `3.3.2b_Merged_Data_Star_Scores.csv`

#### Outputs: `3.3.3_Merged_Data_Prod_Scores.csv`

Within our dataset, we found the 100 movies with the highest domestic revenues for each
year. We assigned “contribution points” to those movies based on their place in that ranking (i.e. the movie with the highest domestic revenue each year is assigned 100 points, the movie with the 2nd highest domestic revenue that year is assigned 99 points, and so on). Then, we obtained a set of all the unique production company IDs in the entire dataset. We calculated a “production company score” for each production company for each year using the same scoring scheme that The Numbers used. Each year, for every production company, we summed all the “contribution points” of the movies they had worked on that year and in the two years before.

In [None]:
from tmdbv3api import TMDb
from tmdbv3api import Movie
from tmdbv3api.exceptions import TMDbException
import random
import pandas as pd
import csv
import numpy as np
from math import exp
import ast
tmdb=TMDb()
tmdb.api_key=' '
    # API key redacted

In [None]:
# Initialize csv file path
csv_file_path= '../3.3.2 Calculate Star Score/Outputs/3.3.2b_Merged_Data_Star_Scores.csv'
df = pd.read_csv(csv_file_path)

In [None]:
df['Movie Contribution to Director and Production Scores'] = 0

# Calculates the contribution of a movie to a star score based on whether or not it was a top 100 domestic grossing movie of that year

release_years = range(2010, 2024)
# iterates through the years
for year in release_years:
    df_year = df[df["Release Year"] == year]
    print(df_year.head())
    # sorts by revenue (descending)
    df_year = df_year.sort_values(by=['Merged Revenue'], ascending=False)
    # iterate through top 100 movies, give them points based on the ranking (100 to the top grossing, 99 to the 2nd top, ..., 1 to the 100th)
    for i in range(0,100):
        tmdb_id_to_update = df_year['IMDB ID'].iloc[i]
        # record the contribution in the dataframe
        df.loc[df['IMDB ID'] == tmdb_id_to_update, 'Movie Contribution to Director and Production Scores'] = 100 - i

In [None]:
# Create a DataFrame with unique production company IDs
prod_ids = []

for index, row in df.iterrows():
    prod_ids_str = row['Production Company ID']
    
    # Check for NaN values and skip them
    if pd.isna(prod_ids_str):
        continue

    # Safely evaluate the content of 'prod_ids' if it's not NaN
    prod_ids += ast.literal_eval(prod_ids_str)

unique_prod_ids = list(set(prod_ids))

# Create a list of years from 2011 to 2023
years = [str(year) for year in range(2010, 2024)]

# Initialize the data with zeros
data = {f'star_{year}': [0] * len(unique_prod_ids) for year in years}
data['ids'] = unique_prod_ids

# Create the 'output_df' DataFrame
prod_df = pd.DataFrame(data)

# Reorder columns with 'ids' as the first column
prod_df = prod_df[['ids'] + [col for col in prod_df.columns if col != 'ids']]

# Print the first few rows of the 'output_df' DataFrame for debugging
print(prod_df.head())

In [None]:
# Iterate through df to calculate production company scores
for index, row in df.iterrows():
    prod_ids_str = row['Production Company ID']
    # Check for NaN values and skip them
    if pd.isna(prod_ids_str):
        continue
    # Safely evaluate the content of 'prod_ids' if it's not NaN
    prod_ids = ast.literal_eval(prod_ids_str)
    release_year = row['Release Year']
    score_contribution = row['Movie Contribution to Director and Production Scores']
    # Iterate through each prod_id in the prod_ids array
    for prod_id in prod_ids:
        # Find the corresponding row in prod_df
        prod_df_row = prod_df[prod_df['ids'] == prod_id]
        if not prod_df_row.empty:
            prod_score = prod_df_row[f'star_{release_year}'].values[0]
            # Add the contribution from the movie to the total star score for that company for that year
            if not pd.isna(score_contribution):
                prod_score += score_contribution
            # Assign the updated production company score to the corresponding 'star_yyyy' column
            prod_df.loc[prod_df['ids'] == prod_id, f'star_{release_year}'] = prod_score


output_prods = prod_df.copy(deep = True)           

#sums the star scores for the previous three years (beginning in 2012)
for prod_id in prod_df['ids']:
    for column in prod_df.columns:
        if column.startswith("star_"):
            release_year = int(column.split("_")[1])
            if release_year > 2011:
                previous_year = release_year - 1
                year_before_previous = release_year - 2
                prod_df_row = prod_df[prod_df['ids'] == prod_id]
                if not prod_df_row.empty:
                    prod_score = prod_df_row[column].values[0]
                    # Calculate score from the previous year
                    previous_year_score = prod_df_row[f'star_{previous_year}'].values[0]
                    prod_score += previous_year_score
                    # Calculate score from the year before the previous year
                    year_before_previous_score = prod_df_row[f'star_{year_before_previous}'].values[0]
                    prod_score += year_before_previous_score
                    # Assign the updated director score to the corresponding 'star_year' column
                    output_prods.loc[output_prods['ids'] == prod_id, column] = prod_score

In [None]:
# Save Production Company Raw Data
output_prods.to_csv('prod_company_df.csv', index=False)

In [None]:
# Create an output dataframe
df_output = df.copy()  # Copy the original DataFrame
df_output = df_output[df_output['Release Year'] > 2012]
df_output['Total Production Company Score'] = 0  # create an empty star scores column
df_output['Avg Production Company Score'] = 0  # create an empty star scores column

for index, row in df_output.iterrows():
    row_prod_info = row['Production Company ID']  # Extract prod_ids array
    if not pd.isna(row_prod_info):  # Check for NaN
        row_prod_info = ast.literal_eval(row_prod_info)
    else:
        row_prod_info = []  # Set to an empty list if NaN
    # computes sum of production company scores
    for prod_id in row_prod_info:
        release_date = row['Release Year']
        previous_year = release_date - 1
        prod_info_row = output_prods[output_prods['ids'] == prod_id]  # Get the row containing the queried prod id
        star_column_name = f'star_{previous_year}'  # get the star score column for the previous year
        star_score = prod_info_row[star_column_name].values[0]  # Use the previous year as the star score
        if star_score != None:
            df_output.loc[df['IMDB ID'] == row['IMDB ID'], 'Total Production Company Score'] += star_score
    # computes avg of production company scores
    if len(row_prod_info) > 0:
        df_output.loc[df['IMDB ID'] == row['IMDB ID'], 'Avg Production Company Score'] = df_output.loc[df['IMDB ID'] == row['IMDB ID'], 'Total Production Company Score'] / len(row_prod_info)

# Save to csv
print(df_output.head())
df_output.to_csv('/Outputs/3.3.3_Merged_Data_Prod_Scores.csv', index=False)