# Data: Hollywood Theatrical Market Synopsys 1995-2021

The original dataset and notebook can be found on [Kaggle](https://www.kaggle.com/mpwolke/holywood-theatrical-market-1995-2021). Notebook by Majid Bazrpach

## About this Dataset

Hollywood Theatrical Market Synopsis 1995 to 2021 | North American Domestic Movies Theatrical Market Synopsis 

Source: https://www.kaggle.com/majidbazrpach/hollywood-theatrical-market-synopsis-1995-to-2021/data

![](https://images7.alphacoders.com/116/thumb-350-1165584.jpg)

### Context

This Dataset contains the data of market analysis built on The Numbers unique categorization system, which uses 6 different criteria to identify a movie. All movies released since 1995 are categorized according to the following attributes: Creative type (factual, contemporary fiction, fantasy etc.), Source (book, play, original screenplay etc.), Genre (drama, horror, documentary etc.), MPAA rating, Production method (live action, digital animation etc.) and Distributor. In order to provide a fair comparison between movies released in different years, all rankings are based on ticket sales, which are calculated using average ticket prices announced by the MPAA in their annual state of the industry report.

### Content

The Dataset contains various files illustrating statistics such as annual ticket sales, highest grossers each year since 1995, top grossing creative types, top grossing distributors, top grossing genres, top grossing MPAA ratings, top grossing sources, top grossing production methods and the number of wide releases each year by various distributors.
Acknowledgements

The data was obtained from The Numbers website. Their theatrical market pages are based on the domestic theatrical market performance only. The domestic market is defined as the North American movie region (consisting of the United States, Canada, Puerto Rico and Guam). This data can be found from the website https://www.the-numbers.com/market/ with detailed analysis.

### Inspiration

2020 and 2021 have been rough years for the movie industry, and being a huge movie fanatic inspired me to share a dataset showing the exponential growth of box office collections as well as ticket sales over time (and the decline after 2020 due to the Covid-19 pandemic) indirectly indicating the quality of modern day films. This Dataset can also be used to study the genres which attract audience the most and encourage one to create an amazing genre specific plot in order to take one step closer to becoming the next most successful director!

# Python packages

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import chart_studio

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import plotly.express as px
import cufflinks as cf
from scipy import stats
%matplotlib inline



from plotly import plot

# init_notebook_mode(connected = True)
# cf.go_offline()

In [None]:
# Install required python package

!pip install chart_studio

# Data Input

## Importing Data

In [None]:
PopularCreativeTypes = pd.read_csv("PopularCreativeTypes.csv");
HighestGrossers = pd.read_csv("HighestGrossers.csv");
Data_Annual_Ticket = pd.read_csv("AnnualTicketSales.csv",thousands = ',');

## Cleaning and Tiding Data

In [None]:
Data_Annual_Ticket["TICKETS SOLD"] = Data_Annual_Ticket["TICKETS SOLD"].replace(',','', regex=False)

Data_Annual_Ticket["TOTAL BOX OFFICE"] = Data_Annual_Ticket["TOTAL BOX OFFICE"].str.replace(',','', regex=False)
Data_Annual_Ticket["TOTAL BOX OFFICE"] = Data_Annual_Ticket["TOTAL BOX OFFICE"].str.replace('$','', regex=False)

Data_Annual_Ticket["TOTAL INFLATION ADJUSTED BOX OFFICE"] = Data_Annual_Ticket["TOTAL INFLATION ADJUSTED BOX OFFICE"].str.replace(',','', regex=False)
Data_Annual_Ticket["TOTAL INFLATION ADJUSTED BOX OFFICE"] = Data_Annual_Ticket["TOTAL INFLATION ADJUSTED BOX OFFICE"].str.replace('$','', regex=False)

Data_Annual_Ticket["AVERAGE TICKET PRICE"] = Data_Annual_Ticket["AVERAGE TICKET PRICE"].str.replace('$','', regex=False)

Data_Annual_Ticket = Data_Annual_Ticket.drop(labels = "Unnamed: 5",axis = 1)

In [None]:
Data_Annual_Ticket.head(5)

**Changing the type of Data(object to float)**

In [None]:
Data_Annual_Ticket['TICKETS SOLD'] = Data_Annual_Ticket['TICKETS SOLD'].astype(float)
Data_Annual_Ticket['TOTAL BOX OFFICE'] = Data_Annual_Ticket['TOTAL BOX OFFICE'].astype(float)

# Visualize

**Using bar chart to illustrate the total box office each year**

In [None]:
px.bar(Data_Annual_Ticket
            ,x = 'YEAR'
            ,y = 'TOTAL BOX OFFICE'
            ,title = 'Total Box Office vs. Year')

**Calculating the total box office if last two years were normal years (*using linear regression*)**

In [None]:
x = list(range(0,(2020-1995)))
y = list(Data_Annual_Ticket['TOTAL BOX OFFICE'] )
y.reverse()
y.pop()
y.pop()
slope, intercept, r, p, std_err = stats.linregress(x, y)
x1 = list(range(0,(2022-1995)))
y1= [slope * x + intercept for x in x1 ]
y1.reverse()
Data_Annual_Ticket['TOTAL BOX OFFICE WITHOUT COVID'] = y1
Data_Annual_Ticket["Diff"] = Data_Annual_Ticket['TOTAL BOX OFFICE WITHOUT COVID']-Data_Annual_Ticket['TOTAL BOX OFFICE']

**Illustrate the difference between total box office with covid and without covid**

In [None]:
px.line(Data_Annual_Ticket
       ,x = 'YEAR'
       ,y = ["TOTAL BOX OFFICE","TOTAL BOX OFFICE WITHOUT COVID"]
       ,labels = {'YEAR' :"Years", "value": "Total Sale"}
       ,title = 'TOTAL BOX OFFICE vs TOTAL BOX OFFICE WITHOUT COVID')

**Calculate that how much does covid-19 affect on last two years** 

In [None]:
px.bar(Data_Annual_Ticket
       ,x = 'YEAR'
       ,y = "Diff"
       ,labels = {'YEAR' :"Year", "Diff": "Financial Loss"}
       ,title = 'Financial Loss (just last two years are important)'
       ,barmode='group')


**How much does covid-19 affect on total box ofice in last two years in percent?**

In [None]:
Data_Annual_Ticket["Percentage of Financial Loss"] = (Data_Annual_Ticket["TOTAL BOX OFFICE WITHOUT COVID"]-Data_Annual_Ticket["TOTAL BOX OFFICE"])/Data_Annual_Ticket["TOTAL BOX OFFICE WITHOUT COVID"]*100

px.bar(Data_Annual_Ticket
       , x = 'YEAR'
       , y = "Percentage of Financial Loss"
       ,labels = {'YEAR' :"Year", "Percentage of Financial Loss": "Percentage of Financial Loss %"}
       ,title = 'Financial Loss % (just last two years are important) ')

# Now Visualizing the Highest Grossers

**Cleaning and Tiding Data**

In [None]:
HighestGrossers["TOTAL IN 2019 DOLLARS"] = HighestGrossers["TOTAL IN 2019 DOLLARS"].str.replace(',','', regex=False)
HighestGrossers["TOTAL IN 2019 DOLLARS"] = HighestGrossers["TOTAL IN 2019 DOLLARS"].str.replace('$','', regex=False)

HighestGrossers["TICKETS SOLD"] = HighestGrossers["TICKETS SOLD"].str.replace(',','')

HighestGrossers['TOTAL IN 2019 DOLLARS'] = HighestGrossers['TOTAL IN 2019 DOLLARS'].astype(float)
HighestGrossers['TICKETS SOLD'] = HighestGrossers['TICKETS SOLD'].astype(float)

In [None]:
HighestGrossers.head(5)

**Because of the inflation we just used TOTAL IN 2019 DOLLARS column**

**We use pie chart to illustrate the percentage of different thing**

In [None]:
px.pie(HighestGrossers
       ,values = 'TOTAL IN 2019 DOLLARS' 
       ,names = 'DISTRIBUTOR'
       ,title = 'Percentage of Each Distributors in Total Ticket Sale'
       ,color_discrete_sequence = px.colors.sequential.RdBu
       ,height = 600
       )

In [None]:
px.pie(HighestGrossers
       ,values = 'TOTAL IN 2019 DOLLARS' 
       ,names = 'MPAA RATING'
       ,title = 'Percentage of Each MPAA Rating in Total Ticket Sale'
       ,color_discrete_sequence = px.colors.sequential.RdBu
       ,height = 600
       )

**using bar chart to state the sum of total ticket sale each distributor and each genre**

In [None]:
df_g = HighestGrossers.groupby(by =['DISTRIBUTOR','GENRE'])['TICKETS SOLD'].sum()
df_g = df_g.reset_index()

px.bar(df_g
       ,x = 'DISTRIBUTOR'
       ,y = 'TICKETS SOLD'
       ,barmode='group'
       ,color = 'GENRE')

**using bar chart to state the count of total ticket sale each distributor and each genre**

In [None]:
df_g = HighestGrossers.groupby(by =['DISTRIBUTOR','GENRE'])['TICKETS SOLD'].count()

df_g = df_g.reset_index()

px.bar(df_g
       ,x = 'DISTRIBUTOR'
       ,y = 'TICKETS SOLD'
       ,barmode='group'
       ,color = 'GENRE')

**doing the same thing to the MPAA rating**


In [None]:
df_g = HighestGrossers.groupby(by =['DISTRIBUTOR','MPAA RATING'])['TICKETS SOLD'].sum()

df_g = df_g.reset_index()

px.bar(df_g
       ,x = 'DISTRIBUTOR'
       ,y = 'TICKETS SOLD'
       ,barmode='group'
       ,color = 'MPAA RATING')

In [None]:
df_g = HighestGrossers.groupby(by =['DISTRIBUTOR','MPAA RATING'])['TICKETS SOLD'].count()

df_g = df_g.reset_index()

px.bar(df_g
       ,x = 'DISTRIBUTOR'
       ,y = 'TICKETS SOLD'
       ,barmode='group'
       ,color = 'MPAA RATING')

**now visualising the Popular Creative Types**

In [None]:
PopularCreativeTypes.head(5)

In [None]:
PopularCreativeTypes["TOTAL GROSS"] = PopularCreativeTypes["TOTAL GROSS"].str.replace(',','', regex=False)
PopularCreativeTypes["TOTAL GROSS"] = PopularCreativeTypes["TOTAL GROSS"].str.replace('$','', regex=False)

PopularCreativeTypes["AVERAGE GROSS"] = PopularCreativeTypes["AVERAGE GROSS"].str.replace(',','', regex=False)
PopularCreativeTypes["AVERAGE GROSS"] = PopularCreativeTypes["AVERAGE GROSS"].str.replace('$','', regex=False)

PopularCreativeTypes["MARKET SHARE"] = PopularCreativeTypes["MARKET SHARE"].str.replace('%','', regex=False)

PopularCreativeTypes["MOVIES"] = PopularCreativeTypes["MOVIES"].str.replace(',','', regex=False)

In [None]:
PopularCreativeTypes = PopularCreativeTypes.drop(index = 9,axis = 0)

In [None]:
PopularCreativeTypes["MOVIES"] = PopularCreativeTypes["MOVIES"].astype(float)
PopularCreativeTypes["TOTAL GROSS"] = PopularCreativeTypes["TOTAL GROSS"].astype(float)
PopularCreativeTypes["AVERAGE GROSS"] = PopularCreativeTypes["AVERAGE GROSS"].astype(float)
PopularCreativeTypes["MARKET SHARE"] = PopularCreativeTypes["MARKET SHARE"].astype(float)

In [None]:
px.pie(PopularCreativeTypes
       ,values = 'TOTAL GROSS' 
       ,names = 'CREATIVE TYPES'
       ,title = 'Percentage of Creative Types in Total Gross'
       ,color_discrete_sequence = px.colors.sequential.RdBu
       ,height = 600
       )

In [None]:
px.bar(PopularCreativeTypes
      ,x = "TOTAL GROSS"
      ,y ="CREATIVE TYPES"
      ,title = "Total Gross of Different type")

In [None]:
px.pie(PopularCreativeTypes
       ,values = 'AVERAGE GROSS' 
       ,names = 'CREATIVE TYPES'
       ,title = 'Percentage of Creative Types in Average Gross'
       ,color_discrete_sequence = px.colors.sequential.RdBu
       ,height = 600
       )

In [None]:
px.bar(PopularCreativeTypes
      ,x = "AVERAGE GROSS"
      ,y = "CREATIVE TYPES"
      ,title = "Average Gross in Different type")

In [None]:
px.pie(PopularCreativeTypes
       ,values = 'MOVIES' 
       ,names = 'CREATIVE TYPES'
       ,title = 'Percentage of Number of Muvies in Each Types'
       ,color_discrete_sequence = px.colors.sequential.RdBu
       ,height = 600
       )

In [None]:
px.bar(PopularCreativeTypes
      ,x = "MOVIES"
      ,y ="CREATIVE TYPES"
      ,title = "Number of Muvies in Different type")