# Analysis of Movie Piracy

Peter Currie, Otto Pfefferkorn, Andy Zheng

## Introduction

Movies are a popular form of entertainment that have been enjoyed by people all around the world for many years. With the rise of the internet, however, it has become easier than ever to view movies online. This has also led to a massive rise in movie piracy, which is the unauthorized watching or reproduction of movies without the permission of the copyright owner. More information about movie piracy can be found here: https://www.legalmatch.com/law-library/article/movie-piracy.html#:~:text=In%20short%2C%20movie%20piracy%20is,umbrella%20of%20intellectual%20property%20laws.

Piracy has become a major issue for the movie industry, as it results in a significant financial hit for both filmmakers and movie studios. Pirated movies are often sold or distributed online for a fraction of the price of a legitimate movie, and sometimes it would be shown on websites for free with ads, which makes it a very attractive option for many people who are looking for an easy way to watch a movie they want to see.

The piracy of movies not only affects the finances of the movie industry but also has an effect on the quality of the movies themselves. When people pirate movies, they often watch them on websites with a ton of ads, or download them from questionable sources, which can result in a very poor viewing experience as the movie would usually be low quality . Not only is piracy illegal, it also takes away from the creative and innovative work of the people involved in making a movie. Filmmakers rely on revenue generated from movie theatre ticket sales and legitimate DVD/BluRay and streaming purchases in order to fund their projects and create future films.

This project will examine the piracy of movies by industry, and look at how much time there is between the release of the movie and the piracy of the movie.

In [170]:
import pandas as pd
import numpy as np

movie_data = pd.read_csv("movies_dataset.csv", sep=',')

#Movie Piracy Database - drop qualitative data
movie_data = movie_data.drop(columns=['Unnamed: 0','id'])
movie_data = movie_data.drop(columns=['storyline', 'writer', 'director'])
movie_data = movie_data.drop(columns=['run_time'])

movie_data.head()



Unnamed: 0,IMDb-rating,appropriate_for,downloads,industry,language,posted_date,release_date,title,views
0,4.8,R,304,Hollywood / English,English,"20 Feb, 2023",Jan 28 2023,Little Dixie,2794
1,6.4,TV-PG,73,Hollywood / English,English,"20 Feb, 2023",Feb 05 2023,Grilling Season: A Curious Caterer Mystery,1002
2,5.2,R,1427,Hollywood / English,"English,Hindi","20 Apr, 2021",Jun 18 2021,In the Earth,14419
3,8.1,,1549,Tollywood,Hindi,"20 Feb, 2023",Feb 17 2023,Vaathi,4878
4,4.6,,657,Tollywood,Hindi,"20 Feb, 2023",Jan 26 2023,Alone,2438


In [178]:


#There is a significant amount of null data in the dataframe

#Find NaN count by column
na_counts = movie_data.isna().sum()

#Create array of industries for iteration, deleting nan value
industries = movie_data['industry'].unique()
industries = [x for x in industries if str(x) != 'nan']

#group of groups by industry
movies_by_industry = movie_data.groupby('industry')

industry_hash = {}
#create individual dataframes by industry and put them into a hashmap
#used for hashmaps of percentage values
for industry in industries:
    industry_hash[industry] = movies_by_industry.get_group(industry)

na_counts


IMDb-rating         841
appropriate_for    9476
downloads             1
industry              1
language            546
posted_date           1
release_date          1
title                 1
views                 1
dtype: int64

In [176]:
na_by_industry = {}
#this hashmap is for the percentage of NaN values in the appropriate_for category by industry
for industry in industries:
    appropriate_column = industry_hash[industry]['appropriate_for']
    #na count divided by total size, rounded to 3 decimal places
    temp_na_percentage = round(appropriate_column.isna().sum()/appropriate_column.size * 100, 3)
    na_by_industry[industry]= temp_na_percentage

na_by_industry

{'Hollywood / English': 40.064,
 'Tollywood': 70.563,
 'Wrestling': 96.998,
 'Bollywood / Indian': 58.866,
 'Punjabi': 87.952,
 'Anime / Kids': 28.122,
 'Dub / Dual Audio': 8.889,
 'Pakistani': 89.13,
 'Stage shows': 100.0,
 '3D Movies': 0.0}

In [175]:
industry_percentage = {}
#percentage of movies made by industry
for industry in industries:
        percentage = round(industry_hash[industry].size/movie_data.size * 100, 3)
        industry_percentage[industry] = percentage

industry_percentage

{'Hollywood / English': 71.292,
 'Tollywood': 5.704,
 'Wrestling': 2.107,
 'Bollywood / Indian': 12.872,
 'Punjabi': 1.616,
 'Anime / Kids': 5.105,
 'Dub / Dual Audio': 0.219,
 'Pakistani': 0.448,
 'Stage shows': 0.628,
 '3D Movies': 0.005}