# Project proposal

---

Group name: Lena Binder, Julian Kulinsky, Sven Ritzmann (Gruppe C)

---


## Introduction

### Future licenses for Netflix to gain competitive advantage

Highly impacted by COVID, direct-to-customer entertainment services drastically shaped the movie and entertainment industry. Disney, Netflix, Amazon, just to name a few, are highly invested in expanding their movies and series selection to compete for the trust and loyalty of customers. Therefore, they not only increased their selection in buying licenses for existing movies and shows, they also spend much money on producing movies and series on their own to provide exclusive content on their corresponding platform. To sum it up, the importance of owning and retaining such intellectual property in perpetuity has also increased among brands as a way to attract and retain customers (Harvard, 2021).
With our analytical approach, we want to investigate which movies 1) are likely to provide the best customer experience as possible measured by i) their directors and ii) their featured stars, 2) have the highest potential of creating a high revenue flow and 3) should be promoted and advertised more to attract customers (if already existing).

### Motivation

As mentioned in the introduction, Disney, Netflix, Amazon and co. are investing heavily in expanding their movie and series offerings to compete for customer trust and loyalty. Literature suggests that content acquisition and retention are critical to maintaining customer loyalty and and audience churn towards video-on-demand. Accordingly, this is a timely topic and one that is important to the film and entertainment industry, which is why we chose it for our use case. We want to find out which movies are best suited to retain customer loyalty or attract new customers.

<b>Literature</b>:

1. Havard, Cody T. (2021) "Disney, Netflix, and Amazon Oh My! An Analysis of Streaming Brand Competition and the Impact on the Future of Consumer Entertainment," Findings in Sport, Hospitality, Entertainment, and Event Management: Vol. 1, Article 7. Available at: https://digitalcommons.memphis.edu/finsheem/vol1/iss1/7

    This analysis paper discusses the impact of brands such as Disney, Netflix, Amazon, and others on the growing streaming service space. The paper mentions that the importance of owning and retaining intellectual property in perpetuity has also increased among brands as a way to attract and retain customers.

2. Mazzolini, P. (2016). Netflix: Financial Position Analysis and Evolution in the Market for Online Streaming Services. In Google Scholar. Libera Università Internazionale degli Studi Sociali. http://tesi.luiss.it/18543/1/174501_MAZZOLINI_PIERFRANCESCO.pdf

    The content of the thesis is to examine the progress that the online television industry has made in recent years from a business and financial perspective. The focus here is particularly on the streaming platform Netflix. It is concluded that content acquisition and retention is critical to maintaining customer loyalty.


3. Hidalgo-Marí, T., Segarra-Saavedra, J. & Palomares-Sánchez, P. (2021). In-depth study of Netflix’s original content of fictional series. Forms, styles and trends in the new streaming scene. Communication & Society, 34(3), 1-13. 

    This article presents an analysis of the original content of fictional series produced by Netflix. Among other things, it is concluded that the acquisition of licenses and rights to broadcast other content continues to be a decisive factor, even if the data of recent years show a certain decline. Nevertheless, this strategic resource guarantees the superiority of Netflix and favors the migration of the audience towards VOD.


### Research question: Which films, stars and genres should Netflix invest in to maintain and expand its market advantage?

These include 1) licences for films and actors that Netflix does not yet have in its portfolio and 2) obtaining existing licences as well as promoting and advertising the films and stars that appear in them.

### Hypotheses
#### How concrete should we be here?

1. Certain movie genres have an impact on customer rating

2. Certain movie stars have an impact on customer rating

3. Certain movie stars tend to generate more revenue than others.

4. Certain movie genres tend to generate more revenue than others.

5. There is a negative correlation between runtime and rating.

6. tbd 



## Data description

In this section, you will describe the data set you wish to explore. This includes

-   description of the observations in the data set,
-   description of how the data was originally collected (not how you found the data but how the original curator of the data collected it).

Our dataset contains information about movies originally collected by IMDb. The IMDb is a US database that contains information and statistics about movies, television series, video production and computer games, as well as people who have worked on them. The database gets its information from various sources such as filmmakers, movie studios, credits or other official sources and has been operated by Amazon since 1998. However, much of the information comes from IMDb users themselves, but is checked for accuracy by IMDb before publication.

Our dataset contains a total of 5000 entries for movies that can be found on the IMDb website. For each movie, the dataset contains the following information:

- <b>Link to movie poster</b>: Poster of the movie in image format
- <b>Title</b>: Name of the movie
- <b>Certificate</b>: Age restriction of the movie
- <b>Runtime</b>: Length of the movie in minutes
- <b>Genre</b>: Film category of the movie to classify similar movies
- <b>Rating</b>: User reviews expresses on a scale from 0 to 10 whereas 10 is representing a extraordinary performance
- <b>About</b>: Description of the movie and its contents
- <b>Director</b>: Director of the movie, responsible for its recording
- <b>Stars</b>: Famous celebrities acting as a role in that movie
- <b>Votes</b>: Number of votes people have submitted
- <b>Gross earnings</b>: Reached revenue with that movie excl. production and taxes

### Data Quantity

In [None]:
#e.g.
# Format of the data
# How large is the database (Numbers of rows and columns)

#### Import libraries

In [1]:
# Import relevant libraries for data analysis
import pandas as pd
from pathlib import Path
import numpy as np
import warnings
import altair as alt

#### Handling of warnings

In [2]:
warnings.simplefilter(action='ignore', category=FutureWarning)

#### Import data

In [3]:
# Create path
PARENT_PATH = str(Path().resolve().parent) + "/"
PATH = "data/"
SUBPATH = "raw/"
FILE = "movies_netflix"
FORMAT = ".csv"

# Import CSV file with pandas
df = pd.read_csv(PARENT_PATH + PATH + SUBPATH + FILE + FORMAT)

# Brief data overview
df.head()

Unnamed: 0,poster,title,certificate,runtime,genre,rating,about,director,stars,votes,gross_earn
0,https://m.media-amazon.com/images/S/sash/4Fyxw...,The Shawshank Redemption,15,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,Frank Darabont,"('Tim Robbins',), ('Morgan Freeman',), ('Bob G...",2626905,$28.34M
1,https://m.media-amazon.com/images/S/sash/4Fyxw...,The Dark Knight,12A,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,Christopher Nolan,"('Christian Bale',), ('Heath Ledger',), ('Aaro...",2598173,$534.86M
2,https://m.media-amazon.com/images/S/sash/4Fyxw...,Inception,12A,148 min,"Action, Adventure, Sci-Fi",8.8,A thief who steals corporate secrets through t...,Christopher Nolan,"('Leonardo DiCaprio',), ('Joseph Gordon-Levitt...",2304062,$292.58M
3,https://m.media-amazon.com/images/S/sash/4Fyxw...,Fight Club,18,139 min,Drama,8.8,An insomniac office worker and a devil-may-car...,David Fincher,"('Brad Pitt',), ('Edward Norton',), ('Meat Loa...",2071088,$37.03M
4,https://m.media-amazon.com/images/S/sash/4Fyxw...,Pulp Fiction,18,154 min,"Crime, Drama",8.9,"The lives of two mob hitmen, a boxer, a gangst...",Quentin Tarantino,"('John Travolta',), ('Uma Thurman',), ('Samuel...",2011013,$107.93M


Finding:
As can be seen, the values of the feature `votes` are values in an inconsistent data format. For this reason, formatting adjustments must be made, e.g. '2.626.905' instead of 26,26,905. 

See source: https://www.imdb.com/title/tt0111161/ratings/?ref_=tt_ov_rt 

### Data Quality

In [None]:
# e.g.
# Does the data include characteristics relevant of the business question
# What data types are present (symbolic, numeric, etc.)
# Are there anomalies (missing data / human errors? / not uniform column types)

#### Overview

In [None]:
# Dytpes of each feature
# df.dtypes
df.info()

In [None]:
# Description of numerical values
df.describe()

Finding:

Through `.describe()` we can see that there is only one numerical variable that can be calculated so far.

This also means that we cannot determine whether certain films were rated above or below average and whether the length of the film results in a better vote etc.

#### Description of the observations in the dataset

Finding:

As can be seen, all features are marked as `object` - exception: `rating` (float64).

The data `runtime` (minus 'min'), `rating`, `votes` and `gross_earn` are numerical data. Consequently, these must be converted into the appropriate data format.

In addition, new columns should be created for the variables `genre` and `stars`, as they contain several persons. This will be important later in the more in-depth evaluation of findings.

#### Conclusion

- It would be useful if the publish year of each movie would be covered within the dataset.
- The variable `certificate` contains wrong age_restriction values that aren't uniform

#### Original collected Dataset

- Crawled from IMDB's website, year: 2022

## Analysis approach

In this section, you will provide a brief overview of your analysis approach. This includes:

-   Description of the relevant variable.
-   Exploratory data analysis and summary statistics for the relevant variables.
-   The visualization types (what kind of visualizations you will you use)

### Relevant Variables

Description

Hypothesis 1
- Dependent variable: Customer ratings (suspected effect)

- Independent variable: movie genres (cause)

Hypothesis 2
- Dependent variable: Customer ratings (suspected effect)

- Independent variable: movie stars (cause)


In [None]:
# Drop useless variables (Tell why)
# poster (does not give additional insight)

### Exploratory data analysis

In [None]:
#	Understanding your variables
#	Cleaning your dataset
#	Analyzing relationships between variables 


#### Transform data

In [None]:
# Show data again
df

In [None]:
# Show datatypes of each feature
df.info()

This step is important for the Exploratory data analysis

In [4]:
# Convert runtime, votes and gross_earn into numerical dtypes
# runtime
df['runtime'] = df['runtime'].str.replace(r'[a-zA-z,]| |-', '')
df['runtime'] = pd.to_numeric(df['runtime'])

# delete the , from the votes
df['votes'] = df['votes'].str.replace(',','')

# convert the vote to numeric
df['votes'] = pd.to_numeric(df['votes'])

# gross_earn
df['gross_earn'] = df['gross_earn'].str.replace(r'[M$]| ', '')
df['gross_earn'] = pd.to_numeric(df['gross_earn'])


In [5]:
from ast import literal_eval

# We observed 30 float entries in df['genre'], so we convert them to str 
for i in range(len(df['genre'])):
   if type(df['genre'][i]) != str:
      df['genre'][i] = str(df['genre'][i])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['genre'][i] = str(df['genre'][i])


In [6]:
# Convert each string entry to list
df['genre'] = df['genre'].str.strip('()').str.split(',')
# df['genre']

In [7]:
import re
# Regex to remove all non-alphabetic characters
regex = re.compile('[^a-zA-Z]')

# Count number of genres to be converted into column
genres = list()
for i in range(len(df['genre'])):
   for genre in df['genre'][i]:
      genre = regex.sub('', genre)
      genre = genre.replace(" ", "")
      if genre not in genres:
         genres.append(genre)

# exclude nan
# len(genres) - 1 # = 21

In [8]:
# Create column for each genre, fill with 0
data = dict()
for element in genres:
   data[element] = 0
genres_as_columns = pd.DataFrame.from_dict(data, orient='index')
genres_as_columns = genres_as_columns.T
genres_as_columns

Unnamed: 0,Drama,Action,Crime,Adventure,SciFi,Romance,Mystery,Western,Fantasy,Thriller,...,History,Comedy,Animation,Family,Horror,Music,Sport,Musical,FilmNoir,nan
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [17]:
# Concatenate both dfs
df = pd.concat([df, genres_as_columns], ignore_index=True)
for genre in genres:
    df[genre] = 0

Unnamed: 0,poster,title,certificate,runtime,genre,rating,about,director,stars,votes,...,History,Comedy,Animation,Family,Horror,Music,Sport,Musical,FilmNoir,nan
0,https://m.media-amazon.com/images/S/sash/4Fyxw...,The Shawshank Redemption,15,142.0,[Drama],9.3,Two imprisoned men bond over a number of years...,Frank Darabont,"('Tim Robbins',), ('Morgan Freeman',), ('Bob G...",2626905.0,...,0,0,0,0,0,0,0,0,0,0
1,https://m.media-amazon.com/images/S/sash/4Fyxw...,The Dark Knight,12A,152.0,"[Action, Crime, Drama]",9.0,When the menace known as the Joker wreaks havo...,Christopher Nolan,"('Christian Bale',), ('Heath Ledger',), ('Aaro...",2598173.0,...,0,0,0,0,0,0,0,0,0,0
2,https://m.media-amazon.com/images/S/sash/4Fyxw...,Inception,12A,148.0,"[Action, Adventure, Sci-Fi]",8.8,A thief who steals corporate secrets through t...,Christopher Nolan,"('Leonardo DiCaprio',), ('Joseph Gordon-Levitt...",2304062.0,...,0,0,0,0,0,0,0,0,0,0
3,https://m.media-amazon.com/images/S/sash/4Fyxw...,Fight Club,18,139.0,[Drama],8.8,An insomniac office worker and a devil-may-car...,David Fincher,"('Brad Pitt',), ('Edward Norton',), ('Meat Loa...",2071088.0,...,0,0,0,0,0,0,0,0,0,0
4,https://m.media-amazon.com/images/S/sash/4Fyxw...,Pulp Fiction,18,154.0,"[Crime, Drama]",8.9,"The lives of two mob hitmen, a boxer, a gangst...",Quentin Tarantino,"('John Travolta',), ('Uma Thurman',), ('Samuel...",2011013.0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5003,,,,,,,,,,,...,0,0,0,0,0,0,0,0,0,0
5004,,,,,,,,,,,...,0,0,0,0,0,0,0,0,0,0
5005,,,,,,,,,,,...,0,0,0,0,0,0,0,0,0,0
5006,,,,,,,,,,,...,0,0,0,0,0,0,0,0,0,0


In [41]:
# Insert 1 into every column when contained in genre
for entry_index in range(len(df['genre'])):
    if type(df['genre'][entry_index]) != float:
        for genre in df['genre'][entry_index]:
            if genre.strip() == 'Sci-Fi':
                genre = 'SciFi'
            elif genre.strip() == 'Film-Noir':
                genre = 'FilmNoir'
            label = genre.strip()
            df[label][entry_index] = 1
    else:
        # Can be ignored because it will get cut out later
        print(df['genre'][entry_index])

nan
nan
nan
nan
nan
nan
nan
nan


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[label][entry_index] = 1


Unnamed: 0,poster,title,certificate,runtime,genre,rating,about,director,stars,votes,...,History,Comedy,Animation,Family,Horror,Music,Sport,Musical,FilmNoir,nan
0,https://m.media-amazon.com/images/S/sash/4Fyxw...,The Shawshank Redemption,15,142.0,[Drama],9.3,Two imprisoned men bond over a number of years...,Frank Darabont,"('Tim Robbins',), ('Morgan Freeman',), ('Bob G...",2626905.0,...,0,0,0,0,0,0,0,0,0,0
1,https://m.media-amazon.com/images/S/sash/4Fyxw...,The Dark Knight,12A,152.0,"[Action, Crime, Drama]",9.0,When the menace known as the Joker wreaks havo...,Christopher Nolan,"('Christian Bale',), ('Heath Ledger',), ('Aaro...",2598173.0,...,0,0,0,0,0,0,0,0,0,0
2,https://m.media-amazon.com/images/S/sash/4Fyxw...,Inception,12A,148.0,"[Action, Adventure, Sci-Fi]",8.8,A thief who steals corporate secrets through t...,Christopher Nolan,"('Leonardo DiCaprio',), ('Joseph Gordon-Levitt...",2304062.0,...,0,0,0,0,0,0,0,0,0,0
3,https://m.media-amazon.com/images/S/sash/4Fyxw...,Fight Club,18,139.0,[Drama],8.8,An insomniac office worker and a devil-may-car...,David Fincher,"('Brad Pitt',), ('Edward Norton',), ('Meat Loa...",2071088.0,...,0,0,0,0,0,0,0,0,0,0
4,https://m.media-amazon.com/images/S/sash/4Fyxw...,Pulp Fiction,18,154.0,"[Crime, Drama]",8.9,"The lives of two mob hitmen, a boxer, a gangst...",Quentin Tarantino,"('John Travolta',), ('Uma Thurman',), ('Samuel...",2011013.0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5003,,,,,,,,,,,...,0,0,0,0,0,0,0,0,0,0
5004,,,,,,,,,,,...,0,0,0,0,0,0,0,0,0,0
5005,,,,,,,,,,,...,0,0,0,0,0,0,0,0,0,0
5006,,,,,,,,,,,...,0,0,0,0,0,0,0,0,0,0


In [46]:
# Drop last 8 rows because they're NaN rows
df.drop(df.tail(8).index,inplace=True)

Unnamed: 0,poster,title,certificate,runtime,genre,rating,about,director,stars,votes,...,History,Comedy,Animation,Family,Horror,Music,Sport,Musical,FilmNoir,nan
0,https://m.media-amazon.com/images/S/sash/4Fyxw...,The Shawshank Redemption,15,142.0,[Drama],9.3,Two imprisoned men bond over a number of years...,Frank Darabont,"('Tim Robbins',), ('Morgan Freeman',), ('Bob G...",2626905.0,...,0,0,0,0,0,0,0,0,0,0
1,https://m.media-amazon.com/images/S/sash/4Fyxw...,The Dark Knight,12A,152.0,"[Action, Crime, Drama]",9.0,When the menace known as the Joker wreaks havo...,Christopher Nolan,"('Christian Bale',), ('Heath Ledger',), ('Aaro...",2598173.0,...,0,0,0,0,0,0,0,0,0,0
2,https://m.media-amazon.com/images/S/sash/4Fyxw...,Inception,12A,148.0,"[Action, Adventure, Sci-Fi]",8.8,A thief who steals corporate secrets through t...,Christopher Nolan,"('Leonardo DiCaprio',), ('Joseph Gordon-Levitt...",2304062.0,...,0,0,0,0,0,0,0,0,0,0
3,https://m.media-amazon.com/images/S/sash/4Fyxw...,Fight Club,18,139.0,[Drama],8.8,An insomniac office worker and a devil-may-car...,David Fincher,"('Brad Pitt',), ('Edward Norton',), ('Meat Loa...",2071088.0,...,0,0,0,0,0,0,0,0,0,0
4,https://m.media-amazon.com/images/S/sash/4Fyxw...,Pulp Fiction,18,154.0,"[Crime, Drama]",8.9,"The lives of two mob hitmen, a boxer, a gangst...",Quentin Tarantino,"('John Travolta',), ('Uma Thurman',), ('Samuel...",2011013.0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4996,https://m.media-amazon.com/images/S/sash/4Fyxw...,Playing It Cool,15,94.0,"[Comedy, Romance]",5.9,Unrequited love motivates a guy to write about...,Justin Reardon,"('Chris Evans',), ('Michelle Monaghan',), ('To...",28515.0,...,0,1,0,0,0,0,0,0,0,0
4997,https://m.media-amazon.com/images/S/sash/4Fyxw...,A Moment to Remember,144 min,,[nan],8.1,A young couple's enduring love is tested when ...,John H. Lee,"('Jung Woo-sung',), ('Son Ye-jin',), ('Jong-ha...",23957.0,...,0,0,0,0,0,0,0,0,0,1
4998,https://m.media-amazon.com/images/S/sash/4Fyxw...,Hairspray,PG,92.0,"[Comedy, Drama, Family]",7.0,A 'pleasantly plump' teenager teaches 1962 Bal...,John Waters,"('Sonny Bono',), ('Ruth Brown',), ('Divine',),...",21536.0,...,0,1,0,1,0,0,0,0,0,0
4999,https://m.media-amazon.com/images/S/sash/4Fyxw...,McCabe & Mrs. Miller,X,120.0,"[Drama, Western]",7.6,A gambler and a prostitute become business par...,Robert Altman,"('Warren Beatty',), ('Julie Christie',), ('Ren...",25097.0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
# Show results of data type conversion
df.info()

In [None]:
# Show the data with density plot for visualization of the rating
alt.Chart(df).transform_density(
    'rating',
    as_=['rating', 'density'],
).mark_area().encode(
    x="rating:Q",
    y='density:Q',
).properties(
    title='Rating'
)

In [None]:
# show the data with density plot for visualisation for the runtime
alt.Chart(df).transform_density(
    'runtime',
    as_=['runtime', 'density'],
).mark_area().encode(
    x="runtime:Q",
    y='density:Q',
).properties(
    title='RunTime'
)

In [None]:
# show the data with density plot for visualisation for the runtime
alt.Chart(df).transform_density(
    'votes',
    as_=['votes', 'density'],
).mark_area().encode(
    x="votes:Q",
    y='density:Q',
).properties(
    title='Votes'
)

In [None]:
# show the data with density plot for visualisation for the runtime
alt.Chart(df).transform_density(
    'gross_earn',
    as_=['gross_earn', 'density'],
).mark_area().encode(
    x="gross_earn:Q",
    y='density:Q',
).properties(
    title='Gross Earn'
)

In [None]:
# Split Genre and Stars in their own Variables

#### Identify outliers, nan-values, human-errors

In [None]:
# outliers
# nan-values
# human-errors (e.g. runtime of the movie [there are Genres inside])

#### Save Transformed DataFrame

In [None]:
# Save cleaned DataFrame to_csv and save it as intermediate as something like [NR]_[NAME]_[NAME] 01_SR_movies_netflix_transformed
SUBPATH_INTERIM = 'interim/'

#### Relationship between variables

e.g. Relationship Matrix

In [None]:
# work in progress (e.g. heatmap in Altair)
corrmat = df.corr()

corrmat

#### Visualization Types

https://www.data-to-viz.com/

Description:

### Summary statistics 

In [None]:
# e.g.
# Measure of location
# - Mean
# - Geometric mean
# - Trimmed Mean
# - Median
# Measure of spread
# - Range
# - Interquartile range
# - Quartiles
# - Skewed
# - Kurtosis
# Graphs / charts
# - Visualizations (e.g. Histogram / Box plot etc.)
# - https://www.data-to-viz.com/ 


### Visualization Types

describe data (what kind of visualizations you will you use) (and why)

## Data dictionary

*Create a data dictionary for all the variables in your data set. You may fill out the data description table or create your own table with Pandas:*

<br>


| Name  |   Description	   	| Type   	|  Format 	|
|---	|---	          	|---	    |---	|
|   	|   	            |   	    |   	|
|   	|   	       	    |   	    |   	|
|   	|   	       	    |   	    |   	|


<br>


- `Type`: nominal, ordinal or numeric

- `Format`: int, float, string, category, date or object

In [None]:
name_list = df.columns.to_list()
description_list = [
    'Poster of the movie in image format. Used to advertise the movie.',
    'Name of the movie.',
    'Age restriction of the movie.',
    'Length in minutes of the movie.',
    'Film category of the movie to classify similar movies.',
    'User reviews expressed on a scale from 0 to 10 whereas 10 is representing a extraordinary performance.',
    'Description of the movie and its contents.',
    'Director of the movie, responsible for its recording.',
    'Famous celebrities acting as a role in that movie.',
    'Number of votes people have submitted.',
    'Reached revenue with that movie excl. production costs and taxes.'
]

type_list = [
    'nominal',
    'nominal',
    'ordinal',
    'numeric',
    'nominal',
    'ordinal',
    'nominal',
    'nominal',
    'nominal',
    'ordinal',
    'numeric'
]

format_list = df.dtypes.to_list()

data_dictionary = pd.DataFrame({
    'Name': name_list,
    'Description': description_list,
    'Type': type_list,
    'Format': format_list,
})
data_dictionary