## Data description

Our dataset contains information about movies originally collected by IMDb. The IMDb is a US database that contains information and statistics about movies, television series, video production and computer games, as well as people who have worked on them. The database gets its information from various sources such as filmmakers, movie studios, credits or other official sources and has been operated by Amazon since 1998. However, much of the information comes from IMDb users themselves, but is checked for accuracy by IMDb before publication.

Our dataset contains a total of 5000 entries for movies that can be found on the IMDb website. For each movie, the dataset contains the following information:

| Feature  | Description   |  
|---|---|
| Link to movie poster  | Poster of the movie in image format  | 
| Title | Name of the movie  | 
| Certificate  | Age restriction of the movie  | 
| Runtime  | Age restriction of the movie  | 
| Genre  | Film category of the movie to classify similar movies  | 
| Rating  | User reviews expresses on a scale from 0 to 10 whereas 10 is representing a extraordinary performance  | 
| About  | Description of the movie and its contents  | 
| Director  | Director of the movie, responsible for its recording  | 
| Stars  | Famous celebrities acting as a role in that movie  | 
| Votes  | Number of votes people have submitted  | 
| Gross earnings  | Reached revenue with that movie excl. production and taxes  | 

### Data Import

#### Import libraries

In [2]:
# Import relevant libraries for data analysis
import pandas as pd
from pathlib import Path
import numpy as np
import warnings
import altair as alt

#### Handling of warnings

In [3]:
warnings.simplefilter(action='ignore', category=FutureWarning)

#### Import data

In [5]:
# Create path
PARENT_PATH = str(Path().resolve().parent) + "/adst/"
PATH = "data/"
SUBPATH = "raw/"
FILE = "movies_netflix"
FORMAT = ".csv"

# Import CSV file with pandas
df = pd.read_csv(PARENT_PATH + PATH + SUBPATH + FILE + FORMAT)

#### Brief data overview

In [6]:
# Brief overview of the data
df.head()

Unnamed: 0,poster,title,certificate,runtime,genre,rating,about,director,stars,votes,gross_earn
0,https://m.media-amazon.com/images/S/sash/4Fyxw...,The Shawshank Redemption,15,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,Frank Darabont,"('Tim Robbins',), ('Morgan Freeman',), ('Bob G...",2626905,$28.34M
1,https://m.media-amazon.com/images/S/sash/4Fyxw...,The Dark Knight,12A,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,Christopher Nolan,"('Christian Bale',), ('Heath Ledger',), ('Aaro...",2598173,$534.86M
2,https://m.media-amazon.com/images/S/sash/4Fyxw...,Inception,12A,148 min,"Action, Adventure, Sci-Fi",8.8,A thief who steals corporate secrets through t...,Christopher Nolan,"('Leonardo DiCaprio',), ('Joseph Gordon-Levitt...",2304062,$292.58M
3,https://m.media-amazon.com/images/S/sash/4Fyxw...,Fight Club,18,139 min,Drama,8.8,An insomniac office worker and a devil-may-car...,David Fincher,"('Brad Pitt',), ('Edward Norton',), ('Meat Loa...",2071088,$37.03M
4,https://m.media-amazon.com/images/S/sash/4Fyxw...,Pulp Fiction,18,154 min,"Crime, Drama",8.9,"The lives of two mob hitmen, a boxer, a gangst...",Quentin Tarantino,"('John Travolta',), ('Uma Thurman',), ('Samuel...",2011013,$107.93M


<b>Findings</b>:
As can be seen, the values of the feature `votes` are values in an inconsistent data format. For this reason, formatting adjustments must be made, e.g. '2.626.905' instead of 26,26,905. 

See source: https://www.imdb.com/title/tt0111161/ratings/?ref_=tt_ov_rt 

### Data Quantity

The collected data from the IMDb are in the format of a csv file. <br>
The dataset contains 5000 rows and 11 features described in more details above.

In [7]:
# Number of rows and columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   poster       5000 non-null   object 
 1   title        5000 non-null   object 
 2   certificate  5000 non-null   object 
 3   runtime      5000 non-null   object 
 4   genre        4970 non-null   object 
 5   rating       5000 non-null   float64
 6   about        4986 non-null   object 
 7   director     5000 non-null   object 
 8   stars        5000 non-null   object 
 9   votes        5000 non-null   object 
 10  gross_earn   4585 non-null   object 
dtypes: float64(1), object(10)
memory usage: 429.8+ KB


### Data Quality

#### Data types of each feature

In [8]:
# Show data types of the features
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   poster       5000 non-null   object 
 1   title        5000 non-null   object 
 2   certificate  5000 non-null   object 
 3   runtime      5000 non-null   object 
 4   genre        4970 non-null   object 
 5   rating       5000 non-null   float64
 6   about        4986 non-null   object 
 7   director     5000 non-null   object 
 8   stars        5000 non-null   object 
 9   votes        5000 non-null   object 
 10  gross_earn   4585 non-null   object 
dtypes: float64(1), object(10)
memory usage: 429.8+ KB


In [9]:
# Desciption of numerical features
df.describe()

Unnamed: 0,rating
count,5000.0
mean,6.6923
std,0.973649
min,1.5
25%,6.1
50%,6.8
75%,7.4
max,9.3


<b>Findings</b>:
Trough `.info()` and `.describe()` we can see that there is only one numerical variable (`rating`) so far. All other variables have the data type `object`.

This means that we for example cannot determine wheter the lengt of a movie results in a better vote or wether certain genres tend to generate more revenue than others.

For this reason some of the variables of the data type `object` have to be converted to numeric data types. These are: `runtime`, `votes` and `gross_earn`.

#### Anomalies

In [11]:
# Check for missing values
df.isnull().sum()

poster           0
title            0
certificate      0
runtime          0
genre           30
rating           0
about           14
director         0
stars            0
votes            0
gross_earn     415
dtype: int64

<b>Findings</b>:
Trough `.isnull()` and `.sum()` we can see that the variable `genre`contains 30 missing values, the variable `about` contains 14 missing values and the variable `gross_earn`contains 415 missing values.

459 values are therefore missing in the complete dataset.

Since it is not reasonable to ignore missing values, there is a need to find ways to handle them efficiently and properly. Depending on the characteristics of the dataset and the task, we can choose to:

- Drop missing values
- Replace missing values

In [12]:
# Check for not uniform values
df.tail()

Unnamed: 0,poster,title,certificate,runtime,genre,rating,about,director,stars,votes,gross_earn
4995,https://m.media-amazon.com/images/S/sash/4Fyxw...,Seven Brides for Seven Brothers,U,102 min,"Comedy, Drama, Musical",7.3,"In 1850 Oregon, when a backwoodsman brings a w...",Stanley Donen,"('Jane Powell',), ('Howard Keel',), ('Jeff Ric...",25112,$9.40M
4996,https://m.media-amazon.com/images/S/sash/4Fyxw...,Playing It Cool,15,94 min,"Comedy, Romance",5.9,Unrequited love motivates a guy to write about...,Justin Reardon,"('Chris Evans',), ('Michelle Monaghan',), ('To...",28515,
4997,https://m.media-amazon.com/images/S/sash/4Fyxw...,A Moment to Remember,144 min,"Drama, Romance",,8.1,A young couple's enduring love is tested when ...,John H. Lee,"('Jung Woo-sung',), ('Son Ye-jin',), ('Jong-ha...",23957,
4998,https://m.media-amazon.com/images/S/sash/4Fyxw...,Hairspray,PG,92 min,"Comedy, Drama, Family",7.0,A 'pleasantly plump' teenager teaches 1962 Bal...,John Waters,"('Sonny Bono',), ('Ruth Brown',), ('Divine',),...",21536,$6.67M
4999,https://m.media-amazon.com/images/S/sash/4Fyxw...,McCabe & Mrs. Miller,X,120 min,"Drama, Western",7.6,A gambler and a prostitute become business par...,Robert Altman,"('Warren Beatty',), ('Julie Christie',), ('Ren...",25097,$8.20M


<b>Findings</b>:
By taking a look at the last rows of the data set we can see, that the variable `certificate` contains values that aren´t uniform. For example row number 4995 contains a letter (U) to specify the age restriction but row number 4996 uses a number (15). Furthermore this column contains values that have noting to do with age restrictions like row number 4997 that contains the `runtime` instead of the age restriction and the `genre` instead of the `runtime`.

For this reason for the column `certificate`, a way must be found to make the values consistent. Likewise, the remaining columns must be looked at more closely to identify incorrect values.

#### Further findings

In addition, new columns should be created for the variables `genre` and `stars`, as they contain several persons. This will be important later in the more in-depth evaluation of findings.