## Import library

In [1]:
import pandas as pd
import numpy as np
# visualization
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Collecting data

## What subject is your data about? What is the source of your data?

The subject of our data is about movies and TV shows on Netflix. The source of our data is from [Kaggle](https://www.kaggle.com/datasets/victorsoeiro/netflix-tv-shows-and-movies?select=credits.csv)

## Do authors of this data allow you to use like this?

Yes, the authors of this data allow us to use like this, because they have a license of CC0: Public Domain.

## How did authors collect data?

The author collected data from the API of JustWatch website, to get the data all he had to do was the requests package. ([Source](https://github.com/victor-soeiro/WebScraping-Projects/blob/main/08%20-%20justwatch/main.ipynb))

# 2. Exploring data

* How many rows and how many columns?
* What is the meaning of each row?
* Are there duplicated rows?
* What is the meaning of each column?
* What is the current data type of each column? Are there
columns having inappropriate data types?
* With each numerical column, how are values distributed?
* What is the percentage of missing values?
* Min? max? Are they abnormal?
* With each categorical column, how are values distributed?
* What is the percentage of missing values?
* How many different values? Show a few
* Are they abnormal?

### Load Dataset

In [2]:
credits_df = pd.read_csv('credits.csv')
titles_df = pd.read_csv('titles.csv')

### How many rows and how many columns?

In [3]:
credits_df.shape

(77801, 5)

### What is the meaning of each row?

In [4]:
credits_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77801 entries, 0 to 77800
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   person_id  77801 non-null  int64 
 1   id         77801 non-null  object
 2   name       77801 non-null  object
 3   character  68029 non-null  object
 4   role       77801 non-null  object
dtypes: int64(1), object(4)
memory usage: 3.0+ MB


In [5]:
titles_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5850 entries, 0 to 5849
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    5850 non-null   object 
 1   title                 5849 non-null   object 
 2   type                  5850 non-null   object 
 3   description           5832 non-null   object 
 4   release_year          5850 non-null   int64  
 5   age_certification     3231 non-null   object 
 6   runtime               5850 non-null   int64  
 7   genres                5850 non-null   object 
 8   production_countries  5850 non-null   object 
 9   seasons               2106 non-null   float64
 10  imdb_id               5447 non-null   object 
 11  imdb_score            5368 non-null   float64
 12  imdb_votes            5352 non-null   float64
 13  tmdb_popularity       5759 non-null   float64
 14  tmdb_score            5539 non-null   float64
dtypes: float64(5), int64(

### Are there duplicated rows?

In [6]:
print('Duplicated date in credits_df:', credits_df.duplicated().sum())
print('Duplicated date in titles_df:', titles_df.duplicated().sum())

Duplicated date in credits_df: 0
Duplicated date in titles_df: 0


### What is the meaning of each column?

1. Credit's column information 


| Colunm Name | Information             |
|-------------|-------------------------|
| person_id   | The id of this person   |
| id          | The id                  |
| name        | The name of this person |
| character   | The character in film   |
| role        | The role of this person | 

2. Title's column information

| Column name          | Information                                   |
|----------------------|-----------------------------------------------|
| id                   | The ID of movie                               |
| title                | The name of movie                             |
| type                 | The type of movie                             |
| description          | The description of movie                      |
| release_year         | The year release of movie                     |
| age_certification    | The age limit of movie                        |
| runtime              | Movie duration                                |
| genres               | The genres of movie                           |
| production_countries | The country product movie                     |
| seasons              | The season of movie or TV shows               |
| imdb_id              | The ID of movie on Internet Movie Database    |
| imdb_score           | The score of movie on Internet Movie Database |
| imdb_votes           | The vote of movie on Internet Movie Database  |
| tmdb_popularity      | The popularity of movie on The Movie Database |
| tmdb_score           | The score of movie on The Movie Database      |


### What is the current data type of each column? Are there columns having inappropriate data types?


In [7]:
credits_df.dtypes

person_id     int64
id           object
name         object
character    object
role         object
dtype: object

In [8]:
titles_df.dtypes

id                       object
title                    object
type                     object
description              object
release_year              int64
age_certification        object
runtime                   int64
genres                   object
production_countries     object
seasons                 float64
imdb_id                  object
imdb_score              float64
imdb_votes              float64
tmdb_popularity         float64
tmdb_score              float64
dtype: object

Convert realease_year to datetime and person_id to string

In [10]:
titles_df['release_year'] = titles_df['release_year'].astype('datetime64[ns]')
credits_df['person_id'] = credits_df['person_id'].astype('str')

As we can see, the data type of each column is appropriate.

### With each numerical column, how are values distributed?

In title dataset, we have 2 numerical columns: release_year, runtime, seasons. In credit dataset, we have no numerical columns.

In [11]:
#not include release_year
titles_df.describe(include=[np.number])

Unnamed: 0,runtime,seasons,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
count,5850.0,2106.0,5368.0,5352.0,5759.0,5539.0
mean,76.888889,2.162868,6.510861,23439.38,22.637925,6.829175
std,39.002509,2.689041,1.163826,95820.47,81.680263,1.170391
min,0.0,1.0,1.5,5.0,0.009442,0.5
25%,44.0,1.0,5.8,516.75,2.7285,6.1
50%,83.0,1.0,6.6,2233.5,6.821,6.9
75%,104.0,2.0,7.3,9494.0,16.59,7.5375
max,240.0,42.0,9.6,2294231.0,2274.044,10.0


### What is the percentage of missing values?

In [12]:

missing_value_titles_df = pd.DataFrame(
    {'The_percentage_of_missing_value': (titles_df.isnull().sum()/titles_df.shape[0]) * 100}
)
missing_value_titles_df

Unnamed: 0,The_percentage_of_missing_value
id,0.0
title,0.017094
type,0.0
description,0.307692
release_year,0.0
age_certification,44.769231
runtime,0.0
genres,0.0
production_countries,0.0
seasons,64.0


In [13]:
missing_value_credits_df = pd.DataFrame(
    {'The_percentage_of_missing_value': (credits_df.isnull().sum()/credits_df.shape[0]) * 100}
).T
missing_value_credits_df

Unnamed: 0,person_id,id,name,character,role
The_percentage_of_missing_value,0.0,0.0,0.0,12.56025,0.0


### With each categorical column, how are values distributed?

In [14]:
credits_nonnumeric_df = credits_df.select_dtypes(exclude=[np.number])
credits_nonnumeric_df


Unnamed: 0,person_id,id,name,character,role
0,3748,tm84618,Robert De Niro,Travis Bickle,ACTOR
1,14658,tm84618,Jodie Foster,Iris Steensma,ACTOR
2,7064,tm84618,Albert Brooks,Tom,ACTOR
3,3739,tm84618,Harvey Keitel,Matthew 'Sport' Higgins,ACTOR
4,48933,tm84618,Cybill Shepherd,Betsy,ACTOR
...,...,...,...,...,...
77796,736339,tm1059008,Adelaida Buscato,María Paz,ACTOR
77797,399499,tm1059008,Luz Stella Luengas,Karen Bayona,ACTOR
77798,373198,tm1059008,Inés Prieto,Fanny,ACTOR
77799,378132,tm1059008,Isabel Gaona,Cacica,ACTOR


In [15]:
non_nummeric_col_info_credit = pd.DataFrame({
     'missing_percentage': ((credits_nonnumeric_df.isnull().sum()/credits_nonnumeric_df.shape[0]) * 100),
     'num_values': len(credits_nonnumeric_df.value_counts().keys()),
     'value_counts': {'person_id': pd.Series(credits_nonnumeric_df['person_id'].value_counts().to_dict()), \
                      'id': pd.Series(credits_nonnumeric_df['id'].value_counts().to_dict()), \
                      'name': pd.Series(credits_nonnumeric_df['name'].value_counts().to_dict()), \
                      'character': pd.Series(credits_nonnumeric_df['character'].value_counts().to_dict()), \
                      'role': pd.Series(credits_nonnumeric_df['role'].value_counts().to_dict())}}).T
non_nummeric_col_info_credit

Unnamed: 0,person_id,id,name,character,role
missing_percentage,0.0,0.0,0.0,12.56025,0.0
num_values,68029,68029,68029,68029,68029
value_counts,38636 25 48004 25 38632 23 138 ...,tm32982 208 tm244149 174 tm39888 ...,Boman Irani 25 Kareena Kapoor Khan ...,Self 1950 Himself ...,ACTOR 73251 DIRECTOR 4550 dtype: int64


# 3. Asking meaningful questions

# 4. Preprocessing and analyzing data to answer each question

# 5. Reflection

# 6. References