# Netflix Movies Project

In this project, I will focus on 2 main parts. First, I will do an in-depth analysis and visualization of the data. Second, I will create a recommendation system for a user.

**Part One**
- Data Preprocessing
- Data Cleaning
- Exploratory Data Analysis

In [62]:
# import libraries
import numpy as np
import pandas as pd

In [63]:
# read data
data = pd.read_csv("./netflix_titles.csv")
print(f"Data has {data.shape[0]} rows and {data.shape[1]} columns.")
print(data.dtypes)
data.head(2)

Data has 8807 rows and 12 columns.
show_id         object
type            object
title           object
director        object
cast            object
country         object
date_added      object
release_year     int64
rating          object
duration        object
listed_in       object
description     object
dtype: object


Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."


In [64]:
# check null values
data.isna().sum()

show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64

**Missingness in Data**

3 types of missing data
- Missing Completely at Random (MCAR): variables and observation have the same chance of being missing. Data being missing is unrelated to data.
- Missing at Random (MAR): probability of being missing is the same within a given group defined by the data. e.g. in a survey, those students who are weak in math could not answer most of the math questions. Hence, the probability of a value being missing is linked to the proportion of students weak in math.
- Missing Not at Random (MNAR): probability of being missing varies due to reasons that are unknown to us. e.g. in a survey, people with weaker opinions respond less often. This is different from MAR because we know the proportion of people who are weak in math but we dont know the proportion of people who have weaker opinions (because they did not do the survey).

Ways to handle missing data (numerical)
- Imputation: mean, median, mode, ffill and bfill (ts data)
- Delete row (not recommended, but suitable for MAR, MCAR)
- Delete column (if proportion of missing values exceed a threshold)

**Drop 'Director'**

`Director` column is dropped because it contains more than 10% of missing values. Any imputation will skew the distribution of the data points. Furthermore, `Director` is not a column that you can guess, you can actually search up the director online. A possible solution is to create an API, that webscrapes and finds the director of the movie. But I will leave this as an enhancement. 

In [65]:
# drop 'Director' column
data.drop(['director'], inplace = True, axis = 1)

In [66]:
data.isna().sum()

show_id           0
type              0
title             0
cast            825
country         831
date_added       10
release_year      0
rating            4
duration          3
listed_in         0
description       0
dtype: int64

**Cast and Country**

See and find patterns of the cast and country.

Attempt to impute country by cast:
1. search all the cast, tag each cast to the countries.

In [67]:
# clear rows that are totally empty
print(f"Original: {data.shape[0]} rows, {data.shape[1]} columns")
cast_country_cleaned = data.dropna(subset=['cast'])
print(f"Cleaned: {cast_country_cleaned.shape[0]} rows, {data.shape[1]} columns")
cast_country_cleaned.head()

Original: 8807 rows, 11 columns
Cleaned: 7982 rows, 11 columns


Unnamed: 0,show_id,type,title,cast,country,date_added,release_year,rating,duration,listed_in,description
1,s2,TV Show,Blood & Water,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
4,s5,TV Show,Kota Factory,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
5,s6,TV Show,Midnight Mass,"Kate Siegel, Zach Gilford, Hamish Linklater, H...",,"September 24, 2021",2021,TV-MA,1 Season,"TV Dramas, TV Horror, TV Mysteries",The arrival of a charismatic young priest brin...
6,s7,Movie,My Little Pony: A New Generation,"Vanessa Hudgens, Kimiko Glenn, James Marsden, ...",,"September 24, 2021",2021,PG,91 min,Children & Family Movies,Equestria's divided. But a bright-eyed hero be...


In [60]:
# create dictionary where keys: values are 'cast_names': 'countries'
def cast_to_country(df):
    """
    Description of function
    - Function maps each cast's name to the country the cast has acted in.
    - Used to guess which country a movie belongs to by looking at the country the cast has acted in.
    - If majority of the casts are from a certain country, say China, then the movie/ tv show is likely to be filmed in China.

    Input
    - df: dataframe
    Output
    
    - dict: {key: value} == {cast_name: country_names}
    """
    d = {}
    for idx, row in df.iterrows():
        if not pd.isna(row.country):
            cast_lst = row.cast.split(", ")
            for cast in cast_lst:
                if cast not in d:
                    d[cast] = []
                d[cast].append(row.country)
    return d

x = cast_to_country(cast_country_cleaned)

# check if every cast only has one country or more than one country
# new_d = {}
# for k, v in x.items():
#     new_d[k] = len(v)

# new_d

In [61]:
# guess which country the movie is made based on casts
def fillCountry(df):
    """
    Description
    - Function guesses which country the movie/tv show is made in by the casts acting in it

    Input
    - df: dataframe

    Output
    - None, but df changed inplace
    """
    d = cast_to_country(df)
    for idx, row in df.iterrows():
        country_count = {}
        for cast in row.cast:
            if :
    

AttributeError: 'dict' object has no attribute 'head'