<h2>Dataset Content</h2>
<p>This dataset consists of tv shows and movies available on Netflix as of 2019. The dataset is collected from Flixable which is a third-party Netflix search engine.

In 2018, they released an interesting report which shows that the number of TV shows on Netflix has nearly tripled since 2010. The streaming service’s number of movies has decreased by more than 2,000 titles since 2010, while its number of TV shows has nearly tripled. It will be interesting to explore what all other insights can be obtained from the same dataset.

<ul>Inspiration
    <li>Understanding what content is available in different countries</li>
    <li>Identifying similar content by matching text-based features</li>
    <li>Network analysis of Actors / Directors and find interesting insights</li>
    <li>Is Netflix has increasingly focusing on TV rather than movies in recent years.</li>
</ul>

In [2]:
import pandas as pd
import numpy as np

In [57]:
df = pd.read_csv("netflix_titles.csv")
df.head(3)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...


In [4]:
# how many rows and columns in this dataset?


In [6]:
df.shape

(8807, 12)

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          8807 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


In [20]:
df.columns

Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')

In [None]:
# try seeing some information about the data and check if there is nulls


Some Questions you should ask yourself about.
<br>
1- Is there any duplicates? .
<br>
2-What about the nulls?.
<br>
3-Does all columns has the a correct format in its values? if its not how should you make it better?
<br>
4-Datatypes?
<br>
5- Before starting , after seeing some info about the dataset and from the first look on the dataset , what columns you think will not be necessary in our dataset? (io: what columns you think dropping it will be better?)
<br>
feel free to wirte only their names in the next cell

Double Click here to start writing
1. 
2. 
3. 
4. 

In [42]:
# 

In [8]:
# show the number of duplicates here
df.duplicated().sum()



0

In [9]:
# show number of nulls
df.isnull().sum() 


show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64

In [14]:
# the cast column is full with nulls , replace the nulls with "UnKnown" 
# make a new column that describes the number of people in the cast (io :Hom many people in the cast? if it is unknown make it 0)
# hint --> make an external function and use apply method



In [17]:
df["cast"].fillna('Unknown', inplace=True)

In [18]:
df['cast'].head(2)

0                                              Unknown
1    Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...
Name: cast, dtype: object

In [43]:
def people_in_cast(cast):
    if pd.notnull(cast):
        return len(cast.split(','))
    else:
        return 0  

In [44]:
df['num_people_in_cast'] = df['cast'].apply(people_in_cast)

In [45]:
df[['cast', 'num_people_in_cast']]

Unnamed: 0,cast,num_people_in_cast
0,Unknown,1
1,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",19
2,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",9
3,Unknown,1
4,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",8
...,...,...
8802,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",10
8803,Unknown,1
8804,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",7
8805,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",9


now time to get rid of these nulls

In [None]:
# let's start with date_added , see the number of nulls in it , replace these nulls with the mode of this column , and in the end 
# convert this column to be in suitable format date time ,(hint --> use fillna method)



In [46]:
 df['date_added'].isnull().sum()

10

In [47]:
date_added_mode = df['date_added'].mode().iloc[0]

In [48]:
df['date_added'].fillna(date_added_mode, inplace=True)


In [49]:
df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce') # errors='coerce is used to handle errors during the conversion of strings to datetime objects

In [51]:
df[['date_added']]

Unnamed: 0,date_added
0,2021-09-25
1,2021-09-24
2,2021-09-24
3,2021-09-24
4,2021-09-24
...,...
8802,2019-11-20
8803,2019-07-01
8804,2019-11-01
8805,2020-01-11


In [22]:
# now time for country 
# when you look closer at the dataset you will find that most of null values in country has the value "Anime" in listed_in column
# make a function that checks if Anime is in listed_in column 
# and if it is then replace the null in country column of this row with "Japan" 
# if it is not then replace the null with the most frequented value (io : mode)
# i will give you a first structue

In [56]:
def country_null(data):
    if pd.isnull(data["country"]):
        if "Anime" in data["listed_in"]:
            data["country"] = "Japan"
        else:
            data["country"] = df["country"].mode().iloc[0]
    return data
df = df.apply(country_null, axis=1)
df[['country']]


Unnamed: 0,country
0,United States
1,South Africa
2,United States
3,United States
4,India
...,...
8802,United States
8803,United States
8804,United States
8805,United States


In [None]:
# director column , duration and rating , fill with mode

<p>let's now google the categories and explore them</p>
<ul>
    <li>TV-MA:This program is specifically designed to be viewed by adults and therefore may be unsuitable for children under 17.</li>
    <li>TV-14:This program contains some material that many parents would find unsuitable for children under 14 years of age.</li>
    <li>TV-PG:This program contains material that parents may find unsuitable for younger children.</li>
    <li>R:Under 17 requires accompanying parent or adult guardian,Parents are urged to learn more about the film before taking their young children with them.</li>
    <li>PG-13:Some material may be inappropriate for children under 13. Parents are urged to be cautious. Some material may be inappropriate for pre-teenagers.</li>
    <li>NR or UR:If a film has not been submitted for a rating or is an uncut version of a film that was submitted</li>
    <li>PG:Some material may not be suitable for children,May contain some material parents might not like for their young children.</li>
    <li>TV-Y7:This program is designed for children age 7 and above.</li>
    <li>TV-G:This program is suitable for all ages.</li>
    <li>TV-Y:Programs rated TV-Y are designed to be appropriate for children of all ages. The thematic elements portrayed in programs with this rating are specifically designed for a very young audience, including children ages 2-6.</li>
    <li>TV-Y7-FV:is recommended for ages 7 and older, with the unique advisory that the program contains fantasy violence.</li>
    <li>G:All ages admitted. Nothing that would offend parents for viewing by children.</li>
    <li>NC-17:No One 17 and Under Admitted. Clearly adult. Children are not admitted.</li>
</ul>

<p> here we discover that UR and NR is the same rating(unrated,Not rated)<br>Uncut/extended versions of films that are labeled "Unrated" also contain warnings saying that the uncut version of the film contains content that differs from the theatrical release and might not be suitable for minors.<br> so we have the fix this. </p>

In [105]:
# in the rating column , UR and NR is the same where Unrated and Notrated , so fix this
df['rating'] = df['rating'].replace('UR', 'NR')


In [106]:
df['rating'].unique()

array(['PG-13', 'TV-MA', 'PG', 'TV-14', 'TV-PG', 'TV-Y', 'TV-Y7', 'R',
       'TV-G', 'G', 'NC-17', '74 min', '84 min', '66 min', 'NR', nan,
       'TV-Y7-FV'], dtype=object)

In [82]:
# in the end of this notebook some columns is strange ,
# what do you think we should do with something like show_id column 
# feel free to do the same for the columns you thought it is not necessary and please write an explanation why do you think it is not important


In [101]:
# i drop some columns like description , show_id and release_year , As a user i think these data are not important    
df.columns

Index(['type', 'title', 'director', 'cast', 'country', 'date_added', 'rating',
       'duration', 'listed_in'],
      dtype='object')

In [102]:
df = df.drop('director',axis=1)

In [103]:
df.head(2)

Unnamed: 0,type,title,cast,country,date_added,rating,duration,listed_in
0,Movie,Dick Johnson Is Dead,,United States,"September 25, 2021",PG-13,90 min,Documentaries
1,TV Show,Blood & Water,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries"


In [104]:
df.columns

Index(['type', 'title', 'cast', 'country', 'date_added', 'rating', 'duration',
       'listed_in'],
      dtype='object')