# Exploratory Data Analysis 1 Project

### Turning Data Into Insights

This project inspects Netflix's library to expose patterns in content type, genres, countries, and release trends—transforming raw data into meaningful information through exploratory data analysis.

##### This Project Will Cover

* What genres are most common in movies vs TV shows
* How has the addition of new content evolved over time
* Which countries dominate certain genres
* Patterns in ratings based of the content type
* “The global reach of Netflix”
* “Netflix through the years”
* “Is Netflix family-friendly?”

This analysis explores the rich landscape of content available on Netflix, one of the world's leading streaming entertainment services. The dataset contains detailed information about movies and TV shows in the Netflix catalog, allowing us to uncover patterns and insights about content distribution, genres, ratings, and more.

##### Dataset Overview
> The analysis uses a comprehensive dataset of Netflix movies and TV shows that includes:

* Content types (Movies vs. TV Shows)
* Titles and descriptions
* Release years
* Content ratings (TV-MA, PG-13, etc.)
* Duration (minutes for movies, seasons for TV shows)
* Genres/categories
* Cast and director information
* Country of origin
* Date added to Netflix

##### Project Objectives
*This exploratory data analysis aims to:*

1. *Understand content distribution:* Analyze the balance between movies and TV shows, and how this has evolved over time
2. *Explore content ratings:* Investigate the distribution of ratings across different genres and content types
3. *Analyze genre preferences:* Identify the most common genres and their characteristics
4. *Examine regional content strategies:* Explore how content varies by country of origin
5. *Investigate temporal patterns:* Analyze when content is added to Netflix and release year trends

*GitHub:* https://github.com/imelinc

#### --- IMPORTING LIBRARIES ---

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

#### --- LOAD THE DATASET ---

In [2]:
df = pd.read_csv("data/netflix_titles.csv")

In [3]:
# Let's get a look of the dataset
df

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
...,...,...,...,...,...,...,...,...,...,...,...,...
8802,s8803,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,"November 20, 2019",2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a..."
8803,s8804,TV Show,Zombie Dumb,,,,"July 1, 2019",2018,TV-Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g..."
8804,s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,"November 1, 2019",2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...
8805,s8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,"January 11, 2020",2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero..."


As we can see, there are multiple rows and columns which would be very handy for our exploratory data analysis in the future, now let's see what is with the *MISSING DATA*

#### --- MISSING DATA ---

In [None]:
# First we get the info of the dataframe to see how many and what kind of data we have
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


So we have 8807 rows with some missing values as we can see, let's find out how many on each column

In [None]:
df.isna().sum()

show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64

So there's how many values are missing for each column, we can see that none of them excedes the 80% of the total (8807), so it's not viable to drop them. 
The best way to get rid of this missing values will be to fill them with something, maybe with the word `"Unknown"`, let's get to it...

In [6]:
# Filling the missing values with 'Unknown'
df['director'] = df['director'].fillna('Unknown')
df['cast'] = df['cast'].fillna('Unknown')
df['country'] = df['country'].fillna('Unknown')
df['date_added'] = df['date_added'].fillna('Unknown')
df['rating'] = df['rating'].fillna('Unknown')
df['duration'] = df['duration'].fillna('Unknown')

Once we've done this, let's see our dataframe now and let's check if there're missing values still

In [7]:
df

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,Unknown,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,Unknown,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",Unknown,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,Unknown,Unknown,Unknown,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,Unknown,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
...,...,...,...,...,...,...,...,...,...,...,...,...
8802,s8803,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,"November 20, 2019",2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a..."
8803,s8804,TV Show,Zombie Dumb,Unknown,Unknown,Unknown,"July 1, 2019",2018,TV-Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g..."
8804,s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,"November 1, 2019",2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...
8805,s8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,"January 11, 2020",2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero..."


In [8]:
df.isna().sum()

show_id         0
type            0
title           0
director        0
cast            0
country         0
date_added      0
release_year    0
rating          0
duration        0
listed_in       0
description     0
dtype: int64

Well now that we managed the missing values problem we can get to work with the project itself

#### --- GENRE DISTRIBUTION --- 

In this section, we explore how different genres are distributed across Netflix's movies and TV shows. This analysis reveals Netflix's content priorities and provides insights into audience preferences.
The visualizations below compare the most popular genres in movies versus TV shows, highlighting significant differences in content strategy between these formats.
By understanding these genre patterns, we gain valuable perspective on Netflix's ability to diverse viewer interests.

In [9]:
# Filter the DataFrame for movies only
movies_df = df[df['type'] == 'Movie'].copy()
# Filter the dataframe for tv shows only
tv_shows_df = df[df['type'] == 'TV Show'].copy()

First we're gonna make some functions that will help not only for this section, but for upcoming ones

In [10]:
def expand_dataframe(df, column):
    """
    This function expands the dataframe in multiple rows given a certain column
    Args:
        df (pandas.DataFrame): dataset
        column (str): name of the column we want to expand
    """
    # Divide the column into multiple rows
    df.loc[:, column] = df[column].str.split(",")
    df_exploded = df.explode(column)
    df_exploded[column] = df_exploded[column].str.strip() 
    
    return df_exploded

In [11]:
def get_count(df, column):
    """
    This function count the amount of values that are on the dataframe given a certain column
    Args:
        df (pandas.DataFrame): dataset
        column (str): name of the column that we want to count values
    """
    # Expand the dataframe
    new_df = expand_dataframe(df, column)
    # Count the number of ocurrences of each genre
    df_genre_counts = new_df[column].value_counts()
    
    return df_genre_counts