# Project 2: Netflix Data Analysis

In this project we will be working with a dataset of Netflix titles. We will be using its data to answer some questions about Netflix titles, directors, and countries using Pandas. We will also use Matplotlib to do a couple of interesting visualizations to get more interesting insights. The data is stored in a csv file named `netflix_titles.csv`.

Data extracted from: https://www.kaggle.com/datasets/shivamb/netflix-shows (with some cleaning and modifications).


### Project Tasks:

- `2.1.` Load the data using Pandas read_csv, use `show_id` as the index_col parameter.

- `2.2.` What is the min and max release years?  

- `2.3.` How many director names are missing values (NaN)?  

- `2.4.` How many different countries are there in the data?  

- `2.5.` How many characters long are on average the title names? (create a new column with the titles length if needed)  

- `2.6.` For a given year, make a pie chart of the number of movies and series combined made by every country, limit it to the top 10 countries.

- `2.7.` Make a line chart of the average duration of movies (not TV shows) in minutes for every year across all the years. (hint: you can create a new column with the integer value of the minutes and then use groupby year and then average on that minutes column)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# Ex 2.1: Load the data using Pandas read_csv, use `show_id` as the index_col parameter. 

data_path = "../data/netflix_titles.csv"

movies_df = pd.read_csv(data_path, index_col="show_id")

movies_df

In [None]:
# Ex 2.2: What is the min and max release years?

min_year = movies_df['release_year'].min()
max_year = movies_df['release_year'].max()

print(f"Min year: {min_year}, Max year: {max_year}")

In [None]:
# Ex 2.3: How many director names are missing values (NaN)?

num_missing_directors = movies_df['director'].isna().sum()

print(f"Number of missing directors: {num_missing_directors}")

In [None]:
# Ex 2.4: How many different countries are there in the data?

# Fill missing values with "Unknown"
country_data = movies_df['country'].fillna("Unknown")

# Create a list to store all individual countries
all_countries = []

# Go through each entry and split countries separated by commas
for countries in country_data:
    # Split by comma and strip whitespace
    country_list = countries.split(", ")
    for country in country_list:
        all_countries.append(country.strip())

# Get unique countries and count them
unique_countries = set(all_countries)
n_countries = len(unique_countries)

print(f"There are {n_countries} different countries in the data")

In [None]:
# Ex 2.5: How many characters long are on average the title names?

# Create a new column with title lengths
movies_df['title_length'] = movies_df['title'].apply(lambda x: len(x))

# Calculate average title length
avg_title_length = movies_df['title_length'].mean()

print(f"The average title length is {avg_title_length:.2f} characters")

In [None]:
# Ex 2.6: For a given year, get the Pandas Series of how many movies and series combined were made by every country, limit it to the top 10 countries.

year = 2005   # you can try to change the year to see the results for different years 

# Filter data by year
year_data = movies_df.loc[movies_df['release_year'] == year]

# Get value counts for countries and limit to top 10
top_10_countries = year_data['country'].value_counts().head(10)

print(top_10_countries)

# Code to plot the pie chart from your data results
fig = plt.figure(figsize=(8, 8))
plt.pie(top_10_countries, labels=top_10_countries.index, autopct="%.2f%%")
plt.title(f"Top 10 Countries in {year}")

plt.show()

In [None]:
# Ex 2.7: Make a line chart of the average duration of movies (not TV shows) in minutes for every year across all the years. 

# Filter only movies (not TV shows)
movies_only = movies_df[movies_df['type'] == 'Movie'].copy()

# Create a new column with duration in minutes
# Extract minutes from duration column (assuming format like "90 min")
movies_only['duration_minutes'] = movies_only['duration'].apply(lambda x: int(x.split(' ')[0]) if 'min' in str(x) else 0)

# Group by year and calculate average duration
movies_avg_duration_per_year = movies_only.groupby('release_year')['duration_minutes'].mean()

fig = plt.figure(figsize=(9, 6))

# Generate the line plot
plt.plot(movies_avg_duration_per_year.index, movies_avg_duration_per_year.values)
plt.xlabel("Year")
plt.ylabel("Average Duration (minutes)")
plt.title("Average Duration of Movies Across Years")

plt.show()