This project involves a comprehensive analysis of Netflix's movies and TV shows data using SQL. The goal is to extract valuable insights and answer various business questions based on the dataset. The following README provides a detailed account of the project's objectives, business problems, solutions, findings, and conclusions.
- Analyze the distribution of content types (Movies vs TV shows).
- Identify the most common ratings for movies and TV shows.
- List and analyze content based on release years, countries, and durations.
- Explore and categorize content based on specific criteria and keywords.
The data for this project is sourced from the Kaggle dataset:
- Dataset Link: Movies Dataset
DROP TABLE IF EXISTS netflix;
CREATE TABLE netflix
(
show_id VARCHAR(20),
type VARCHAR(20),
title VARCHAR(150),
director VARCHAR(210),
casts VARCHAR(1000),
country VARCHAR(150),
date_added VARCHAR(50),
release_year INT,
rating VARCHAR(10),
duration VARCHAR(20),
listed_in VARCHAR(100),
description VARCHAR(250)
);SELECT
type,
COUNT(*) as total_content
FROM netflix
GROUP BY typeObjective: Determine the distribution of content types on Netflix.
SELECT
type,
rating
FROM
(
SELECT
type,
rating,
COUNT(*),
RANK() OVER(PARTITION BY type ORDER BY COUNT(*) DESC) as ranking
FROM netflix
GROUP BY 1, 2
) as t1
WHERE ranking = 1Objective: Identify the most frequently occurring rating for each type of content.
SELECT * FROM netflix
WHERE
type = 'Movie'
AND
release_year = 2020Objective: Retrieve all movies released in a specific year.
SELECT
UNNEST(STRING_TO_ARRAY(country, ',')) as new_country,
COUNT(show_id) as total_content
FROM netflix
GROUP BY 1
ORDER BY 2 DESC
LIMIT 5Let's FIX the Multiple Values Received on a String List at the 'Country' Column:
SELECT
UNNEST(STRING_TO_ARRAY(country, ',')) as new_country
FROM netflixObjective: Identify the top 5 countries with the highest number of content items.
SELECT * FROM netflix
WHERE
type = 'Movie'
AND
duration = (SELECT MAX(duration) FROM netflix)Objective: Find the movie with the longest duration.
SELECT * FROM netflix
WHERE
TO_DATE(date_added, 'Month DD, YYYY') >= CURRENT_DATE - INTERVAL '5 years'Objective: Retrieve content added to Netflix in the last 5 years.
SELECT * FROM netflix
WHERE
director = 'Clint Eastwood'Objective: List all content directed by 'Clint Eastwood'.
SELECT
type,
title,
duration
FROM netflix
WHERE
type = 'TV Show'
AND
SPLIT_PART(duration, ' ', 1)::numeric > 5
ORDER BY 3 DESCObjective: Identify TV shows with more than 5 seasons.
SELECT
UNNEST(STRING_TO_ARRAY(listed_in, ',')) as genre,
COUNT(show_id) as "Number of Content"
FROM netflix
GROUP BY 1
ORDER BY 2 DESCObjective: Count the number of content items in each genre.
return top 5 year with highest avg content release!
SELECT
EXTRACT(YEAR FROM TO_DATE(date_added, 'Month DD, YYYY')) as Year,
COUNT(*) as yearly_content,
ROUND(
COUNT(*)::numeric/ (SELECT COUNT(*) FROM netflix WHERE country = 'United States')::numeric * 100 , 2) as "AVG_content_yearly"
FROM netflix
WHERE country = 'United States'
GROUP BY 1
ORDER BY 3 DESC
LIMIT 5Objective: Calculate and rank years by the average number of content releases by United States.
SELECT
type,
title,
listed_in
FROM netflix
WHERE
type = 'Movie'
AND
listed_in ILIKE '%documentaries%'ONLY Documentaries results, no mixed styles:
SELECT
type,
title,
listed_in
FROM netflix
WHERE
type = 'Movie'
AND
listed_in LIKE 'Documentaries'Objective: Retrieve all movies classified as Documentaries.
SELECT *
FROM netflix
WHERE
director IS NULLObjective: List content that does not have a director.
SELECT DISTINCT
show_id,
type,
title,
casts,
release_year
FROM netflix
WHERE
casts ILIKE '%Salman Khan%'
AND
release_year > EXTRACT(YEAR FROM CURRENT_DATE) - 15
ORDER BY release_year DESCObjective: Count the number of movies featuring 'Salman Khan' in the last 10 years.
14. Find the Top 10 Actors Who Have Appeared in the Highest Number of Movies Produced in United States.
SELECT
UNNEST(STRING_TO_ARRAY(casts, ',')) as actors,
COUNT(*) as total_content
FROM netflix
WHERE country ILIKE '%United States'
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10WITH Categorize_Table
AS
(
SELECT *,
CASE
WHEN
description ILIKE '%kill%' OR
description ILIKE '%violence%' THEN 'Bad_Content'
ELSE 'Good Content'
END as category
FROM netflix
)
SELECT
category,
COUNT(*) as total_content
FROM Categorize_Table
GROUP BY 1- Content Distribution: The dataset contains a diverse range of movies and TV shows with varying ratings and genres.
- Common Ratings: Insights into the most common ratings provide an understanding of the content's target audience.
- Geographical Insights: The top countries and the average content releases by India highlight regional content distribution.
- Content Categorization: Categorizing content based on specific keywords helps in understanding the nature of content available on Netflix.
This analysis provides a comprehensive view of Netflix's content and can help inform content strategy and decision-making.
Thank you if you made it to the end!
