## Introduction

This dataset was provided by DecoderBot, a versatile platform involved in various services, including IT solutions, design and development, and internship programs for  IT enthusiasts.

The dataset consists of 11 columns with 5283 data entries of Netflix movies and TV show which has been rated by IMDb from 1953 to 2022. The data attributes includes:

- **Title**: Name of the TV show or movie.
- **Type**: category of each entry as a TV show or movie.
- **Description**:Summary of the plot or storyline of each TV show or movie.
- **Release_year**: indicates the year each movie of TV show was released.
- **Age_certification**: Age ratings assigned to each title, indicating whether they are suitable for the general audience  or restricted due to adult content.
- **Runtime**: Duration of TV show episodes or movies.
- **Imdb_score**: Score assigned to each movie or TV show representing its overall quality and popularity on IMDb.
- **Imdb_votes**: Number of votes received by each movie or TV show  on IMDb.

In [1]:
# Import all necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
from wordcloud import WordCloud


# load the Netflix TV shows and movies dataset into the pandas dataframe

# Define the file path
file_path = r"C:\Users\ianiy\OneDrive\Desktop\Data Analyst GIGs\DecoderBot\Task 1- Netflix TV Shows and Movies.csv"

# Load the CSV file into a DataFrame
Movies_rating = pd.read_csv(file_path)

# Display the first few rows of the DataFrame
Movies_rating.head()

Unnamed: 0,index,id,title,type,description,release_year,age_certification,runtime,imdb_id,imdb_score,imdb_votes
0,0,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,113,tt0075314,8.3,795222.0
1,1,tm127384,Monty Python and the Holy Grail,MOVIE,"King Arthur, accompanied by his squire, recrui...",1975,PG,91,tt0071853,8.2,530877.0
2,2,tm70993,Life of Brian,MOVIE,"Brian Cohen is an average young Jewish man, bu...",1979,R,94,tt0079470,8.0,392419.0
3,3,tm190788,The Exorcist,MOVIE,12-year-old Regan MacNeil begins to adapt an e...,1973,R,133,tt0070047,8.1,391942.0
4,4,ts22164,Monty Python's Flying Circus,SHOW,A British sketch comedy series with the shows ...,1969,TV-14,30,tt0063929,8.8,72895.0


In [2]:
# Display basic information about dataset
Movies_rating.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5283 entries, 0 to 5282
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   index              5283 non-null   int64  
 1   id                 5283 non-null   object 
 2   title              5283 non-null   object 
 3   type               5283 non-null   object 
 4   description        5278 non-null   object 
 5   release_year       5283 non-null   int64  
 6   age_certification  2998 non-null   object 
 7   runtime            5283 non-null   int64  
 8   imdb_id            5283 non-null   object 
 9   imdb_score         5283 non-null   float64
 10  imdb_votes         5267 non-null   float64
dtypes: float64(2), int64(3), object(6)
memory usage: 454.1+ KB


## Data Cleaning

In [3]:
# Check the columns with the null values 
Movies_rating.isna().sum()

index                   0
id                      0
title                   0
type                    0
description             5
release_year            0
age_certification    2285
runtime                 0
imdb_id                 0
imdb_score              0
imdb_votes             16
dtype: int64

This dataframe shows that there are missing values in some columns such as **'description', 'age_certification', and 'imdb_votes'**. Also some of the columns are not useful to our analysis, thus such columns will be dropped.

In [4]:
# Check for duplicates

Movies_rating.duplicated().sum()

0

In [5]:
# Check for data consistency

Movies_rating.value_counts()

index  id        title                   type   description                                                                                                                                                                                                                                                                                                                                                                               release_year  age_certification  runtime  imdb_id     imdb_score  imdb_votes
0      tm84618   Taxi Driver             MOVIE  A mentally unstable Vietnam War veteran works as a night-time taxi driver in New York City where the perceived decadence and sleaze feed his urge for violent action, attempting to save a preadolescent prostitute in the process.                                                                                                                                                       1976          R                  113      tt0075314   8.3         79522

In [6]:
# Check for data consistency and format

for column in Movies_rating.columns:
    unique_values = Movies_rating[column].unique()
    print(f"Unique values in {column}: {unique_values}")

Unique values in index: [   0    1    2 ... 5280 5281 5282]
Unique values in id: ['tm84618' 'tm127384' 'tm70993' ... 'tm1045018' 'tm1098060' 'ts271048']
Unique values in title: ['Taxi Driver' 'Monty Python and the Holy Grail' 'Life of Brian' ...
 'Clash' 'Shadow Parties' 'Mighty Little Bheem: Kite Festival']
Unique values in type: ['MOVIE' 'SHOW']
Unique values in description: ['A mentally unstable Vietnam War veteran works as a night-time taxi driver in New York City where the perceived decadence and sleaze feed his urge for violent action, attempting to save a preadolescent prostitute in the process.'
 'King Arthur, accompanied by his squire, recruits his Knights of the Round Table, including Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Robin the Not-Quite-So-Brave-As-Sir-Lancelot and Sir Galahad the Pure. On the way, Arthur battles the Black Knight who, despite having had all his limbs chopped off, insists he can still fight. They reach Camelot, but Arthur decides not  to ente

In [7]:
# Drop unusable columns
Movies_rating = Movies_rating.drop(columns=['index', 'id', 'imdb_id'])

Movies_rating.head()

Unnamed: 0,title,type,description,release_year,age_certification,runtime,imdb_score,imdb_votes
0,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,113,8.3,795222.0
1,Monty Python and the Holy Grail,MOVIE,"King Arthur, accompanied by his squire, recrui...",1975,PG,91,8.2,530877.0
2,Life of Brian,MOVIE,"Brian Cohen is an average young Jewish man, bu...",1979,R,94,8.0,392419.0
3,The Exorcist,MOVIE,12-year-old Regan MacNeil begins to adapt an e...,1973,R,133,8.1,391942.0
4,Monty Python's Flying Circus,SHOW,A British sketch comedy series with the shows ...,1969,TV-14,30,8.8,72895.0


In [8]:
# group title by movie or show 

grouped = Movies_rating.groupby('type').size()

print(grouped)

type
MOVIE    3407
SHOW     1876
dtype: int64


In [9]:
# group title by age_certification

grouped = Movies_rating.groupby('age_certification').size()

print(grouped)

age_certification
G        105
NC-17     13
PG       238
PG-13    424
R        548
TV-14    436
TV-G      72
TV-MA    792
TV-PG    172
TV-Y      94
TV-Y7    104
dtype: int64


There are 3407 movies and 1876 tv shows with age rating ranging from General (G), Restricted (R), Parental Guidance (PG), to all children (TV-Y). Before we proceed with the analysis, we need to determine the age certification of 2285 movie or tv show for data accuracy using the fill forward method:

In [10]:
Movies_rating['age_certification'].fillna(method='ffill', inplace=True)

In [11]:
# Fill missing values of the IMDb votes using the mean value

average_votes = Movies_rating['imdb_votes'].mean()

Movies_rating['imdb_votes'].fillna(average_votes, inplace=True)

In [12]:
Movies_rating['description'].replace('NA', np.nan, inplace=True)

# fill missing values in the 'description' column
Movies_rating['description'].fillna(value='Not provided', inplace=True)

In [13]:
# Check the columns with the null values 
Movies_rating.isna().sum()

title                0
type                 0
description          0
release_year         0
age_certification    0
runtime              0
imdb_score           0
imdb_votes           0
dtype: int64

In [14]:
# group title by age_certification

grouped = Movies_rating.groupby('age_certification').size()

print(grouped)

age_certification
G         303
NC-17      22
PG        458
PG-13     748
R         838
TV-14     746
TV-G      117
TV-MA    1334
TV-PG     314
TV-Y      200
TV-Y7     203
dtype: int64


In [15]:
# Display the statiscal description of the dataframe and check for outliers

Movies_rating.describe()

Unnamed: 0,release_year,runtime,imdb_score,imdb_votes
count,5283.0,5283.0,5283.0,5283.0
mean,2015.879992,79.199886,6.533447,23407.19
std,7.346098,38.915974,1.160932,87002.24
min,1953.0,0.0,1.5,5.0
25%,2015.0,45.0,5.8,522.0
50%,2018.0,87.0,6.6,2285.0
75%,2020.0,106.0,7.4,10276.0
max,2022.0,235.0,9.6,2268288.0


The median release year settling at 2018 suggests a sweet spot for content creation. The runtime sweet spot at 87 minutes signifies a balance between a quick watch and a more extended session. Viewer opinions, encapsulated by the median IMDb score of 6.6, reflect a discerning audience.

In [16]:
# save the dataframe as a csv file

Movies_rating.to_csv('Movies_rating.csv', index=False)