# Netflix Top 10 Titles

## Background
Each week, Netflix publishes its top 10 titles, along with how many hours users spent watching each one. These are split up into four categories: Films (English), Films (Non-English), TV (English), and TV (Non-English). Investors have a number of key questions about Netflix that this data can help address. For example:
+ Is Netflix producing and licensing engaging content?
+ Are Netflix's content investments in new genres or geographies generating significant viewership?
+ How is viewership trending over time, and what implications does this have for Netflix's subscriber numbers?

To answer these questions, the Netflix top 10 website is scrapped every week and then IMDb is scrapped to get information including a movie or show’s running time and ratings. Once the system have this data, it is cleaned and analyzed to provide insights to clients.

## Import libraries

In [1]:
# import necessary libraries
import pandas as pd
import numpy as np

## Load the Data

In [2]:
# read netflix dataset
nflx_top_10 = pd.read_excel('Prescreening Files/Data File.xlsx', sheet_name='NFLX Top 10')
# read imdb dataset
imdb_top_10 = pd.read_excel('Prescreening Files/Data File.xlsx', sheet_name='IMDB Rating')

# check datasets
print(nflx_top_10.info())
print(imdb_top_10.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 520 entries, 0 to 519
Data columns (total 8 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   date_added                  520 non-null    datetime64[ns]
 1   week                        520 non-null    datetime64[ns]
 2   category                    520 non-null    object        
 3   show_title                  520 non-null    object        
 4   season_title                249 non-null    object        
 5   weekly_rank                 520 non-null    int64         
 6   cumulative_weeks_in_top_10  520 non-null    int64         
 7   weekly_hours_viewed         520 non-null    int64         
dtypes: datetime64[ns](2), int64(3), object(3)
memory usage: 32.6+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15879 entries, 0 to 15878
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  

## Join movie and ratings datasets

As there is only one row to consider, the two tables can be joined to keep the rating in the same table which help to answer the questions in hand.

In [None]:
# join movie and ratings datasets
nflx_top_10 = pd.merge(nflx_top_10, imdb_top_10, how='left', left_on='show_title', right_on='title')
print(nflx_top_10.info())

Now the dataset is ready to analyse and answer the questions. Let's dive in.

Also, ensure date fields are parsed correctly.

In [4]:
# Ensure date parsing is correct
nflx_top_10['date_added'] = pd.to_datetime(nflx_top_10['date_added'])
nflx_top_10['week'] = pd.to_datetime(nflx_top_10['week'])

## Analysis

Now we will try to dive into the dataset and answer a few questions regarding the same.

### _1. Within the most recent week of data, which English title had the highest total weeks in the top 10?_

To answer this, first we need to filter out only the English titles in both movies and tv.

In [5]:
# Print counts of titles category wise
print(f"Counts of titles by category:{nflx_top_10['category'].value_counts()}")
# Filter to English titles
eng_titles = nflx_top_10[nflx_top_10['category'].str.contains('\\(English\\)')]
print(f"Number of English titles: {eng_titles.shape[0]}")
# the film with most appearances in our data set
most_viewed = nflx_top_10['show_title'].value_counts().idxmax()
# avg weekly hours viewed of the above title

Counts of titles by category:category
Films (English)        139
TV (English)           133
Films (Non-English)    132
TV (Non-English)       130
Name: count, dtype: int64
Number of English titles: 272
