# Disney+ Movies and TV Shows #


You are working as Junior Data Analyst at Disney Plus. There is combined data on movies and TV shows. It is available in [Kaggle](https://www.kaggle.com/datasets/shivamb/disney-movies-and-tv-shows). 

The Content Manager wants you to clean the data and provide insights to answer some of the questions like:

1. Identify the unique list of ratings. How many movies and tv shows are listed under each rating?
2. What is the average duration of movies and tv shows?
3. How many movies and tv shows have been released till now in Germany. Give the list year-wise.
4. Which Director has directed the maximum number of movies and in which genre?

The task has been handed over to you by your team lead in the data department. They expect you to apply your Python knowledge on strings, descriptive statistics and want you to build re-usable functions.

Your analysis will help the Content Manager at Disney Plus to make a decision on the direction and future investments the company makes in movies as well as TV shows and will influence what gets released.

## Data Dictionary ##

1. show_id - Unique id
2. type - Movie or TV Show
3. title - Name of the movie/show
4. director - Directors of the movie/show
5. cast - Main cast of the moview/show
6. country - Country of production
7. date_Added - Date added on Disney+
8. release_year - Original Release Year of the moview/tv show
9. rating - Rating of the movie/show 
10. duration - Total duration of the moview/show

## Step 1: Read the file and display first 5 rows ##

In [1]:
from csv import reader
opened_file = open('./Data/disney_plus_titles.csv', encoding="utf-8")
read_file = reader(opened_file)
dp = list(read_file)
dp_header = dp[0]
dp = dp[1:]

In [2]:
print(dp_header)

['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added', 'release_year', 'rating', 'duration', 'listed_in', 'description']


### Create Function `explore_data` ###
Create a function `explore_data` that will take dp, start row, end row and a boolean to check if you want to display total number of rows and columns. 

It should print rows from start to end and if we want to display number of rows and columns, then that too. 

For example, I want to display first five rows and also want to display total number of rows and columns in the input data. Make sure to use local variables.

In [3]:
def explore_data(data_list_l, start_l, end_l, rows_and_columns_l=False):
    if end_l == None:
        data_slice_l = data_list_l[start_l:]
    else:
        data_slice_l = data_list_l[start_l:end_l]
        
    for row_l in data_slice_l:
        print(row_l)
        print('\n')
        
    if rows_and_columns_l:
        print('no. of rows:', len(data_list_l))
        print('no. of columns:', len(data_list_l[0]))

Use `explore_data` function to display top 5 rows along with number of rows and columns

In [4]:
explore_data(dp,0,5,True)

['s1', 'Movie - Animation', 'Duck the Halls: A Mickey Mouse Christmas Special', 'Alonso Ramirez Ramos, Dave Wasson', 'Chris Diamantopoulos, Tony Anselmo, Tress MacNeille, Bill Farmer, Russi Taylor, Corey Burton', '', 'November 26, 2021', '2016', 'TV-G', '23 min', 'Animation, Family', 'Join Mickey and the gang as they duck the halls!']


['s2', 'Movie-Comedy', 'Ernest Saves Christmas', 'John Cherry', 'Jim Varney, Noelle Parker, Douglas Seale', '', 'November 26, 2021', '1988', 'PG', '91 min', 'Comedy', 'Santa Claus passes his magic bag to a new St. Nic.']


['s3', 'Movie', 'Ice Age: A Mammoth Christmas', 'Karen Disher', 'Raymond Albert Romano, John Leguizamo, Denis Leary, Queen Latifah', 'United States', 'November 26, 2021', '2011', 'TV-G', '23 min', 'Animation, Comedy, Family', "Sid the Sloth is on Santa's naughty list."]


['s4', 'Movie', 'The Queen Family Singalong', 'Hamish Hamilton', 'Darren Criss, Adam Lambert, Derek Hough, Alexander Jean, Fall Out Boy, Jimmie Allen', '', 'November

Again use the `explore_data` function to display the last 5 rows. 

This time you don't have to display the number of rows and columns. 

Make sure that the title in the last row is 'Captain Sparky vs. The Flying Saucers'

In [5]:
explore_data(dp,-5, None, False)

['s1446', 'Movie', 'X-Men Origins: Wolverine', 'Gavin Hood', 'Hugh Jackman, Liev Schreiber, Danny Huston, will.i.am , Lynn Collins, Kevin Durand', 'United States, United Kingdom', 'June 4, 2021', '2009', 'PG-13', '108 min', 'Action-Adventure, Family, Science Fiction', 'Wolverine unites with legendary X-Men to fight against forces determined to eliminate mutants.']


['s1447', 'Movie', 'Night at the Museum: Battle of the Smithsonian', 'Shawn Levy', 'Ben Stiller, Amy Adams, Owen Wilson, Hank Azaria, Christopher Guest, Alain Chabat', 'United States, Canada', 'April 2, 2021', '2009', 'PG', '106 min', 'Action-Adventure, Comedy, Family', 'Larry Daley returns to rescue some old friends while the Smithsonian Institution comes alive.']


['s1448', 'Movie', 'Eddie the Eagle', 'Dexter Fletcher', 'Tom Costello, Jo Hartley, Keith Allen, Dickon Tolson, Jack Costello, Taron Egerton', 'United Kingdom, Germany, United States', 'December 18, 2020', '2016', 'PG-13', '107 min', 'Biographical, Comedy, Dram

## Step 2: Separate Movies and TV Shows ##

Code a `for loop` to separate Movies to `disney_movies` and TV Shows to `disney_shows` and also check if there are any other than movies and shows. 

Also, ensure that the column has only 2 values, either `Movie` or `TV Show`

In [6]:
disney_movies = []
disney_shows = []
others = []

for row in dp:
    type = row[1].lower()
    if type.startswith("movie"):
        row[1] = "Movie"
        disney_movies.append(row)
    elif type.startswith("tv"):
        disney_shows.append(row)
    else:
        others.append(row)

In [7]:
print("Disney Movies: ", len(disney_movies))
print("Disney TV Shows: ", len(disney_shows))
print("Disney Others: ", len(others))

Disney Movies:  1052
Disney TV Shows:  398
Disney Others:  0


In [8]:
explore_data(disney_movies,0,5,False)

['s1', 'Movie', 'Duck the Halls: A Mickey Mouse Christmas Special', 'Alonso Ramirez Ramos, Dave Wasson', 'Chris Diamantopoulos, Tony Anselmo, Tress MacNeille, Bill Farmer, Russi Taylor, Corey Burton', '', 'November 26, 2021', '2016', 'TV-G', '23 min', 'Animation, Family', 'Join Mickey and the gang as they duck the halls!']


['s2', 'Movie', 'Ernest Saves Christmas', 'John Cherry', 'Jim Varney, Noelle Parker, Douglas Seale', '', 'November 26, 2021', '1988', 'PG', '91 min', 'Comedy', 'Santa Claus passes his magic bag to a new St. Nic.']


['s3', 'Movie', 'Ice Age: A Mammoth Christmas', 'Karen Disher', 'Raymond Albert Romano, John Leguizamo, Denis Leary, Queen Latifah', 'United States', 'November 26, 2021', '2011', 'TV-G', '23 min', 'Animation, Comedy, Family', "Sid the Sloth is on Santa's naughty list."]


['s4', 'Movie', 'The Queen Family Singalong', 'Hamish Hamilton', 'Darren Criss, Adam Lambert, Derek Hough, Alexander Jean, Fall Out Boy, Jimmie Allen', '', 'November 26, 2021', '2021',

In [9]:
explore_data(disney_shows,0,5,False)

['s5', 'TV Show - Docuseries', 'The Beatles: Get Back', '', 'John Lennon, Paul McCartney, George Harrison, Ringo Starr', '', 'November 25, 2021', '2021', '', '1 Season', 'Docuseries, Historical, Music', 'A three-part documentary from Peter Jackson capturing a moment in music history with The Beatles.']


['s7', 'TV Show - Season 1', 'Hawkeye', '', 'Jeremy Renner, Hailee Steinfeld, Vera Farmiga, Fra Fee, Tony Dalton, Zahn McClarnon', '', 'November 24, 2021', '2021', 'TV-14', '1 Season', 'Action-Adventure, Superhero', 'Clint Barton/Hawkeye must team up with skilled archer Kate Bishop to unravel a criminal conspiracy.']


['s8', 'TV Show', 'Port Protection Alaska', '', 'Gary Muehlberger, Mary Miller, Curly Leach, Sam Carlson, Stuart Andrews, David Squibb', 'United States', 'November 24, 2021', '2015', 'TV-14', '2 Seasons', 'Docuseries, Reality, Survival', 'Residents of Port Protection must combat volatile conditions to survive and thrive in Alaska.']


['s9', 'TV Show', 'Secrets of the Zo

## Step 3: Get a list of ratings ##

Create a function that will generate a list of unique values for required column from the given dataset. 

In [10]:
def list_of_elements(data_l, loc_l):
    ratings_list_l = []
    for row_l in data_l:
        if row_l[loc_l] not in ratings_list_l:
            ratings_list_l.append(row_l[loc_l])
    return ratings_list_l

Use the above created function to generate the list of ratings in the original dataset

In [11]:
print(list_of_elements(dp,8))

['TV-G', 'PG', 'TV-PG', '', 'PG-13', 'TV-14', 'G', 'TV-Y7', 'TV-Y', 'TV-Y7-FV']


## Step 4: For each corresponding rating get the number of movies and shows. ##

Write a function rating_count that will create a dictionary with key value pair where key is the rating and value is the count of movies or TV Shows. The output should be sorted in descending order of counts

In [12]:
def rating_count(dataset_l):
    rating_count_l = {}
    for row_l in dataset_l:
        rating_l = row_l[8]
        if rating_l in rating_count_l:
            rating_count_l[rating_l] += 1
        else:
            rating_count_l[rating_l] = 1
    sorted_rating_count = dict(sorted(rating_count_l.items(), key=lambda item: item[1], reverse=True))
    return sorted_rating_count

Print the Movie Ratings Count and TV Shows Ratings Count

In [13]:
print("Moview Ratings Count:", rating_count(disney_movies))
print("Shows Ratings Count:", rating_count(disney_shows))

Moview Ratings Count: {'G': 253, 'PG': 235, 'TV-G': 233, 'TV-PG': 181, 'PG-13': 66, 'TV-14': 37, 'TV-Y7': 36, 'TV-Y7-FV': 7, 'TV-Y': 3, '': 1}
Shows Ratings Count: {'TV-PG': 120, 'TV-Y7': 95, 'TV-G': 85, 'TV-Y': 47, 'TV-14': 42, 'TV-Y7-FV': 6, '': 2, 'PG': 1}


## Step 5: Get the list of categories for "listed_in" column ##

Get the list of unique values in 'listed_in' column. It's now a one line code.

In [14]:
print(list_of_elements(dp,10))

['Animation, Family', 'Comedy', 'Animation, Comedy, Family', 'Musical', 'Docuseries, Historical, Music', 'Biographical, Documentary', 'Action-Adventure, Superhero', 'Docuseries, Reality, Survival', 'Animals & Nature, Docuseries, Family', 'Comedy, Family, Musical', 'Documentary', 'Comedy, Family, Music', 'Documentary, Family', 'Action-Adventure, Animals & Nature, Docuseries', 'Animals & Nature', 'Animation', 'Animation, Kids', 'Comedy, Coming of Age, Drama', 'Comedy, Family, Fantasy', 'Animation, Comedy, Drama', 'Animation, Family, Fantasy', 'Action-Adventure, Animation, Comedy', 'Comedy, Family', 'Action-Adventure, Comedy, Family', 'Lifestyle', 'Movies', 'Action-Adventure, Science Fiction', 'Action-Adventure, Fantasy, Superhero', 'Coming of Age, Music', 'Animation, Drama', 'Concert Film, Music', 'Animation, Comedy, Coming of Age', 'Animation, Comedy', 'Animation, Crime, Family', 'Science Fiction', 'Action-Adventure, Fantasy', 'Comedy, Fantasy, Kids', 'Action-Adventure, Comedy, Kids', '

## Step 6: Get the unique list of above categories and then the count in each category ##
For this generalize the list_of_elements to have data, column location and seperator as input. Let's name it list_of_items

In [15]:
def list_of_items(data_l, loc_l, sep_l):
    items_list_l = []
    for row_l in data_l:
        row_list_l = []
        row_list_l = row_l[loc_l].split(sep_l)
        for item in row_list_l:
            if item not in items_list_l:
                items_list_l.append(item)
    return items_list_l

Use above created function to print the unique list of Genres ('listed-in')

In [16]:
print(list_of_items(dp,10,", "))

['Animation', 'Family', 'Comedy', 'Musical', 'Docuseries', 'Historical', 'Music', 'Biographical', 'Documentary', 'Action-Adventure', 'Superhero', 'Reality', 'Survival', 'Animals & Nature', 'Kids', 'Coming of Age', 'Drama', 'Fantasy', 'Lifestyle', 'Movies', 'Science Fiction', 'Concert Film', 'Crime', 'Sports', 'Anthology', 'Medical', 'Variety', 'Spy/Espionage', 'Buddy', 'Parody', 'Game Show / Competition', 'Romance', 'Anime', 'Romantic Comedy', 'Thriller', 'Police/Cop', 'Talk Show', 'Western', 'Dance', 'Series', 'Mystery', 'Soap Opera / Melodrama', 'Disaster', 'Travel']


Generalize the rating_count function to have data, location and separator as input. Let's name it item_count

In [17]:
def items_count(dataset_l, loc_l, sep_l):
    items_count_l = {}
    for row_l in dataset_l:
        items_list_l = row_l[loc_l].split(sep_l)
        for item_l in items_list_l:
            if item_l in items_count_l:
                items_count_l[item_l] += 1
            else:
                items_count_l[item_l] = 1
    sorted_items_count = dict(sorted(items_count_l.items(), key=lambda item: item[1], reverse=True))
    return sorted_items_count

Get the count of each genre in movies and tv shows

In [18]:
print("Moview Category Count:", items_count(disney_movies, 10, ", "))
print("Shows Category Count:", items_count(disney_shows, 10, ", "))

Moview Category Count: {'Family': 533, 'Comedy': 407, 'Animation': 381, 'Action-Adventure': 314, 'Documentary': 174, 'Fantasy': 158, 'Coming of Age': 153, 'Animals & Nature': 130, 'Drama': 121, 'Science Fiction': 76, 'Biographical': 41, 'Musical': 40, 'Kids': 39, 'Music': 38, 'Sports': 38, 'Historical': 38, 'Buddy': 24, 'Romance': 19, 'Superhero': 16, 'Crime': 16, 'Mystery': 8, 'Concert Film': 7, 'Variety': 7, 'Parody': 7, 'Anthology': 7, 'Dance': 6, 'Thriller': 5, 'Western': 5, 'Reality': 4, 'Lifestyle': 3, 'Movies': 3, 'Survival': 3, 'Spy/Espionage': 2, 'Romantic Comedy': 2, 'Medical': 2, 'Disaster': 2}
Shows Category Count: {'Animation': 161, 'Action-Adventure': 138, 'Docuseries': 122, 'Comedy': 119, 'Kids': 102, 'Family': 99, 'Animals & Nature': 78, 'Coming of Age': 52, 'Fantasy': 34, 'Reality': 22, 'Anthology': 21, 'Buddy': 16, 'Historical': 15, 'Science Fiction': 15, 'Drama': 13, 'Music': 10, 'Game Show / Competition': 10, 'Survival': 6, 'Sports': 5, 'Lifestyle': 5, 'Variety': 5,

The maximum movies are listed in Family followed by Comedy and animation.

The maximum Shows are listed in Animation followed by Action-Adventure and Docuseries.

## Step 7: What is the average duration of movies and shows ##

The duration of movies is in minutes and shows in seasons. Remove the suffix minutes and seasons and covert them to numeric. Create a function for it.

In [19]:
def duration_conversion(data_l):
    for row_l in data_l:
        new_values = []
        new_values = row_l[9].split(" ")
        row_l[9] = int(new_values[0])
    return data_l

Create a function average that will create the average of all the values in a column in given dataset.

In [20]:
def average(data_l,loc_l):
    sum_l = 0
    count_l = 0
    for row_l in data_l:
        sum_l += row_l[loc_l]
        count_l += 1
    return(sum_l/count_l)

Use the newly created functions to calculate the average duration for movies.

In [21]:
disney_movies_mod = duration_conversion(disney_movies)

In [22]:
explore_data(disney_movies_mod,0,2,True)

['s1', 'Movie', 'Duck the Halls: A Mickey Mouse Christmas Special', 'Alonso Ramirez Ramos, Dave Wasson', 'Chris Diamantopoulos, Tony Anselmo, Tress MacNeille, Bill Farmer, Russi Taylor, Corey Burton', '', 'November 26, 2021', '2016', 'TV-G', 23, 'Animation, Family', 'Join Mickey and the gang as they duck the halls!']


['s2', 'Movie', 'Ernest Saves Christmas', 'John Cherry', 'Jim Varney, Noelle Parker, Douglas Seale', '', 'November 26, 2021', '1988', 'PG', 91, 'Comedy', 'Santa Claus passes his magic bag to a new St. Nic.']


no. of rows: 1052
no. of columns: 12


In [23]:
print("Average Duration of Movies in minutes: ", average(disney_movies_mod, 9))

Average Duration of Movies in minutes:  71.9106463878327


 Now Calculate the average duration for TV Shows

In [24]:
disney_show_mod = duration_conversion(disney_shows)
explore_data(disney_show_mod,0,2,True)

['s5', 'TV Show - Docuseries', 'The Beatles: Get Back', '', 'John Lennon, Paul McCartney, George Harrison, Ringo Starr', '', 'November 25, 2021', '2021', '', 1, 'Docuseries, Historical, Music', 'A three-part documentary from Peter Jackson capturing a moment in music history with The Beatles.']


['s7', 'TV Show - Season 1', 'Hawkeye', '', 'Jeremy Renner, Hailee Steinfeld, Vera Farmiga, Fra Fee, Tony Dalton, Zahn McClarnon', '', 'November 24, 2021', '2021', 'TV-14', 1, 'Action-Adventure, Superhero', 'Clint Barton/Hawkeye must team up with skilled archer Kate Bishop to unravel a criminal conspiracy.']


no. of rows: 398
no. of columns: 12


In [25]:
print("Average Duration of Shows in seasons: ", average(disney_show_mod, 9))

Average Duration of Shows in seasons:  2.1180904522613067


The average duration of movies is 72 minutes.

The average duration of TV shows is 2 seasons.

## Step 8: Can you make your function more robust? ##

Make a nested function `filter_data` to filter the data based on criteria in the input and then use this nested function in items_count to generate statistics for filtered data. 

For example, if your content manager asks you to provide the count of movies based on genre only in Germany. 

You start working on the function and after some time he wants the data only for the year 2004. 

With this ever changing filtering criteria, you decide to make your function handle this dynamic changing filtering list. 

In [26]:
def filter_data(row_ll, filters_ll):
    include_row_ll = True
    for loc_filter_ll, val_filter_ll in filters_ll.items():
        if ',' in row_ll[loc_filter_ll]:
            parts_ll = [part.strip() for part in row_ll[loc_filter_ll].split(',')]
            if val_filter_ll not in parts_ll:
                include_row_ll = False
                break
        else:
            if row_ll[loc_filter_ll] != val_filter_ll:
                include_row_ll = False
                break
    return include_row_ll

In [27]:
def items_count(dataset_l, loc_l, sep_l, filters_l):
    items_count_l = {}
    for row_l in dataset_l:
        if filters_l != {}:
            include_row_l = filter_data(row_l, filters_l)
        else:
            include_row_l = True
        if include_row_l:
            items_list_l = row_l[loc_l].split(sep_l)
            for item_l in items_list_l:
                if item_l.strip() in items_count_l:
                    items_count_l[item_l.strip()] += 1
                else:
                    items_count_l[item_l.strip()] = 1
    sorted_items_count = dict(sorted(items_count_l.items(), key=lambda item: item[1], reverse=True))
    return sorted_items_count

Print the count of movies based on genre and without any filtering criteria

In [28]:
print(items_count(disney_movies, 10, ',', {}))

{'Family': 533, 'Comedy': 407, 'Animation': 381, 'Action-Adventure': 314, 'Documentary': 174, 'Fantasy': 158, 'Coming of Age': 153, 'Animals & Nature': 130, 'Drama': 121, 'Science Fiction': 76, 'Biographical': 41, 'Musical': 40, 'Kids': 39, 'Music': 38, 'Sports': 38, 'Historical': 38, 'Buddy': 24, 'Romance': 19, 'Superhero': 16, 'Crime': 16, 'Mystery': 8, 'Concert Film': 7, 'Variety': 7, 'Parody': 7, 'Anthology': 7, 'Dance': 6, 'Thriller': 5, 'Western': 5, 'Reality': 4, 'Lifestyle': 3, 'Movies': 3, 'Survival': 3, 'Spy/Espionage': 2, 'Romantic Comedy': 2, 'Medical': 2, 'Disaster': 2}


Now, print the count of movies based on genre and only for Germany

In [30]:
print(items_count(disney_movies, 10, ',', {5: 'Germany'}))

{'Comedy': 5, 'Action-Adventure': 4, 'Family': 4, 'Science Fiction': 2, 'Coming of Age': 2, 'Animation': 1, 'Animals & Nature': 1, 'Documentary': 1, 'Biographical': 1, 'Drama': 1, 'Buddy': 1}


Now, print the count of movies based on genre and only for Germany and that too only for year 2004

In [31]:
print(items_count(disney_movies, 10, ',', {5: 'Germany', 7: '2004'}))

{'Comedy': 2, 'Action-Adventure': 1, 'Family': 1, 'Coming of Age': 1}
