## Practice Exercise 2

In this assignment, you will try to find some interesting insights into a few movies released between 1916 and 2016, using Python. You will have to download a movie dataset, write Python code to explore the data, gain insights into the movies, actors, directors, and collections, and submit the code.

#### Some tips before starting the assignment

1. Identify the task to be performed correctly, and only then proceed to write the required code. Don’t perform any incorrect analysis or look for information that isn’t required for the assignment.
2. In some cases, the variable names have already been assigned, and you just need to write code against them. In other cases, the names to be given are mentioned in the instructions. We strongly advise you to use the mentioned names only.
3. Always keep inspecting your data frame after you have performed a particular set of operations.
4. There are some checkpoints given in the IPython notebook provided. They're just useful pieces of information you can use to check if the result you have obtained after performing a particular task is correct or not.
5. Note that you will be asked to refer to documentation for solving some of the questions. That is done on purpose for you to learn new commands and also how to use the documentation.

In [4]:
# Import the numpy and pandas packages

import numpy as np
import pandas as pd

### Task 1: Reading and Inspection

**Subtask 1.1: Import and read**

Import and read the movie database. Store it in a variable called `movies`.

In [7]:
# Write your code for importing the csv file here
movies =  pd.read_csv("Movies.csv", header = 0)
movies.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,Color,Andrew Stanton,462.0,132.0,475.0,530.0,Samantha Morton,640.0,73058679.0,Action|Adventure|Sci-Fi,...,738.0,English,USA,PG-13,263700000.0,2012.0,632.0,6.6,2.35,24000


**Subtask 1.2: Inspect the dataframe**

Inspect the dataframe's columns, shapes, variable types etc.

In [10]:
# Write your code for inspection here
movies.shape

(3853, 28)

#### <font color='red'>Question 1: How many rows and columns are present in the dataframe? </font>
-  <font color='red'>(3821, 26)</font>
-  <font color='red'>(3879, 28)</font>
-  <font color='red'>(3853, 28)</font>
-  <font color='red'>(3866, 26)</font>

#### <font color='red'>Question 2: How many columns have null values present in them? Try writing a code for this instead of counting them manually.</font>

-  <font color='red'>3</font>
-  <font color='red'>6</font>
-  <font color='red'>9</font>
-  <font color='red'>12</font>

In [14]:

print(movies.isnull().any().sum()) 

12


In [16]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3853 entries, 0 to 3852
Data columns (total 28 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   color                      3851 non-null   object 
 1   director_name              3853 non-null   object 
 2   num_critic_for_reviews     3852 non-null   float64
 3   duration                   3852 non-null   float64
 4   director_facebook_likes    3853 non-null   float64
 5   actor_3_facebook_likes     3847 non-null   float64
 6   actor_2_name               3852 non-null   object 
 7   actor_1_facebook_likes     3853 non-null   float64
 8   gross                      3853 non-null   float64
 9   genres                     3853 non-null   object 
 10  actor_1_name               3853 non-null   object 
 11  movie_title                3853 non-null   object 
 12  num_voted_users            3853 non-null   int64  
 13  cast_total_facebook_likes  3853 non-null   int64

### Task 2: Cleaning the Data

**Subtask 2.1: Drop unecessary columns**

For this assignment, you will mostly be analyzing the movies with respect to the ratings, gross collection, popularity of movies, etc. So many of the columns in this dataframe are not required. So it is advised to drop the following columns.
-  color
-  director_facebook_likes
-  actor_1_facebook_likes
-  actor_2_facebook_likes
-  actor_3_facebook_likes
-  actor_2_name
-  cast_total_facebook_likes
-  actor_3_name
-  duration
-  facenumber_in_poster
-  content_rating
-  country
-  movie_imdb_link
-  aspect_ratio
-  plot_keywords

In [19]:
# Check the 'drop' function in the Pandas library - dataframe.drop(list_of_unnecessary_columns, axis = )
# Write your code for dropping the columns here. It is advised to keep inspecting the dataframe after each set of operations

movies.drop(['color', 'director_facebook_likes', 'actor_1_facebook_likes','actor_2_facebook_likes','actor_3_facebook_likes','actor_2_name','cast_total_facebook_likes','actor_3_name','duration','facenumber_in_poster','content_rating','country','movie_imdb_link','aspect_ratio','plot_keywords'], axis=1, inplace=True)
movies.head()

Unnamed: 0,director_name,num_critic_for_reviews,gross,genres,actor_1_name,movie_title,num_voted_users,num_user_for_reviews,language,budget,title_year,imdb_score,movie_facebook_likes
0,James Cameron,723.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,Avatar,886204,3054.0,English,237000000.0,2009.0,7.9,33000
1,Gore Verbinski,302.0,309404152.0,Action|Adventure|Fantasy,Johnny Depp,Pirates of the Caribbean: At World's End,471220,1238.0,English,300000000.0,2007.0,7.1,0
2,Sam Mendes,602.0,200074175.0,Action|Adventure|Thriller,Christoph Waltz,Spectre,275868,994.0,English,245000000.0,2015.0,6.8,85000
3,Christopher Nolan,813.0,448130642.0,Action|Thriller,Tom Hardy,The Dark Knight Rises,1144337,2701.0,English,250000000.0,2012.0,8.5,164000
4,Andrew Stanton,462.0,73058679.0,Action|Adventure|Sci-Fi,Daryl Sabara,John Carter,212204,738.0,English,263700000.0,2012.0,6.6,24000


#### <font color='red'>Question 3: What is the count of columns in the new dataframe? </font>
-  <font color='red'>10</font>
-  <font color='red'>13</font>
-  <font color='red'>15</font>
-  <font color='red'>17</font>

In [22]:
len(movies.columns)

13

**Subtask 2.2: Inspect Null values**

As you have seen above, there are null values in multiple columns of the dataframe 'movies'. Find out the percentage of null values in each column of the dataframe 'movies'. 

In [25]:
# Write you code here
null_percent = (movies.isnull().sum() / len(movies)) * 100
null_percent = null_percent.round(2)  # Round to 2 decimal places
print(null_percent)
# …it gives column-wise null percentages, because:

# movies.isnull() → creates a DataFrame of the same shape with True for nulls.

# .sum() → sums down each column, so you get the total number of nulls per column.

# / len(movies) → divides by the total number of rows, again per column.

director_name             0.00
num_critic_for_reviews    0.03
gross                     0.00
genres                    0.00
actor_1_name              0.00
movie_title               0.00
num_voted_users           0.00
num_user_for_reviews      0.00
language                  0.10
budget                    0.00
title_year                0.00
imdb_score                0.00
movie_facebook_likes      0.00
dtype: float64


In [27]:
 round(100*(movies.isnull().sum()/len(movies.index)), 2)

director_name             0.00
num_critic_for_reviews    0.03
gross                     0.00
genres                    0.00
actor_1_name              0.00
movie_title               0.00
num_voted_users           0.00
num_user_for_reviews      0.00
language                  0.10
budget                    0.00
title_year                0.00
imdb_score                0.00
movie_facebook_likes      0.00
dtype: float64

#### <font color='red'>Question 4: Which column has the highest percentage of null values? </font>
-  <font color='red'>language</font>
-  <font color='red'>genres</font>
-  <font color='red'>num_critic_for_reviews</font>
-  <font color='red'>imdb_score</font>

 If you want overall null percentage across the whole DataFrame:

total_cells = movies.size  # total number of cells
total_nulls = movies.isnull().sum().sum()  # total null values
overall_null_percentage = (total_nulls / total_cells) * 100

print(f"Overall null percentage: {overall_null_percentage:.2f}%")
🔍 Difference:
Column-level: Helpful to see which features are incomplete.

Overall: Helpful to understand how sparse the dataset is as a whole.

**Subtask 2.3: Fill NaN values**

You might notice that the `language` column has some NaN values. Here, on inspection, you will see that it is safe to replace all the missing values with `'English'`.

In [32]:
# Write your code for filling the NaN values in the 'language' column here
movies['language'].fillna('English', inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  movies['language'].fillna('English', inplace=True)


#### <font color='red'>Question 5: What is the count of movies made in English language after replacing the NaN values with English? </font>
-  <font color='red'>3670</font>
-  <font color='red'>3674</font>
-  <font color='red'>3668</font>
-  <font color='red'>3672</font>

In [35]:
len(movies[movies['language']=='English'])

3675

### Task 3: Data Analysis

**Subtask 3.1: Change the unit of columns**

Convert the unit of the `budget` and `gross` columns from `$` to `million $`.

In [38]:
# Write your code for unit conversion here


**Subtask 3.2: Find the movies with highest profit**

   1. Create a new column called `profit` which contains the difference of the two columns: `gross` and `budget`.
   2. Sort the dataframe using the `profit` column as reference. (Find which command can be used here to sort entries from the documentation)
   3. Extract the top ten profiting movies in descending order and store them in a new dataframe - `top10`

In [41]:
# Write your code for creating the profit column here
movies["profit"]=movies["gross"]-movies["budget"]

In [43]:
# Write your code for sorting the dataframe here
movies_sorted = movies.sort_values(by='profit', ascending=False)

In [45]:
top10 = movies_sorted
top10

Unnamed: 0,director_name,num_critic_for_reviews,gross,genres,actor_1_name,movie_title,num_voted_users,num_user_for_reviews,language,budget,title_year,imdb_score,movie_facebook_likes,profit
0,James Cameron,723.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,Avatar,886204,3054.0,English,2.370000e+08,2009.0,7.9,33000,5.235058e+08
28,Colin Trevorrow,644.0,652177271.0,Action|Adventure|Sci-Fi|Thriller,Bryce Dallas Howard,Jurassic World,418214,1290.0,English,1.500000e+08,2015.0,7.0,150000,5.021773e+08
25,James Cameron,315.0,658672302.0,Drama|Romance,Leonardo DiCaprio,Titanic,793059,2528.0,English,2.000000e+08,1997.0,7.7,26000,4.586723e+08
2704,George Lucas,282.0,460935665.0,Action|Adventure|Fantasy|Sci-Fi,Harrison Ford,Star Wars: Episode IV - A New Hope,911097,1470.0,English,1.100000e+07,1977.0,8.7,33000,4.499357e+08
2748,Steven Spielberg,215.0,434949459.0,Family|Sci-Fi,Henry Thomas,E.T. the Extra-Terrestrial,281842,515.0,English,1.050000e+07,1982.0,7.9,34000,4.244495e+08
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2147,Katsuhiro Ôtomo,105.0,410388.0,Action|Adventure|Animation|Family|Sci-Fi|Thriller,William Hootkins,Steamboy,13727,79.0,Japanese,2.127520e+09,2004.0,6.9,973,-2.127110e+09
2136,Hayao Miyazaki,174.0,2298191.0,Adventure|Animation|Fantasy,Minnie Driver,Princess Mononoke,221552,570.0,Japanese,2.400000e+09,1997.0,8.4,11000,-2.397702e+09
2693,Lajos Koltai,73.0,195888.0,Drama|Romance|War,Marcell Nagy,Fateless,5603,45.0,Hungarian,2.500000e+09,2005.0,7.1,607,-2.499804e+09
3282,Chan-wook Park,202.0,211667.0,Crime|Drama,Min-sik Choi,Lady Vengeance,53508,131.0,Korean,4.200000e+09,2005.0,7.7,4000,-4.199788e+09


**Checkpoint:** You might spot two movies directed by `James Cameron` in the list.

#### <font color='red'>Question 6: Which movie is ranked 5th from the top in the list obtained? </font>
-  <font color='red'>E.T. the Extra-Terrestrial</font>
-  <font color='red'>The Avengers</font>
-  <font color='red'>The Dark Knight</font>
-  <font color='red'>Titanic</font>

**Subtask 3.3: Find IMDb Top 250**

Create a new dataframe `IMDb_Top_250` and store the top 250 movies with the highest IMDb Rating (corresponding to the column: `imdb_score`). Also make sure that for all of these movies, the `num_voted_users` is greater than 25,000. 

Also add a `Rank` column containing the values 1 to 250 indicating the ranks of the corresponding films.

In [50]:
# Write your code for extracting the top 250 movies as per the IMDb score here. Make sure that you store it in a new dataframe 
# and name that dataframe as 'IMDb_Top_250'
# Filter movies with more than 25,000 votes
filtered_movies = movies[movies['num_voted_users'] > 25000]

# Sort by IMDb score in descending order and take the top 250
IMDb_Top_250 = filtered_movies.sort_values(by='imdb_score', ascending=False).head(250).copy()

# Add a Rank column
IMDb_Top_250['Rank'] = range(1, 251)

# View the top 5 rows
print(IMDb_Top_250.head())
print("next line")
print(IMDb_Top_250[['Rank', 'imdb_score', 'num_voted_users']].head())


             director_name  num_critic_for_reviews        gross  \
1795        Frank Darabont                   199.0   28341469.0   
3016  Francis Ford Coppola                   208.0  134821952.0   
64       Christopher Nolan                   645.0  533316061.0   
2543  Francis Ford Coppola                   149.0   57300000.0   
325          Peter Jackson                   328.0  377019252.0   

                              genres    actor_1_name  \
1795                     Crime|Drama  Morgan Freeman   
3016                     Crime|Drama       Al Pacino   
64       Action|Crime|Drama|Thriller  Christian Bale   
2543                     Crime|Drama  Robert De Niro   
325   Action|Adventure|Drama|Fantasy   Orlando Bloom   

                                         movie_title  num_voted_users  \
1795                       The Shawshank Redemption           1689764   
3016                                  The Godfather           1155770   
64                                  The D

#### <font color='red'>Question 7: Suppose movies are divided into 5 buckets based on the IMDb ratings: </font>
-  <font color='red'>7.5 to 8</font>
-  <font color='red'>8 to 8.5</font>
-  <font color='red'>8.5 to 9</font>
-  <font color='red'>9 to 9.5</font>
-  <font color='red'>9.5 to 10</font>

<font color = 'red'> Which bucket holds the maximum number of movies from *IMDb_Top_250*? </font>

In [53]:
import pandas as pd

# Define the function to assign buckets based on imdb_score
def assign_bucket(score):
    if 7.5 <= score < 8:
        return '7.5 - 8'
    elif 8 <= score < 8.5:
        return '8 - 8.5'
    elif 8.5 <= score < 9:
        return '8.5 - 9'
    elif 9 <= score < 9.5:
        return '9 - 9.5'
    elif 9.5 <= score <= 10:
        return '9.5 - 10'
    else:
        return 'Others'

# Apply the function to create a new column
IMDb_Top_250['bucket'] = IMDb_Top_250['imdb_score'].apply(assign_bucket)

# Count movies in each bucket
bucket_counts = IMDb_Top_250['bucket'].value_counts()

# Find the bucket with the most movies
most_common_bucket = bucket_counts.idxmax()
most_common_count = bucket_counts.max()

# Display the result
print("Bucket with the most movies:", most_common_bucket)
print("Number of movies in that bucket:", most_common_count)


Bucket with the most movies: 8 - 8.5
Number of movies in that bucket: 159


**Subtask 3.4: Find the critic-favorite and audience-favorite actors**

   1. Create three new dataframes namely, `Meryl_Streep`, `Leo_Caprio`, and `Brad_Pitt` which contain the movies in which the actors: 'Meryl Streep', 'Leonardo DiCaprio', and 'Brad Pitt' are the lead actors. Use only the `actor_1_name` column for extraction. Also, make sure that you use the names 'Meryl Streep', 'Leonardo DiCaprio', and 'Brad Pitt' for the said extraction.
   2. Append the rows of all these dataframes and store them in a new dataframe named `Combined`.
   3. Group the combined dataframe using the `actor_1_name` column.
   4. Find the mean of the `num_critic_for_reviews` and `num_user_for_review` and identify the actors which have the highest mean.

In [56]:
Meryl_Streep = movies.loc[movies.actor_1_name == 'Meryl Streep']
Leo_Caprio = movies.loc[movies.actor_1_name == 'Leonardo DiCaprio']
Brad_Pitt = movies.loc[movies.actor_1_name == 'Brad Pitt']
Combined = pd.concat([Meryl_Streep, Brad_Pitt, Leo_Caprio])
Combined_by_segment = Combined.groupby('actor_1_name')
Combined_by_segment['imdb_score'].mean()
Combined_by_segment['num_user_for_reviews'].mean()

actor_1_name
Brad Pitt            742.352941
Leonardo DiCaprio    914.476190
Meryl Streep         297.181818
Name: num_user_for_reviews, dtype: float64

# Write your code for creating three new dataframes here
Meryl_Streep = # Include all movies in which Meryl_Streep is the lead

Leo_Caprio = # Include all movies in which Leo_Caprio is the lead

In [None]:
Brad_Pitt = # Include all movies in which Brad_Pitt is the lead

# Write your code for combining the three dataframes here
Combined 

In [None]:
# Write your code for grouping the combined dataframe here


In [66]:
# Write the code for finding the mean of critic reviews and audience reviews here
Combined_by_segment['imdb_score'].mean()

actor_1_name
Brad Pitt            7.152941
Leonardo DiCaprio    7.495238
Meryl Streep         6.745455
Name: imdb_score, dtype: float64

#### <font color='red'>Question 8: Which actor is highest rated among the three actors according to the user reviews? </font>
-  <font color='red'>Meryl Streep</font>
-  <font color='red'>Leonardo DiCaprio</font>
-  <font color='red'>Brad Pitt</font>

#### <font color='red'>Question 9: Which actor is highest rated among the three actors according to the critics?</font>
-  <font color='red'>Meryl Streep</font>
-  <font color='red'>Leonardo DiCaprio</font>
-  <font color='red'>Brad Pitt</font>

There are two columns 'actor_1_facebook_likes' and 'actor_2_facebook_likes' which have quite a few missing values but since you have less data points, you don't want to drop many of them. For a hypothetical analysis that you're conducting, it will be okay if one of them has a missing value but you can't afford to have both missing values. So your aim here is to find the indices of the rows in which both of these columns have missing values simultaneously.

Expected Output:

- First print the indices of the rows where both these columns have missing values. The print statement has been provided in the stub. You just need to fill it. 

- After you have printed the above indices, drop these particular rows and print the number of retained rows in the dataframe.

A sample output would look like the following:

[389, 1019, 1178, 3400, 4012]

4847
Here, the list in the first line indicates a sample list which indicates the indices of the rows where both of the columns have missing values. And the second line represents the number of rows remaining in the dataframe after you have dropped the above rows.

In [70]:
# Importing the pandas package
import pandas as pd 

# Reading the dataframe
movies1 =  pd.read_csv("Movies.csv", header = 0)

# Find rows where both actor_1_facebook_likes and actor_2_facebook_likes are NaN
missing_both = movies1[movies1['actor_1_facebook_likes'].isna() & movies1['actor_2_facebook_likes'].isna()]

# Print their indices
print(list(missing_both.index))

# Drop those rows from the original DataFrame
movies1 = movies1.drop(missing_both.index)

# Print the number of rows left
print(len(movies1))


[]
3853


In [72]:
# column names small case to title case
# Importing the pandas package
import pandas as pd

# Reading the dataframe
df =pd.read_csv("Movies.csv", header = 0)

# Convert all column names to uppercase
df.columns = df.columns.str.title()

# Display the updated DataFrame
print(df.columns)


# Printing the final columns. Do not edit this part.


Index(['Color', 'Director_Name', 'Num_Critic_For_Reviews', 'Duration',
       'Director_Facebook_Likes', 'Actor_3_Facebook_Likes', 'Actor_2_Name',
       'Actor_1_Facebook_Likes', 'Gross', 'Genres', 'Actor_1_Name',
       'Movie_Title', 'Num_Voted_Users', 'Cast_Total_Facebook_Likes',
       'Actor_3_Name', 'Facenumber_In_Poster', 'Plot_Keywords',
       'Movie_Imdb_Link', 'Num_User_For_Reviews', 'Language', 'Country',
       'Content_Rating', 'Budget', 'Title_Year', 'Actor_2_Facebook_Likes',
       'Imdb_Score', 'Aspect_Ratio', 'Movie_Facebook_Likes'],
      dtype='object')


There are a lot of columns that aren't visible. But you might have noticed straight away that there are quite a few missing values in the data frame. Two columns for instance, 'aspect_ratio' and 'facenumber_in_poster' also have a few missing values(NaN). Now, replace the missing values with the 'median' value of the respective columns and print the null value count for both.

Expected Output: First print the number of missing values in both of these columns, then output the median in both the columns and then impute the missing values with the respective medians and print the count of missing values again. Store all of these in a dictionary format like the following:

{'aspect_ratio_mv': 431, 'facenumber_in_poster_mv': 97}

{'aspect_ratio_median: 1.44, 'facenumber_in_poster': 2.0}

{'aspect_ratio_final': 0, 'facenumber_in_poster_final': 0}

The code for the same has been provided in the stub; you just need to complete these dictionaries.
Note: You don't need to use any print statement. The print statements have already been written; you just need to complete the dictionaries provided in the stub.

In [None]:
# Importing the pandas package
import pandas as pd 

# Reading the movies dataframe
movies = pd.read_csv('https://query.data.world/s/zlr77ctyxez3kv6zqn6nn5six42cfq')

# Your aim is to complete the following three print statements after all the colons

# Get the null value counts in both the columns
aspect_ratio_mv = movies['aspect_ratio'].isnull().sum()
facenumber_in_poster_mv = movies['facenumber_in_poster'].isnull().sum()

# Complete the first dictionary
mv = {'aspect_ratio_mv': aspect_ratio_mv, 'facenumber_in_poster_mv': facenumber_in_poster_mv}

# Get the median of both the columns
aspect_ratio_median = movies['aspect_ratio'].median()
facenumber_in_poster_median = movies['facenumber_in_poster'].median()

# Complete the second dictionary
median = {'aspect_ratio_median': aspect_ratio_median, 'facenumber_in_poster_median': facenumber_in_poster_median}

movies.loc[pd.isnull(movies['aspect_ratio']), ['aspect_ratio']] = movies['aspect_ratio'].median()
movies.loc[pd.isnull(movies['facenumber_in_poster']), ['facenumber_in_poster']] = movies['facenumber_in_poster'].median()

# Get the final null value count of both the columns
aspect_ratio_final = movies['aspect_ratio'].isnull().sum()
facenumber_in_poster_final = movies['facenumber_in_poster'].isnull().sum()

# Complete the third dictionary
final = {'aspect_ratio_final': aspect_ratio_final, 'facenumber_in_poster_final': facenumber_in_poster_final}

# Printing the values in the three dictionaries. Please do not edit this part
print(sorted(mv.values()))
print(sorted(median.values()))
print(sorted(final.values()))