## Data Analysis on "Movie" Data

Insights into a few movies released between 1916 and 2016, using Python. 
Source: Movie dataset
Python code is used to explore the data, gain insights into the movies, actors, directors, and collections.

In [1]:
# Import the numpy and pandas packages

import numpy as np
import pandas as pd

### Task 1: Reading and Inspection

**Subtask 1.1: Import and read**

Import and read the movie database. Store it in a variable called `movies`.

In [2]:
# Write your code for importing the csv file here
movies = pd.read_csv("movies.csv")
print(movies)

      color      director_name  num_critic_for_reviews  duration  \
0     Color      James Cameron                   723.0     178.0   
1     Color     Gore Verbinski                   302.0     169.0   
2     Color         Sam Mendes                   602.0     148.0   
3     Color  Christopher Nolan                   813.0     164.0   
4     Color     Andrew Stanton                   462.0     132.0   
...     ...                ...                     ...       ...   
3848  Color      Shane Carruth                   143.0      77.0   
3849  Color   Neill Dela Llana                    35.0      80.0   
3850  Color   Robert Rodriguez                    56.0      81.0   
3851  Color       Edward Burns                    14.0      95.0   
3852  Color           Jon Gunn                    43.0      90.0   

      director_facebook_likes  actor_3_facebook_likes        actor_2_name  \
0                         0.0                   855.0    Joel David Moore   
1                       563.0

**Subtask 1.2: Inspect the dataframe**

Inspect the dataframe's columns, shapes, variable types etc.

In [4]:

# Write your code for inspection here
movies = pd.read_csv("movies.csv")
print(movies)
# How many rows and columns are present in the dataframe? 
print(movies.shape)

# How many columns have null values present in them?
# Sum of Count of total NaN at each column in DataFrame
print((movies.isnull().sum()>0).sum())

# Count all NaN in a DataFrame (both columns & Rows)
print(movies.isnull().sum().sum())

# Count total NaN at each column in DataFrame
print(movies.isnull().sum(axis = 0))

      color      director_name  num_critic_for_reviews  duration  \
0     Color      James Cameron                   723.0     178.0   
1     Color     Gore Verbinski                   302.0     169.0   
2     Color         Sam Mendes                   602.0     148.0   
3     Color  Christopher Nolan                   813.0     164.0   
4     Color     Andrew Stanton                   462.0     132.0   
...     ...                ...                     ...       ...   
3848  Color      Shane Carruth                   143.0      77.0   
3849  Color   Neill Dela Llana                    35.0      80.0   
3850  Color   Robert Rodriguez                    56.0      81.0   
3851  Color       Edward Burns                    14.0      95.0   
3852  Color           Jon Gunn                    43.0      90.0   

      director_facebook_likes  actor_3_facebook_likes        actor_2_name  \
0                         0.0                   855.0    Joel David Moore   
1                       563.0

### Task 2: Cleaning the Data

**Subtask 2.1: Drop unecessary columns**

For this assignment, you will mostly be analyzing the movies with respect to the ratings, gross collection, popularity of movies, etc. So many of the columns in this dataframe are not required. So it is advised to drop the following columns.
-  color
-  director_facebook_likes
-  actor_1_facebook_likes
-  actor_2_facebook_likes
-  actor_3_facebook_likes
-  actor_2_name
-  cast_total_facebook_likes
-  actor_3_name
-  duration
-  facenumber_in_poster
-  content_rating
-  country
-  movie_imdb_link
-  aspect_ratio
-  plot_keywords

In [15]:
# Check the 'drop' function in the Pandas library - dataframe.drop(list_of_unnecessary_columns, axis = )
# Write your code for dropping the columns here. It is advised to keep inspecting the dataframe after each set of operations

print((movies.drop(['color','director_facebook_likes','actor_1_facebook_likes','actor_2_facebook_likes','actor_3_facebook_likes','actor_2_name','cast_total_facebook_likes','actor_3_name','duration','facenumber_in_poster','content_rating','country','movie_imdb_link','aspect_ratio','plot_keywords'], axis=1).shape))

movies=movies.drop(['color','director_facebook_likes','actor_1_facebook_likes','actor_2_facebook_likes','actor_3_facebook_likes','actor_2_name','cast_total_facebook_likes','actor_3_name','duration','facenumber_in_poster','content_rating','country','movie_imdb_link','aspect_ratio','plot_keywords'], axis=1)


(3853, 13)


**Subtask 2.2: Inspect Null values**

As you have seen above, there are null values in multiple columns of the dataframe 'movies'. Find out the percentage of null values in each column of the dataframe 'movies'. 

In [17]:
# Which column has the highest percentage of null values
print(round(100*(movies.isnull().sum()/len(movies.index)), 2))


director_name             0.00
num_critic_for_reviews    0.03
gross                     0.00
genres                    0.00
actor_1_name              0.00
movie_title               0.00
num_voted_users           0.00
num_user_for_reviews      0.00
language                  0.08
budget                    0.00
title_year                0.00
imdb_score                0.00
movie_facebook_likes      0.00
dtype: float64


**Subtask 2.3: Fill NaN values**

You might notice that the `language` column has some NaN values. Here, on inspection, you will see that it is safe to replace all the missing values with `'English'`.

In [19]:
# Write your code for filling the NaN values in the 'language' column here

## What is the count of movies made in English language after replacing the NaN values with English? ¶

## To replace the values, you can equate the entries with language column as null with 'English':
movies.loc[pd.isnull(movies['language']), ['language']] = 'English'
#Next, you can count the movies made in English:
print((movies.language == 'English').sum())


3674


### Task 3: Data Analysis

**Subtask 3.1: Change the unit of columns**

Convert the unit of the `budget` and `gross` columns from `$` to `million $`.

In [20]:
# Write your code for unit conversion here
movies.gross = movies.gross.apply(lambda x: x/1000000)
movies.budget = movies.budget.apply(lambda x: x/1000000)

**Subtask 3.2: Find the movies with highest profit**

   1. Create a new column called `profit` which contains the difference of the two columns: `gross` and `budget`.
   2. Sort the dataframe using the `profit` column as reference. (Find which command can be used here to sort entries from the documentation)
   3. Extract the top ten profiting movies in descending order and store them in a new dataframe - `top10`

In [21]:
# Write your code for creating the profit column here
movies["Profit"]=movies["gross"]-movies["budget"]

In [24]:
# Write your code for sorting the dataframe here
top_10 = movies.sort_values(by = 'Profit', ascending = False).head(10)

In [27]:
# Write your code to get the top 10 profiting movies here
#Which movie is ranked 5th from the top in the list obtained? 
movies_final=top_10['movie_title']
print(movies_final.head(5))




0                                   Avatar 
28                          Jurassic World 
25                                 Titanic 
2704    Star Wars: Episode IV - A New Hope 
2748            E.T. the Extra-Terrestrial 
Name: movie_title, dtype: object


**Checkpoint:** You might spot two movies directed by `James Cameron` in the list.

**Subtask 3.3: Find IMDb Top 250**

Create a new dataframe `IMDb_Top_250` and store the top 250 movies with the highest IMDb Rating (corresponding to the column: `imdb_score`). Also make sure that for all of these movies, the `num_voted_users` is greater than 25,000. 

Also add a `Rank` column containing the values 1 to 250 indicating the ranks of the corresponding films.

In [29]:
# Write your code for extracting the top 250 movies as per the IMDb score here. Make sure that you store it in a new dataframe 
# and name that dataframe as 'IMDb_Top_250'
IMDb_Top_250=movies[movies["num_voted_users"]>25000].sort_values(by='imdb_score', ascending=False).head(250)
print(IMDb_Top_250)

#Suppose movies are divided into 5 buckets based on the IMDb ratings: 
#Which bucket holds the maximum number of movies from IMDb_Top_250?

case1=IMDb_Top_250[(IMDb_Top_250["imdb_score"] >=7.5) & (IMDb_Top_250["imdb_score"] <8) ]
print(case1.shape)
case2=IMDb_Top_250[(IMDb_Top_250["imdb_score"] >=8) & (IMDb_Top_250["imdb_score"] <8.5) ]
print(case2.shape)
case3=IMDb_Top_250[(IMDb_Top_250["imdb_score"] >=8.5) & (IMDb_Top_250["imdb_score"] <9) ]
print(case3.shape)
case4=IMDb_Top_250[(IMDb_Top_250["imdb_score"] >=9.5) & (IMDb_Top_250["imdb_score"] <10) ]
print(case4.shape)

             director_name  num_critic_for_reviews       gross  \
1795        Frank Darabont                   199.0   28.341469   
3016  Francis Ford Coppola                   208.0  134.821952   
64       Christopher Nolan                   645.0  533.316061   
2543  Francis Ford Coppola                   149.0   57.300000   
325          Peter Jackson                   328.0  377.019252   
...                    ...                     ...         ...   
67             Jon Favreau                   486.0  318.298180   
73              Doug Liman                   585.0  100.189501   
3052           Yash Chopra                    29.0    2.921738   
95           Peter Jackson                   645.0  303.001229   
3170  Christophe Barratier                   112.0    3.629758   

                              genres           actor_1_name  \
1795                     Crime|Drama         Morgan Freeman   
3016                     Crime|Drama              Al Pacino   
64       Action|Cr

**Subtask 3.4: Find the critic-favorite and audience-favorite actors**

   1. Create three new dataframes namely, `Meryl_Streep`, `Leo_Caprio`, and `Brad_Pitt` which contain the movies in which the actors: 'Meryl Streep', 'Leonardo DiCaprio', and 'Brad Pitt' are the lead actors. Use only the `actor_1_name` column for extraction. Also, make sure that you use the names 'Meryl Streep', 'Leonardo DiCaprio', and 'Brad Pitt' for the said extraction.
   2. Append the rows of all these dataframes and store them in a new dataframe named `Combined`.
   3. Group the combined dataframe using the `actor_1_name` column.
   4. Find the mean of the `num_critic_for_reviews` and `num_user_for_review` and identify the actors which have the highest mean.

In [30]:
# Write your code for creating three new dataframes here
# Include all movies in which Meryl_Streep is the lead
Meryl_Streep=movies[movies["actor_1_name"] =='Meryl Streep']
print(Meryl_Streep)

       director_name  num_critic_for_reviews       gross  \
392     Nancy Meyers                   187.0  112.703470   
1038   Curtis Hanson                    42.0   46.815748   
1132     Nora Ephron                   252.0   94.125426   
1322   David Frankel                   208.0  124.732962   
1390  Robert Redford                   227.0   14.998070   
1471  Sydney Pollack                    66.0   87.100000   
1514   David Frankel                   234.0   63.536011   
1563   Carl Franklin                    64.0   23.209440   
1784  Stephen Daldry                   174.0   41.597830   
2500  Phyllida Lloyd                   331.0   29.959436   
2793   Robert Altman                   211.0   20.338609   

                               genres  actor_1_name  \
392              Comedy|Drama|Romance  Meryl Streep   
1038  Action|Adventure|Crime|Thriller  Meryl Streep   
1132          Biography|Drama|Romance  Meryl Streep   
1322             Comedy|Drama|Romance  Meryl Streep   
1390

In [31]:
# Leo_Caprio = # Include all movies in which Leo_Caprio is the lead
Leo_Caprio=movies[movies["actor_1_name"] =='Leonardo DiCaprio']
print(Leo_Caprio)

              director_name  num_critic_for_reviews       gross  \
25            James Cameron                   315.0  658.672302   
49             Baz Luhrmann                   490.0  144.812796   
94        Christopher Nolan                   642.0  292.568851   
173   Alejandro G. Iñárritu                   556.0  183.635922   
246         Martin Scorsese                   267.0  102.608827   
283       Quentin Tarantino                   765.0  162.804648   
293            Edward Zwick                   166.0   57.366262   
294         Martin Scorsese                   606.0  116.866727   
312         Martin Scorsese                   233.0   77.679638   
347         Martin Scorsese                   352.0  132.373442   
431         Martin Scorsese                   490.0  127.968405   
608            Ridley Scott                   238.0   39.380442   
859        Steven Spielberg                   194.0  164.435221   
934             Danny Boyle                   118.0   39.77859

In [32]:
# Include all movies in which Brad_Pitt is the lead
Brad_Pitt=movies[movies["actor_1_name"] =='Brad Pitt']
print(Brad_Pitt)

              director_name  num_critic_for_reviews       gross  \
97            David Fincher                   362.0  127.490802   
142       Wolfgang Petersen                   220.0  133.228348   
243       Steven Soderbergh                   198.0  125.531634   
244              Doug Liman                   233.0  186.336103   
367              Tony Scott                   142.0    0.026871   
383       Steven Soderbergh                   186.0  183.405771   
448              David Ayer                   406.0   85.707116   
579     Jean-Jacques Annaud                    76.0   37.901509   
646           David Fincher                   315.0   37.023395   
749         Patrick Gilmore                    98.0   26.288320   
887             Neil Jordan                   120.0  105.264608   
1397        Terrence Malick                   584.0   13.303319   
1608         Andrew Dominik                   273.0    3.904982   
2030  Alejandro G. Iñárritu                   285.0   34.30077

In [33]:
# Write your code for combining the three dataframes here
Combined=Meryl_Streep.append(Leo_Caprio).append(Brad_Pitt)


In [39]:
# Write your code for grouping the combined dataframe here
# Write the code for finding the mean of critic reviews and audience reviews here
# Which actor is highest rated among the three actors according to the user reviews?
# Which actor is highest rated among the three actors according to the critics?
print(Combined.groupby("actor_1_name").mean()[['num_critic_for_reviews','num_user_for_reviews']])

                   num_critic_for_reviews  num_user_for_reviews
actor_1_name                                                   
Brad Pitt                      245.000000            742.352941
Leonardo DiCaprio              330.190476            914.476190
Meryl Streep                   181.454545            297.181818
