## As a data scientist, you have been hired by a rookie movie producer to help him decide what type of movies to produce and which actors to cast. Your task is to analyze the data he has provided, which includes information on 3,000 movies, and use your findings to make recommendations.

## To do this, you will first need to explore and clean the data to ensure that it is accurate and complete. This may involve checking for missing values, duplicate data, and outliers.

## Once the data is clean, you can start to identify trends and patterns. For example, you could look at the most profitable movies, the most popular genres, and the actors who have starred in the most successful films.

## You could also use the data to create predictive models. For example, you could develop a model that can predict the profitability of a movie based on its genre, budget, and cast.

## Once you have a good understanding of the data, you can start to make recommendations to the movie producer. For example, you could recommend that he produce movies in certain genres or that he cast certain actors.

## You could also provide him with more specific recommendations, such as suggesting that he produce a remake of a successful film or that he create a new franchise based on a popular book series.

## Ultimately, the goal of your analysis is to help the movie producer make informed decisions about his business. By providing him with valuable insights and recommendations, you can help him to increase his chances of success.

## Further, you have to answer the following questions:
1. ### <b> Which movie made the highest profit? Who were its producer and director? Identify the actors in that film.</b>
2. ### <b>This data has information about movies made in different languages. Which language has the highest average ROI (return on investment)? </b>
3. ### <b> Find out the unique genres of movies in this dataset.</b>
4. ### <b> Make a table of all the producers and directors of each movie. Find the top 3 producers who have produced movies with the highest average RoI? </b>
5. ### <b> Which actor has acted in the most number of movies? Deep dive into the movies, genres and profits corresponding to this actor. </b>
6. ### <b>Top 3 directors prefer which actors the most? </b>

In [None]:

#Import package
import pandas as pd
import numpy as np

In [None]:

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
imdb_df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/imdb_data.csv')

In [None]:
imdb_df.head(2)

Unnamed: 0,id,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,original_title,overview,popularity,...,release_date,runtime,spoken_languages,status,tagline,title,Keywords,cast,crew,revenue
0,1,"[{'id': 313576, 'name': 'Hot Tub Time Machine ...",14000000,"[{'id': 35, 'name': 'Comedy'}]",,tt2637294,en,Hot Tub Time Machine 2,"When Lou, who has become the ""father of the In...",6.575393,...,2/20/15,93.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,The Laws of Space and Time are About to be Vio...,Hot Tub Time Machine 2,"[{'id': 4379, 'name': 'time travel'}, {'id': 9...","[{'cast_id': 4, 'character': 'Lou', 'credit_id...","[{'credit_id': '59ac067c92514107af02c8c8', 'de...",12314651
1,2,"[{'id': 107674, 'name': 'The Princess Diaries ...",40000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,tt0368933,en,The Princess Diaries 2: Royal Engagement,Mia Thermopolis is now a college graduate and ...,8.248895,...,8/6/04,113.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,It can take a lifetime to find true love; she'...,The Princess Diaries 2: Royal Engagement,"[{'id': 2505, 'name': 'coronation'}, {'id': 42...","[{'cast_id': 1, 'character': 'Mia Thermopolis'...","[{'credit_id': '52fe43fe9251416c7502563d', 'de...",95149435


In [None]:
imdb_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 23 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     3000 non-null   int64  
 1   belongs_to_collection  604 non-null    object 
 2   budget                 3000 non-null   int64  
 3   genres                 2993 non-null   object 
 4   homepage               946 non-null    object 
 5   imdb_id                3000 non-null   object 
 6   original_language      3000 non-null   object 
 7   original_title         3000 non-null   object 
 8   overview               2992 non-null   object 
 9   popularity             3000 non-null   float64
 10  poster_path            2999 non-null   object 
 11  production_companies   2844 non-null   object 
 12  production_countries   2945 non-null   object 
 13  release_date           3000 non-null   object 
 14  runtime                2998 non-null   float64
 15  spok

In [None]:
print(imdb_df.columns)

Index(['id', 'belongs_to_collection', 'budget', 'genres', 'homepage',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'runtime', 'spoken_languages',
       'status', 'tagline', 'title', 'Keywords', 'cast', 'crew', 'revenue'],
      dtype='object')


## After reading all of the questions, identify the columns of data that are necessary to answer each question in order to obtain accurate insights. Keep all non-null columns.

In [None]:
columns_to_keep= ['budget', 'genres','original_language', 'original_title','cast', 'crew', 'revenue']

In [None]:
imdb_df = imdb_df[columns_to_keep]

In [None]:
print(imdb_df.columns)

Index(['budget', 'genres', 'original_language', 'original_title', 'cast',
       'crew', 'revenue'],
      dtype='object')


In [None]:
#find all the row indexes for which genres is not null
imdb_df.loc[~imdb_df['genres'].isna(),'genres']

0                          [{'id': 35, 'name': 'Comedy'}]
1       [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...
2                           [{'id': 18, 'name': 'Drama'}]
3       [{'id': 53, 'name': 'Thriller'}, {'id': 18, 'n...
4       [{'id': 28, 'name': 'Action'}, {'id': 53, 'nam...
                              ...                        
2995    [{'id': 35, 'name': 'Comedy'}, {'id': 10749, '...
2996    [{'id': 18, 'name': 'Drama'}, {'id': 10402, 'n...
2997    [{'id': 80, 'name': 'Crime'}, {'id': 28, 'name...
2998    [{'id': 35, 'name': 'Comedy'}, {'id': 10749, '...
2999    [{'id': 53, 'name': 'Thriller'}, {'id': 28, 'n...
Name: genres, Length: 2993, dtype: object

In [None]:
type(imdb_df.loc[0,'genres'])

str

In [None]:
type(imdb_df.loc[0,'cast'])

str

The eval() function in Python is a built-in function that evaluates a string as a Python expression. This means that you can use the eval() function to execute arbitrary Python code from a string.



In [None]:
expression = "1 + 2"
result = eval(expression)
print(result)

3


In [None]:
def convert_to_list(str):
  return eval(str)

In [None]:
#apply the above function only on non null values in genres column
imdb_df.loc[~imdb_df['genres'].isna(),'genres']= imdb_df.loc[~imdb_df['genres'].isna(),'genres'].apply(convert_to_list)

In [None]:
#apply the above function only on non null values in cast column
imdb_df.loc[~imdb_df['cast'].isna(),'cast']= imdb_df.loc[~imdb_df['cast'].isna(),'cast'].apply(convert_to_list)

In [None]:
#apply the above function only on non null values in crew column
imdb_df.loc[~imdb_df['crew'].isna(),'crew']= imdb_df.loc[~imdb_df['crew'].isna(),'crew'].apply(convert_to_list)

In [None]:
type(imdb_df.loc[0,'cast'])

list

In [None]:
imdb_df_new = imdb_df.copy()

#Q1.Which movie made the highest profit? Who were its producer and director? Identify the actors in that film.

In [None]:
#checking for sanity in budget columns (outliers,vague values etc)
imdb_df_new.describe()
#budget of a movie in general cannot be 0 hence replacing those value with 0

Unnamed: 0,budget,revenue
count,3000.0,3000.0
mean,22531330.0,66725850.0
std,37026090.0,137532300.0
min,0.0,1.0
25%,0.0,2379808.0
50%,8000000.0,16807070.0
75%,29000000.0,68919200.0
max,380000000.0,1519558000.0


In [None]:

imdb_df_new[imdb_df_new['budget']==0].head(3)

Unnamed: 0,budget,genres,original_language,original_title,cast,crew,revenue
4,0,"[{'id': 28, 'name': 'Action'}, {'id': 53, 'nam...",ko,마린보이,"[{'cast_id': 3, 'character': 'Chun-soo', 'cred...","[{'credit_id': '52fe464b9251416c75073b43', 'de...",3923970
7,0,"[{'id': 99, 'name': 'Documentary'}]",en,Control Room,"[{'cast_id': 2, 'character': 'Himself', 'credi...","[{'credit_id': '52fe47a69251416c750a0daf', 'de...",2586511
8,0,"[{'id': 28, 'name': 'Action'}, {'id': 35, 'nam...",en,Muppet Treasure Island,"[{'cast_id': 1, 'character': 'Long John Silver...","[{'credit_id': '52fe43c89251416c7501deb3', 'de...",34327391


In [None]:
imdb_df_new['budget'].median()

8000000.0

In [None]:
#Replace extremely low values of budget and revenue column with median values of budget, revenue
imdb_df_new.loc[imdb_df_new['budget']<1000,'budget']= imdb_df_new['budget'].median()

imdb_df_new.loc[imdb_df_new['revenue']<1000,'revenue']= imdb_df_new['revenue'].median()



In [None]:
imdb_df_new.describe() #now fine

Unnamed: 0,budget,revenue
count,3000.0,3000.0
mean,24744670.0,67045180.0
std,35832540.0,137396400.0
min,2500.0,1404.0
25%,8000000.0,2947600.0
50%,8000000.0,16808730.0
75%,29000000.0,68919200.0
max,380000000.0,1519558000.0


In [None]:
imdb_df_new['genres'].isnull().sum()

7

In [None]:
#create profit and ROI column
imdb_df_new['profit'] = imdb_df_new['revenue'] - imdb_df_new['budget']
imdb_df_new['roi']= 100* (imdb_df_new['profit']/imdb_df_new['budget'])

In [None]:

imdb_df_new.head(2)

Unnamed: 0,budget,genres,original_language,original_title,cast,crew,revenue,profit,roi
0,14000000,"[{'id': 35, 'name': 'Comedy'}]",en,Hot Tub Time Machine 2,"[{'cast_id': 4, 'character': 'Lou', 'credit_id...","[{'credit_id': '59ac067c92514107af02c8c8', 'de...",12314651,-1685349,-12.038207
1,40000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",en,The Princess Diaries 2: Royal Engagement,"[{'cast_id': 1, 'character': 'Mia Thermopolis'...","[{'credit_id': '52fe43fe9251416c7502563d', 'de...",95149435,55149435,137.873588


In [None]:
#maximum profit
imdb_df_new['profit'].max()

1316249360

In [None]:
#find index or row which have the max profit using .idxmax()
#.idxmax()-->> returns the row number(index) for the max value of the column
imdb_df_new['profit'].idxmax()

1761

###The movie which made the highest profit is:

In [None]:
imdb_df_new.loc[imdb_df_new['profit'].idxmax(),'original_title']

'Furious 7'

In [None]:
max_profit_movie_df = imdb_df_new.iloc[imdb_df_new['profit'].idxmax()]

In [None]:
max_profit_movie_df.head()

budget                                                       190000000
genres                                  [{'id': 28, 'name': 'Action'}]
original_language                                                   en
original_title                                               Furious 7
cast                 [{'cast_id': 17, 'character': 'Dominic Toretto...
Name: 1761, dtype: object

In [None]:
max_profit_movie_df.loc['cast'][0]['name']

'Vin Diesel'

In [None]:
crew_list= max_profit_movie_df.loc['crew']
crew_list[0:3]

[{'credit_id': '52fe4cc8c3a36847f823e681',
  'department': 'Production',
  'gender': 2,
  'id': 12835,
  'job': 'Producer',
  'name': 'Vin Diesel',
  'profile_path': '/7rwSXluNWZAluYMOEWBxkPmckES.jpg'},
 {'credit_id': '52fe4cc8c3a36847f823e687',
  'department': 'Production',
  'gender': 2,
  'id': 11874,
  'job': 'Producer',
  'name': 'Neal H. Moritz',
  'profile_path': '/cNcsEYmoS4niCz3UkVAA09dUIob.jpg'},
 {'credit_id': '52fe4cc8c3a36847f823e68d',
  'department': 'Writing',
  'gender': 2,
  'id': 58191,
  'job': 'Writer',
  'name': 'Chris Morgan',
  'profile_path': '/dUGxIwFBLrSFLImxjeda1krndMO.jpg'}]

In [None]:
producer_list=[]
director_list=[]
for elem in crew_list:
  if elem['job']=='Producer':
    producer_list.append(elem['name'])
  if elem['job']=='Director':
    director_list.append(elem['name'])


In [None]:
print(f'PRODUCERS : {producer_list}')
print(f'DIRECTORS : {director_list}')

PRODUCERS : ['Vin Diesel', 'Neal H. Moritz', 'Michael Fottrell', 'Brandon Birtell']
DIRECTORS : ['James Wan']


In [None]:
cast_list =max_profit_movie_df['cast']

In [None]:
cast_list[0:3]

[{'cast_id': 17,
  'character': 'Dominic Toretto',
  'credit_id': '5431dfd10e0a265915002c34',
  'gender': 2,
  'id': 12835,
  'name': 'Vin Diesel',
  'order': 0,
  'profile_path': '/7rwSXluNWZAluYMOEWBxkPmckES.jpg'},
 {'cast_id': 19,
  'character': "Brian O'Conner",
  'credit_id': '5431dfe4c3a3681143002b98',
  'gender': 2,
  'id': 8167,
  'name': 'Paul Walker',
  'order': 1,
  'profile_path': '/iqvYezRoEY5k8wnlfHriHQfl5dX.jpg'},
 {'cast_id': 18,
  'character': 'Hobbs',
  'credit_id': '5431dfdbc3a36831a6004376',
  'gender': 2,
  'id': 18918,
  'name': 'Dwayne Johnson',
  'order': 2,
  'profile_path': '/kuqFzlYMc2IrsOyPznMd1FroeGq.jpg'}]

###Actors in the Highest profit movie

In [None]:
actor_list=[]
for elem in cast_list:
  actor_list.append(elem['name'])

In [None]:
#actors
print(f'Actors of the movie are :')
actor_list

Actors of the movie are :


['Vin Diesel',
 'Paul Walker',
 'Dwayne Johnson',
 'Michelle Rodriguez',
 'Tyrese Gibson',
 'Ludacris',
 'Jordana Brewster',
 'Djimon Hounsou',
 'Tony Jaa',
 'Ronda Rousey',
 'Nathalie Emmanuel',
 'Kurt Russell',
 'Jason Statham',
 'Sung Kang',
 'Gal Gadot',
 'Lucas Black',
 'Elsa Pataky',
 'Noel Gugliemi',
 'John Brotherton',
 'Luke Evans',
 'Ali Fazal',
 'Miller Kimsey',
 'Charlie Kimsey',
 'Eden Estrella',
 'Gentry White',
 'Iggy Azalea',
 'Jon Lee Brody',
 'Levy Tran',
 'Anna Colwell',
 'Viktor Hernandez',
 'Steve Coulter',
 'Robert Pralgo',
 'Antwan Mills',
 'J.J. Phillips',
 'Jorge Ferragut',
 'Sara Sohn',
 'Benjamin Blankenship',
 'D.J. Hapa',
 'T-Pain',
 'Brian Mahoney',
 'Brittney Alger',
 'Romeo Santos',
 'Jocelin Donahue',
 'Stephanie Langston',
 'Jorge-Luis Pallo',
 'Tego Calder√≥n',
 'Nathalie Kelley',
 'Shad Moss',
 'Don Omar',
 'Klement Tinaj',
 'Caleb Walker',
 'Cody Walker']

#Q2.This data has information about movies made in different languages. Which language has the highest average ROI (return on investment)?

In [None]:
 #we already calculated roi above
 #df['roi'] = 100 * df['profit']/df['budget']

In [None]:
#Use groupby function on movie languages and ROI and finding mean
imdb_df_new.groupby('original_language')['roi'].mean().reset_index().sort_values(by='roi',ascending=False).head(3)

Unnamed: 0,original_language,roi
18,ko,11309.685605
6,el,5198.013245
28,sr,3261.4136


In [None]:

print('Language with highest average roi is')
imdb_df_new.groupby('original_language')['roi'].mean().reset_index().sort_values(by='roi',ascending=False).iloc[0,0]

Language with highest average roi is


'ko'

#Q3.Find out the unique genres of movies in this dataset.

In [None]:
#considering only those rows in genres column which have no null values
no_na_genres = imdb_df_new[~imdb_df_new['genres'].isna()]

In [None]:
type(no_na_genres)

pandas.core.frame.DataFrame

In [None]:
no_na_genres.loc[0,'genres']

[{'id': 35, 'name': 'Comedy'}]

In [None]:
no_na_genres.loc[3,'genres'][0]

{'id': 53, 'name': 'Thriller'}

In [None]:
no_na_genres.loc[3,'genres']

[{'id': 53, 'name': 'Thriller'}, {'id': 18, 'name': 'Drama'}]

In [None]:
#create a list of genres and using .iterrow() method to iterate over genres column
# .iterrow() --->> same as enumerate() its compulsory to use it in case of DataFrame
gen_list=[]
for index,row in no_na_genres.iterrows():
  genre = no_na_genres.loc[index,'genres']
  for k in genre:
    gen_list.append(k['name'])

#unique list of genres are:
pd.DataFrame(set(gen_list),columns=['Unique Genres'])

Unnamed: 0,Unique Genres
0,Family
1,Comedy
2,Crime
3,Science Fiction
4,Documentary
5,Adventure
6,Romance
7,Drama
8,Western
9,War


## ** 4) make a table of all the producers and director of echa movie .Find the top 3 producerwho have produce movie with the highest avg ROI?**

> Indented block



In [None]:
# consider only these rows in crew columns which have no null values
no_na_crew=imdb_df_new[~imdb_df_new ['crew'].isna()]

In [None]:
no_na_crew.shape

(2984, 9)

In [None]:
#A simple function extract list of all producer for a given movie_index
def create_producer_list(index):
  movie_index=no_na_crew.iloc[index]
  crew_list=movie_index.loc['crew']
  producer_list=[]
  for elem in crew_list:
    if elem['job']=='producer':
      producer_list.append(elem['name'])
      return producer_list




In [None]:
create_producer_list(62)

In [None]:
#A simple function extract list of all Director for a given movie_index
def create_Director_list(index):
  movie_index=no_na_crew.iloc[index]
  crew_list=movie_index.loc['crew']
  Directorroducer_list=[]
  for elem in crew_list:
    if elem['job']=='Director':
     return elem['name']



In [None]:
create_Director_list(61)

'Carol Reed'