## You have been hired by a rookie movie producer to help him decide what type of movies to produce and which actors to cast. You have to back your recommendations based on thorough analysis of the data he shared with you which has the list of 3000 movies and the corresponding details.

## As a data scientist, you have to first explore the data and check its sanity.

## Further, you have to answer the following questions:
1. ### <b> Which movie made the highest profit? Who were its producer and director? Identify the actors in that film.</b>
2. ### <b>This data has information about movies made in different languages. Which language has the highest average ROI (return on investment)? </b>
3. ### <b> Find out the unique genres of movies in this dataset.</b>
4. ### <b> Make a table of all the producers and directors of each movie. Find the top 3 producers who have produced movies with the highest average RoI? </b>
5. ### <b> Which actor has acted in the most number of movies? Deep dive into the movies, genres and profits corresponding to this actor. </b>
6. ### <b>Top 3 directors prefer which actors the most? </b>



# Data Exploration

In [None]:
#Import package 
import pandas as pd
import numpy as np
import ast

In [None]:
#mount drive
from google.colab import drive
drive.mount('/content/gdrive')

In [None]:
file_path='/content/gdrive/MyDrive/Colab Notebooks/Datas/imdb_data.csv'
df=pd.read_csv(file_path)

In [None]:
df.head(4)

In [None]:
df.info()

**Extract Actors,directors,producers,genres from dataset**

In [None]:
df1=df #Make duplicate dataframe
try:
  df1.crew=df.crew.fillna('[]').apply(ast.literal_eval)
  df1.cast=df.cast.fillna('[]').apply(ast.literal_eval)
  df1.genres=df.genres.fillna('[]').apply(ast.literal_eval)
except:
  print("Cant apply ast literal eval")

df1['new_cast']=df1.cast.apply(lambda x:[i['name']for i in x]if isinstance(x,list) else [])
df1['new_genre']=df1.genres.apply(lambda x:[i['name']for i in x]if isinstance(x,list) else [])

In [None]:
def extract_data(data,column,condition):
  lst=[]
  for i in range(0,len(data)):
    y=[]
    for j in range(0,len(data[column][i])):
      x=data[column][i]
      if x[j]['job']==condition:
        z=x[j]['name']
        y.append(z)
      
    lst.append(y)
  return lst

In [None]:
director=extract_data(df1,'crew','Director')
producer=extract_data(df1,'crew','Producer')
df1['director']=director
df1['producer']=producer
df1['profit']=df1['revenue']-df1['budget']
df1['roi']=((df1['profit']/df1['budget']*100 )).round(3)
df1.replace(np.inf,0,inplace=True)
pd.options.display.float_format = '{:.4f}'.format


In [None]:
df1.head(1)

In [None]:
df_final=df1[['original_language','title','new_cast','new_genre','director','producer','profit','roi']]
df_final.head(4)

**Which movie made the highest profit? Who were its producer and director? Identify the actors in that film**

In [None]:
top_gross=df_final.sort_values(by='profit',ascending=False).head(10).reset_index()
top_gross['profit'] = top_gross['profit'].apply("{:,}".format)
top_gross[['title','producer','director','new_cast','profit']]

**This data has information about movies made in different languages. Which language has the highest average ROI (return on investment)?**

In [None]:
roi=df1[['original_language','roi']]
roi.replace(np.inf,0,inplace=True)
lang=roi['original_language'].value_counts().rename_axis('original_language').reset_index(name='counts')
roi_sum=roi.groupby('original_language').sum()
data=pd.merge(roi_sum,lang,on='original_language')
data['roi_avg'] = data['roi']/data['counts']
data[['original_language','roi_avg']].sort_values(by='roi_avg',ascending=False)

**Find out the unique genres of movies in this dataset.**

In [None]:
new_df=df_final.explode('new_genre')
x=pd.DataFrame(list(new_df['new_genre'].unique()))
x

# **Which actor has acted in the most number of movies? Deep dive into the movies, genres and profits corresponding to this actor.**

In [None]:
df_final.head(1)

In [None]:
new_df = df_final.explode('new_cast').reset_index()
top_actors = pd.DataFrame(new_df.new_cast.value_counts()).reset_index()
top_actors.columns = ['Actors','Number Of Movies']
top_actors.head(3)

# **Movies by Robert De Niro**

In [None]:
new_df.head(2)

In [None]:
def eda(cast):
  m=[]
  for i in range(0,len(new_df)):
    if new_df['new_cast'][i]==cast:
      x=new_df['new_cast'][i]
      y=new_df['title'][i]
      z=new_df['new_genre'][i]
      a=new_df['profit'][i]
      l=[x,y,z,a]
      m.append(l)
  return m

In [None]:
m=eda('Robert De Niro')
data=pd.DataFrame(m)
data.columns=['actor','movie','genre','profit']
data.sort_values(by='profit',ascending=False,inplace=True)
data.loc[:, "profit"] =data["profit"].map('{:,d}'.format)
data.head(5)

# **Movies by Samuel L. Jackson**

In [None]:
m=eda('Samuel L. Jackson')
data=pd.DataFrame(m)
data.columns=['actor','movie','genre','profit']
data.sort_values(by='profit',ascending=False,inplace=True)
data.loc[:, "profit"] =data["profit"].map('{:,d}'.format)
data.head(5)

# **Movies by Morgan Freeman**	

In [None]:
m=eda('Morgan Freeman')
data=pd.DataFrame(m)
data.columns=['actor','movie','genre','profit']
data.sort_values(by='profit',ascending=False,inplace=True)
data.loc[:, "profit"] =data["profit"].map('{:,d}'.format)
data.head(5)

# **Make a table of all the producers and directors of each movie. Find the top 3 producers who have produced movies with the highest average RoI?**

In [None]:
df_final[['title','director','producer']]

In [None]:
df1

In [None]:
df2=df_final.explode('producer').reset_index()
df2.replace(np.inf,0,inplace=True)
df3=df2['producer'].value_counts().reset_index()
df4=df2.groupby('producer').sum().reset_index()
df3.columns=['producer','counts']
data=pd.merge(df3,df4,on='producer')
data['avg_roi']=data['roi']/data['counts']
data[['producer','avg_roi']].sort_values(by='avg_roi',ascending=False).head(3)

# **Top 3 directors prefer which actors the most?**

In [None]:
df1.head(1)

In [None]:
#convert list
def list_to_string(s):
    if len(s)>1:
        x=s[0]
        for i in range(1,len(s)):
            x=x+','+s[i]
    elif len(s)==1:
        x=s[0]
    else:
        x=''
    return x

lis=[]
for i in range(0,len(df1)):
  s=df1['director'][i]
  x=list_to_string(s)
  lis.append(x)

In [None]:
df1['director']=lis
data=df1.groupby('director')['popularity'].mean().sort_values(ascending=False).reset_index()
data.head(3)

In [None]:
data_explode=df1.explode('new_cast').reset_index()
data_explode.head(5)

In [None]:
def dir_pref(data,condition):
  x=[]
  for i in range(0,len(data)):
    if data['director'][i]==condition:
      s=data['new_cast'][i]
      x.append(s)
  return x
  

**Tim Miller**

In [None]:
x=dir_pref(data_explode,'Tim Miller')
x=list(set(x))
x

**Edgar Wright**

In [None]:
x=dir_pref(data_explode,'Edgar Wright')
x=list(set(x))
x

**James Gunn**

In [None]:
x=dir_pref(data_explode,'James Gunn')
x=list(set(x))
x