

# Project: Investigate a Dataset (Analysing TMDB Movie Data )

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

> **TMDB Movie Data**: This data set contains information about 10,000 movies collected from The Movie Database (TMDb), including user ratings and revenue.
>
> **Questions:**
>
> * Which movie has the highest Profit? Which movie has the lowest Profit?
>
> * How many documentary movies there? is investing in it waste money?
>
> * which director, cast, genres are in great demand?
>
> * Movies profits investigation increased or decreased every year? Is it a good investment?

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
%matplotlib inline
import re
import collections

<a id='wrangling'></a>
## Data Wrangling

 Data Cleaning (Replace this with more specific notes!)
> * Remove duplicate Data
> * Some columns we didnot need them [id,imdb_id,popularity]
> * Change data type of release_date  to be date time
> * Change data type of budget_adj and revenue_adj to be Integer as no need for fractions in millions no.
> * Although budget,revenue & budget_adj,revenue_adj has the same information but i will keep both of them as
we need budget,revenue as basic data and need budget_adj,revenue_adj for statistics purboses .
> * Drop all Rows whitch has budget_adj,revenue_adj less than 1000$  as the data seems to be wrong .
> * Some movies run time = 0 but we will do nothing as this column will be removed with rows with 0 budget .

In [None]:
# Load tmdb-movies into pandas 
tmdb_df=pd.read_csv("tmdb-movies.csv")
tmdb_df.head()

In [None]:
#review columns 
tmdb_df.info()

In [None]:
# check is there any duplication
tmdb_df.duplicated().value_counts()

In [None]:
# show statistics of whole data
tmdb_df.describe()


### Data Cleaning 
> * Remove duplicate Data
> * Some columns we didnot need them [id,imdb_id,popularity]
> * Change data type of release_date  to be date time
> * Change data type of budget_adj and revenue_adj to be Integer as no need for fractions in millions no.
> * Although budget,revenue & budget_adj,revenue_adj has the same information but i will keep both of them as
we need budget,revenue as basic data and need budget_adj,revenue_adj for statistics purboses .
> * Drop all Rows whitch has budget_adj,revenue_adj less than 100$  as the data seems to be wrong .
> * Some movies run time = 0 but we will do nothing as this column will be removed with rows with 0 budget .

In [None]:
# Make copy of data to clean
tmdb_df_cleaned=tmdb_df.copy()

In [None]:
#Drop duplicate values
tmdb_df_cleaned.drop_duplicates(inplace=True)

In [None]:
#drop columns that i will not use and contain no important data  [id,imdb_id,popularity] 
tmdb_df_cleaned.drop(columns=['id','imdb_id','popularity'],inplace=True)

In [None]:
#test the drop operation 
tmdb_df_cleaned.head(2)

In [None]:
#Change Data type of  release_date to date time
tmdb_df_cleaned['release_date']=pd.to_datetime(tmdb_df_cleaned['release_date'])
tmdb_df_cleaned.head(2)

In [None]:
#test Data type of  release_date
tmdb_df_cleaned.info()

In [None]:
#Change data type of budget_adj and revenue_adj to be Integer as no need for fractions in millions or billions .
col=['budget_adj','revenue_adj']
tmdb_df_cleaned[col]=tmdb_df_cleaned[col].astype(int)
tmdb_df_cleaned.head(2)

In [None]:
#test Change data type of budget_adj and revenue_adj
tmdb_df_cleaned.info()

In [None]:
# drop all data >100$ in budget_adj and revenue_adj
tmdb_df_cleaned=tmdb_df_cleaned[(tmdb_df_cleaned['budget_adj']>100) & (tmdb_df_cleaned['revenue_adj']>100)]
tmdb_df_cleaned.head(2)

In [None]:
#test the drop operation 
tmdb_df_cleaned.info()

<a id='eda'></a>
## Exploratory Data Analysis

> **Questions:**
>
> * Which movie has the highest Profit? Which movie has the lowest Profit?
>
> * How many documentary movies there? is investing in it waste money?
>
> * which director, cast, genres are in great demand?
>
> * Movies profits investigation increased or decreased every year? Is it a good investment?

### Research Question 1 (Which movie has the highest Profit? Which movie has the lowest Profit?)

In [None]:
# calculate the profit for all  movies
tmdb_df_cleaned['profit']=tmdb_df_cleaned['revenue_adj']-tmdb_df_cleaned['budget_adj']

In [None]:
# movie with highest profit is 'Jaws' with profit 1878643093
tmdb_df_cleaned[tmdb_df_cleaned['profit']==tmdb_df_cleaned['profit'].max()]

In [None]:
tmdb_df_cleaned['profit'].describe()

In [None]:
# movie with lowest profit is "The Warrior's Way" movie lose 413912431 $ 
tmdb_df_cleaned[tmdb_df_cleaned['profit']==tmdb_df_cleaned['profit'].min()]

In [None]:
# Get highest 10 movie in profit
tmdb_df_cleaned['profit'].nlargest(10)

In [None]:
#get top_10  in pandas data frame
tmdb_top_10_revenue=tmdb_df_cleaned[lambda x :x['profit'] >= 1246626367]
tmdb_top_10_revenue.groupby('original_title')['profit'].sum()

In [None]:
# Get lowest 10 movie in profit
tmdb_df_cleaned['profit'].nsmallest(10)

In [None]:
#get min_10  in pandas data frame
tmdb_min_10_revenue=tmdb_df_cleaned[lambda x :x['profit'] <= -91445050]
tmdb_min_10_revenue.groupby('original_title')['profit'].sum()

In [None]:
#Function to plot relation between x , y get list and labels
def plot_relation_rate(stage,xlabel,ylabel,message):
    stage=stage.sort_values()
    plt.figure(figsize=(14, 9))
    plt.barh(stage.index,stage.array)
    plt.xlabel(xlabel, fontsize = 14)
    plt.ylabel(ylabel, fontsize = 14)
    plt.title('relation between {} and {} {}'.format(xlabel,ylabel,message), fontsize = 16)
    #plt.gca().invert_yaxis()
    plt.show();

In [None]:
# plot  highest 10 movie name 
top_stage=tmdb_top_10_revenue.groupby(['original_title'])['profit'].sum()
plot_relation_rate(top_stage,'Profit','Movie Name',"top 10")

In [None]:
# plot  highest 10 directors with highest profit 
top_stage=tmdb_top_10_revenue.groupby(['director'])['profit'].sum()
plot_relation_rate(top_stage,'Profit','Director Name',"TOP 10")

In [None]:
# plot  lowest 10 directors with lowest profit 
top_stage=tmdb_min_10_revenue.groupby(['director'])['profit'].sum()
plot_relation_rate(top_stage,'profit','Director','Lowest 10')

In [None]:
# plot  lowest 10 movies name with lowest profit 
top_stage=tmdb_min_10_revenue.groupby(['original_title'])['profit'].sum()
plot_relation_rate(top_stage,'profit','Movie name',"Lowest 10")

### Research Question 2  (How many documentary movies there? is investing in it waste money?)

In [None]:
# How many documentry movie ? there are 35 .
tmdb_df=tmdb_df_cleaned[tmdb_df_cleaned['genres'].str.contains('Documentary')]
tmdb_df.info()

In [None]:
# there is about 35 documentary movie and the relation between profit and movies
top_stage=tmdb_df.groupby(['original_title'])['profit'].sum()
plot_relation_rate(top_stage,'Profit','Documentry',"Movie")

> ####  From the figure we found that most of documentry movie has a good rate in profit
> * Let's found the relation between percentage of all losing movies  and losing documentary movies

In [None]:
#compare the precentage of losing movies budget>revenue  in both all movies and documentry movies
fail_d_movies=tmdb_df[tmdb_df['profit']<=0].count().max()
all_d_movies=tmdb_df['profit'].count().max()
percentage_fail_documentry_movies=(fail_d_movies/all_d_movies)*100
percentage_fail_all_movies=((tmdb_df_cleaned[tmdb_df_cleaned['profit']<=0].count().max())/(tmdb_df_cleaned['profit'].count().max()))*100
plt.figure(figsize=(14, 9))
plt.barh(['percentage_fail_documentry_movies','percentage_fail_all_movies'],[percentage_fail_documentry_movies,percentage_fail_all_movies])
plt.title('Relation between losing Documentry movies and all movies', fontsize = 16)
plt.show();

In [None]:
print ("The percentage_fail_documentry_movies = ",percentage_fail_documentry_movies,'and percentage_fail_all_movies =',percentage_fail_all_movies)

> ####  From the figure we found that most of documentry movie has a good success rate which is almost less than the loss rate in movies in general we can conclode that this is good investment

### Research Question 3  (which director, cast, genres are in great demand?)

In [None]:
#Function to plot relation between x , y take dictionary variable
def plot_relation_rate_dic(dic,xlabel,ylabel,message):
    type(dic)
    dic=dict(sorted(dic.items(), key=lambda item: item[1]))
    plt.figure(figsize=(22, 10), dpi = 130)
    plt.xlabel(xlabel ,fontsize = 14)
    plt.ylabel(ylabel, fontsize = 14)
    plt.bar(dic.keys(),dic.values())
    plt.title('relation between {} and {} {}'.format(xlabel,ylabel,message), fontsize = 16)
    #plt.gca().invert_yaxis()
    plt.show();

In [None]:
#function to count each row in data frame column  and his occurance using regular expression
def data_occurance_rate_due_profit(data):
    temparr={}
    for item in data:
        try:
            data_temp=re.search(r'[\w\W]+',item).group().split('|')
           # print(data_temp)
            for item in data_temp:
                if(item in temparr):
                    temparr[item]=temparr[item]+1
                else:
                    temparr[item]=1
        except :
            continue
    return temparr

In [None]:
#Get most wanted movies geners and sort them with largest
geners=data_occurance_rate_due_profit(tmdb_df_cleaned['genres'])
dict(sorted(geners.items(), key=lambda item: item[1], reverse=True))

> #### As we can see Drama then Comedy in a great demand

In [None]:
# plot bar diagram for movie geners
plot_relation_rate_dic(geners,'Movie Type','Occurance',"of data")

In [None]:
#Get Cast that  in great demand
cast=data_occurance_rate_due_profit(tmdb_df_cleaned['cast'])
dict(sorted(cast.items(), key=lambda item: item[1], reverse=True))
collections.Counter(dict(sorted(cast.items(), key=lambda item: item[1], reverse=True))).most_common(10)

> #### As we can see 'Robert De Niro' and all above are  in great demand

In [None]:
#Get Director that  in great demand
director=data_occurance_rate_due_profit(tmdb_df_cleaned['director'])
dict(sorted(director.items(), key=lambda item: item[1], reverse=True))
collections.Counter(dict(sorted(director.items(), key=lambda item: item[1], reverse=True))).most_common(10)

> #### As we can see 'Steven Spielberg' and all above are  in great demand
> * #### we can gather cast and one of the above directors and put them in a drama ,comedy or thrill movie and that will be a great investment

### Research Question 4  (Movies profits investment increased or decreased every year? Is it a good investment?)

In [None]:
# plot the relation between movies profit per year 
profit_per_year=tmdb_df_cleaned.groupby('release_year')['profit'].sum()
plt.figure(figsize=(12,6))
plt.xlabel('Release year',fontsize = 12)
plt.ylabel('Movies Profit',fontsize = 14)
plt.title('Calculating Profits of all movies per year')
plt.plot(profit_per_year)

### As we can see the profits increases every year with excellent rate 

<a id='conclusions'></a>
## Conclusions

> ### Which movie has the highest Profit? Which movie has the lowest Profit?
>
> movie with highest profit is 'Jaws' with profit 1878643093
>
> movie with lowest profit is "The Warrior's Way" movie lose 413912431 
>
> ### How many documentary movies there? is investing in it waste money?
>
>   We have 35 documentary movies
>
>   found that most of documentary movie has a good success rate which is almost less than the loss rate in movies in general.
>
> *   we can conclude that this is good investment
>
> ### which director, cast, genres are in great demand?
>
>   Genres: ‘Drama': 1742, 'Comedy': 1340, 'Thriller': 1192, 'Action': 1074,
>    'Adventure': 741
>
> *   As we can see Drama Then Comedy in a great demand
>
>   Cast: ‘Robert De Niro', 52), ('Bruce Willis', 46), ('Samuel L. Jackson', 3), ('Nicolas Cage', 43), ('Matt Damon', 36),     >   ('Johnny Depp', 35), ('Brad Pitt', 34), ('Tom Hanks', 34), ('Harrison Ford', 33), ('Tom Cruise', 33)
>
> *   As we can see 'Robert De Niro' and all above are in great demand
>
>    Director: Steven Spielberg', 28), ('Clint Eastwood', 24), ('Ridley Scott', 21)
>    ('Woody Allen', 18), ('Robert Rodriguez', 17), ('Tim Burton', 17), ('Steven Soderbergh', 17), ('Martin Scorsese', 17),     >    ('Robert Zemeckis', 15), ('Renny Harlin', 15)]
>
> *   As we can see 'Steven Spielberg' and all above are in great demand
> #### we can gather cast and one director from above and put them in a drama ,comedy or thrill movie and that will be a great investment.
>
> ### Movies profits investment increased or decreased every year? Is it a good investment?
>
>   As we can see the profits increases every year with excellent rate .

