
# THE MONEY MAKERS : 
### (Investigating the TMDb movie database to find the highest grossing movies and its characteristics)   


## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

>This project uses the TMDb dataset for analysis to know the peculiar nature and mentality of the public in general and to know which movie genres are popular and also to find answers to questions like 'What can be said about the success of a movie before it is released'.  

> Given that major films costing over $100 million to produce can still flop, this question is more important than ever to the industry.  This project also aims to find about which genres are popular? Does a particular production house has any secrect for highest revenews and does actually famous starred movies get the highest ratings on the board.   

#### The projects starts with importing the important libraries used for data-analysis using python.  


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

csv_path=('tmdb-movies.csv')
df=pd.read_csv(csv_path)
df.head()

## DATA WRANGLING
> #### Gathering knowledge about the dataset

In [None]:
df.info()

From the above it is clear that there are a total of 10866 entries, we can also see that there are a few null values in the cast,director and genres column, also it can be seen that there are a lot of null values in the column homepage,tagline,keywords and production_companies dropping these entries can cause data shrinkage and lead to wrong results.  Therefore I decided to drop the homepage and tagline column as it is not much required.

In [None]:
df.describe()

From the above table we can see that there has a lot of zero values in the budget,revenue as well as runtime columns. It should be noted that the release_year column indicates that all the listed movies have been released, so the question arises that can a movie of zero budget and zero runtime be released? These are surely missing values which needs to be fixed: 

In [None]:
df_rev_zero=df.query('revenue==0')
df_rev_zero['revenue'].value_counts()

There are a total of 6016 entries with zero revenue if I drop them it could lead too much loss of raw data.So I decided to keep these and replace these zeros with null values.

In [None]:
df_budget_zero=df.query('budget==0')
df_budget_zero['budget'].value_counts()

Here too 5696 values are zeros so i will keep them and replace them with null values.

In [None]:
df_runtime_zero=df.query('runtime==0')
df_runtime_zero['runtime'].value_counts()

There are few data with zero runtime so we can drop them in the data-cleaning stage.

# Data Cleaning (Replace this with more specific notes!)

In this step we will remove the unwanted data that could mis-lead our findings.We will be cleaning the data on the basis of information acquired from the data-wrangling process. 
    

In [None]:
df.drop(['id','imdb_id','homepage','tagline','overview'],axis=1, inplace=True)
df.head()

Dropping unnecessary columnns that are of no use in the analysis process: id,homepage,tagline,overview.

In [None]:
sum(df.duplicated())

In [None]:
df.drop_duplicates(inplace=True)

It was found that there was a duplicate entry which needed to be deleted in the step

In [None]:
cal2 = ['cast', 'director', 'genres']
df.dropna(subset = cal2, inplace=True)
df.info()

Dropping the null values from the cast, director & genres columns.

In [None]:
df.isnull().sum()

To check if the null values have been dropped or not.

### Finally replacing the zero values with null in the budget and revenue column:

In [None]:
df['budget']=df['budget'].replace(0,np.NaN)
df['revenue']=df['revenue'].replace(0,np.NaN)
df.info()

In [None]:
df1=df.query('runtime == 0')
df1['runtime'].value_counts()

#### Finally dropping the columns with lesser values of zero values: Runtime

In [None]:
df =df[df['runtime'] !=0]
df.info()
df.shape

### Data-cleaning summary:
Finally after dropping the unnecessary columns, duplicate entries, also dropping the null values from the cast, director and genres columns and replacing the zero values from the budget,revenue and runtime columns we are left with 10703 rows and 16 columns which can now be used for analysis purpose.

<a id='eda'></a>
# Exploratory Data Analysis

> In this section we will be using the cleaned data to draw meaningful insights as to what could be said if a movie havng some particular characteristics will be successful or not also will the people like the movie or not.

## Research Part 1: Finding the characteristics and properties related to successful movies
> Question 1: What properties are related with movies with high popularity?  
> Question 2: What properties are related with movies with high voting?

In [None]:
#quartile generating function:
def cut_into_quantile(dfname ,column_name):
# find quartile, max and min values
    min_value = dfname[column_name].min()
    first_quantile = dfname[column_name].describe()[4]
    second_quantile = dfname[column_name].describe()[5]
    third_quantile = dfname[column_name].describe()[6]
    max_value = dfname[column_name].max()
# Bin edges that will be used to "cut" the data into groups
    bin_edges = [ min_value, first_quantile, second_quantile, third_quantile, max_value]
# Labels for the four budget level groups
    bin_names = [ 'Low', 'Medium', 'Moderately High', 'High'] 
# Creates budget_levels column
    name = '{}_levels'.format(column_name)
    dfname[name] = pd.cut(dfname[column_name], bin_edges, labels=bin_names, include_lowest = True)
    return dfname

In [None]:
#splitting the pipe characters:
def find_top(dataframe_col, num=3):
    # split the characters in the input column 
    #and make it to a list
    alist = dataframe_col.str.cat(sep='|').split('|')
    #transfer it to a dataframe
    new = pd.DataFrame({'top' :alist})
    #count their number of appeared times and
    #choose the top3
    top = new['top'].value_counts().head(num)
    return top

In [None]:
# Selecting Top 100 popular movies.
df_top_p = df.sort_values(['release_year','popularity'], ascending=[True, False])
#group by year and choose the top 100 high
df_top_p = df_top_p.groupby('release_year').head(100).reset_index(drop=True)
df_top_p.tail(5)

In [None]:
# Select Top 100 high revenue movies.
df_top_r = df.sort_values(['release_year','revenue'], ascending=[True, False])
df_top_r = df_top_r.groupby('release_year').head(100).reset_index(drop=True)
df_top_r.head(5)

In [None]:
# Select Top 100 high score rating movies.
df_top_s = df.sort_values(['release_year','vote_average'], ascending=[True, False])
df_top_s = df_top_s.groupby('release_year').head(100).reset_index(drop=True)
df_top_s.head(5)

## Question1: What properties are related with movies with high popularity?  
> 1.1 What is the budget level associated with movies having high popularity

In [None]:
df = cut_into_quantile(df,'budget')
df.tail(10)

In [None]:
result_mean = df.groupby('budget_levels')['popularity'].mean()
result_mean

In [None]:
result_median = df.groupby('budget_levels')['popularity'].median()
result_median

In [None]:
# the x locations for the groups
ind = np.arange(len(result_mean))  
# the width of the bars
width = 0.5  

In [None]:
# plotting a bar graph:
sns.set_style('darkgrid')
bars = plt.bar(ind, result_mean, width, color='g', alpha=.7, label='mean')

# title and labels
plt.ylabel('popularity')
plt.xlabel('budget levels')
plt.title('Popularity with Budget Levels')
locations = ind
labels = result_median.index  
plt.xticks(locations, labels)
# legend
plt.legend();

### From the above graph it is clear that movies with higher popularity are with higher budget.
It is notable since movies with higher popularity may have a higher promoting & advertising cost. And with the high promotion level people always have more probability to get know these movies.Also it can be said that popular movies would be having popular casts which intern draws their fans to view the movie.

## 1.2 How is the runtime status associated with movies having high popularity

In [None]:
df = cut_into_quantile(df,'runtime')
df.head(5)

In [None]:
result_mean = df.groupby('runtime_levels')['popularity'].mean()
result_mean

In [None]:
result_median = df.groupby('runtime_levels')['popularity'].median()
result_median

In [None]:
ind = np.arange(len(result_median))  # the x locations for the groups
width = 0.5  

In [None]:
# plotting bars
bars = plt.bar(ind, result_median, width, color='#1ea2bc', alpha=.7, label='median')

# title and labels
plt.ylabel('popularity')
plt.xlabel('runtime levels')
plt.title('Popularity with Runtime Levels')
locations = ind
labels = result_median.index  
plt.xticks(locations, labels)
# legend
plt.legend();

### From above graph it can be conluded that the popular movies have longer runtime.

## 1.3 Which casts, directors, keywords, genres and production companies are associated with high popularity movies:

In [None]:
df_top_p.tail(5)    #choosing the df_top_p dataset, it is a dataframe for 100 popular movies of each year

In [None]:
# find top three cast
a = find_top(df_top_p.cast)
# find top three director
b = find_top(df_top_p.director)
# find top three keywords
c = find_top(df_top_p.keywords)
# find top three genres
d = find_top(df_top_p.genres)
# find top three production companies
e = find_top(df_top_p.production_companies)

In [None]:
#Use the result above to create a summary dataframe.
df_popular = pd.DataFrame({'popular_cast': a.index, 'popular_director': b.index, 'popular_keywords': c.index, 'popular_genres': d.index, 'popular_producer': e.index})
df_popular

#### From the above it is notable that Robert De Niro has appeared in the most popular movie also the most popular movies are of the genres:(Drama,Comedy & Thriller). The most sought after keywords from the most popular movies are: Based on Novel, Sex & Dystopia. It is found that Warner Bros, Universal pictures & Paramount pictures have been associated with the most popular movies and also have built a huge name in the industry.

## Question 2: What kinds of properties are associated with movies that have high voting score?

### What is the budget level associated with the highest voted movie:

In [None]:
# Find the mean and median voting score of each level with groupby
result_mean = df.groupby('budget_levels')['vote_average'].mean()
result_mean

In [None]:
result_median = df.groupby('budget_levels')['vote_average'].median()
result_median

In [None]:
# plot bars
sns.set_style('darkgrid')
ind = np.arange(len(result_mean))  # the x locations for the groups
width = 0.5       # the width of the bars

# plot bars
plt.subplots(figsize=(8, 6))
bars = plt.bar(ind, result_median, width, color='y', alpha=.7, label='mean')

# title and labels
plt.ylabel('rating')
plt.xlabel('budget levels')
plt.title('Rating with Budget Levels')
locations = ind  # xtick locations，345...
labels = result_median.index  
plt.xticks(locations, labels)
# legend
plt.legend( loc='upper left');

We can see that there is no big difference in average voting score at different budget levels. So from the result, maybe high budget of a movie is not necessary to a good quality of movie!

### How is the runtime status is associated with the highest voted movie:

In [None]:
# Find the mean popularity of each level with groupby
result_mean = df.groupby('runtime_levels')['vote_average'].mean()
result_mean

In [None]:
result_median = df.groupby('runtime_levels')['vote_average'].median()
result_median

In [None]:
sns.set_style('darkgrid')
ind = np.arange(len(result_mean))  # the x locations for the groups
width = 0.5       # the width of the bars

# plot bars
bars = plt.bar(ind, result_median, width, color='g', alpha=.7, label='mean')

# title and labels
plt.ylabel('rating')
plt.xlabel('runtime levels')
plt.title('Rating with Runtime Levels')
locations = ind  # xtick locations，345...
labels = result_median.index  
plt.xticks(locations, labels)
# legend
plt.legend();

It can be seen that the runtime of a movie does not has any significance effect of the voting of the movie.It seems that long runtime is not necessary for a movie to be good or bad.

# Research Question 2 : Top Keywords and Genres Trends by Generation

>Question 1: Number of movie released year by year  
Question 2: Keywords Trends by Generation  
Question 3: Genres Trends by Generation

### Question 1: Number of movie released year by year

In [None]:
movie_count = df.groupby('release_year').count()['original_title']
movie_count.tail(10)

In [None]:
#set style
sns.set_style('darkgrid')
#set x, y axis data
# x is movie release year
x = movie_count.index
# y is number of movie released
y = movie_count
#set size
plt.figure(figsize=(10, 5))
#plot line chart 
plt.plot(x, y, color = 'g', label = 'mean')
#set title and labels
plt.title('Number of Movie Released year by year')
plt.xlabel('Year')
plt.ylabel('Number of Movie Released');

From the above graph it can be seen that the number of movies released through generations have increased exponentially and thus making the entertainment industry soo huge nowadays.
>  It is seen that in the 1960s the total movies released each year is below 50, which in 1970s increased to be lesser than 100. In the 1980s the number of movie releases increased from below 100 to below 200 and that though because of the enternainment industry in this generation produced movies on nudity and sex.  In the 1990s the movie industry saw many independent directors which in tern resulted in more movies. From 2000s the entertainment industry boomed and skyrocketed from 200 movies per year to 680 movies.This came with the discovries of new age lens and camera.This boom was also supported by the tech industry which included (3D,VR,etc).

### Question 2: Keywords Trends by Generation

In [None]:
# sorting the movie release year list.
dfyear= df.release_year.unique()
dfyear= np.sort(dfyear)
dfyear

In [None]:
# year list of 1960s
y1960s =dfyear[:10]
# year list of 1970s
y1970s =dfyear[10:20]
# year list of 1980s
y1980s =dfyear[20:30]
# year list of 1990s
y1990s = dfyear[30:40]
# year list of afer 2000
y2000 = dfyear[40:]

In [None]:
# year list of each generation
times = [y1960s, y1970s, y1980s, y1990s, y2000]
#generation name
names = ['1960s', '1970s', '1980s', '1990s', 'after2000']
#creat a empty dataframe,df_r3
df_r3 = pd.DataFrame()
index = 0
#for each generation, do the following procedure
for s in times:
    # first filter dataframe with the selected generation, and store it to dfn
    dfn = df[df.release_year.isin(s)] 
    #apply the find_top function with the selected frame, using the result create a dataframe, store it to dfn2 
    dfn2 = pd.DataFrame({'year' :names[index],'top': find_top(dfn.keywords,1)})
     #append dfn2 to df_q2
    df_r3 = df_r3.append(dfn2)
    index +=1
df_r3

<a id='conclusions'></a>
# Conclusions
### Part one: Finding the Properties related with Successful Movies.
In this part, I first found out the properties that are associated with high popularity movies.  They were with high budget levels and longer run time.  The cast associated with high popularity movies are Robert De Niro and Bruce Willis;  director associated with high popularity movies are Steven Spielberg;  genres associated with high popularity movies are drama, comedy, and thriller;  keywords associated with high popularity movies are based on novel and dystopia;  producer associated with high popularity movies are Warner Bros., Universal Pictures and Paramount Pictures.

### Part two: Finding top keywords and genres associated with popular movies by generation.

In this part it was concluded that the number of movies released over the time has grown exponentially over the time-period that was divided into five generations: 1960s, 1970s, 1980s, 1990s and 2000s.
Again we found out the top keyword that was prominent in those generations:   
    1960s - Based on Novel  
    1970s - Based on Novel  
    1980s - Nuidity  
    1990s - Independent films  
    2000s - Woman directors  


## Limitations:
In this particular project "Analysing TmDB dataset",there were some formatting done to the dataset to make it more precise whereas keeping its integrity as it is.
<br>1.There were some columns that wasn't useful so it was dropped off- 'Id', 'Homepage' and 'Tagline'.
<br>2.There were many zero values in the budget, revenue and runtime columns, when searched for some of the movies from this list it was found that they actually had budget ,revenue and definitely had runtimes.So it was missing values that were present there.As the total number of zero values was too many dropping them could have lead us to wrong decisions and it was replaced with numpy NaN values, so that other variables of that data could be used.
<br>3.The runtime column had quite a few zero values and so it was dropped from the dataset.
<br>4.There were also few null values in the cast,directors and genres column which was also dropped so that it could not affect the observations.
<br>5.There was also one duplicate items which intern was dropped.
<br>without doing the above steps it could have been possible that we arrive to some other observations but that wasn't for sure would have been the correct observations.

In [None]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])