# Project: IMDB MOVIE DATA INVESTIGATION

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [None]:
df=pd.read_csv('../input/imdb-data/imdb-movies.csv')
df.head()

<a id='intro'></a>
## Introduction

Imdb dataset included such as movies, directors, genres of moveies, release date and year, budget, production companies etc.
I will try to answer 2 questions.
1. Which genres are most popular from year to year?
2. Does bugdet effect popularity of movie? How?

At this section, I checked dataset information and missing value. I defined my questions accordingly missing value. Budget, release_date, popularity do not have any missing value but genres column has 23 missing value. So I will only drop 23 rows to complete my study.

In [None]:
# Missing value check
df.isnull().sum()

In [None]:
# Data info check
df.info()

<a id='wrangling'></a>
## Data Wrangling


### Duplicated Items

In [None]:
#duplicated lines check
sum(df.duplicated())

In [None]:
df.drop_duplicates(inplace=True)

In [None]:
#Check is drop_duplicates function worked - result should be 0
sum(df.duplicated())


### Missing Value

In [None]:
# genres column has 23 missing data. Genres is object so we can not fill these empty items with mean. I will drop 23 rows with missing data.
df = df[df['genres'].notna()]

In [None]:
#recheck missing value of genres
#I only dropped missing genres rows but kept other missing values. 
#If I would used dropna function, all missing rows will be deleted. In this case, dataset lose lots of data.
#genres has 0 missing value now.
df.isnull().sum()

<a id='eda'></a>
## Exploratory Data Analysis


### Research Question 1 : Which genres are most popular from year to year?

In [None]:
#df filtered for question 1
df_1=df.filter(items=['genres','popularity','release_year'])
df_1.head()

In [None]:
df_1.info()

In [None]:
#Checked genres types and counts. There are lots of category for genres but some of them contain only 1 sample.
df_1['genres'].value_counts()

In [None]:
#Checked if a category has over than 200 sample
df_1['genres'].value_counts()[df_1['genres'].value_counts()>200]

In [None]:
# Categories are grouped by released year and calculated mean of popularity per year
# Drawed graphic to view popularity mean change per year for categories
drama=df_1.query('genres=="Drama"').groupby(['release_year'])['popularity'].mean()
comedy=df_1.query('genres=="Comedy"').groupby(['release_year'])['popularity'].mean()
documentary=df_1.query('genres=="Documentary"').groupby(['release_year'])['popularity'].mean()
drama_romance=df_1.query('genres=="Drama|Romance"').groupby(['release_year'])['popularity'].mean()
horror=df_1.query('genres=="Horror"').groupby(['release_year'])['popularity'].mean()
plt.plot(drama, label="Drama")
plt.plot(comedy, label="Comedy")
plt.plot(documentary,label="Documentary")
plt.plot(drama_romance, label="Drama and Romance")
plt.plot(horror, label="Horror")
plt.title('Popularity Mean vs Years', fontsize=24)
plt.xlabel('Year', fontsize=16)
plt.ylabel('Popularity Mean', fontsize=16)
plt.grid(True)
plt.legend()

Popularity mean per year for 5 categories is visible on graphic.
Graphic shows popularity mean change years by years.
For example Drama and Romance category improved popularity later 2010 a lot. My detailed observations are under <a href="#conclusions">Conclusions</a></li>

In [None]:
df_1_filtered=df_1.query('genres=="Drama" or genres=="Comedy" or genres=="Documentary" or genres=="Drama|Romance" or genres=="Comedy|Drama" or genres=="Comedy|Romance" or genres=="Horror|Thriller"')
labels=['Horror and Thriller','Drama and Romance', 'Drama', 'Documentary', 'Comedy and Romance', 'Comedy and Drama','Comedy']

fig = plt.figure(figsize =(10, 7)) 
plt.pie(df_1_filtered.groupby(['genres']).mean()['popularity'], labels=labels, autopct='%.2f')
df_1_filtered.groupby(['genres']).mean()['popularity']
plt.title('Popularity Mean of Genres',fontsize=24 )

Pie chart shows that mean of popularity accordingly genres.
Comedy and Drama has highest popularity than others. But documentary genre has lowest popularity than others.


### Research Question 2 : Does bugdet effect popularity of movie? How?

In [None]:
#Created new dataframe to answer question 2
# genres, popularity and budget are included
df_2=df.filter(items=['genres','popularity','budget'])
df_2.head()

In [None]:
#Grouped by genres and took mean of budget per genres and sorted 
grouped_budget=df_2.groupby(['genres'])['budget'].mean().sort_values(ascending=False).reset_index()
#Filtered budget which are greater than 190000000
filtered_budget=grouped_budget[grouped_budget['budget']>=190000000.0]
filtered_budget

In [None]:
#created pie chart for 5 category according to budget
labels=['Adventure|Fantasy|Action|Western|Thriller','Thriller|Action|Adventure|Science Fiction','Family|Fantasy|Adventure',
        'Adventure|Action|Fantasy','Action|Family|Science Fiction|Adventure|Mystery']
fig = plt.figure(figsize =(10, 7)) 
plt.pie(filtered_budget['budget'],labels=labels, autopct='%1.1f%%', explode=[0.1, 0.0, 0.0, 0.0, 0.0])
plt.title('Budget Distribution of Genres', fontsize=24)
#Adventure|Fantasy|Action|Western|Thriller category has greater budget than other categories

Pie chart shows budget mean distribution per genres.
Adventure,Fantasy,Action,Western,Thriller category has highest budget. And Action,Family,Science, Advanture, Mystery category has lowest budget.

In [None]:
#comedy category have 712 sample
df_2['genres'].value_counts()

In [None]:
#Only comedy category is filtered
comedy=df_2[df_2['genres']=='Comedy']
comedy['popularity'].sort_values(ascending=False)

In [None]:
#popularity change plotted
popularity=comedy['popularity']
budget=comedy['budget']
plt.plot(popularity, color='orange')

Graphic shows popularity change for comedy category

In [None]:
#budget change plotted
plt.plot(budget)

Graphic shows change of budget for comedy category.

When I checked both graphic, I could not find a similarity to make some comments. Graphics are slightly different than each other.

So I changed my method. I decided to categorize popularity 1,2,3,4,5,6 and find mean of budget for each popularity range.

Please see below research.

In [None]:
#popularity highest rate is 6.6
comedy['popularity'].sort_values(ascending=False)

In [None]:
#samples are grouped by 1,2,3,4,5,6 popularity and taken mean of budget per range
comedy_1=comedy[comedy['popularity']<1]
budget_1=comedy_1['budget'].mean()
comedy_2=comedy[(comedy['popularity']>1) & (comedy['popularity']<2) ]
budget_2=comedy_2['budget'].mean()
comedy_3=comedy[(comedy['popularity']>2) & (comedy['popularity']<3) ]
budget_3=comedy_3['budget'].mean()
comedy_4=comedy[(comedy['popularity']>3) & (comedy['popularity']<4) ]
budget_4=comedy_4['budget'].mean()
comedy_5=comedy[(comedy['popularity']>4) & (comedy['popularity']<5) ]
budget_5=comedy_5['budget'].mean()
comedy_6=comedy[comedy['popularity']>5 ]
budget_6=comedy_6['budget'].mean()

In [None]:
#plotted budget vs popularity rate
budget_list=[budget_1,budget_2,budget_3,budget_4,budget_5,budget_6]
popularity_list=[1,2,3,4,5,6]
plt.plot(popularity_list,budget_list)
plt.title('Popularity vs Budget', fontsize=20)
plt.xlabel('Popularity Rate', fontsize=16)
plt.ylabel('Budget Mean', fontsize=16)
plt.legend()
plt.grid(True)

Graphic shows change of budget mean accordingly popularity range (1-6).
There is not any linear/non linear relationship between popularity and budget mean for comedy category.
You can find my detailed observations and comment under <a href="#conclusions">Conclusions</a>


<a id='conclusions'></a>
#  Conclusions

# **Question 1:** 
# 
**My observations:**
 - Drama and romance movies improved popularity a lot later 2010.
 - Drama movies improved popularity between 1970-1980 six times more.
 - Documentary category has lowest popularity
 - Comedy movies was most popular category at beginning of 60s.
 - Horror movies was the most popular category at beginning of 90s.
 
**Notes:**
- CSV data is enough to answer this question for some categories.
- Movies mostly labeled with many genres. I first decided to separate categories. For example, If a movie labeled as Drama and Romance, I thought I can add 2 line for this movie, first one is Drama, second one is Romance. But that seemed to me complicated and I gave up. 
- Later I decided to use genres which has more samples than others. 
- Some categories has 1 sample. 1 sample is not enough to answer this question.
- I faced some difficulties when I tried to plot graphics. It was usually lack of practice. Later I overwhelmed.


# **Question 2:**

**My observations:**
* I tried to observe if budget effects popularity of a movie.
* In this case, I wanted to analysis comedy category because this category had greater sample than others.
* When I plotted budget vs popularity, I could not observe any logical result.
* So I decided to divide popularity to 1,2,3,4,5,6 category and I took mean of budget for each range.
* Latest graphic showed that there is not linear/bounded relationship between this 2 category.
* For example when popularity improved to 2 from 1, mean of budget also increased.
* But when popularity improved to 3 from 2, mean of budget decreased.
* So we can say that mean of budget directly effect to popularity for comedy category

**Notes:**
- I selected comedy category to assess budget effects to popularity.
- When I plotted popularity vs budget with all data, I could not observe a relationship between graphics. 
- Later I decided to categorize popularity to 1,2,3,4,5,6. I calculated mean of budget for each popularity category.
- Finally I drawed line chart graphic for mean of budget vs popularity range. In this case I could not observe direct relationship between these subjects. We can not say budget increasement improves popularity or opposite.
- Accordingly my opinion, csv data is enough to evaluate this question but accordingly my observation there is not linear/nonlinear relationship between budget and popularity.
- When I drawed popularity vs budget graphics I could not observe a relationship. In this case, I felt stuck because I could not find a solution how I can answer this question. Later I thought maybe I can categorize popularity in a range and try to create more clear graphic than first one. And it worked for my observation.