# **Exploratory Data Analysis (EDA) + Time Series** <a id="0"></a> <br>

#### **Content:**
* 1-[Time Series](#1)
    * a-[Number of titles per year](#2)
    * b-[Revenue per year](#3)
    * c-[Budget per year](#4)
    * d-[Runtime per year](#5)    
* 2-[EDA](#6)

***THIS KERNEL IS UNDER CONSTRUCTION***

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
train = pd.read_csv('../input/train.csv')

In [None]:
train.head(3)

In [None]:
shape = train.shape
shape

In [None]:
pct_nans = round(train.isnull().sum()/shape[0]*100,1)
print("Percentage of missing data in each column\n", pct_nans)

## 1 - TIME SERIES [^](#0) <a id="1"></a> <br>

In [None]:
TS = train.loc[:,["original_title","release_date","budget","runtime","revenue"]]
TS.dropna()

TS.release_date = pd.to_datetime(TS.release_date)
TS.loc[:,"Year"] = TS["release_date"].dt.year
TS.loc[:,"Month"] = TS["release_date"].dt.month
TS = TS[TS.Year<2018]

### a) Number of titles per year [^](#0) <a id="2"></a> <br>

In [None]:
titles = TS.groupby("Year")["original_title"].count()
titles.plot(figsize=(15,8))
plt.xlabel("Year of release")
plt.ylabel("Number of titles released")
plt.xticks(np.arange(1970,2025,5))

plt.show()

### b) Revenue per year [^](#0) <a id="3"></a> <br>

In [None]:
revenues = TS.groupby("Year")["revenue"].aggregate(["min","mean","max"])
revenues.plot(figsize=(15,8))
plt.xlabel("Year of release")
plt.ylabel("Revenue")
plt.xticks(np.arange(1970,2025,5))
plt.show()

### c) Budget per year [^](#0) <a id="4"></a> <br>

In [None]:
budgets = TS.groupby("Year")["budget"].aggregate(["min","mean","max"])
budgets.plot(figsize=(15,8))
plt.xlabel("Year of release")
plt.ylabel("Budget")
plt.xticks(np.arange(1970,2025,5))
plt.show()

There are movies with budget of 0. Let's check them.

In [None]:
b_zeros = TS[TS.budget==0]
b_zeros.head()

After quick check in Google it is clear that these **0s are actually missing values**, e.g.:

<a href="https://en.wikipedia.org/wiki/Control_Room_(film)">"Control Room"</a> - actual budget is around 60 000 $.

<a href="https://en.wikipedia.org/wiki/The_Invisible_Woman_(2013_film)">"The Invisible Woman"</a> - actual budget is around 12 000 000 $.


### d) Runtime per year [^](#0) <a id="5"></a> <br>

In [None]:
runtimes = TS.groupby("Year")["runtime"].aggregate(["min","mean","max"])
runtimes.plot(figsize=(15,8))
plt.xlabel("Year of release")
plt.ylabel("Runtime")
plt.xticks(np.arange(1970,2025,5))
plt.show()

It seems we have some movies with duration of 0 minutes. Let's investigate.

In [None]:
r_zeros = TS[TS.runtime==0]
r_zeros.head()

Once again after checking these titles in Google it appears that 0s are missing values, e.g.:

<a href="https://en.wikipedia.org/wiki/The_Worst_Christmas_of_My_Life_(film)">"Ia peggior Natale della mia vita (The_Worst_Christmas_of_My_Life)"</a> - actual runtime is 86 minutes.

<a href="https://en.wikipedia.org/wiki/The_Worst_Week_of_My_Life_(film)">"La peggior settimana della mia vita (The_Worst_Week_of_My_Life)"</a> - actual runtime is 93 minutes.


## 2 - EDA [^](#0) <a id="6"></a> <br>

In [None]:
train.plot(x="runtime",y="budget", kind="scatter",figsize=(12,8))
plt.show()

In [None]:
train.plot(x="popularity",y="budget", kind="scatter",figsize=(12,8))
plt.show()

For clarity let's remove the outliers and see the most popular movie.

In [None]:
pop = train[train.popularity<50]
pop.plot(x="popularity",y="budget", kind="scatter",figsize=(12,8))
plt.show()

In [None]:
top3 = train.popularity.nlargest(3)
idx3 = top3.index.tolist()
train.iloc[idx3,[7,9]]

In [None]:
top3 = train.budget.nlargest(3)
idx3 = top3.index.tolist()
train.iloc[idx3,[7,2]]

In [None]:
pop.plot(x="popularity",y="revenue", kind="scatter",figsize=(12,8))
plt.show()

In [None]:
top3 = train.revenue.nlargest(3)
idx3 = top3.index.tolist()
train.iloc[idx3,[7,-1]]

In [None]:
f,ax = plt.subplots(figsize=(10, 8))
sns.heatmap(TS.corr(), annot=True, linewidths=.5, fmt= '.2f',ax=ax)
plt.show()

In [None]:
train.original_language.nunique()

In [None]:
plt.figure(figsize=(15,5))
sns.countplot(train['original_language'].sort_values())