# Project 1: Explanatory Data Analysis & Data Presentation (Movies Dataset)

# Project Brief for Self-Coders

Here you´ll have the opportunity to code major parts of Project 1 on your own. If you need any help or inspiration, have a look at the Videos or the Jupyter Notebook with the full code. <br> <br>
Keep in mind that it´s all about __getting the right results/conclusions__. It´s not about finding the identical code. Things can be coded in many different ways. Even if you come to the same conclusions, it´s very unlikely that we have the very same code. 

## Data Import and first Inspection

1. __Import__ the movies dataset from the CSV file "movies_complete.csv". __Inspect__ the data.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt


In [None]:
df = pd.read_csv("movies_complete.csv", parse_dates=["release_date"])

In [None]:
df

__Some additional information on Features/Columns__:

* **id:** The ID of the movie (clear/unique identifier).
* **title:** The Official Title of the movie.
* **tagline:** The tagline of the movie.
* **release_date:** Theatrical Release Date of the movie.
* **genres:** Genres associated with the movie.
* **belongs_to_collection:** Gives information on the movie series/franchise the particular film belongs to.
* **original_language:** The language in which the movie was originally shot in.
* **budget_musd:** The budget of the movie in million dollars.
* **revenue_musd:** The total revenue of the movie in million dollars.
* **production_companies:** Production companies involved with the making of the movie.
* **production_countries:** Countries where the movie was shot/produced in.
* **vote_count:** The number of votes by users, as counted by TMDB.
* **vote_average:** The average rating of the movie.
* **popularity:** The Popularity Score assigned by TMDB.
* **runtime:** The runtime of the movie in minutes.
* **overview:** A brief blurb of the movie.
* **spoken_languages:** Spoken languages in the film.
* **poster_path:** The URL of the poster image.
* **cast:** (Main) Actors appearing in the movie.
* **cast_size:** number of Actors appearing in the movie.
* **director:** Director of the movie.
* **crew_size:** Size of the film crew (incl. director, excl. actors).

In [None]:
df.info()

In [None]:
df.genres[1]

In [None]:
df.cast[3]

In [None]:
df.describe()

In [None]:
df.hist(figsize=(20,12),bins=100)
plt.show()

__Movies Top 5 - Highest Budget__

In [None]:
df.budget_musd.value_counts(dropna=False).head(20)

In [None]:
df.revenue_musd.value_counts(dropna=False).head(20)

In [None]:
df.vote_average.value_counts(dropna=False)

In [None]:
df.vote_count.value_counts()

In [None]:
df.describe(include="all")

In [None]:
df.describe(include="object")

In [None]:
df[df.title=="Cinderella"]

## The best and the worst movies...

2. __Filter__ the Dataset and __find the best/worst n Movies__ with the


- Highest Revenue
- Highest Budget
- Highest Profit (=Revenue - Budget)
- Lowest Profit (=Revenue - Budget)
- Highest Return on Investment (=Revenue / Budget) (only movies with Budget >= 10) 
- Lowest Return on Investment (=Revenue / Budget) (only movies with Budget >= 10)
- Highest number of Votes
- Highest Rating (only movies with 10 or more Ratings)
- Lowest Rating (only movies with 10 or more Ratings)
- Highest Popularity

__Define__ an appropriate __user-defined function__ to reuse code.


In [None]:
from IPython.display import HTML

In [None]:
df_best=df[["poster_path","title","budget_musd","revenue_musd","vote_count","vote_average","popularity"]].copy()
df_best

In [None]:
df_best["profit_musd"]=df.revenue_musd.sub(df.budget_musd)
df_best["return"]=df.revenue_musd.div(df.budget_musd)
df_best

In [None]:
df_best.columns=["","Title","Budget","Revenue","Votes","Average Rating","Pupularity","Profit","ROI"]
df_best.set_index("Title", inplace= True)
df_best

In [None]:
df_best.iloc[0,0]

In [None]:
#Use python logical
subset=df_best.iloc[:5,:2]
subset


In [None]:
HTML(subset.to_html(escape=False))

In [None]:
#df_axu=df_best
df_best.sort_values(by= "Average Rating", ascending=False)

In [None]:
df_best.sort_values(by= "ROI", ascending=False)


In [None]:
df_best.loc[df_best.Budget >=5].sort_values(by="ROI",ascending=False)

In [None]:
df_best.Budget.fillna(0,inplace=True)
df_best.Votes.fillna(0,inplace=True)


In [None]:
df_best.info()

## Functions

In [None]:
def best_worst(n,by,ascending=False, min_bud=0,min_votes=0):
  df2=df_best.loc[(df_best.Budget >= min_bud) & (df_best.Votes >= min_votes), 
                  ["",by]].sort_values(by=by, ascending=ascending).head(n).copy()
  return HTML(df2.to_html(escape=False))

### End Functions

__Movies Top 5 Highest Revenue__

In [None]:
best_worst(5,"Revenue")

__Movies Top 5 - Highest Budget__

In [None]:
best_worst(5,"Budget")

__Movies Top 5 - Highest Profit__

In [None]:
best_worst(5,"Profit")

__Movies Top 5 - Lowest Profit__

In [None]:
best_worst(5,by="Profit",ascending=True)

__Movies Top 5 - Highest ROI__

In [None]:
best_worst(5,"ROI",min_bud=0)

__Movies Top 5 - Lowest ROI__

In [None]:
best_worst(5,"ROI",ascending=True, min_bud=100)

__Movies Top 5 - Most Votes__

In [None]:
best_worst(5,"Votes")

__Movies Top 5 - Highest Rating__

In [None]:
best_worst(5,"Average Rating",min_votes=10)

__Movies Top 5 - Lowest Rating__

In [None]:
best_worst(5,"Average Rating",ascending=True,min_votes=100)

In [None]:
best_worst(5,"Average Rating", ascending=True,min_votes=20,min_bud=0)

__Movies Top 5 - Most Popular__

In [None]:
best_worst(5,"Pupularity")

## Find your next Movie

3. __Filter__ the Dataset for movies that meet the following conditions:

__Search 1: Science Fiction Action Movie with Bruce Willis (sorted from high to low Rating)__

In [None]:
df.info()

In [None]:
df.genres[0]

In [None]:
aux_genres= df.genres.str.contains("Action")&df.genres.str.contains("Science Fiction")
aux_genres

In [None]:
aux_actor= df.cast.str.contains("Bruce Willis")
aux_actor

In [None]:
df.loc[aux_actor & aux_genres,["title","vote_average","cast"]].sort_values(by="vote_average", ascending=False)

In [None]:
bruce=df.loc[aux_actor & aux_genres,["title","poster_path","vote_average"]].sort_values(by="vote_average", ascending=False)
HTML(bruce.to_html(escape=False))

__Search 2: Movies with Uma Thurman and directed by Quentin Tarantino (sorted from short to long runtime)__

In [None]:
df.info()

In [None]:
df.director[0]

In [None]:
mask_director=df.director=="Quentin Tarantino"
mask_director

In [None]:
mask_actor= df.cast.str.contains("Uma Thurman")
mask_actor.info()

In [None]:
quentin=df.loc[mask_director & mask_actor,["title","vote_average","poster_path","runtime"]].sort_values(by="runtime",ascending=True).set_index("title")
HTML(quentin.to_html(escape=False))

In [None]:
df.head(3
        )

__Search 3: Most Successful Pixar Studio Movies between 2010 and 2015 (sorted from high to low Revenue)__

In [None]:
mask_studio= df.production_companies.str.contains("Pixar Animation Studios")
mask_time=df.release_date.between("2010-01-01","2015-12-31")

In [None]:
df.loc[mask_studio & mask_time,["title","poster_path","production_companies"]]

__Search 4: Action or Thriller Movie with original language English and minimum Rating of 7.5 (most recent movies first)__

## Are Franchises more successful?

4. __Analyze__ the Dataset and __find out whether Franchises (Movies that belong to a collection) are more successful than stand-alone movies__ in terms of:

- mean revenue
- median Return on Investment
- mean budget raised
- mean popularity
- mean rating

hint: use groupby()

__Franchise vs. Stand-alone: Average Revenue__

__Franchise vs. Stand-alone: Return on Investment / Profitability (median)__

__Franchise vs. Stand-alone: Average Budget__

__Franchise vs. Stand-alone: Average Popularity__

__Franchise vs. Stand-alone: Average Rating__

## Most Successful Franchises

5. __Find__ the __most successful Franchises__ in terms of

- __total number of movies__
- __total & mean budget__
- __total & mean revenue__
- __mean rating__

## Most Successful Directors

6. __Find__ the __most successful Directors__ in terms of

- __total number of movies__
- __total revenue__
- __mean rating__