![example](images/director_shot.jpeg)

# Project Title

**Authors:** Melvin Garcia
***

## Overview

A one-paragraph overview of the project, including the business problem, data, methods, results and recommendations.

The aim of this report is derive insights on what makes a movie successful (and is not) from data ranging between IMDB, Rotten Tomatoes, Box Office Mojo, TheMovieDB.org and the-numbers.com. These insights are intended to help Microsoft Stakeholders make informed decisions on how to make an entryway into the movie industry and strategically being creating a new movie studio.

Points to include and add afterward:
- Data
- Methods
- Results
- Recommendations

## Business Problem

Summary of the business problem you are trying to solve, and the data questions that you plan to answer to solve them.

***
Questions to consider:
* What are the business's pain points related to this project?
* How did you pick the data analysis question(s) that you did?
* Why are these questions important from a business perspective?
***

## Data Understanding

Describe the data being used for this project.
***
Questions to consider:
* Where did the data come from, and how do they relate to the data analysis questions?
* What do the data represent? Who is in the sample and what variables are included?
* What is the target variable?
* What are the properties of the variables you intend to use?
***

In [4]:
# Import standard packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [5]:
# Here you run your code to explore the data

# Connect to DB
import sqlite3
conn = sqlite3.connect('data\movies.db')
cur  = conn.cursor()

## Import schema of DB to reference and explore

- Box Office Mojo (bom)
- IMDB (imdb)
- Rotten Tomatoes (rotten_tomatoes)
- TheMovieDB.org (tmdb)
- the-numbers.com (tn)

![movies.db schema](images/movies_db_schema.png)

## Explore the data

In [6]:
sql_query = cur.execute("""SELECT * FROM imdb_title_basics LIMIT 5;""").fetchall()
df = pd.DataFrame(sql_query)
df.columns = [x[0] for x in cur.description]
df.head()

Unnamed: 0,idx,tconst,primary_title,original_title,start_year,runtime_minutes,genres
0,0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"


In [7]:
sql_query = cur.execute("""SELECT * FROM bom_movie_gross LIMIT 5;""").fetchall()
df = pd.DataFrame(sql_query)
df.columns = [x[0] for x in cur.description]
df.head()

Unnamed: 0,idx,title,studio,domestic_gross,foreign_gross,year
0,0,Toy Story 3,BV,415000000.0,652000000,2010
1,1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,3,Inception,WB,292600000.0,535700000,2010
4,4,Shrek Forever After,P/DW,238700000.0,513900000,2010


In [11]:
sql_query = cur.execute("""SELECT * FROM tmdb_movies LIMIT 5;""").fetchall()
df = pd.DataFrame(sql_query)
df.columns = [x[0] for x in cur.description]
df.head()

Unnamed: 0,idx,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


In [14]:
sql_query = cur.execute("""SELECT * FROM bom_movie_gross LIMIT 5;""").fetchall()
df = pd.DataFrame(sql_query)
df.columns = [x[0] for x in cur.description]
df.head()

Unnamed: 0,idx,title,studio,domestic_gross,foreign_gross,year
0,0,Toy Story 3,BV,415000000.0,652000000,2010
1,1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,3,Inception,WB,292600000.0,535700000,2010
4,4,Shrek Forever After,P/DW,238700000.0,513900000,2010


For example, you could identify the top performing genres and then only investigate the characteristics of films with those genres.

- Aim to look define 'successful' in number of ways top grossing, rotten tomatoes ratings, imdb ratings etc

In [2]:
def pandas_df_sql(query):
    query_cur = cur.execute(query).fetchall()
    df = pd.DataFrame(query_cur)
    df.columns = [x[0] for x in cur.description]
    return df

In [None]:
# Get movies by imdb title, bom_movie_gross, tmdb_movies and tn_move_budgets

movie_budgetgross_query = """
SELECT 
    *
FROM 
    (SELECT
        tconst,
        lower(primary_title) imdb_title,
        lower(original_title) imdb_original_title,
        start_year,
        runtime_minutes,
        genres
    FROM imdb_title_basics
    ) itb
JOIN 
    (SELECT 
        lower(title) bmg_title,
        studio,
        domestic_gross,
        foreign_gross,
        year
    FROM bom_movie_gross
    ) bmg 
ON itb.imdb_title = bmg.bmg_title;
"""

movie_gross_df = pandas_df_sql(movie_budgetgross_query)
movie_gross_df.info()
movie_gross_df.head()

## Data Preparation

Describe and justify the process for preparing the data for analysis.

***
Questions to consider:
* Were there variables you dropped or created?
* How did you address missing values or outliers?
* Why are these choices appropriate given the data and the business problem?
***

In [9]:
# Here you run your code to clean the data

## Data Modeling

[EDIT - Not needed; replace with Data Analysis / Visualization]

Describe and justify the process for analyzing or modeling the data.

***
Questions to consider:
* How did you analyze or model the data?
* How did you iterate on your initial approach to make it better?
* Why are these choices appropriate given the data and the business problem?
***

In [10]:
# Here you run your code to model the data


## Evaluation
Evaluate how well your work solves the stated business problem.

***
Questions to consider:
* How do you interpret the results?
* How well does your model fit your data? How much better is this than your baseline model?
* How confident are you that your results would generalize beyond the data you have?
* How confident are you that this model would benefit the business if put into use?
***

## Conclusions
Provide your conclusions about the work you've done, including any limitations or next steps.

***
Questions to consider:
* What would you recommend the business do as a result of this work?
* What are some reasons why your analysis might not fully solve the business problem?
* What else could you do in the future to improve this project?
***