![example](images/director_shot.jpeg)

# Microsoft Movie Analysis - Project 1

**Authors:** Scott Graham
***

## Overview

The goal is to provide an insight to [Microsoft](https://www.microsoft.com/en-au/movies-and-tv?activetab=movies%3aprimaryr2) regarding the latest box office films and what is popular and most successful so they can leverage this information to provide quality content in their new movie studio. Detailed analysis of what is "hot or not" will ensure that resources are spent developing content that will be widely accepted and best return on investment for Microsoft.

## Business Problem

Microsoft need to provide movie content that is relevant with current trends, to insure that their resource allocation is directed in the right direction to provide the best results. Using the database from [IMDB](https://www.imdb.com/) to gain insights from fan reviews along with critical reviews to provide the best information about what is trending with Microsoft's intended audience.

***
Questions to consider:
* What are the business's pain points related to this project?
* How did you pick the data analysis question(s) that you did?
* Why are these questions important from a business perspective?
***

## Data Understanding

[IMDB](https://www.imdb.com/) is one of the largest database for information relating to movies and TV series that includes information about the cast, crew, plot summaries, rating and reviews both critcal and fan-based. We will use this information to determine if speicifc actors are providing the best content, genres, movie themes or anything additional to provide Microsoft with a clear indicator of how to invest their resources for their movie studio.
Describe the data being used for this project.
***
Questions to consider:
* Where did the data come from, and how do they relate to the data analysis questions?
* What do the data represent? Who is in the sample and what variables are included?
* What is the target variable?
* What are the properties of the variables you intend to use?
***

In [1]:
# Import standard packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]:
# Here you run your code to explore the data
bom_gross = pd.read_csv('data/zippedData/bom.movie_gross.csv.gz')
imdb_basics = pd.read_csv('data/zippedData/imdb.title.basics.csv.gz')
imdb_ratings = pd.read_csv('data/zippedData/imdb.title.ratings.csv.gz')

In [None]:
bom_gross.info()

In [None]:
imdb_basics.info()

In [None]:
imdb_ratings.info()

In [None]:
# import sqlite3
# con1 = sqlite3.connect('data.sqlite')
# con2 = sqlite3.connect('imdb_ratings')

In [None]:
# imdb_combined = """
# SELECT *
# FROM imdb_ratings
#     JOIN imdb_basics
#     USING(tconst)
# LIMIT 10
# ;
# """
# pd.read_sql(imdb_combined, con1)

# BOM Gross Data
General information relevant to the BOM Gross data.

In [None]:
bom_gross.head()

In [None]:
bom_gross.info()

In [3]:
#Convert foreign gross to float to match domestic gross

bom_gross['foreign_gross'] = pd.to_numeric(bom_gross['foreign_gross'], errors='coerce')
#If we want to change NaN values to 0:
# bom_gross = bom_gross.replace(np.nan, 0, regex=True)
print(bom_gross.dtypes)

title              object
studio             object
domestic_gross    float64
foreign_gross     float64
year                int64
dtype: object


In [None]:
#Checking that all titles are unique
bom_gross.duplicated('title').value_counts()

In [None]:
#Determine what was the repeat
bom_gross['title'].describe()

In [None]:
bom_gross['studio'].describe()

In [None]:
bom_gross['domestic_gross'].describe()

In [None]:
bom_gross['foreign_gross'].describe()

# IMDB Basics Data
General information relevant to the IMDB Basics data

In [None]:
imdb_basics.head()

In [None]:
imdb_basics.info()

In [None]:
#To check that the numbers are all rows are unique
imdb_basics.duplicated('tconst').value_counts()

In [None]:
#Checking for repeats
imdb_basics.duplicated('primary_title').value_counts()

In [None]:
#Checking for repeats
imdb_basics.duplicated('original_title').value_counts()

In [None]:
#Checking the frequency of each genre.
imdb_basics['genres'].value_counts()
#Result of this shows that I need to seperate values when multiple genres are grouped together

In [None]:
imdb_basics['runtime_minutes'].describe()

# IMDB Ratings Data
General information relevant to the IMDB Ratings data

In [None]:
imdb_ratings.head()

In [None]:
#To test that all rows are unique
imdb_ratings.duplicated('tconst').value_counts()

In [None]:
imdb_ratings['averagerating'].describe()

In [None]:
imdb_ratings['numvotes'].describe()

## Data Preparation

Describe and justify the process for preparing the data for analysis.

***
Questions to consider:
* Were there variables you dropped or created?
* How did you address missing values or outliers?
* Why are these choices appropriate given the data and the business problem?
***

# Data Cleaning
I need to modify:

    Primary and Original title and determine which one to use
    
    Remove symbols before primary title name
    
    Remove columns - orginal_title
    
    Break apart the genres
    
    Combine the foreign and domestic gross to give a total gross column
    

In [4]:
#Need to remove the repeat title in bom_gross 'Bluebeard'
bom_gross = bom_gross.drop_duplicates()

In [15]:
#Rename tconst to reviewid
imdb_basics.rename(columns={'tconst':'reviewid'}, inplace=True)
imdb_basics.head()

Unnamed: 0,reviewid,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"


In [16]:
#Rename tconst to reviewid
imdb_ratings.rename(columns={'tconst':'reviewid'}, inplace=True)
imdb_ratings.head()

Unnamed: 0,reviewid,averagerating,numvotes
0,tt10356526,8.3,31
1,tt10384606,8.9,559
2,tt1042974,6.4,20
3,tt1043726,4.2,50352
4,tt1060240,6.5,21


In [17]:
#Removing symbols from primary title name
imdb_basics['primary_title'] = imdb_basics['primary_title'].str.replace("[!,#]", "")
# imdb_basics

  imdb_basics['primary_title'] = imdb_basics['primary_title'].str.replace("[!,#]", "")


In [18]:
imdb_basics.sort_values('primary_title').head(10)

Unnamed: 0,reviewid,primary_title,original_title,start_year,runtime_minutes,genres
88974,tt5144238,$2 a Day,$2 a Day,2015,52.0,Documentary
28591,tt2106284,$50K and a Call Girl: A Love Story,$50K and a Call Girl: A Love Story,2014,90.0,"Action,Adventure,Comedy"
140532,tt9118844,$MOKE,$MOKE,2019,,
70067,tt4004608,$elfie Shootout,$elfie Shootout,2016,86.0,Comedy
33792,tt2258233,$ellebrity,$ellebrity,2012,89.0,Documentary
76834,tt4397606,$kumbagz,$kumbagz,2015,71.0,"Crime,Thriller"
39648,tt2410904,$tiffed or How I Learned to Deal with Dissapoi...,$tiffed or How I Learned to Deal with Dissapoi...,2012,51.0,Comedy
112668,tt6608094,&,&,2017,,
121005,tt7288662,& Jara Hatke,& Jara Hatke,2016,110.0,"Drama,Family,Romance"
36537,tt2332503,&Me,&Me,2013,88.0,Romance


In [19]:
#Remove unnecessary columns
imdb_basics.drop(columns = 'original_title', inplace = True)

In [20]:
imdb_basics.head()

Unnamed: 0,reviewid,primary_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,2017,80.0,"Comedy,Drama,Fantasy"


In [23]:
#Add a new column that has the total gross
bom_gross['total_gross'] = bom_gross['domestic_gross'] + bom_gross['foreign_gross']
bom_gross.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year,total_gross
0,Toy Story 3,BV,415000000.0,652000000.0,2010,1067000000.0
1,Alice in Wonderland (2010),BV,334200000.0,691300000.0,2010,1025500000.0
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000.0,2010,960300000.0
3,Inception,WB,292600000.0,535700000.0,2010,828300000.0
4,Shrek Forever After,P/DW,238700000.0,513900000.0,2010,752600000.0


# Merging Datasets
Merging the IMDB datasets for easy comparison of the values.

In [21]:
# to_concat = [imdb_basics, imdb_ratings]
# imdb_comb = pd.concat(to_concat)
# imdb_comb
#The above didn't seem to work. The lower solution works better.

imdb_comb = pd.merge(imdb_basics, imdb_ratings, how = 'inner')
imdb_comb.sort_values('primary_title')

Unnamed: 0,reviewid,primary_title,start_year,runtime_minutes,genres,averagerating,numvotes
17401,tt2106284,$50K and a Call Girl: A Love Story,2014,90.0,"Action,Adventure,Comedy",6.8,1818
41912,tt4004608,$elfie Shootout,2016,86.0,Comedy,3.5,101
20738,tt2258233,$ellebrity,2012,89.0,Documentary,5.5,1001
45518,tt4397606,$kumbagz,2015,71.0,"Crime,Thriller",6.4,16
66086,tt7288662,& Jara Hatke,2016,110.0,"Drama,Family,Romance",6.5,13
...,...,...,...,...,...,...,...
71004,tt8514766,Üç Harflilerin Musallat Oldugu Büyülü Konakta ...,2018,80.0,"Comedy,Horror,Thriller",4.8,51
45685,tt4422510,Üç Iki Bir... Kestik,2014,92.0,Comedy,4.3,88
52301,tt5217114,à propos: philosophie,2016,80.0,Documentary,8.2,5
36816,tt3509772,ärtico,2014,78.0,Drama,6.6,101


In [None]:
#Now make a table that has the BOM gross value as the key as it has all individual title names
#then add in the columns from the imdb_comb table. Obviously a lot of columns will not be needed
#Why am I doing this? To compare the amount of gross vs it's rating and number of reviews to justify data

bom_gross.sort_values('title')

In [None]:
# Here you run your code to clean the data

## Data Modeling
Describe and justify the process for analyzing or modeling the data.

***
Questions to consider:
* How did you analyze or model the data?
* How did you iterate on your initial approach to make it better?
* Why are these choices appropriate given the data and the business problem?
***

In [None]:
# Here you run your code to model the data


## Evaluation
Evaluate how well your work solves the stated business problem.

***
Questions to consider:
* How do you interpret the results?
* How well does your model fit your data? How much better is this than your baseline model?
* How confident are you that your results would generalize beyond the data you have?
* How confident are you that this model would benefit the business if put into use?
***

## Conclusions
Provide your conclusions about the work you've done, including any limitations or next steps.

***
Questions to consider:
* What would you recommend the business do as a result of this work?
* What are some reasons why your analysis might not fully solve the business problem?
* What else could you do in the future to improve this project?
***