# Microsoft Movie Studio Data Analysis
---

Authors: Andrew Bernklau, [Kelsey Lane](kelsklane@gmail.com), Lenore Perconti

## Overview
---

This project analyzes various datasets related to movies in order to formulate three relevant reccomendations for the direction of Microsoft's new movie studio. By looking at **columns**, the data show that **basic results**. Microsoft can use these reccomendations to help decide which direction they want to take their new studio in terms of what types of movies to create for **intended direction ie) gross/rating**.

## Business Problem
---
For its' future movies, Microsoft should **summarize reccomendations**. By following these reccomendations, Microsoft can **target audience/money/aka how use**. **Summarize implications of project for prob/stakeholder**

## Data
---
[IMDB](https://www.imdb.com) is a public, online database with information about video media content. The datasets provide infomration about directors, writers, ratings, and runtime that can be used to track the success of different films. **edit features** The datasets used in the analysis include ones pretaining to **list used datasets**.
- **Present size of datasets and descriptive stats for features used in analysis**
- **Justify features inclusion based on properties + project relevance**
- **Identify any data limitations that have project implications - nothing earlier than 2018 (?)**

# Initial Exploration

Loads in libraries used throughout the notebook

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

Read in Files:

In [2]:
#Columns: title, studio, domestic_gross, foreign_gross, year
bom_movie_gross = pd.read_csv('Data/bom.movie_gross.csv.gz', compression='gzip')

#Columns: nconst, primary_name, birth_year, death_year, primary_profession, known_for_titles
imdb_name_basics = pd.read_csv('Data/imdb.name.basics.csv.gz', compression='gzip')

#Columns: title_id, ordering, title, region, language, types, attributes, is_original_title
imdb_title_akas = pd.read_csv('Data/imdb.title.akas.csv.gz', compression='gzip')

#Columns: tconst, primary_title, original_title, start_year, runtime_minutes, genres
imdb_title_basics = pd.read_csv('Data/imdb.title.basics.csv.gz', compression='gzip')

#Columns: tconst, directors, writers
imdb_title_crew = pd.read_csv('Data/imdb.title.crew.csv.gz', compression='gzip')

#Columns: tconst, ordering, nconst, category, job, characters
imdb_title_principals = pd.read_csv('Data/imdb.title.principals.csv.gz', compression='gzip')

#Columns: tconst, averagerating, numvotes
imdb_title_ratings = pd.read_csv('Data/imdb.title.ratings.csv.gz', compression='gzip')

#Columns: id, synopsis, rating, genre, director, writer, theater_date, dvd_date, currency, box_office, runtie, studio
rt_movie_info = pd.read_csv('Data/rt.movie_info.tsv.gz', compression='gzip', sep = '\t')

#Columns: id, review, rating, fresh, critic, top_critic, publisher, date
rt_reviews = pd.read_csv('Data/rt.reviews.tsv.gz', compression='gzip', sep = '\t', encoding = 'latin1')

#Columns: Unnamed:0, genre_ids, original_language, original_title, popularity, release_date, title, vote_average, vote_count
tmdb_movies = pd.read_csv('Data/tmdb.movies.csv.gz', compression='gzip')

#Columns: id, release_date, movie, production_budget, domestic_gross, worldwide_gross
tn_movie_budgets = pd.read_csv('Data/tn.movie_budgets.csv.gz', compression='gzip')


# Box Office Mojo Data Rundown

In [10]:
#bom_movie_gross['year'].max()

2018

Foreign_gross needs to be converted to an int from string

Studio has 5 missing values, domestic gross is missing 28 values, foreign gross is missing 1350

Two movies named Bluebeard (same or different movies?)

Could look into domestic vs foreign markets, most profitable studios + trends they follow?

# IMDb Data Rundown

In [4]:
#imdb_name_basics.head()
#imdb_title_akas.head()
#imdb_title_basics['start_year'].sort_values(ascending = False)
#imdb_title_crew['writers'].value_counts()
#imdb_title_principals['characters'].value_counts()
#imdb_title_ratings.info()

**Probably need to match movie titles/names instead of tt and nm tags**

**name_basics**: List of people and their primary profession as well as titles they're known for
- many birth yers and death years missing - can drop these columns? dont seem to provide useful info
- 51,340 rows missing profession
- 30,204 rows missing known for titles
- many names are repeated multiple times - would have to consolodate this info
- could be useful to get list of current trending directors/actors etc to reccomend for a film?
- would need to clean up primary_profession column so seperate based on , then add proffesions to a list so easier to access data

**title_akas**: international info? titles, regions, language, and lists if original or not
- may have issues with this one due to foreign text, ditch it?
- region is missing info and language missing a lot of info, types and attributes also missing info
- lists US info, types marks if in festivale/dvd/etc, attributes has info on if new/alt spelling/complete title
- could be used to clarify info in other tables perhaps?

**title_basics**: List of titles (original and marketed) plus the runtime, start_year, and genres
- original title is missing 21 entries, runtime missing 31,739 entries, genre missing 5,408
- primary and original title has many repeats of names
- one of the films starts in 2115? Error or start_year is like, in film universe start
- would need to split genre on commas and makes list for easier readability of data
- would be good to groupby year and look and runtime/genre trends?

**title_crew**: List of titles and their directors and writers (references name_basic)
- Directors missing 5727 entries and writers missing 35,883 entries
- basically useful if we want to recommend a list of directors/writers of top movies
- could also look at high # of director/writer and see what projects they worked on ie do it backwards

**title_principals**: List of movies and people, for actors lists the role, ordering = importance?
- job and character predictably have null values
- Useful to see what actor played what character ie) wanna only nab main cast + not supporting

**title_ratings**: Has average movie rating and number of votes
- Basically where to grab rating info - merge w/ title_basics?
- no null values! But doesn't contain same number of films as original list
    - probably because there are duplicate movies in original list potentially?
    - only look at movies with ratings??

# Rotten Tomatoes Rundown

In [58]:
#rt_movie_info['genre'].value_counts()
#rt_reviews.info()

**movie_info**: general info on films as well as theater release/dvd_date/box office/studio
- missing info in every column but id, mostly in currency/box office and studio
- could use for rating info - seperate into children/adult movies?
    - mostly R/NR movies, around 1/6 or so are PG and even less are G

**reviews**: contains fresh/rotten review, top critic + publisher as well as date of review
- rating, review, publisher, and critic all missing info
- rating would need to get converted to int
- who to reach out for to review film? or which publisher?

# TheMovieDatabase Rundown

In [69]:
#tmdb_movies['vote_average'].sort_values(ascending = False)

- no missing values! data actually seems pretty clean/well formated
- There are duplicate movie entries though
- popularity ranking, vote avg and vote count could be useful popularity metrics?

# The Numbers Rundown

In [68]:
#tn_movie_budgets['movie'].value_counts()

- also no missing data!
- need to convert budget and gross columns into ints though
- some movies also in here multiple times
- budgets alongside domestic and worldwide gross
    - does worldwide include domestic?

# Possible reccomendations to work towards

1. Good genre/genre combos
    - also consider quantity per year grouped by studio
    - maybe look into both domestic + international market
    - so one rec if want corner domestic market and one rec for international?
2. Based on top ratings OR top grossing films - what cast/director/writer to angle for? Studio to work with?
3. Recommendation for top grossing + different for top rated?
    - that way could cater to if just want money or interested in like, artistic value
4. Could also seperate out recommendations based on audience ie) PG vs R