# <p style="color:black"> EDA OF MOVIE RAW DATA

<figure>
    <img src="movie header.jpg"
         alt="Movie Studio"
         width="400"
         height="200">
    <figcaption><center><bold>Microsoft's new movie studio!</figcaption>
</figure>

The purpose of this notebook is to perform exploratory data anlysis of the movie studio information that was provided to our group for the Phase 1 Project. When looking at the data, it can be broken up into two types:

1. An extract from IMDB, which is presented in a .db database file, and
2. Various CSV style files from different movie analysis websites.

This notebook's analysis will be structured as such:

- Analysis of the database file
- Analysis of the high priority CSV files (mainly, `bom.movie_gross.csv.gz`)
- Analysis of the remaining CSV files
- Preliminary thoughts on combination of files, and
- Preliminary thoughts on the group project story

# <p style="color:black"> DATABASE SECTION


## <p style="color:black"> Database - Short summary of findings


In short, the database information gives a lot of color and background to each of the movie_id's. From the movie_id, we can determine genre, geography, language, and the people who are involved in the film through use of SQL combination scripts.

This database will likely become relevant once we are able to put more numbers and analysis to the performance of the movies, which will likely come from the csv files.

See the section on the CSV files for more analysis

## <p style="color:black"> Analysis of the database file

To begin, we will import the SQL lite database package and import the database

In [2]:
import sqlite3
import pandas as pd
conn = sqlite3.connect("Raw Data/im.db")
cur = conn.cursor()

Let's take a look at the names of all the tables, and compare it to the schema that was presented in the intro materials:

In [3]:
df = pd.read_sql("""
SELECT name as table_name
FROM sqlite_master
WHERE type = 'table';
""", conn)
df

Unnamed: 0,table_name


<figure>
    <img src="movie_data_erd.jpeg"
         alt="Database Schema"
         width="600"
         height="300">
    <figcaption><center><bold>It is clear from analyzing the tables that the schema jpg is accurate and the database is loaded</figcaption>
</figure>

### <p style="color:black"> Table: Persons

In [None]:
df_persons = pd.read_sql("""
SELECT *
FROM persons
""", conn)
df_persons.info()

In [None]:
df_persons.head(2)

Conclusion:
- The primary key of this data is likely person_id
- This table looks like a mostly complete list of persons, names, and their professions
- Primary professions has an embedded list of data inside of it
- The most useful columns for this are likely person_id and primary_profession, as they will link certain tables together

### <p style="color:black"> Table: Principals

In [None]:
df_principals = pd.read_sql("""
SELECT *
FROM principals
""", conn)
df_principals.info()

In [None]:
df_principals.tail(5)

Conclusion:
- The primary key of this table appears to be person_id
- This table goes into detail on which person is related to which movie
- Characters is a list of all relevant characters for an actor
- The most useful columns for this are likely person_id, category, and movie_id, as they could provide links for data attributes

### <p style="color:black"> Table: Known For, Directors, and Writers

Note: These tables appear similar in nature in that they are connector tables or subsets of the Principals table. This section will explore how/if these tables can be combined

In [None]:
df_principals = pd.read_sql("""
SELECT *
FROM known_for
""", conn)
df_principals.info()

In [None]:
df_principals.head(2)

In [None]:
df_directors = pd.read_sql("""
SELECT *
FROM directors
""", conn)
df_principals.info()

In [None]:
df_directors.head(2)

In [None]:
df_writers = pd.read_sql("""
SELECT *
FROM writers
""", conn)
df_writers.info()

In [None]:
df_writers.head(2)

Conclusion:
- All tables contain movie_id and person_id fields and are complete
- Principals and directors tables have the same numer of records (1.6m), Writers has about a fourth of that
- It is not immediately clear the benefit of combining these together, however when we formulate our hypothesis, perhaps it will become more evident of the value of linking movies and their associated people (writers, directors, etc)

### <p style="color:black"> Table: Movie Basics

In [None]:
df_movie_basics = pd.read_sql("""
SELECT *
FROM movie_basics
""", conn)
df_movie_basics.info()

In [None]:
df_movie_basics.head(2)

Conclusion:
- Seems like a solid index for basic movie information
- Primary key is likely 'movie_id'
- Genres contains a list of genres
- Mostly complete except for runtime and genres, which look about 90% complete

### <p style="color:black"> Table: Movie Ratings

In [None]:
df_movie_ratings = pd.read_sql("""
SELECT *
FROM movie_ratings
""", conn)
df_movie_ratings.info()

In [None]:
df_movie_ratings.head(2)

Conclusion:
- Contains rating information by movie ID
- Data set appears complete
- Could be useful for understanding a movie's reception vs it's revenue

### <p style="color:black"> Table: Movie AKA's

In [None]:
df_movie_akas = pd.read_sql("""
SELECT *
FROM movie_akas
""", conn)
df_movie_akas.info()

In [None]:
df_movie_akas.tail(15)

Conclusion:
- Could be useful for understanding the region and language a particular movie was distributed in
- Mostly complete, but contains a lot of missing values in the language, attributes, and types fields

# <p style="color:black"> CSV SECTION

## <p style="color:black"> Short Summary of CSV Files

The CSV files will take some work to combine, but at the end of the day, they will contain important pieces of information we can use for our story. In my mind, the following data points are important and we can get them from the following (files)

For each movie:
- The movie ID (tn.movie_budgets.csv.gz)
- The title (movie_budgets.csv.gz)
- The studio (bom.movie_gross.csv.gz)
- The domestic gross (bom.movie_gross.csv.gz -or- movie_budgets.csv.gz)
- The international gross (bom.movie_gross.csv.gz -or- movie_budgets.csv.gz)
- The year it came out (bom.movie_gross.csv.gz)
- The genre (movie_info.tsv.gz)
- The director (movie_info.tsv.gz)
- The writer (rt.movie_info.tsv.gz)

For each studio:
- Movies published by that studio (bom.movie_gross.csv.gz)
- Domestic and international gross of that studio over time (rt.movie_info.tsv.gz)
 - Maybe we can throw in what kind of genres each studio excel in 




## <p style="color:black"> Analysis of CSV files

The CSV files are made up of the five files
- Bom.Movie_gross.csv.gz
- rt.movie_info_tsv.gz
- rt.reviews.tsv.gz
- tmdb.movies.csv.gz
- tn.movie_budgets.csv.gz

## <p style="color:black"> CSV File: bom.movie_gross.csv.gz


In [None]:
df_movie_gross = pd.read_csv("Raw Data/bom.movie_gross.csv.gz")
import seaborn as sns
import matplotlib.pyplot as plt 

In [None]:
df_movie_gross.info()

In [None]:
df_movie_gross.head()

In [None]:
df_movie_gross.pivot_table(index='studio', columns='year', values='domestic_gross', aggfunc='sum')

Conclusion:
- This will be a very important data set, as it contains the domestic and foreign gross for each movie
- This data is not complete, and does not have a movie_id for movies which could link it back to the IMDB movie database
- Additionally, we do not know what currency the foreign currency is denominated in

## <p style="color:black"> CSV File: rt.movie_info.tsv.gz

In [None]:
df_movie_info = pd.read_csv("Raw Data/rt.movie_info.tsv.gz", sep='\t')

In [None]:
df_movie_info.info()

In [None]:
df_movie_info.head()

Conclusion:
- Man, this data sucks
- There is no movie title! More likely, we will need to pull together the director, year, and studio of each of these in order to link it to other data tables
- Unclear what kind of value add information this creates for our hypothesis

## <p style="color:black"> CSV File: rt.reviews.tsv.gz

In [None]:
df_movie_reviews = pd.read_csv("Raw Data/rt.reviews.tsv.gz", sep='\t', encoding='latin1')

In [None]:
df_movie_reviews.info()

In [None]:
df_movie_reviews.tail()

Conclusion:
- Again, this data is pretty rough. There isn't a good way to identify what the movie title is, nor what studio it is affiliated with.
- It does give a rating for each movie, but identifying what movie it related to will be tough


## <p style="color:black"> CSV File: tmdb.movies.csv.gz

In [None]:
df_movie_db2 = pd.read_csv("Raw Data/tmdb.movies.csv.gz")

In [None]:
df_movie_db2.info()

In [None]:
df_movie_db2.head()

Conclusion:
- Complete list of data for movies. Has some type of ID, but unclear what kind of ID this is
- Contains popularity data, likely some kind of user generated reviews, possibily?


## <p style="color:black"> CSV File: tn.movie_budgets.csv.gz

In [None]:
df_budgets = pd.read_csv("Raw Data/tn.movie_budgets.csv.gz")

In [None]:
df_budgets['id'].value_counts()

In [None]:
df_budgets.loc[df_budgets['id'] == 65]

Conclusion:
- Would ya look at that, this might be the ID we need to link an ID to a title, which could link us to the rest of the data
- We should probably look and see what the difference is between the domestic_gross represented here versus domestic_gross in "bom.movie_gross.csv.gz"
- This data set likely is going to be very valuable in linking together the different csv files