# <p style="color:black"> EDA OF MOVIE RAW DATA

<figure>
    <img src="movie header.jpg"
         alt="Movie Studio"
         width="400"
         height="200">
    <figcaption><center><bold>Microsoft's new movie studio!</figcaption>
</figure>

The purpose of this notebook is to perform exploratory data anlysis of the movie studio information that was provided to our group for the Phase 1 Project. When looking at the data, it can be broken up into two types:

1. An extract from IMDB, which is presented in a .db database file, and
2. Various CSV style files from different movie analysis websites.

This notebook's analysis will be structured as such:

- Analysis of the database file
- Analysis of the high priority CSV files (mainly, `bom.movie_gross.csv.gz`)
- Analysis of the remaining CSV files
- Preliminary thoughts on combination of files, and
- Preliminary thoughts on the group project story

# <p style="color:black"> DATABASE SECTION


## <p style="color:black"> Database - Short summary of findings


In short, the database information gives a lot of color and background to each of the movie_id's. From the movie_id, we can determine genre, geography, language, and the people who are involved in the film through use of SQL combination scripts.

This database will likely become relevant once we are able to put more numbers and analysis to the performance of the movies, which will likely come from the csv files.

See the section on the CSV files for more analysis

## <p style="color:black"> Analysis of the database file

To begin, we will import the SQL lite database package and import the database

In [214]:
import sqlite3
import pandas as pd
import numpy as np
conn = sqlite3.connect("Raw Data/im.db")
cur = conn.cursor()

Let's take a look at the names of all the tables, and compare it to the schema that was presented in the intro materials:

In [215]:
df = pd.read_sql("""
SELECT name as table_name
FROM sqlite_master
WHERE type = 'table';
""", conn)
df

Unnamed: 0,table_name
0,movie_basics
1,directors
2,known_for
3,movie_akas
4,movie_ratings
5,persons
6,principals
7,writers


In [216]:
df.head()

Unnamed: 0,table_name
0,movie_basics
1,directors
2,known_for
3,movie_akas
4,movie_ratings


<figure>
    <img src="movie_data_erd.jpeg"
         alt="Database Schema"
         width="600"
         height="300">
    <figcaption><center><bold>It is clear from analyzing the tables that the schema jpg is accurate and the database is loaded</figcaption>
</figure>

### <p style="color:black"> Table: Persons

In [217]:
df_persons = pd.read_sql("""
SELECT *
FROM persons
""", conn)
df_persons.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 606648 entries, 0 to 606647
Data columns (total 5 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   person_id           606648 non-null  object 
 1   primary_name        606648 non-null  object 
 2   birth_year          82736 non-null   float64
 3   death_year          6783 non-null    float64
 4   primary_profession  555308 non-null  object 
dtypes: float64(2), object(3)
memory usage: 23.1+ MB


In [218]:
df_persons.head(5)

Unnamed: 0,person_id,primary_name,birth_year,death_year,primary_profession
0,nm0061671,Mary Ellen Bauder,,,"miscellaneous,production_manager,producer"
1,nm0061865,Joseph Bauer,,,"composer,music_department,sound_department"
2,nm0062070,Bruce Baum,,,"miscellaneous,actor,writer"
3,nm0062195,Axel Baumann,,,"camera_department,cinematographer,art_department"
4,nm0062798,Pete Baxter,,,"production_designer,art_department,set_decorator"


In [219]:
df_persons_IDname = df_persons.filter(['person_id', 'primary_name'], axis=1)
df_persons_IDname.head(10)

Unnamed: 0,person_id,primary_name
0,nm0061671,Mary Ellen Bauder
1,nm0061865,Joseph Bauer
2,nm0062070,Bruce Baum
3,nm0062195,Axel Baumann
4,nm0062798,Pete Baxter
5,nm0062879,Ruel S. Bayani
6,nm0063198,Bayou
7,nm0063432,Stevie Be-Zet
8,nm0063618,Jeff Beal
9,nm0063750,Lindsay Beamish


Conclusion:
- The primary key of this data is likely person_id
- This table looks like a mostly complete list of persons, names, and their professions
- Primary professions has an embedded list of data inside of it
- The most useful columns for this are likely person_id and primary_profession, as they will link certain tables together

### <p style="color:black"> Table: Principals

In [220]:
df_principals = pd.read_sql("""
SELECT *
FROM principals
""", conn)
df_principals.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1028186 entries, 0 to 1028185
Data columns (total 6 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   movie_id    1028186 non-null  object
 1   ordering    1028186 non-null  int64 
 2   person_id   1028186 non-null  object
 3   category    1028186 non-null  object
 4   job         177684 non-null   object
 5   characters  393360 non-null   object
dtypes: int64(1), object(5)
memory usage: 47.1+ MB


In [221]:
df_principals.tail(5)

Unnamed: 0,movie_id,ordering,person_id,category,job,characters
1028181,tt9692684,1,nm0186469,actor,,"[""Ebenezer Scrooge""]"
1028182,tt9692684,2,nm4929530,self,,"[""Herself"",""Regan""]"
1028183,tt9692684,3,nm10441594,director,,
1028184,tt9692684,4,nm6009913,writer,writer,
1028185,tt9692684,5,nm10441595,producer,producer,


In [222]:
df_principals.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1028186 entries, 0 to 1028185
Data columns (total 6 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   movie_id    1028186 non-null  object
 1   ordering    1028186 non-null  int64 
 2   person_id   1028186 non-null  object
 3   category    1028186 non-null  object
 4   job         177684 non-null   object
 5   characters  393360 non-null   object
dtypes: int64(1), object(5)
memory usage: 47.1+ MB


In [223]:
df_principals['category'].value_counts()

actor                  256718
director               146393
actress                146208
producer               113724
cinematographer         80091
composer                77063
writer                  74357
self                    65424
editor                  55512
production_designer      9373
archive_footage          3307
archive_sound              16
Name: category, dtype: int64

In [224]:
df_p_dirmovieID = df_principals.filter(['person_id', 'movie_id', 'category'], axis=1)

In [225]:
df_p_dirmovieID.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1028186 entries, 0 to 1028185
Data columns (total 3 columns):
 #   Column     Non-Null Count    Dtype 
---  ------     --------------    ----- 
 0   person_id  1028186 non-null  object
 1   movie_id   1028186 non-null  object
 2   category   1028186 non-null  object
dtypes: object(3)
memory usage: 23.5+ MB


In [226]:
df_directors_IDs = df_p_dirmovieID.loc[df_p_dirmovieID['category'].isin(['director'])]

In [227]:
df_directors_IDs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 146393 entries, 1 to 1028183
Data columns (total 3 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   person_id  146393 non-null  object
 1   movie_id   146393 non-null  object
 2   category   146393 non-null  object
dtypes: object(3)
memory usage: 4.5+ MB


In [228]:
df_directors_IDs.head(12)

Unnamed: 0,person_id,movie_id,category
1,nm0398271,tt0111414,director
8,nm0362736,tt0323808,director
18,nm1145057,tt0417610,director
28,nm0707738,tt0469152,director
35,nm0776090,tt0473032,director
41,nm0001053,tt0475290,director
42,nm0001054,tt0475290,director
51,nm0197636,tt0477302,director
61,nm0007082,tt0780548,director
71,nm1275939,tt0879405,director


In [229]:
df_directors_id = df_directors_IDs.merge(df_persons_IDname, on='person_id', how='inner')
df_directors_id.head(7)

Unnamed: 0,person_id,movie_id,category,primary_name
0,nm0398271,tt0111414,director,Frank Howson
1,nm0398271,tt5573596,director,Frank Howson
2,nm0362736,tt0323808,director,Robin Hardy
3,nm1145057,tt0417610,director,Alejandro Chomski
4,nm1145057,tt5291716,director,Alejandro Chomski
5,nm1145057,tt4551544,director,Alejandro Chomski
6,nm0707738,tt0469152,director,Alyssa R. Bennett


Conclusion:
- The primary key of this table appears to be person_id
- This table goes into detail on which person is related to which movie
- Characters is a list of all relevant characters for an actor
- The most useful columns for this are likely person_id, category, and movie_id, as they could provide links for data attributes

### <p style="color:black"> Table: Known For, Directors, and Writers

Note: These tables appear similar in nature in that they are connector tables or subsets of the Principals table. This section will explore how/if these tables can be combined

In [230]:
df_principals = pd.read_sql("""
SELECT *
FROM known_for
""", conn)
df_principals.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1638260 entries, 0 to 1638259
Data columns (total 2 columns):
 #   Column     Non-Null Count    Dtype 
---  ------     --------------    ----- 
 0   person_id  1638260 non-null  object
 1   movie_id   1638260 non-null  object
dtypes: object(2)
memory usage: 25.0+ MB


In [231]:
df_principals.head(2)

Unnamed: 0,person_id,movie_id
0,nm0061671,tt0837562
1,nm0061671,tt2398241


In [232]:
df_directors = pd.read_sql("""
SELECT *
FROM directors
""", conn)
df_principals.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1638260 entries, 0 to 1638259
Data columns (total 2 columns):
 #   Column     Non-Null Count    Dtype 
---  ------     --------------    ----- 
 0   person_id  1638260 non-null  object
 1   movie_id   1638260 non-null  object
dtypes: object(2)
memory usage: 25.0+ MB


In [233]:
df_directors.head(2)

Unnamed: 0,movie_id,person_id
0,tt0285252,nm0899854
1,tt0462036,nm1940585


In [234]:
df_writers = pd.read_sql("""
SELECT *
FROM writers
""", conn)
df_writers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 255873 entries, 0 to 255872
Data columns (total 2 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   movie_id   255873 non-null  object
 1   person_id  255873 non-null  object
dtypes: object(2)
memory usage: 3.9+ MB


In [235]:
df_writers.head(2)

Unnamed: 0,movie_id,person_id
0,tt0285252,nm0899854
1,tt0438973,nm0175726


Conclusion:
- All tables contain movie_id and person_id fields and are complete
- Principals and directors tables have the same numer of records (1.6m), Writers has about a fourth of that
- It is not immediately clear the benefit of combining these together, however when we formulate our hypothesis, perhaps it will become more evident of the value of linking movies and their associated people (writers, directors, etc)

### <p style="color:black"> Table: Movie Basics

In [236]:
df_movie_basics = pd.read_sql("""
SELECT *
FROM movie_basics
""", conn)
df_movie_basics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146144 entries, 0 to 146143
Data columns (total 6 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   movie_id         146144 non-null  object 
 1   primary_title    146144 non-null  object 
 2   original_title   146123 non-null  object 
 3   start_year       146144 non-null  int64  
 4   runtime_minutes  114405 non-null  float64
 5   genres           140736 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 6.7+ MB


In [237]:
df_movie_basics.head(4)

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"


In [238]:
df_mbasics = df_movie_basics.filter(['movie_id', 'primary_title', 'genres'], axis=1)
df_mbasics.head()

Unnamed: 0,movie_id,primary_title,genres
0,tt0063540,Sunghursh,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,Drama
3,tt0069204,Sabse Bada Sukh,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,"Comedy,Drama,Fantasy"


Conclusion:
- Seems like a solid index for basic movie information
- Primary key is likely 'movie_id'
- Genres contains a list of genres
- Mostly complete except for runtime and genres, which look about 90% complete

### <p style="color:black"> Table: Movie Ratings

In [239]:
df_movie_ratings = pd.read_sql("""
SELECT *
FROM movie_ratings
""", conn)
df_movie_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73856 entries, 0 to 73855
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   movie_id       73856 non-null  object 
 1   averagerating  73856 non-null  float64
 2   numvotes       73856 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 1.7+ MB


In [240]:
df_movie_ratings.head(10)

Unnamed: 0,movie_id,averagerating,numvotes
0,tt10356526,8.3,31
1,tt10384606,8.9,559
2,tt1042974,6.4,20
3,tt1043726,4.2,50352
4,tt1060240,6.5,21
5,tt1069246,6.2,326
6,tt1094666,7.0,1613
7,tt1130982,6.4,571
8,tt1156528,7.2,265
9,tt1161457,4.2,148


In [241]:
df_movie_ratings['averagerating'] = df_movie_ratings['averagerating']
df_high_rtg = df_movie_ratings[df_movie_ratings['averagerating'] >= 7.0]
df_high_rtgs = df_high_rtg[df_high_rtg['numvotes'] >= 5000]
df_high_rtgs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1322 entries, 12 to 73727
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   movie_id       1322 non-null   object 
 1   averagerating  1322 non-null   float64
 2   numvotes       1322 non-null   int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 41.3+ KB


In [242]:
df_high_rtgs.head()

Unnamed: 0,movie_id,averagerating,numvotes
12,tt1181840,7.0,5494
16,tt1210166,7.6,326657
19,tt1229238,7.4,428142
20,tt1232829,7.2,477771
59,tt1403981,7.1,129443


In [243]:
df_title_rtg = df_mbasics.merge(df_high_rtgs, on='movie_id', how='inner')
df_title_rtg.head(2)

Unnamed: 0,movie_id,primary_title,genres,averagerating,numvotes
0,tt0315642,Wazir,"Action,Crime,Drama",7.1,15378
1,tt0359950,The Secret Life of Walter Mitty,"Adventure,Comedy,Drama",7.3,275300


In [244]:
df_title_rtg.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1322 entries, 0 to 1321
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   movie_id       1322 non-null   object 
 1   primary_title  1322 non-null   object 
 2   genres         1322 non-null   object 
 3   averagerating  1322 non-null   float64
 4   numvotes       1322 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 62.0+ KB


In [245]:
df_movies_info = df_title_rtg.merge(df_directors_id, on='movie_id', how='inner')
df_movies_info.head()

Unnamed: 0,movie_id,primary_title,genres,averagerating,numvotes,person_id,category,primary_name
0,tt0315642,Wazir,"Action,Crime,Drama",7.1,15378,nm2349060,director,Bejoy Nambiar
1,tt0369610,Jurassic World,"Action,Adventure,Sci-Fi",7.0,539338,nm1119880,director,Colin Trevorrow
2,tt0398286,Tangled,"Adventure,Animation,Comedy",7.8,366366,nm1977355,director,Nathan Greno
3,tt0398286,Tangled,"Adventure,Animation,Comedy",7.8,366366,nm0397174,director,Byron Howard
4,tt0433035,Real Steel,"Action,Drama,Family",7.1,283534,nm0506613,director,Shawn Levy


In [246]:
df_movies_info.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1393 entries, 0 to 1392
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   movie_id       1393 non-null   object 
 1   primary_title  1393 non-null   object 
 2   genres         1393 non-null   object 
 3   averagerating  1393 non-null   float64
 4   numvotes       1393 non-null   int64  
 5   person_id      1393 non-null   object 
 6   category       1393 non-null   object 
 7   primary_name   1393 non-null   object 
dtypes: float64(1), int64(1), object(6)
memory usage: 97.9+ KB


In [247]:
df_movies_info.value_counts(['genres'])

genres                     
Drama                          114
Comedy,Drama                    80
Drama,Romance                   50
Adventure,Animation,Comedy      48
Comedy,Drama,Romance            47
                              ... 
Documentary,Drama,News           1
Animation,Crime,Documentary      1
Documentary,Drama,History        1
Animation,Drama                  1
Romance,Sci-Fi,Thriller          1
Length: 221, dtype: int64

In [248]:
df_movies_info.value_counts(['averagerating'])

averagerating
7.2              174
7.1              167
7.0              140
7.4              130
7.3              128
7.5              108
7.6               97
7.7               88
7.8               85
8.1               52
8.0               51
7.9               48
8.2               34
8.3               32
8.5               15
8.4               13
8.8                9
8.7                7
8.6                4
9.3                3
9.4                3
9.5                2
8.9                1
9.2                1
9.7                1
dtype: int64

Conclusion:
- Contains rating information by movie ID
- Data set appears complete
- Could be useful for understanding a movie's reception vs it's revenue

### <p style="color:black"> Table: Movie AKA's

In [249]:
df_movie_akas = pd.read_sql("""
SELECT *
FROM movie_akas
""", conn)
df_movie_akas.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 331703 entries, 0 to 331702
Data columns (total 8 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   movie_id           331703 non-null  object 
 1   ordering           331703 non-null  int64  
 2   title              331703 non-null  object 
 3   region             278410 non-null  object 
 4   language           41715 non-null   object 
 5   types              168447 non-null  object 
 6   attributes         14925 non-null   object 
 7   is_original_title  331678 non-null  float64
dtypes: float64(1), int64(1), object(6)
memory usage: 20.2+ MB


In [250]:
df_movie_akas.tail(15)

Unnamed: 0,movie_id,ordering,title,region,language,types,attributes,is_original_title
331688,tt9705860,3,Dusan Vukotic hrvatski okarovac,HR,,,,0.0
331689,tt9723084,1,Anderswo. Allein in Afrika,DE,,imdbDisplay,,0.0
331690,tt9723084,2,Anderswo. Allein in Afrika,,,original,,1.0
331691,tt9726638,1,Qi Tian Da Sheng 2,CN,yue,imdbDisplay,,0.0
331692,tt9726638,2,Monkey King: The Volcano,,,original,,1.0
331693,tt9726638,3,Qi Tian Da Sheng Huo Yan Shan,CN,yue,imdbDisplay,,0.0
331694,tt9755806,1,Big Shark,US,,,,0.0
331695,tt9755806,2,Большая Акула,RU,,,,0.0
331696,tt9755806,3,Big Shark,,,original,,1.0
331697,tt9827784,1,Sayonara kuchibiru,JP,,,,0.0


Conclusion:
- Could be useful for understanding the region and language a particular movie was distributed in
- Mostly complete, but contains a lot of missing values in the language, attributes, and types fields

# <p style="color:black"> CSV SECTION

## <p style="color:black"> Short Summary of CSV Files

The CSV files will take some work to combine, but at the end of the day, they will contain important pieces of information we can use for our story. In my mind, the following data points are important and we can get them from the following (files)

For each movie:
- The movie ID (tn.movie_budgets.csv.gz)
- The title (movie_budgets.csv.gz)
- The studio (bom.movie_gross.csv.gz)
- The domestic gross (bom.movie_gross.csv.gz -or- movie_budgets.csv.gz)
- The international gross (bom.movie_gross.csv.gz -or- movie_budgets.csv.gz)
- The year it came out (bom.movie_gross.csv.gz)
- The genre (movie_info.tsv.gz)
- The director (movie_info.tsv.gz)
- The writer (rt.movie_info.tsv.gz)

For each studio:
- Movies published by that studio (bom.movie_gross.csv.gz)
- Domestic and international gross of that studio over time (rt.movie_info.tsv.gz)
 - Maybe we can throw in what kind of genres each studio excel in 




## <p style="color:black"> Analysis of CSV files

The CSV files are made up of the five files
- Bom.Movie_gross.csv.gz
- rt.movie_info_tsv.gz
- rt.reviews.tsv.gz
- tmdb.movies.csv.gz
- tn.movie_budgets.csv.gz

## <p style="color:black"> CSV File: bom.movie_gross.csv.gz


In [251]:
df_movie_gross = pd.read_csv("Raw Data/bom.movie_gross.csv.gz")
import seaborn as sns
import matplotlib.pyplot as plt 

In [252]:
df_movie_gross.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


In [253]:
df_movie_gross.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


In [254]:
df_movie_gross.pivot_table(index='studio', columns='year', values='domestic_gross', aggfunc='sum')

year,2010,2011,2012,2013,2014,2015,2016,2017,2018
studio,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
3D,6100000.0,,,,,,,,
A23,,151000.0,,13200.0,,,,,
A24,,,,27845400.0,18988300.0,59675100.0,76431700.0,95405700.0,45848000.0
ADC,,,,,,228000.0,,20200.0,
AF,,,1155000.0,76900.0,558000.0,353000.0,,,
...,...,...,...,...,...,...,...,...,...
XL,,,,117000.0,341000.0,,,,
YFG,,,,,,,1100000.0,,
Yash,43800.0,496000.0,5352600.0,8000000.0,,2379000.0,8500000.0,6530000.0,330000.0
Zee,,,,,,,1100000.0,,


Conclusion:
- This will be a very important data set, as it contains the domestic and foreign gross for each movie
- This data is not complete, and does not have a movie_id for movies which could link it back to the IMDB movie database
- Additionally, we do not know what currency the foreign currency is denominated in

## <p style="color:black"> CSV File: rt.movie_info.tsv.gz

In [255]:
df_movie_info = pd.read_csv("Raw Data/rt.movie_info.tsv.gz", sep='\t')

In [256]:
df_movie_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1560 entries, 0 to 1559
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            1560 non-null   int64 
 1   synopsis      1498 non-null   object
 2   rating        1557 non-null   object
 3   genre         1552 non-null   object
 4   director      1361 non-null   object
 5   writer        1111 non-null   object
 6   theater_date  1201 non-null   object
 7   dvd_date      1201 non-null   object
 8   currency      340 non-null    object
 9   box_office    340 non-null    object
 10  runtime       1530 non-null   object
 11  studio        494 non-null    object
dtypes: int64(1), object(11)
memory usage: 146.4+ KB


In [257]:
df_movie_info.head()

Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
0,1,"This gritty, fast-paced, and innovative police...",R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,"Oct 9, 1971","Sep 25, 2001",,,104 minutes,
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000.0,108 minutes,Entertainment One
2,5,Illeana Douglas delivers a superb performance ...,R,Drama|Musical and Performing Arts,Allison Anders,Allison Anders,"Sep 13, 1996","Apr 18, 2000",,,116 minutes,
3,6,Michael Douglas runs afoul of a treacherous su...,R,Drama|Mystery and Suspense,Barry Levinson,Paul Attanasio|Michael Crichton,"Dec 9, 1994","Aug 27, 1997",,,128 minutes,
4,7,,NR,Drama|Romance,Rodney Bennett,Giles Cooper,,,,,200 minutes,


Conclusion:
- Man, this data sucks
- There is no movie title! More likely, we will need to pull together the director, year, and studio of each of these in order to link it to other data tables
- Unclear what kind of value add information this creates for our hypothesis

## <p style="color:black"> CSV File: rt.reviews.tsv.gz

In [258]:
df_movie_reviews = pd.read_csv("Raw Data/rt.reviews.tsv.gz", sep='\t', encoding='latin1')

In [259]:
df_movie_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54432 entries, 0 to 54431
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          54432 non-null  int64 
 1   review      48869 non-null  object
 2   rating      40915 non-null  object
 3   fresh       54432 non-null  object
 4   critic      51710 non-null  object
 5   top_critic  54432 non-null  int64 
 6   publisher   54123 non-null  object
 7   date        54432 non-null  object
dtypes: int64(2), object(6)
memory usage: 3.3+ MB


In [260]:
df_movie_reviews.tail()

Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
54427,2000,The real charm of this trifle is the deadpan c...,,fresh,Laura Sinagra,1,Village Voice,"September 24, 2002"
54428,2000,,1/5,rotten,Michael Szymanski,0,Zap2it.com,"September 21, 2005"
54429,2000,,2/5,rotten,Emanuel Levy,0,EmanuelLevy.Com,"July 17, 2005"
54430,2000,,2.5/5,rotten,Christopher Null,0,Filmcritic.com,"September 7, 2003"
54431,2000,,3/5,fresh,Nicolas Lacroix,0,Showbizz.net,"November 12, 2002"


Conclusion:
- Again, this data is pretty rough. There isn't a good way to identify what the movie title is, nor what studio it is affiliated with.
- It does give a rating for each movie, but identifying what movie it related to will be tough


## <p style="color:black"> CSV File: tmdb.movies.csv.gz

In [261]:
df_movie_db2 = pd.read_csv("Raw Data/tmdb.movies.csv.gz")

In [262]:
df_movie_db2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26517 entries, 0 to 26516
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         26517 non-null  int64  
 1   genre_ids          26517 non-null  object 
 2   id                 26517 non-null  int64  
 3   original_language  26517 non-null  object 
 4   original_title     26517 non-null  object 
 5   popularity         26517 non-null  float64
 6   release_date       26517 non-null  object 
 7   title              26517 non-null  object 
 8   vote_average       26517 non-null  float64
 9   vote_count         26517 non-null  int64  
dtypes: float64(2), int64(3), object(5)
memory usage: 2.0+ MB


In [263]:
df_movie_db2.head()

Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


Conclusion:
- Complete list of data for movies. Has some type of ID, but unclear what kind of ID this is
- Contains popularity data, likely some kind of user generated reviews, possibily?


## <p style="color:black"> CSV File: tn.movie_budgets.csv.gz

In [264]:
df_budgets = pd.read_csv("Raw Data/tn.movie_budgets.csv.gz")

In [265]:
df_budgets.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


In [266]:
##
##Financial Data CSV

In [267]:
df_financial_data = pd.read_csv('financial_data.csv')

In [268]:
df_financial_data.head()

Unnamed: 0.1,Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross,year_x,profit,profit_margin,...,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,year_y
0,0,1,2009-12-18,Avatar,425000000.0,760507625.0,2776345000.0,2009,2351345000.0,84.692106,...,tt0499549,movie,Avatar,Avatar,0.0,2009-01-01 00:00:00,\N,162.0,"Action,Adventure,Fantasy",2009.0
1,1,2,2011-05-20,Pirates of the Caribbean On Stranger Tides,410600000.0,241063875.0,1045664000.0,2011,635063900.0,60.73308,...,tt1298650,movie,Pirates of the Caribbean On Stranger Tides,Pirates of the Caribbean: On Stranger Tides,0.0,2011-01-01 00:00:00,\N,137.0,"Action,Adventure,Fantasy",2011.0
2,2,3,2019-06-07,Dark Phoenix,350000000.0,42762350.0,149762400.0,2019,-200237600.0,-133.703598,...,,,,,,,,,,
3,3,4,2015-05-01,Avengers Age of Ultron,330600000.0,459005868.0,1403014000.0,2015,1072414000.0,76.436443,...,tt2395427,movie,Avengers Age of Ultron,Avengers: Age of Ultron,0.0,2015-01-01 00:00:00,\N,141.0,"Action,Adventure,Sci-Fi",2015.0
4,4,5,2017-12-15,Star Wars Ep VIII The Last Jedi,317000000.0,620181382.0,1316722000.0,2017,999721700.0,75.925058,...,,,,,,,,,,


In [269]:
df_financial_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3889 entries, 0 to 3888
Data columns (total 21 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         3889 non-null   int64  
 1   id                 3889 non-null   int64  
 2   release_date       3889 non-null   object 
 3   movie              3889 non-null   object 
 4   production_budget  3889 non-null   float64
 5   domestic_gross     3889 non-null   float64
 6   worldwide_gross    3889 non-null   float64
 7   year_x             3889 non-null   int64  
 8   profit             3889 non-null   float64
 9   profit_margin      3889 non-null   float64
 10  movie_and_year     3889 non-null   object 
 11  tconst             2870 non-null   object 
 12  titleType          2870 non-null   object 
 13  primaryTitle       2870 non-null   object 
 14  originalTitle      2870 non-null   object 
 15  isAdult            2870 non-null   float64
 16  startYear          2870 

In [270]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import string
punct = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{}~'

In [271]:
transtab = str.maketrans(dict.fromkeys(punct, ''))

df_movies_info['primary_title'] = '|'.join(df_movies_info['primary_title'].tolist()).translate(transtab).split('|')

In [272]:
df_movies_info.head(5)

Unnamed: 0,movie_id,primary_title,genres,averagerating,numvotes,person_id,category,primary_name
0,tt0315642,Wazir,"Action,Crime,Drama",7.1,15378,nm2349060,director,Bejoy Nambiar
1,tt0369610,Jurassic World,"Action,Adventure,Sci-Fi",7.0,539338,nm1119880,director,Colin Trevorrow
2,tt0398286,Tangled,"Adventure,Animation,Comedy",7.8,366366,nm1977355,director,Nathan Greno
3,tt0398286,Tangled,"Adventure,Animation,Comedy",7.8,366366,nm0397174,director,Byron Howard
4,tt0433035,Real Steel,"Action,Drama,Family",7.1,283534,nm0506613,director,Shawn Levy


In [274]:
df_financial_data = df_financial_data.dropna()

In [275]:
df_financial_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2870 entries, 0 to 3888
Data columns (total 21 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         2870 non-null   int64  
 1   id                 2870 non-null   int64  
 2   release_date       2870 non-null   object 
 3   movie              2870 non-null   object 
 4   production_budget  2870 non-null   float64
 5   domestic_gross     2870 non-null   float64
 6   worldwide_gross    2870 non-null   float64
 7   year_x             2870 non-null   int64  
 8   profit             2870 non-null   float64
 9   profit_margin      2870 non-null   float64
 10  movie_and_year     2870 non-null   object 
 11  tconst             2870 non-null   object 
 12  titleType          2870 non-null   object 
 13  primaryTitle       2870 non-null   object 
 14  originalTitle      2870 non-null   object 
 15  isAdult            2870 non-null   float64
 16  startYear          2870 

In [276]:
df_financial_data = df_financial_data.drop(['Unnamed: 0', 'id', 'startYear', 'endYear', 'year_y', 'originalTitle'], axis=1)

In [277]:
df_financial_data = df_financial_data.rename(columns={'tconst': 'movie_id', 'primaryTitle': 'primary_title'})

In [278]:
df_mf_combo = df_movies_info.merge(df_financial_data, on=['movie_id', 'primary_title', 'genres'], how='inner')

In [279]:
df_mf_combo.head()

Unnamed: 0,movie_id,primary_title,genres,averagerating,numvotes,person_id,category,primary_name,release_date,movie,production_budget,domestic_gross,worldwide_gross,year_x,profit,profit_margin,movie_and_year,titleType,isAdult,runtimeMinutes
0,tt0369610,Jurassic World,"Action,Adventure,Sci-Fi",7.0,539338,nm1119880,director,Colin Trevorrow,2015-06-12,Jurassic World,215000000.0,652270625.0,1648855000.0,2015,1433855000.0,86.960647,Jurassic World - 2015,movie,0.0,124
1,tt0398286,Tangled,"Adventure,Animation,Comedy",7.8,366366,nm1977355,director,Nathan Greno,2010-11-24,Tangled,260000000.0,200821936.0,586477200.0,2010,326477200.0,55.667504,Tangled - 2010,movie,0.0,100
2,tt0398286,Tangled,"Adventure,Animation,Comedy",7.8,366366,nm0397174,director,Byron Howard,2010-11-24,Tangled,260000000.0,200821936.0,586477200.0,2010,326477200.0,55.667504,Tangled - 2010,movie,0.0,100
3,tt0435761,Toy Story 3,"Adventure,Animation,Comedy",8.3,682218,nm0881279,director,Lee Unkrich,2010-06-18,Toy Story 3,200000000.0,415004880.0,1068880000.0,2010,868879500.0,81.288817,Toy Story 3 - 2010,movie,0.0,103
4,tt0437086,Alita Battle Angel,"Action,Adventure,Sci-Fi",7.5,88207,nm0001675,director,Robert Rodriguez,2019-02-14,Alita Battle Angel,170000000.0,85710210.0,402976000.0,2019,232976000.0,57.813869,Alita Battle Angel - 2019,movie,0.0,122


In [211]:
df_mf_combo.value_counts(['genres'])

genres                    
Adventure,Animation,Comedy    33
Action,Adventure,Sci-Fi       30
Comedy,Drama                  16
Action,Adventure,Animation    16
Comedy,Drama,Romance          16
                              ..
Mystery,Thriller               1
Biography,Drama,Music          1
Biography,Comedy,Crime         1
Adventure,Mystery,Sci-Fi       1
Biography,Drama,Musical        1
Length: 95, dtype: int64

In [212]:
df_mf_combo.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 338 entries, 0 to 337
Data columns (total 21 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   movie_id           338 non-null    object 
 1   primary_title      338 non-null    object 
 2   genres             338 non-null    object 
 3   averagerating      338 non-null    float64
 4   numvotes           338 non-null    int64  
 5   person_id          338 non-null    object 
 6   category           338 non-null    object 
 7   primary_name       338 non-null    object 
 8   release_date       338 non-null    object 
 9   movie              338 non-null    object 
 10  production_budget  338 non-null    float64
 11  domestic_gross     338 non-null    float64
 12  worldwide_gross    338 non-null    float64
 13  year_x             338 non-null    int64  
 14  profit             338 non-null    float64
 15  profit_margin      338 non-null    float64
 16  movie_and_year     338 non

Conclusion:
- Would ya look at that, this might be the ID we need to link an ID to a title, which could link us to the rest of the data
- We should probably look and see what the difference is between the domestic_gross represented here versus domestic_gross in "bom.movie_gross.csv.gz"
- This data set likely is going to be very valuable in linking together the different csv files