# CSV File Analysis and SQL Schema
The purpose of this notebook is to check the quality of the data that's downloaded and see if any cleaning/transformations are needed. I will also start exploring what the SQL schema will look like.

In [1]:
from pathlib import Path 
import pandas as pd

# this notebook assumes you're running Jupyter from root
ROOT = Path().resolve().parent
DATA_DIR = ROOT / "data"
MOVIES_CSV = DATA_DIR / "movies.csv"
MOVIE_DETAILS_CSV = DATA_DIR / "movie_details.csv"
CAST_CSV = DATA_DIR / "cast.csv"
GENRES_CSV = DATA_DIR / "genres.csv"

## CSV Exploration
### Movies CSV
I explored this file a little in the first EDA notebook, but I will perform a more thorough one here and consider what features to use for the database.

In [2]:
# need python engine
df_movies = pd.read_csv(MOVIES_CSV, engine="python")
print(df_movies.head())

   adult                     backdrop_path                genre_ids     id  \
0  False  /Ar7QuJ7sJEiC0oP3I8fKBKIQD9u.jpg             [28, 18, 12]     98   
1  False  /7isarjYDEKZ5t1CgcvbuqEUby8P.jpg                     [27]   9532   
2  False   /zvmsyAMr3cVDdIu7UvDLSmRXlF.jpg          [35, 18, 10749]  22705   
3  False  /mZj8EUr6F1x2PWZjKPxaeYd5WRw.jpg  [12, 16, 35, 10751, 14]  11688   
4  False  /uHZRTGMFb1RLmgWcqlIOZsGbDCT.jpg                     [35]   4247   

  original_language            original_title  \
0                en                 Gladiator   
1                en         Final Destination   
2                it             Tra(sgre)dire   
3                en  The Emperor's New Groove   
4                en               Scary Movie   

                                            overview  popularity  \
0  After the death of Emperor Marcus Aurelius, hi...     16.1208   
1  After a teenager has a terrifying vision of hi...     15.7876   
2  While scouting out apartments

In [3]:
print(df_movies.info())

                 id    popularity  vote_average    vote_count
count  5.768600e+04  57686.000000  57686.000000  57686.000000
mean   3.758800e+05      0.954345      6.042615    319.600284
std    3.296253e+05      2.461093      1.073200   1392.720682
min    8.000000e+00      0.000000      0.000000     10.000000
25%    7.021475e+04      0.241500      5.400000     16.000000
50%    3.244285e+05      0.423800      6.134500     32.000000
75%    5.740810e+05      0.838800      6.800000    104.000000
max    1.471337e+06    172.901400     10.000000  37661.000000


In [4]:
print(df_movies.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57686 entries, 0 to 57685
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   adult              57686 non-null  bool   
 1   backdrop_path      53592 non-null  object 
 2   genre_ids          57686 non-null  object 
 3   id                 57686 non-null  int64  
 4   original_language  57686 non-null  object 
 5   original_title     57685 non-null  object 
 6   overview           57150 non-null  object 
 7   popularity         57686 non-null  float64
 8   poster_path        57502 non-null  object 
 9   release_date       57686 non-null  object 
 10  title              57685 non-null  object 
 11  video              57686 non-null  bool   
 12  vote_average       57686 non-null  float64
 13  vote_count         57686 non-null  int64  
dtypes: bool(2), float64(2), int64(2), object(8)
memory usage: 5.4+ MB
None


In [5]:
# let's check for missing values
print("Count of missing values:")
print(df_movies.isnull().sum())
print("\nPercent missing values:")
print(df_movies.isnull().sum()/len(df_movies))

Count of missing values:
adult                   0
backdrop_path        4094
genre_ids               0
id                      0
original_language       0
original_title          1
overview              536
popularity              0
poster_path           184
release_date            0
title                   1
video                   0
vote_average            0
vote_count              0
dtype: int64

Percent missing values:
adult                0.000000
backdrop_path        0.070970
genre_ids            0.000000
id                   0.000000
original_language    0.000000
original_title       0.000017
overview             0.009292
popularity           0.000000
poster_path          0.003190
release_date         0.000000
title                0.000017
video                0.000000
vote_average         0.000000
vote_count           0.000000
dtype: float64


In [8]:
# overall, it's a pretty complete dataset with most fields filled
# the attribute with the most missing values is backdrop, but it's unneeded for now

# let's check entries with a missing title
missing_title = (df_movies["original_title"].isnull()) | (df_movies["title"].isnull())
print(df_movies[missing_title])

       adult                     backdrop_path genre_ids       id  \
49583  False  /ptI1nhBl0uqN6vvMw7Og0JEVGPK.jpg  [28, 18]  1161605   

      original_language original_title  \
49583                en            NaN   

                                                overview  popularity  \
49583  A hitman is tasked to take out ex-mobsters whe...      0.3825   

                            poster_path release_date title  video  \
49583  /oDFMsLYPPRquWxF7zFuhe9qHwGa.jpg   2021-05-21   NaN  False   

       vote_average  vote_count  
49583           9.2          44  


In [14]:
# there's only one and it seems to be an obscure title, so it shouldn't affect
# our future analysis much

# lastly let's check duplicates
print(df_movies[df_movies.duplicated()])

Empty DataFrame
Columns: [adult, backdrop_path, genre_ids, id, original_language, original_title, overview, popularity, poster_path, release_date, title, video, vote_average, vote_count]
Index: []


### Movie Details CSV
Now let's check the movie details csv

In [9]:
df_movie_details = pd.read_csv(MOVIE_DETAILS_CSV, engine="python")

In [11]:
print(df_movie_details.head())

   adult                     backdrop_path  \
0  False  /Ar7QuJ7sJEiC0oP3I8fKBKIQD9u.jpg   
1  False  /7isarjYDEKZ5t1CgcvbuqEUby8P.jpg   
2  False   /zvmsyAMr3cVDdIu7UvDLSmRXlF.jpg   
3  False  /mZj8EUr6F1x2PWZjKPxaeYd5WRw.jpg   
4  False  /uHZRTGMFb1RLmgWcqlIOZsGbDCT.jpg   

                               belongs_to_collection     budget  \
0  {"id": 1069584, "name": "Gladiator Collection"...  103000000   
1  {"id": 8864, "name": "Final Destination Collec...   23000000   
2                                                NaN    2100000   
3  {"id": 178117, "name": "The Emperor's New Groo...  100000000   
4  {"id": 4246, "name": "Scary Movie Collection",...   19000000   

                                              genres homepage     id  \
0  [{"id": 28, "name": "Action"}, {"id": 18, "nam...      NaN     98   
1                     [{"id": 27, "name": "Horror"}]      NaN   9532   
2  [{"id": 35, "name": "Comedy"}, {"id": 18, "nam...      NaN  22705   
3  [{"id": 12, "name": "Adventur

In [12]:
df_movie_details.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57686 entries, 0 to 57685
Data columns (total 26 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  57686 non-null  bool   
 1   backdrop_path          53594 non-null  object 
 2   belongs_to_collection  7129 non-null   object 
 3   budget                 57686 non-null  int64  
 4   genres                 57686 non-null  object 
 5   homepage               18282 non-null  object 
 6   id                     57686 non-null  int64  
 7   imdb_id                57232 non-null  object 
 8   origin_country         57686 non-null  object 
 9   original_language      57686 non-null  object 
 10  original_title         57685 non-null  object 
 11  overview               57151 non-null  object 
 12  popularity             57686 non-null  float64
 13  poster_path            57501 non-null  object 
 14  production_companies   57686 non-null  object 
 15  pr

In [15]:
df_movie_details.describe()

Unnamed: 0,budget,id,popularity,revenue,runtime,vote_average,vote_count
count,57686.0,57686.0,57686.0,57686.0,57686.0,57686.0,57686.0
mean,3982117.0,375880.0,1.005197,10433840.0,94.097979,6.042688,319.684967
std,18307510.0,329625.3,2.432796,68096520.0,31.100197,1.073186,1393.07261
min,0.0,8.0,0.0,0.0,0.0,0.0,10.0
25%,0.0,70214.75,0.246725,0.0,85.0,5.4,16.0
50%,0.0,324428.5,0.454,0.0,94.0,6.134,32.0
75%,0.0,574081.0,0.928375,0.0,107.0,6.8,104.0
max,465400000.0,1471337.0,162.5804,2923706000.0,999.0,10.0,37669.0


In [19]:
# the max vote_count value is a little high, let's see if it's plausible
# (i.e. a big blockbuster)
df_movie_details[df_movie_details["vote_count"] == 37669]

Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,origin_country,original_language,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
15805,False,/8ZTVqvKDQ8emSGUEMjsS4yHAwrp.jpg,,160000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 878, ""na...",https://www.warnerbros.com/movies/inception,27205,tt1375666,"[""US"", ""GB""]",en,...,2010-07-15,839030630,148,"[{""english_name"": ""English"", ""iso_639_1"": ""en""...",Released,Your mind is the scene of the crime.,Inception,False,8.369,37669


In [21]:
# it's Inception, so very believable
# just like the discover/movie dataset, this seems to be pretty complete with
# most missing values occuring in "non-essential" attributes such as homepage,
# tagline, etc

# last check, let's see what the belongs_to_collection field looks like
df_movie_details[df_movie_details["belongs_to_collection"].notnull()]["belongs_to_collection"].head()

0    {"id": 1069584, "name": "Gladiator Collection"...
1    {"id": 8864, "name": "Final Destination Collec...
3    {"id": 178117, "name": "The Emperor's New Groo...
4    {"id": 4246, "name": "Scary Movie Collection",...
6    {"id": 87359, "name": "Mission: Impossible Col...
Name: belongs_to_collection, dtype: object

In [26]:
# this field looks like it provides the franchise a film belongs to
# this could be interesting for future analysis but for now i might just
# have a flag labeling if a film belongs to a franchise or not

# lastly, let's see if any values are duplicated
df_movie_details[df_movie_details.duplicated()]

Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,origin_country,original_language,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count


In [None]:
### Cast CSV

In [22]:
df_cast = pd.read_csv(CAST_CSV, engine="python")

In [23]:
df_cast.head()

Unnamed: 0,adult,gender,id,known_for_department,name,original_name,popularity,profile_path,cast_id,character,credit_id,order,movie_id
0,False,2,934,Acting,Russell Crowe,Russell Crowe,4.0598,/rsxGCRtPu42uKDJZlz7qknvz8h6.jpg,8,Maximus,52fe4217c3a36847f8003435,0,98
1,False,2,73421,Acting,Joaquin Phoenix,Joaquin Phoenix,2.4213,/u38k3hQBDwNX0VA22aQceDp9Iyv.jpg,9,Commodus,52fe4217c3a36847f8003439,1,98
2,False,1,935,Acting,Connie Nielsen,Connie Nielsen,2.9302,/lvQypTfeH2Gn2PTbzq6XkT2PLmn.jpg,10,Lucilla,52fe4217c3a36847f800343d,2,98
3,False,2,936,Acting,Oliver Reed,Oliver Reed,1.669,/dWfotc1X71wNCGyPO9hXpv8U9Gw.jpg,11,Proximo,52fe4217c3a36847f8003441,3,98
4,False,2,194,Acting,Richard Harris,Richard Harris,3.0321,/lCvcVMuxrg1f5A8OMqY9AqkkcZR.jpg,12,Marcus Aurelius,52fe4217c3a36847f8003445,4,98


In [24]:
df_cast.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1108528 entries, 0 to 1108527
Data columns (total 13 columns):
 #   Column                Non-Null Count    Dtype  
---  ------                --------------    -----  
 0   adult                 1108528 non-null  bool   
 1   gender                1108528 non-null  int64  
 2   id                    1108528 non-null  int64  
 3   known_for_department  1108527 non-null  object 
 4   name                  1108528 non-null  object 
 5   original_name         1108528 non-null  object 
 6   popularity            1108528 non-null  float64
 7   profile_path          766673 non-null   object 
 8   cast_id               1108528 non-null  int64  
 9   character             1050288 non-null  object 
 10  credit_id             1108528 non-null  object 
 11  order                 1108528 non-null  int64  
 12  movie_id              1108528 non-null  int64  
dtypes: bool(1), float64(1), int64(5), object(6)
memory usage: 102.5+ MB


In [None]:
df_cast.describe()