# CSV File Analysis and SQL Schema
The purpose of this notebook is to check the quality of the data that's downloaded and see if any cleaning/transformations are needed. I will also start exploring what the SQL schema will look like.

In [2]:
import json

from pathlib import Path 
import pandas as pd

# this notebook assumes you're running Jupyter from root
ROOT = Path().resolve().parent
DATA_DIR = ROOT / "data"
MOVIES_FILE = DATA_DIR / "movies" / "movies_2000.parquet"
MOVIE_DETAILS_CSV = DATA_DIR / "movie_details.csv"
CREDITS_CSV = DATA_DIR / "credits.csv"
GENRES_CSV = DATA_DIR / "genres.csv"

## CSV Exploration
### Movies CSV
I explored this file a little in the first EDA notebook, but I will perform a more thorough one here and consider what features to use for the database.

In [3]:
# need python engine
df_movies = pd.read_parquet(MOVIES_FILE, engine="pyarrow")
print(df_movies.head())

   0
0  [
1  {
2  "
3  a
4  d


In [7]:
print(df_movies.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60801 entries, 0 to 60800
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   adult              60801 non-null  bool   
 1   backdrop_path      56674 non-null  object 
 2   genre_ids          60801 non-null  object 
 3   id                 60801 non-null  int64  
 4   original_language  60801 non-null  object 
 5   original_title     60801 non-null  object 
 6   overview           60801 non-null  object 
 7   popularity         60801 non-null  float64
 8   poster_path        60622 non-null  object 
 9   release_date       60801 non-null  object 
 10  title              60801 non-null  object 
 11  video              60801 non-null  bool   
 12  vote_average       60801 non-null  float64
 13  vote_count         60801 non-null  int64  
dtypes: bool(2), float64(2), int64(2), object(8)
memory usage: 5.7+ MB
None


In [8]:
print(df_movies.describe())

                 id    popularity  vote_average    vote_count
count  6.080100e+04  60801.000000  60801.000000  60801.000000
mean   3.956722e+05      2.757176      5.994734    313.229273
std    3.501588e+05      5.206090      1.098148   1383.043970
min    8.000000e+00      0.000000      1.200000     10.000000
25%    7.354800e+04      1.059300      5.318000     16.000000
50%    3.360260e+05      2.060500      6.100000     31.000000
75%    5.951080e+05      3.621400      6.796000    101.000000
max    1.576287e+06    496.923000     10.000000  38261.000000


In [9]:
# let's check for missing values
print("Count of missing values:")
print(df_movies.isnull().sum())
print("\nPercent missing values:")
print(df_movies.isnull().sum()/len(df_movies))

Count of missing values:
adult                   0
backdrop_path        4127
genre_ids               0
id                      0
original_language       0
original_title          0
overview                0
popularity              0
poster_path           179
release_date            0
title                   0
video                   0
vote_average            0
vote_count              0
dtype: int64

Percent missing values:
adult                0.000000
backdrop_path        0.067877
genre_ids            0.000000
id                   0.000000
original_language    0.000000
original_title       0.000000
overview             0.000000
popularity           0.000000
poster_path          0.002944
release_date         0.000000
title                0.000000
video                0.000000
vote_average         0.000000
vote_count           0.000000
dtype: float64


In [10]:
# overall, it's a pretty complete dataset with most fields filled
# the attribute with the most missing values is backdrop, but it's unneeded for now

# let's check entries with a missing title
missing_title = (df_movies["original_title"].isnull()) | (df_movies["title"].isnull())
print(df_movies[missing_title])

Empty DataFrame
Columns: [adult, backdrop_path, genre_ids, id, original_language, original_title, overview, popularity, poster_path, release_date, title, video, vote_average, vote_count]
Index: []


In [13]:
# lastly let's check duplicates
print(df_movies[df_movies["id"].duplicated()])

      adult                     backdrop_path genre_ids      id  \
7466  False  /iy7yd28D9cEKeWa8cepYxg4t4U3.jpg  [27, 28]  267913   

     original_language    original_title  \
7466                en  Vampire Assassin   

                                               overview  popularity  \
7466  Martial artist Ron Hall stars in this dark vam...      1.4216   

                           poster_path release_date             title  video  \
7466  /zHgrYaOKGhOTU0m8lVyGMANiYZe.jpg   2005-08-09  Vampire Assassin  False   

      vote_average  vote_count  
7466         2.136          11  


In [8]:
df = pd.read_csv(MOVIES_CSV)
dupes = df[df.duplicated(subset=['id'], keep=False)].sort_values('id')
print(dupes[['id', 'title', 'release_date']].head(20))
print(f"\nTotal duplicates: {dupes['id'].nunique()} movies")

Empty DataFrame
Columns: [id, title, release_date]
Index: []

Total duplicates: 0 movies


### Movie Details CSV
Now let's check the movie details csv

In [9]:
df_movie_details = pd.read_csv(MOVIE_DETAILS_CSV, engine="python")

In [10]:
print(df_movie_details.head())

   adult                     backdrop_path belongs_to_collection   budget  \
0  False  /wFV6uAU2SJXj9E9w3FDFGAJyshG.jpg                   NaN        0   
1  False  /dD90r6NQ8cFgYjjYGSLRQLCdJWN.jpg                   NaN        0   
2  False                               NaN                   NaN        0   
3  False  /ifq88qw3vgoKlUyw0OAmPQCSqBc.jpg                   NaN        0   
4  False                               NaN                   NaN  3231400   

                                              genres homepage      id  \
0  [{"id": 18, "name": "Drama"}, {"id": 80, "name...      NaN  537605   
1                      [{"id": 18, "name": "Drama"}]      NaN  515728   
2                                                 []      NaN  365504   
3                     [{"id": 27, "name": "Horror"}]      NaN  300236   
4  [{"id": 14, "name": "Fantasy"}, {"id": 18, "na...      NaN  289673   

     imdb_id origin_country original_language  ... release_date revenue  \
0  tt2473052         ["

In [11]:
df_movie_details.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60767 entries, 0 to 60766
Data columns (total 26 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  60767 non-null  bool   
 1   backdrop_path          56642 non-null  object 
 2   belongs_to_collection  7416 non-null   object 
 3   budget                 60767 non-null  int64  
 4   genres                 60767 non-null  object 
 5   homepage               19342 non-null  object 
 6   id                     60767 non-null  int64  
 7   imdb_id                60303 non-null  object 
 8   origin_country         60767 non-null  object 
 9   original_language      60767 non-null  object 
 10  original_title         60766 non-null  object 
 11  overview               60264 non-null  object 
 12  popularity             60767 non-null  float64
 13  poster_path            60589 non-null  object 
 14  production_companies   60767 non-null  object 
 15  pr

In [11]:
df_movie_details.describe()

Unnamed: 0,budget,id,popularity,revenue,runtime,vote_average,vote_count
count,60557.0,60557.0,60557.0,60557.0,60557.0,60557.0,60557.0
mean,3940526.0,393426.3,2.998853,10240290.0,94.00786,5.993948,313.350992
std,18385930.0,347880.0,6.552957,67856540.0,31.055907,1.098821,1382.497375
min,0.0,8.0,0.0,0.0,0.0,1.25,10.0
25%,0.0,73210.0,1.2395,0.0,85.0,5.318,16.0
50%,0.0,334684.0,2.1893,0.0,94.0,6.1,31.0
75%,0.0,592678.0,3.277,0.0,107.0,6.798,101.0
max,583900000.0,1571470.0,357.8025,2923706000.0,999.0,10.0,38155.0


In [12]:
# the max vote_count value is a little high, let's see if it's plausible
# (i.e. a big blockbuster)
df_movie_details[df_movie_details["vote_count"] == 37670]

Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,origin_country,original_language,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count


In [13]:
# it's Inception, so very believable
# just like the discover/movie dataset, this seems to be pretty complete with
# most missing values occuring in "non-essential" attributes such as homepage,
# tagline, etc

# there are several nested jsons in this file, namely "belongs_to_collection" and "production_companies"
# let's get a sample of how that looks
json.loads(df_movie_details["belongs_to_collection"][0])

{'id': 290973,
 'name': 'Crouching Tiger, Hidden Dragon Collection',
 'poster_path': '/8x9ajHWUm5K5rdMRvXe2vhjdLAk.jpg',
 'backdrop_path': '/fWbparTMpxYDgutCf9LLlcJgFZT.jpg'}

In [14]:
json.loads(df_movie_details["production_companies"][0])

[{'id': 58,
  'logo_path': '/voYCwlBHJQANtjvm5MNIkCF1dDH.png',
  'name': 'Sony Pictures Classics',
  'origin_country': 'US'},
 {'id': 76795,
  'logo_path': '/g3iItU50K4SUdDekNqBhU9O43Xe.png',
  'name': 'Columbia Pictures Film Production Asia',
  'origin_country': 'HK'},
 {'id': 10284,
  'logo_path': '/u0FCdiR026xbEbuY4yqaKj9Lf2O.png',
  'name': 'Edko Films',
  'origin_country': 'HK'},
 {'id': 97292, 'logo_path': None, 'name': 'Zoom Hunt', 'origin_country': 'TW'},
 {'id': 2269,
  'logo_path': None,
  'name': 'China Film Co-Production Corp.',
  'origin_country': 'CN'},
 {'id': 10565,
  'logo_path': '/5djnxodjmgbzdXiNRllwhQPxANY.png',
  'name': 'Good Machine',
  'origin_country': 'US'}]

In [12]:
# both fields look interesting, but for now we can keep the production_companies in
# our initial database schema, as it has the two fields probably most relevant for our
# initial analysis. the collection field might require me to rework my movie script
# and ingest from the collections endpoint

# last check: duplicates
df_movie_details[df_movie_details.duplicated()]

Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,origin_country,original_language,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count


In [16]:
### Cast CSV

In [17]:
df_cast = pd.read_csv(CREDITS_CSV, engine="python")

FileNotFoundError: [Errno 2] No such file or directory: '/home/jeff/movie_data_analysis/data/cast.csv'

In [None]:
df_cast.head()

In [None]:
df_cast.info()

In [None]:
df_cast.describe()

In [None]:
# let's check the known_for_department feature
print(df_cast["known_for_department"].unique())

In [None]:
# interestingly, i was expecting most of the folk here to be labeled as 'acting'
# let's get an idea of how many in 'cast' are labeled otherwise
df_cast_nonactors = df_cast[df_cast["known_for_department"] != "Acting"]
print(len(df_cast_nonactors)/len(df_cast))

In [None]:
# let's see what the first 5 entries look like
df_cast_nonactors.head()

In [None]:
# looks like non-actors seem to be casted as extras. because of this,
# we can safely not include the known_for_department feature for now

# duplicates check
df_cast[df_cast.duplicated()]

In [None]:
# let's see if some actors had multiple roles
df_cast[df_cast.duplicated(["id", "movie_id"], keep=False)].sort_values(by="popularity", ascending=False)

### Genres CSV
Lastly, let's check out the genres csv. It's a fairly small file so we can just print it

In [None]:
df_genres = pd.read_csv(GENRES_CSV, engine="python")
print(df_genres)

### CSV Summary
Overall, it looks like the CSV files are largely complete with few missing values in key features. There are some such as backdrop_path and profile_path that might be useful in the future, but we can exclude those for now. The following are the features I'll keep for now.

- **Movies CSV**:
  
genre_ids, id, original_language, original_title, overview, popularity, release_date, title, vote_average, vote_count
- **Movie Details CSV**:

budget, genres, id, origin_country, original_language, original_title, production_companies, revenue, runtime, tagline, status
- **Cast CSV**:

id, name, original_name, popularity, character, order, movie_id

- **Genres CSV**

id, name