# CSV File Analysis and SQL Schema
The purpose of this notebook is to check the quality of the data that's downloaded and see if any cleaning/transformations are needed. I will also start exploring what the SQL schema will look like.

In [1]:
from pathlib import Path 
import pandas as pd

# this notebook assumes you're running Jupyter from root
ROOT = Path().resolve().parent
DATA_DIR = ROOT / "data"
MOVIES_CSV = DATA_DIR / "movies.csv"
MOVIE_DETAILS_CSV = DATA_DIR / "movie_details.csv"
CAST_CSV = DATA_DIR / "cast.csv"
GENRES_CSV = DATA_DIR / "genres.csv"

## Movies CSV
I explored this file a little in the first EDA notebook, but I will perform a more thorough one here and consider what features to use for the database.

In [2]:
# need python engine
df_movies = pd.read_csv(MOVIES_CSV, engine="python")
print(df_movies.head())



   adult                     backdrop_path                genre_ids     id  \
0  False  /7isarjYDEKZ5t1CgcvbuqEUby8P.jpg                     [27]   9532   
1  False  /Ar7QuJ7sJEiC0oP3I8fKBKIQD9u.jpg             [28, 18, 12]     98   
2  False   /zvmsyAMr3cVDdIu7UvDLSmRXlF.jpg          [35, 18, 10749]  22705   
3  False  /uHZRTGMFb1RLmgWcqlIOZsGbDCT.jpg                     [35]   4247   
4  False  /mZj8EUr6F1x2PWZjKPxaeYd5WRw.jpg  [12, 16, 35, 10751, 14]  11688   

  original_language            original_title  \
0                en         Final Destination   
1                en                 Gladiator   
2                it             Tra(sgre)dire   
3                en               Scary Movie   
4                en  The Emperor's New Groove   

                                            overview  popularity  \
0  After a teenager has a terrifying vision of hi...     17.5062   
1  After the death of Emperor Marcus Aurelius, hi...     15.7859   
2  While scouting out apartments

In [3]:
print(df_movies.describe())

                 id    popularity  vote_average    vote_count
count  5.764000e+04  57640.000000  57640.000000  57640.000000
mean   3.755458e+05      1.037799      6.042354    319.575607
std    3.293602e+05      2.320150      1.073040   1392.171684
min    8.000000e+00      0.000800      1.200000     10.000000
25%    7.013550e+04      0.291975      5.400000     16.000000
50%    3.242900e+05      0.519200      6.132000     32.000000
75%    5.737698e+05      0.983100      6.800000    104.000000
max    1.471337e+06    121.988500     10.000000  37639.000000


In [4]:
print(df_movies.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57640 entries, 0 to 57639
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   adult              57640 non-null  bool   
 1   backdrop_path      53532 non-null  object 
 2   genre_ids          57640 non-null  object 
 3   id                 57640 non-null  int64  
 4   original_language  57640 non-null  object 
 5   original_title     57639 non-null  object 
 6   overview           57096 non-null  object 
 7   popularity         57640 non-null  float64
 8   poster_path        57456 non-null  object 
 9   release_date       57640 non-null  object 
 10  title              57639 non-null  object 
 11  video              57640 non-null  bool   
 12  vote_average       57640 non-null  float64
 13  vote_count         57640 non-null  int64  
dtypes: bool(2), float64(2), int64(2), object(8)
memory usage: 5.4+ MB
None


In [9]:
# let's check for missing values
print("Count of missing values:")
print(df_movies.isnull().sum())
print("\nPercent missing values:")
print(df_movies.isnull().sum()/len(df_movies))

Count of missing values:
adult                   0
backdrop_path        4108
genre_ids               0
id                      0
original_language       0
original_title          1
overview              544
popularity              0
poster_path           184
release_date            0
title                   1
video                   0
vote_average            0
vote_count              0
dtype: int64

Percent missing values:
adult                0.000000
backdrop_path        0.071270
genre_ids            0.000000
id                   0.000000
original_language    0.000000
original_title       0.000017
overview             0.009438
popularity           0.000000
poster_path          0.003192
release_date         0.000000
title                0.000017
video                0.000000
vote_average         0.000000
vote_count           0.000000
dtype: float64
