# Project Title
### Data Engineering Capstone Project

#### Project Summary
--describe your project at a high level--

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [29]:
# Do all imports and installs here
import pandas as pd
from pyspark.sql import SparkSession

import os

movies_path="data/movies"
print(os.listdir(movies_path))

['keywords.csv', 'credits.csv', 'ratings.csv', 'movies_metadata.csv', 'links.csv', 'links_small.csv', 'ratings_small.csv']


#### Common functions

In [30]:
def info(df):
    display(df.info())
    
    print("Null values count:")
    print()
    c=df.isnull().sum()
    print(c[c>0])
    
    print()
    print("Statistics:")
    display(df.describe(include='all'))
    
    print("Head:")
    display(df.head())

### Step 1: Scope the Project and Gather Data

#### Scope 
Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc>

#### Describe and Gather Data 
Describe the data sets you're using. Where did it come from? What type of information is included? 

In [31]:
# Read in the data here

In [32]:
spark = SparkSession.builder.enableHiveSupport().getOrCreate()

In [33]:


credits_df=spark.read.csv(f"{movies_path}/credits.csv", inferSchema=True, header=True)

keywords_df=spark.read.csv(f"{movies_path}/keywords.csv", inferSchema=True, header=True)

links_df=spark.read.csv(f"{movies_path}/links.csv", inferSchema=True, header=True)

movies_metadata_df=spark.read.csv(f"{movies_path}/movies_metadata.csv", inferSchema=True, header=True)

ratings_df=spark.read.csv(f"{movies_path}/ratings_small.csv", inferSchema=True, header=True)

### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
Document steps necessary to clean the data

#### Credits


In [34]:
info(credits_df.toPandas())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45476 entries, 0 to 45475
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   cast    45476 non-null  object
 1   crew    45475 non-null  object
 2   id      45476 non-null  object
dtypes: object(3)
memory usage: 1.0+ MB


None

Null values count:

crew    1
dtype: int64

Statistics:


Unnamed: 0,cast,crew,id
count,45476,45475,45476
unique,42987,41476,31677
top,[],[],'gender': 2
freq,2418,742,4113


Head:


Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","""[{'credit_id': '52fe4284c3a36847f8024f49', 'd...",'profile_path': None}
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"""[{'cast_id': 1, 'character': """"Savannah 'Vann...",'credit_id': '52fe44779251416c91011aad','gender': 1
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862


#### TODO: Parse

#### Keywords

In [35]:
info(keywords_df.toPandas())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46419 entries, 0 to 46418
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        46419 non-null  int32 
 1   keywords  46419 non-null  object
dtypes: int32(1), object(1)
memory usage: 544.1+ KB


None

Null values count:

Series([], dtype: int64)

Statistics:


Unnamed: 0,id,keywords
count,46419.0,46419
unique,,25951
top,,[]
freq,,14795
mean,109769.951873,
std,113045.780256,
min,2.0,
25%,26810.5,
50%,61198.0,
75%,159908.5,


Head:


Unnamed: 0,id,keywords
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,"""[{'id': 10090, 'name': 'board game'}, {'id': ..."
2,15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


#### Links

In [36]:
info(links_df.toPandas())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45843 entries, 0 to 45842
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   movieId  45843 non-null  int32  
 1   imdbId   45843 non-null  int32  
 2   tmdbId   45624 non-null  float64
dtypes: float64(1), int32(2)
memory usage: 716.4 KB


None

Null values count:

tmdbId    219
dtype: int64

Statistics:


Unnamed: 0,movieId,imdbId,tmdbId
count,45843.0,45843.0,45624.0
mean,96578.775626,993708.0,108661.382847
std,57216.863469,1361924.0,112665.97083
min,1.0,1.0,2.0
25%,49202.5,83330.5,26502.75
50%,108799.0,283991.0,60178.0
75%,145270.5,1538311.0,157849.5
max,176279.0,7158814.0,469172.0


Head:


Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


#### Movies metadata

In [37]:
info(movies_metadata_df.toPandas())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45572 entries, 0 to 45571
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   adult                  45572 non-null  object
 1   belongs_to_collection  4591 non-null   object
 2   budget                 45555 non-null  object
 3   genres                 45549 non-null  object
 4   homepage               7937 non-null   object
 5   id                     45541 non-null  object
 6   imdb_id                45447 non-null  object
 7   original_language      45527 non-null  object
 8   original_title         45540 non-null  object
 9   overview               44587 non-null  object
 10  popularity             45452 non-null  object
 11  poster_path            45091 non-null  object
 12  production_companies   45448 non-null  object
 13  production_countries   45452 non-null  object
 14  release_date           45378 non-null  object
 15  revenue            

None

Null values count:

belongs_to_collection    40981
budget                      17
genres                      23
homepage                 37635
id                          31
imdb_id                    125
original_language           45
original_title              32
overview                   985
popularity                 120
poster_path                481
production_companies       124
production_countries       120
release_date               194
revenue                    136
runtime                    404
spoken_languages           164
status                     245
tagline                  23132
title                      778
video                      553
vote_average               498
vote_count                 389
dtype: int64

Statistics:


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
count,45572,4591,45555,45549,7937,45541,45447,45527,45540,44587,...,45378,45436,45168.0,45408,45327,22440,44794,45019,45074.0,45183
unique,111,1795,1358,4172,7740,45469,45386,246,43418,44253,...,19065,9169,2535.0,3613,1367,19588,40102,2102,1573.0,2898
top,False,"{'id': 415931, 'name': 'The Bowery Boys', 'pos...",0,"[{'id': 18, 'name': 'Drama'}]",0,"[{'id': 35, 'name': 'Comedy'}]",0,en,0,No overview found.,...,"[{'iso_3166_1': 'US', 'name': 'United States o...",0,90.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Released,Released,False,0.0,1
freq,45454,29,36509,4996,61,15,20,32185,16,133,...,478,34799,2334.0,20523,41194,1090,749,41496,2756.0,2972


Head:


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"""Cheated on, mistreated and stepped on, the wo...",...,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,81452156,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173


#### Ratings

In [41]:
info(ratings_df.toPandas())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100004 entries, 0 to 100003
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100004 non-null  int32  
 1   movieId    100004 non-null  int32  
 2   rating     100004 non-null  float64
 3   timestamp  100004 non-null  int32  
dtypes: float64(1), int32(3)
memory usage: 1.9 MB


None

Null values count:

Series([], dtype: int64)

Statistics:


Unnamed: 0,userId,movieId,rating,timestamp
count,100004.0,100004.0,100004.0,100004.0
mean,347.01131,12548.664363,3.543608,1129639000.0
std,195.163838,26369.198969,1.058064,191685800.0
min,1.0,1.0,0.5,789652000.0
25%,182.0,1028.0,3.0,965847800.0
50%,367.0,2406.5,4.0,1110422000.0
75%,520.0,5418.0,4.0,1296192000.0
max,671.0,163949.0,5.0,1476641000.0


Head:


Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [39]:
# Write code here

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [40]:
# Perform quality checks here

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.