Skip to content

Perform ETL on several movie datasets to predict popular films for a streaming service.

Notifications You must be signed in to change notification settings

mcarter-00/Movies-ETL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Movies-ETL

Perform ETL on several movie datasets to predict popular films for a streaming service.

Challenge

The "process_ETL" function runs through the data pipeline from extraction, transformation, and loading. In this example, the function extracts data from three filepaths: wikipedia.movies.json, movies_metadata.csv, and ratings.csv. It then proceeds to perform the transformation steps, which include cleaning data from both Wikipedia and Kaggle. Once the data is cleaned, the Wikipedia and Kaggle Movies datasets are merged into one dataset. The ratings dataset is kept apart. The newly transformed datasets are then loaded into SQL database for further analysis.

Assumptions:

  • User starts with a data file, such as a .json and .csv, to use this function.
  • Exploratory data is done prior to or apart from using this function.
  • It is permissible to remove null data given that half the columns from the Wikipedia dataset contain more than 6,000 null values.
  • It is permissible to replace NaN with zero's to better utilize dataset.
  • User is able to calculate statistics for the Ratings dataset given that it is grouped by "MovieID" and "Rating".
  • Theres is a connection to Postgress/SQL to load datasets using this function.

About

Perform ETL on several movie datasets to predict popular films for a streaming service.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published