In this project, we will investigate the 'TMDb Movie Data' dataset. This dataset is selected from Here to fulfill the requirements of Project no. 2 in Egyptian MCIT / ITIDA FWD initiative course Data Analysis - Professional Nanodegree taken through Udacity online platform. The goal of this project is to analyze, explore and visualize a dataset in a way that would allow concluding results to help answering related questions.
This dataset represent information about 10,000 movies collected from a specific source. The goal is to answer the following questions:
- Which genres are most popular from year to year?
- What properties are associated with high revenue movies?
To run this project code successfully, the following packages must be installed and imported to the project.
- numpy, pandas, matplotlib and seaborn:
All of them are considered standard packages within this course and are assumed to be available already. In case any of those packages is not available, please install it (using the method appropriate to the available setup) first. - IMDbPy:
- General to any dataset, missing inputs are not uncommon. The majority of these inputs in this dataset cannot be inferred using the standard pandas filling methods, because they are mainly characteristics of the movie being analyzed. As an example, how would one fill a missing 'genre' of a movie?
- As such it became of paramount importance to be able to get such data from another source. The source in this case, is the IMDb website.
- There are many ways to scrape the website:
- Screen scrape a webpage: requires massive works and is highly prone to errors.
- Use an API: most of the APIs found require registration for a key to be used. This imposes constraints for code sharing and reproducability.
- Weapon of Choice: Use IMDbPy python package and its Documentation. This is a GPL License python package for retrieving and managing the data of the IMDb movies database. Use the Documentation reference to install it to be able to run this project.