Final Project for 2020 Georgia Tech Data Analytics Bootcamp

Have you ever watched a movie and ended up disappointed with the outcome? Our project focused on designing models to predict IMDB movie ratings based on genre, year of release, budget, duration, and director.

Objective

For this project, we used various models to determine the best method for predicting the IMDB rating of a movie based on historical data. In order to conduct this predictive modeling endeavor, we sourced our data from Kaggle.com which featured a set of csv files with information regarding movies rated by IMDB users dating back from 1906 to 2019.

Link to Dataset: https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset/

Prototype/Inspirations: https://netflix.com

Heroku app: https://clarke-imdb.herokuapp.com/

Methods Used

Data Cleaning
Data Visualization

Technologies

Python
Pandas
Jupyter
Javascript
D3
HTML
CSS
Machine Learning Models:
- Support Vector Model
- Deep Learning
- Logistic Regression
- Random Forest
DataTables
Flask
PostgreSQL

Project Description

Site Design

Our site was designed using HTML and CSS. We wanted our site to be easily navigated, and fun to use. We highlighted important elements on our homepage, and showed our project process through navigational links.

Data Cleaning

Tableau Visualizations

Release Year Dashboard

Genre Dashboard

Director Dashboard

The directors we have shown are the more popular directors with the most votes and data. We used directors as one of our elements for the prediction models as well. Here we are showing the director data comparison of gender vs rating and budget vs rating.

For gender vs rating showing green from 0-7.3 and blue 7.4 to 8. For these particular directors we noticed that all of them are male and the male gender voted for highest rated directors. But on the far left side of the gender vs rating we have Martin Scorsese, who is known for The Godfather has the highest rating votes but not the highest budget for movies.

From the budget vs rating dashboard we see Anthony Russo known for The Avengers and Captain America is showing the highest avg budget of 230 million with an average vote of 8 but he's not the highest rated director. So we see just because some one has a higher budget doesn't mean they are the most popular voted director.

Machine Learning Model

Given the analysis that we did in Tableau, we decided to predict the movie’s rating class based on release year, duration, budget, genre, and director. We started with a linear regression model and elasticnet, but the accuracy was around 10%. So we figured that no amount of hyper-parameter tuning would get us much higher than that. For some models we tried only using the numerical features (year, duration, and budget) but the results were not better so we decided to move forward with both numerical and categorical data.

Deep Learning

Support Vector Model (SVM)

For the SVM we cleaned the data with label and dummy encoder to get a better accuracy for our target rating class for bad excellent and good. We have the SVM training accuracy at 0.91% and SVM testing accuracy at 0.64%.

Name		Name	Last commit message	Last commit date
Latest commit History 204 Commits
.ipynb_checkpoints		.ipynb_checkpoints
ML Models		ML Models
Resources		Resources
Templates		Templates
__pycache__		__pycache__
static		static
.gitignore		.gitignore
Procfile		Procfile
Proposal.md		Proposal.md
README.md		README.md
app.py		app.py
empty_test.csv		empty_test.csv
models.py		models.py
movie_cleaning.ipynb		movie_cleaning.ipynb
netflixbg.jpg		netflixbg.jpg
predict.py		predict.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Final Project for 2020 Georgia Tech Data Analytics Bootcamp

Objective

Methods Used

Technologies

Project Description

Site Design

Data Cleaning

Tableau Visualizations

Release Year Dashboard

Genre Dashboard

Director Dashboard

Machine Learning Model

Deep Learning

Support Vector Model (SVM)

Logistic Regression

Random Forest

About

Uh oh!

Releases

Packages

Languages

jnfost/Final_Project

Folders and files

Latest commit

History

Repository files navigation

Final Project for 2020 Georgia Tech Data Analytics Bootcamp

Objective

Methods Used

Technologies

Project Description

Site Design

Data Cleaning

Tableau Visualizations

Release Year Dashboard

Genre Dashboard

Director Dashboard

Machine Learning Model

Deep Learning

Support Vector Model (SVM)

Logistic Regression

Random Forest

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages