# Project: How do Movie Metrics affects its viewer rating on iMDB?

## Table of Contents:
* [Introduction](#1)
* [Wrangling](#2)
* [Exploratory Visuals](#3)
* [Explanatory Visuals](#4)
* [Conclusion](#5)

## Introduction:<a class="anchor" id="1"></a>
Have you ever browsed iMDB looking for good movies to watch, sorted by rating? Or browsing the movie you just watched on iMDB, only to find that it has a shockingly low or high viewer rating? What can we say about the high or low ratings of a movie on iMDB?

We take a dive into the dataset provided by Kaggle (but now replaced with TMDB ratings due to DMCA Takedown https://www.kaggle.com/tmdb/tmdb-movie-metadata/home) to see what's going on behind these scores!

In [18]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from patsy import dmatrices
import statsmodels.api as sm;
from statsmodels.stats.outliers_influence import variance_inflation_factor

%matplotlib inline

In [24]:
# Large dataset, we want to see all columns
pd.set_option('display.max_columns', None)

In [20]:
# read in .csv
df_og = pd.read_csv('parking-violations-issued-fiscal-year-2018.csv', low_memory = False)

## Wrangling:

In [22]:
df = df_og.copy()

In [26]:
df.tail()

Unnamed: 0,Summons Number,Plate ID,Registration State,Plate Type,Issue Date,Violation Code,Vehicle Body Type,Vehicle Make,Issuing Agency,Street Code1,Street Code2,Street Code3,Vehicle Expiration Date,Violation Location,Violation Precinct,Issuer Precinct,Issuer Code,Issuer Command,Issuer Squad,Violation Time,Time First Observed,Violation County,Violation In Front Of Or Opposite,House Number,Street Name,Intersecting Street,Date First Observed,Law Section,Sub Division,Violation Legal Code,Days Parking In Effect,From Hours In Effect,To Hours In Effect,Vehicle Color,Unregistered Vehicle?,Vehicle Year,Meter Number,Feet From Curb,Violation Post Code,Violation Description,No Standing or Stopping Violation,Hydrant Violation,Double Parking Violation
5906118,8723550041,CFG3656,NY,PAS,2018-12-25T00:00:00.000,70,4DSD,HONDA,T,17530,24890,13612,20181211.0,13.0,13,13,367416,T102,N,0155P,,NY,F,145,E 27th St,,0,408,C,,YYYYYYY,,,WH,,2012,,0,7,,,,
5906119,8723550053,W64JAB,NJ,PAS,2018-12-25T00:00:00.000,40,SUBN,JEEP,T,10210,0,0,88880088.0,13.0,13,13,367416,T102,N,0158P,,NY,I,W,3rd Ave,10ft N/of E 27th St,0,408,J3,,YYYYYYY,,,BLACK,,0,,5,7,,,,
5906120,8723550065,88122,NY,MED,2018-12-25T00:00:00.000,71,SUBN,KIA,T,10010,0,0,20190423.0,13.0,13,13,367416,T102,N,0209P,,NY,I,W,1st Ave,15ft N/of E 24th St,0,408,E2,,YYYYYYY,,,BK,,2018,,0,7,,,,
5906121,8723550089,FNJ4864,QB,PAS,2018-12-25T00:00:00.000,40,4DSD,MAZDA,T,17470,10210,10110,88880088.0,13.0,13,13,367416,T102,N,0219P,,NY,F,215,E 24th St,,0,408,D,,YYYYYYY,,,OTHER,,0,,0,7,,,,
5906122,8723550107,22799MH,NY,COM,2018-12-26T00:00:00.000,46,VAN,ME/BE,T,34270,10910,11010,88880088.0,10.0,10,10,367416,T102,N,1228P,,NY,F,420,W 25th St,,0,408,D1,,YYYYYYY,,,WHITE,,2015,,0,99,,,,


In [31]:
# Taking the columns we want to work with
df = df[['Registration State', 'Issue Date', 'Violation Time', 'Violation Code', 'Street Name', 'Sub Division',
        'Vehicle Body Type', 'Vehicle Make', 'Vehicle Color', 'Vehicle Year']]

### Data Issues
* column names: add _ between blank space and all lower case
* violation code is type int -> cast as string
* combine issue date + violation time

In [33]:
df.drop_duplicates(inplace = True)
df.dropna(inplace = True)

In [35]:
df_og.shape

(5906123, 43)