# Project: How do Movie Metrics affects its viewer rating on iMDB?

## Table of Contents:
* [Introduction](#1)
* [Wrangling](#2)
* [Exploratory Visuals](#3)
* [Explanatory Visuals](#4)
* [Conclusion](#5)

## Introduction:<a class="anchor" id="1"></a>
Have you ever browsed iMDB looking for good movies to watch, sorted by rating? Or browsing the movie you just watched on iMDB, only to find that it has a shockingly low or high viewer rating? What can we say about the high or low ratings of a movie on iMDB?

We take a dive into the dataset provided by Kaggle (but now replaced with TMDB ratings due to DMCA Takedown https://www.kaggle.com/tmdb/tmdb-movie-metadata/home) to see what's going on behind these scores!

In [1]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from patsy import dmatrices
import statsmodels.api as sm;
from datetime import datetime, timedelta
from statsmodels.stats.outliers_influence import variance_inflation_factor

%matplotlib inline

In [10]:
# Large dataset, we want to see all columns
pd.set_option('display.max_columns', None)

In [11]:
# read in .csv
df_og = pd.read_csv('parking-violations-issued-fiscal-year-2018.csv', low_memory = False)

## Wrangling:

In [12]:
df = df_og.copy()

In [13]:
# Taking the columns we want to work with
df = df[['Registration State', 'Issue Date', 'Violation Time', 'Violation Code', 'Street Name', 'Sub Division',
        'Vehicle Body Type', 'Vehicle Make', 'Vehicle Color', 'Vehicle Year']]

In [14]:
df.head()

Unnamed: 0,Registration State,Issue Date,Violation Time,Violation Code,Street Name,Sub Division,Vehicle Body Type,Vehicle Make,Vehicle Color,Vehicle Year
0,NY,2018-07-03T00:00:00.000,0811P,14,HANSON PLACE,D1,SDN,HONDA,BLUE,2006
1,NY,2018-06-28T00:00:00.000,1145A,46,AUSTIN ST,C,SDN,NISSA,GRY,2017
2,NY,2018-06-08T00:00:00.000,0355P,24,GREAT KILLS BOAT LAU,D5,SUBN,JEEP,GREEN,0
3,NC,2018-06-07T00:00:00.000,0123P,24,GREAT KILLS PARK BOA,D5,P-U,FORD,WHITE,0
4,NY,2018-06-29T00:00:00.000,0514P,17,HANSON PLACE,C4,SUBN,HYUND,GREEN,2007


### Data Issues
* column names: add _ between blank space and all lower case
* violation code is type int -> cast as string
* combine issue date + violation time

In [15]:
df.drop_duplicates(inplace = True)
df.dropna(inplace = True)

In [16]:
df_og.shape

(5906123, 43)

In [24]:
# Playing around with datetime objects to learn them

x = datetime(2020, 5, 17)
x = x + timedelta(hours=9, minutes=59)

print(x)

2020-05-17 09:59:00


In [18]:
# Convert Issue Date to datetime object
df['Issue Date'] = pd.to_datetime(df['Issue Date'])

In [19]:
df.head()

Unnamed: 0,Registration State,Issue Date,Violation Time,Violation Code,Street Name,Sub Division,Vehicle Body Type,Vehicle Make,Vehicle Color,Vehicle Year
0,NY,2018-07-03,0811P,14,HANSON PLACE,D1,SDN,HONDA,BLUE,2006
1,NY,2018-06-28,1145A,46,AUSTIN ST,C,SDN,NISSA,GRY,2017
2,NY,2018-06-08,0355P,24,GREAT KILLS BOAT LAU,D5,SUBN,JEEP,GREEN,0
3,NC,2018-06-07,0123P,24,GREAT KILLS PARK BOA,D5,P-U,FORD,WHITE,0
4,NY,2018-06-29,0514P,17,HANSON PLACE,C4,SUBN,HYUND,GREEN,2007


In [21]:
df['Issue Date'][0] = df['Issue Date'][0] + timedelta(hours=9, minutes=59)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [22]:
df.head(1)

Unnamed: 0,Registration State,Issue Date,Violation Time,Violation Code,Street Name,Sub Division,Vehicle Body Type,Vehicle Make,Vehicle Color,Vehicle Year
0,NY,2018-07-03 09:59:00,0811P,14,HANSON PLACE,D1,SDN,HONDA,BLUE,2006


In [None]:
#pseudocode to combine issue date and violation time

#loop each row's violation time
    #take first 2 characters, store them in hour variable
    #take the next 2 characters, store them in minute variable
    #cast hour and minute into int
    #take the last character, if P, we add 12 to hour
    #Issue Date[i] = Issue Date[i] + timedelta(hours=hour, minutes=minute)
#end loop