![example](images/director_shot.jpeg)

# Project Title

**Authors:** Ian Butler, Ashli Dougherty, Nicolas Pierce
***

## Overview

A one-paragraph overview of the project, including the business problem, data, methods, results and recommendations.

## Business Problem

Summary of the business problem you are trying to solve, and the data questions that you plan to answer to solve them.

***
Questions to consider:
* What are the business's pain points related to this project?
* How did you pick the data analysis question(s) that you did?
* Why are these questions important from a business perspective?
***

## Data Understanding

Describe the data being used for this project.
***
Questions to consider:
* Where did the data come from, and how do they relate to the data analysis questions?
* What do the data represent? Who is in the sample and what variables are included?
* What is the target variable?
* What are the properties of the variables you intend to use?
***

The below cell imports standard packages and also unzips the currently zipped IMDB dataset into ./zippedData

In [84]:
# Import standard packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sqlite3

# from Kevin Rivera
from zipfile import ZipFile
# specifying the zip file name
file_name = "./zippedData/im.db.zip"
# opening the zip file in READ mode
with ZipFile(file_name, 'r') as zip:
    # printing all the contents of the zip file
    zip.printdir()
    # extracting all the files
    print('Extracting all the files now...')
    # extract data to the same directory as the other data
    zip.extractall(path='./zippedData')
    print('Done!')

%matplotlib inline

File Name                                             Modified             Size
im.db                                          2021-12-20 16:31:38    169443328
Extracting all the files now...
Done!


In [85]:
conn = sqlite3.Connection('./zippedData/im.db')
cursor = conn.cursor()

In [86]:
testq = """

select
    *
from
    movie_basics

"""

In [87]:
testq_results = pd.read_sql(testq, conn)

In [88]:
testq_results.head()

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"


In [89]:
# Here you run your code to explore the data

## IAN'S DATA EXPLORATION BEGINS HERE

Instantiate a variable to run a SQL query on the entire movie_basics table.

In [90]:
imdb_movie_basics_query = """
select
    *
from
    movie_basics
"""

Instantiate a variable to create a pandas data frame on the movie_basics query.

In [91]:
movie_basics_df = pd.read_sql(imdb_movie_basics_query, conn)

Render the head of the movie_basics pandas data frame.

In [92]:
movie_basics_df.head()

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama"
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama"
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama"
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy"


Confirm the data type of the table.

In [93]:
type(movie_basics_df)

pandas.core.frame.DataFrame

Explore the movie_basics data frame.

In [94]:
movie_basics_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146144 entries, 0 to 146143
Data columns (total 6 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   movie_id         146144 non-null  object 
 1   primary_title    146144 non-null  object 
 2   original_title   146123 non-null  object 
 3   start_year       146144 non-null  int64  
 4   runtime_minutes  114405 non-null  float64
 5   genres           140736 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 6.7+ MB


In [95]:
movie_basics_df.describe()

Unnamed: 0,start_year,runtime_minutes
count,146144.0,114405.0
mean,2014.621798,86.187247
std,2.733583,166.36059
min,2010.0,1.0
25%,2012.0,70.0
50%,2015.0,87.0
75%,2017.0,99.0
max,2115.0,51420.0


Instantiate a variable to run a SQL query on the entire movie_ratings table.

In [96]:
imdb_movie_ratings_query = """
select
    *
from
    movie_ratings
"""

Instantiate a variable to create a pandas data frame on the movie_ratings table.

In [97]:
movie_ratings_df = pd.read_sql(imdb_movie_ratings_query, conn)

Render the head of the movie_ratings data frame.

In [98]:
movie_ratings_df.head()

Unnamed: 0,movie_id,averagerating,numvotes
0,tt10356526,8.3,31
1,tt10384606,8.9,559
2,tt1042974,6.4,20
3,tt1043726,4.2,50352
4,tt1060240,6.5,21


Confirm the data type of the table.

In [99]:
type(movie_ratings_df)

pandas.core.frame.DataFrame

Explore the movie_ratings data frame.

In [100]:
movie_ratings_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73856 entries, 0 to 73855
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   movie_id       73856 non-null  object 
 1   averagerating  73856 non-null  float64
 2   numvotes       73856 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 1.7+ MB


In [101]:
movie_ratings_df.describe()

Unnamed: 0,averagerating,numvotes
count,73856.0,73856.0
mean,6.332729,3523.662
std,1.474978,30294.02
min,1.0,5.0
25%,5.5,14.0
50%,6.5,49.0
75%,7.4,282.0
max,10.0,1841066.0


In [125]:
fourth_quartile_movie_ratings_df = movie_ratings_df[
    movie_ratings_df['numvotes'] > 2.820000e+02]

In [126]:
fourth_quartile_movie_ratings_df.head()

Unnamed: 0,movie_id,averagerating,numvotes
1,tt10384606,8.9,559
3,tt1043726,4.2,50352
5,tt1069246,6.2,326
6,tt1094666,7.0,1613
7,tt1130982,6.4,571


In [127]:
fourth_quartile_movie_ratings_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18445 entries, 1 to 73844
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   movie_id       18445 non-null  object 
 1   averagerating  18445 non-null  float64
 2   numvotes       18445 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 576.4+ KB


In [128]:
fourth_quartile_movie_ratings_df.describe()

Unnamed: 0,averagerating,numvotes
count,18445.0,18445.0
mean,6.025422,13946.07
std,1.297828,59414.11
min,1.0,283.0
25%,5.3,503.0
50%,6.2,1087.0
75%,6.9,3705.0
max,9.9,1841066.0


Instantiate a variable to run a SQL query on the entire movie_basics table, joined with the entire movie_ratings table.<br>Specify explicit column names to avoid duplicating movie_id from movie_basics and movie_ratings.

In [129]:
imdb_movie_basics_and_ratings_query = """
select
    mb.movie_id,
    mb.primary_title,
    mb.original_title,
    mb.start_year,
    mb.runtime_minutes,
    mb.genres,
    mr.averagerating as average_rating,
    mr.numvotes as num_votes
from
    movie_basics as mb
join movie_ratings as mr
    on mb.movie_id = mr.movie_id
"""

Instantiate a variable to create a pandas data frame on the movie_basics and movie_ratings tables.

In [130]:
movie_basics_and_ratings_df = pd.read_sql(imdb_movie_basics_and_ratings_query, conn)

Render the head of the movie_basics_and_ratings data frame.

In [131]:
movie_basics_and_ratings_df.head()

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres,average_rating,num_votes
0,tt0063540,Sunghursh,Sunghursh,2013,175.0,"Action,Crime,Drama",7.0,77
1,tt0066787,One Day Before the Rainy Season,Ashad Ka Ek Din,2019,114.0,"Biography,Drama",7.2,43
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama,6.9,4517
3,tt0069204,Sabse Bada Sukh,Sabse Bada Sukh,2018,,"Comedy,Drama",6.1,13
4,tt0100275,The Wandering Soap Opera,La Telenovela Errante,2017,80.0,"Comedy,Drama,Fantasy",6.5,119


Confirm the data type of the movie_basics_and_ratings_df data frame.

In [132]:
type(movie_basics_and_ratings_df)

pandas.core.frame.DataFrame

Explore the movie_basics_and_ratings_df data frame.

In [133]:
movie_basics_and_ratings_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73856 entries, 0 to 73855
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   movie_id         73856 non-null  object 
 1   primary_title    73856 non-null  object 
 2   original_title   73856 non-null  object 
 3   start_year       73856 non-null  int64  
 4   runtime_minutes  66236 non-null  float64
 5   genres           73052 non-null  object 
 6   average_rating   73856 non-null  float64
 7   num_votes        73856 non-null  int64  
dtypes: float64(2), int64(2), object(4)
memory usage: 4.5+ MB


In [134]:
movie_basics_and_ratings_df.describe()

Unnamed: 0,start_year,runtime_minutes,average_rating,num_votes
count,73856.0,66236.0,73856.0,73856.0
mean,2014.276132,94.65404,6.332729,3523.662
std,2.614807,208.574111,1.474978,30294.02
min,2010.0,3.0,1.0,5.0
25%,2012.0,81.0,5.5,14.0
50%,2014.0,91.0,6.5,49.0
75%,2016.0,104.0,7.4,282.0
max,2019.0,51420.0,10.0,1841066.0


In [135]:
fourth_quartile_movie_basics_and_ratings_df = movie_basics_and_ratings_df[
    movie_basics_and_ratings_df['num_votes'] > 2.820000e+02
]

In [136]:
fourth_quartile_movie_basics_and_ratings_df.head()

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres,average_rating,num_votes
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama,6.9,4517
7,tt0146592,Pál Adrienn,Pál Adrienn,2010,136.0,Drama,6.8,451
12,tt0176694,The Tragedy of Man,Az ember tragédiája,2011,160.0,"Animation,Drama,History",7.8,584
16,tt0249516,Foodfight!,Foodfight!,2012,91.0,"Action,Animation,Comedy",1.9,8248
27,tt0293069,Dark Blood,Dark Blood,2012,86.0,Thriller,6.6,1053


In [137]:
fourth_quartile_movie_basics_and_ratings_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18445 entries, 2 to 73849
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   movie_id         18445 non-null  object 
 1   primary_title    18445 non-null  object 
 2   original_title   18445 non-null  object 
 3   start_year       18445 non-null  int64  
 4   runtime_minutes  18282 non-null  float64
 5   genres           18438 non-null  object 
 6   average_rating   18445 non-null  float64
 7   num_votes        18445 non-null  int64  
dtypes: float64(2), int64(2), object(4)
memory usage: 1.3+ MB


In [138]:
fourth_quartile_movie_basics_and_ratings_df.describe()

Unnamed: 0,start_year,runtime_minutes,average_rating,num_votes
count,18445.0,18282.0,18445.0,18445.0
mean,2014.197343,103.046166,6.025422,13946.07
std,2.561388,21.040001,1.297828,59414.11
min,2010.0,39.0,1.0,283.0
25%,2012.0,90.0,5.3,503.0
50%,2014.0,98.0,6.2,1087.0
75%,2016.0,112.0,6.9,3705.0
max,2019.0,467.0,9.9,1841066.0


In [139]:
fourth_quartile_movie_basics_and_ratings_df['primary_title'].duplicated(keep=False).value_counts()

False    17477
True       968
Name: primary_title, dtype: int64

In [140]:
fourth_quartile_movie_basics_and_ratings_df_cleaned = fourth_quartile_movie_basics_and_ratings_df[
    (fourth_quartile_movie_basics_and_ratings_df['primary_title'].duplicated(keep=False)) == False
]

In [141]:
fourth_quartile_movie_basics_and_ratings_df_cleaned.head()

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres,average_rating,num_votes
2,tt0069049,The Other Side of the Wind,The Other Side of the Wind,2018,122.0,Drama,6.9,4517
7,tt0146592,Pál Adrienn,Pál Adrienn,2010,136.0,Drama,6.8,451
12,tt0176694,The Tragedy of Man,Az ember tragédiája,2011,160.0,"Animation,Drama,History",7.8,584
16,tt0249516,Foodfight!,Foodfight!,2012,91.0,"Action,Animation,Comedy",1.9,8248
27,tt0293069,Dark Blood,Dark Blood,2012,86.0,Thriller,6.6,1053


In [142]:
fourth_quartile_movie_basics_and_ratings_df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17477 entries, 2 to 73849
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   movie_id         17477 non-null  object 
 1   primary_title    17477 non-null  object 
 2   original_title   17477 non-null  object 
 3   start_year       17477 non-null  int64  
 4   runtime_minutes  17316 non-null  float64
 5   genres           17470 non-null  object 
 6   average_rating   17477 non-null  float64
 7   num_votes        17477 non-null  int64  
dtypes: float64(2), int64(2), object(4)
memory usage: 1.2+ MB


In [143]:
fourth_quartile_movie_basics_and_ratings_df_cleaned.describe()

Unnamed: 0,start_year,runtime_minutes,average_rating,num_votes
count,17477.0,17316.0,17477.0,17477.0
mean,2014.198947,103.044641,6.032786,14150.34
std,2.559805,21.141866,1.299882,60404.23
min,2010.0,39.0,1.0,283.0
25%,2012.0,90.0,5.3,502.0
50%,2014.0,98.0,6.2,1077.0
75%,2016.0,112.0,7.0,3671.0
max,2019.0,467.0,9.9,1841066.0


Instantiate a variable to create a pandas data frame on the movie_gross csv.

In [121]:
movie_gross_df = pd.read_csv('./zippedData/bom.movie_gross.csv.gz')

Render the head of the movie_gross_df data frame.

In [122]:
movie_gross_df

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010
...,...,...,...,...,...
3382,The Quake,Magn.,6200.0,,2018
3383,Edward II (2018 re-release),FM,4800.0,,2018
3384,El Pacto,Sony,2500.0,,2018
3385,The Swan,Synergetic,2400.0,,2018


Confirm the data type of the movie_gross data frame.

In [123]:
type(movie_gross_df)

pandas.core.frame.DataFrame

Instantiate a variable to create a pandas data frame on the movie_basics_and_ratings_df, joined with the movie_gross_df.

## MESS AROUND BELOW HERE

In [78]:
fourth_quartile_movie_basics_ratings_and_gross_df_cleaned = pd.merge(
    left=fourth_quartile_movie_basics_and_ratings_df_cleaned,
    right=movie_gross_df,
    left_on='primary_title',
    right_on='title')

In [144]:
fourth_quartile_movie_basics_ratings_and_gross_df_cleaned

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres,average_rating,num_votes,title,studio,domestic_gross,foreign_gross,year
0,tt0398286,Tangled,Tangled,2010,100.0,"Adventure,Animation,Comedy",7.8,366366,Tangled,BV,200800000.0,391000000,2010
1,tt0435761,Toy Story 3,Toy Story 3,2010,103.0,"Adventure,Animation,Comedy",8.3,682218,Toy Story 3,BV,415000000.0,652000000,2010
2,tt0446029,Scott Pilgrim vs. the World,Scott Pilgrim vs. the World,2010,112.0,"Action,Comedy,Fantasy",7.5,339338,Scott Pilgrim vs. the World,Uni.,31500000.0,16100000,2010
3,tt0451279,Wonder Woman,Wonder Woman,2017,141.0,"Action,Adventure,Fantasy",7.5,487527,Wonder Woman,WB,412600000.0,409300000,2017
4,tt0454876,Life of Pi,Life of Pi,2012,127.0,"Adventure,Drama,Fantasy",7.9,535836,Life of Pi,Fox,125000000.0,484000000,2012
...,...,...,...,...,...,...,...,...,...,...,...,...,...
379,tt8671762,Jackpot,Jackpot,2018,150.0,"Comedy,Romance",7.8,18,Jackpot,DR,800.0,1100000,2014
380,tt8851190,Red,Red,2018,90.0,Drama,8.1,26,Red,Sum.,90400000.0,108600000,2010
381,tt9042690,The Negotiation,The Negotiation,2018,89.0,"Documentary,History,War",7.6,43,The Negotiation,CJ,111000.0,,2018
382,tt9151704,Burn the Stage: The Movie,Burn the Stage: The Movie,2018,84.0,"Documentary,Music",8.8,2067,Burn the Stage: The Movie,Trafalgar,4200000.0,16100000,2018


In [149]:
fourth_quartile_movie_basics_ratings_and_gross_df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 384 entries, 0 to 383
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   movie_id         384 non-null    object 
 1   primary_title    384 non-null    object 
 2   original_title   384 non-null    object 
 3   start_year       384 non-null    int64  
 4   runtime_minutes  377 non-null    float64
 5   genres           384 non-null    object 
 6   average_rating   384 non-null    float64
 7   num_votes        384 non-null    int64  
 8   title            384 non-null    object 
 9   studio           383 non-null    object 
 10  domestic_gross   381 non-null    float64
 11  foreign_gross    232 non-null    object 
 12  year             384 non-null    int64  
dtypes: float64(3), int64(3), object(7)
memory usage: 42.0+ KB


## WORKING HERE WHEN BACK FROM LUNCH

In [148]:
fourth_quartile_movie_basics_ratings_and_gross_df_cleaned.sort_values(by='domestic_gross', ascending=False)

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres,average_rating,num_votes,title,studio,domestic_gross,foreign_gross,year
288,tt4154756,Avengers: Infinity War,Avengers: Infinity War,2018,149.0,"Action,Adventure,Sci-Fi",8.5,670926,Avengers: Infinity War,BV,678800000.0,1369.5,2018
254,tt3606756,Incredibles 2,Incredibles 2,2018,118.0,"Action,Adventure,Animation",7.7,203510,Incredibles 2,BV,608600000.0,634200000,2018
265,tt3748528,Rogue One: A Star Wars Story,Rogue One,2016,133.0,"Action,Adventure,Sci-Fi",7.8,478592,Rogue One: A Star Wars Story,BV,532200000.0,523900000,2016
53,tt1345836,The Dark Knight Rises,The Dark Knight Rises,2012,164.0,"Action,Thriller",8.4,1387769,The Dark Knight Rises,WB,448100000.0,636800000,2012
142,tt1951264,The Hunger Games: Catching Fire,The Hunger Games: Catching Fire,2013,146.0,"Action,Adventure,Sci-Fi",7.5,575455,The Hunger Games: Catching Fire,LGF,424700000.0,440300000,2013
...,...,...,...,...,...,...,...,...,...,...,...,...,...
83,tt1576702,Skin Trade,Skin Trade,2010,78.0,Documentary,8.8,31,Skin Trade,Magn.,1200.0,,2015
379,tt8671762,Jackpot,Jackpot,2018,150.0,"Comedy,Romance",7.8,18,Jackpot,DR,800.0,1100000,2014
82,tt1570982,Celine: Through the Eyes of the World,Celine: Through the Eyes of the World,2010,120.0,"Documentary,Music",7.9,349,Celine: Through the Eyes of the World,Sony,,119000,2010
100,tt1667130,The Green Wave,The Green Wave,2010,80.0,Documentary,7.6,290,The Green Wave,RF,,70100,2012


## PICKUP HERE. THE FOLLOWING DATA HAS DUPLICATE TITLES. ADDRESS.

In [21]:
movie_basics_ratings_and_gross_df = pd.merge(
    left=movie_basics_and_ratings_df,
    right=movie_gross_df,
    left_on='primary_title',
    right_on='title')

Render the head of the movie_basics_ratings_and_gross_df data frame.

In [22]:
movie_basics_ratings_and_gross_df.head()

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres,average_rating,num_votes,title,studio,domestic_gross,foreign_gross,year
0,tt0315642,Wazir,Wazir,2016,103.0,"Action,Crime,Drama",7.1,15378,Wazir,Relbig.,1100000.0,,2016
1,tt0337692,On the Road,On the Road,2012,124.0,"Adventure,Drama,Romance",6.1,37886,On the Road,IFC,744000.0,8000000.0,2012
2,tt4339118,On the Road,On the Road,2014,89.0,Drama,6.0,6,On the Road,IFC,744000.0,8000000.0,2012
3,tt5647250,On the Road,On the Road,2016,121.0,Drama,5.7,127,On the Road,IFC,744000.0,8000000.0,2012
4,tt0359950,The Secret Life of Walter Mitty,The Secret Life of Walter Mitty,2013,114.0,"Adventure,Comedy,Drama",7.3,275300,The Secret Life of Walter Mitty,Fox,58200000.0,129900000.0,2013


In [27]:
movie_basics_ratings_and_gross_df['title'].duplicated()

0       False
1       False
2        True
3        True
4       False
        ...  
3022    False
3023    False
3024    False
3025    False
3026    False
Name: title, Length: 3027, dtype: bool

In [23]:
movie_basics_ratings_and_gross_df.drop('title', axis=1)

Unnamed: 0,movie_id,primary_title,original_title,start_year,runtime_minutes,genres,average_rating,num_votes,studio,domestic_gross,foreign_gross,year
0,tt0315642,Wazir,Wazir,2016,103.0,"Action,Crime,Drama",7.1,15378,Relbig.,1100000.0,,2016
1,tt0337692,On the Road,On the Road,2012,124.0,"Adventure,Drama,Romance",6.1,37886,IFC,744000.0,8000000,2012
2,tt4339118,On the Road,On the Road,2014,89.0,Drama,6.0,6,IFC,744000.0,8000000,2012
3,tt5647250,On the Road,On the Road,2016,121.0,Drama,5.7,127,IFC,744000.0,8000000,2012
4,tt0359950,The Secret Life of Walter Mitty,The Secret Life of Walter Mitty,2013,114.0,"Adventure,Comedy,Drama",7.3,275300,Fox,58200000.0,129900000,2013
...,...,...,...,...,...,...,...,...,...,...,...,...
3022,tt8331988,The Chambermaid,La camarista,2018,102.0,Drama,7.1,147,FM,300.0,,2015
3023,tt8404272,How Long Will I Love U,Chao shi kong tong ju,2018,101.0,Romance,6.5,607,WGUSA,747000.0,82100000,2018
3024,tt8427036,Helicopter Eela,Helicopter Eela,2018,135.0,Drama,5.4,673,Eros,72000.0,,2018
3025,tt9078374,Last Letter,"Ni hao, Zhihua",2018,114.0,"Drama,Romance",6.4,322,CL,181000.0,,2018


In [24]:
movie_basics_ratings_and_gross_df['primary_title'].value_counts()

One Day                    6
Split                      6
Eden                       6
Gold                       6
Anna                       6
                          ..
Raavan                     1
Wrath of the Titans        1
A Late Quartet             1
Silver Linings Playbook    1
Un Padre No Tan Padre      1
Name: primary_title, Length: 2598, dtype: int64

## IAN'S DATA EXPLORATION ENDS HERE

## Data Preparation

Describe and justify the process for preparing the data for analysis.

***
Questions to consider:
* Were there variables you dropped or created?
* How did you address missing values or outliers?
* Why are these choices appropriate given the data and the business problem?
***

In [None]:
# Here you run your code to clean the data

## Data Modeling
Describe and justify the process for analyzing or modeling the data.

***
Questions to consider:
* How did you analyze or model the data?
* How did you iterate on your initial approach to make it better?
* Why are these choices appropriate given the data and the business problem?
***

In [None]:
# Here you run your code to model the data


## Evaluation
Evaluate how well your work solves the stated business problem.

***
Questions to consider:
* How do you interpret the results?
* How well does your model fit your data? How much better is this than your baseline model?
* How confident are you that your results would generalize beyond the data you have?
* How confident are you that this model would benefit the business if put into use?
***

## Conclusions
Provide your conclusions about the work you've done, including any limitations or next steps.

***
Questions to consider:
* What would you recommend the business do as a result of this work?
* What are some reasons why your analysis might not fully solve the business problem?
* What else could you do in the future to improve this project?
***