# Phase 2 Review

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from statsmodels.formula.api import ols
import scipy.stats as stats
import math

pd.set_option('display.max_columns', 100)

### Check Your Data … Quickly
The first thing you want to do when you get a new dataset, is to quickly to verify the contents with the .head() method.

In [2]:
df = pd.read_csv('movie_metadata.csv')
print(df.shape)
df.head()

(5043, 28)


Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,actor_1_name,movie_title,num_voted_users,cast_total_facebook_likes,actor_3_name,facenumber_in_poster,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,Avatar,886204,4834,Wes Studi,0.0,avatar|future|marine|native|paraplegic,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,Johnny Depp,Pirates of the Caribbean: At World's End,471220,48350,Jack Davenport,0.0,goddess|marriage ceremony|marriage proposal|pi...,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,Christoph Waltz,Spectre,275868,11700,Stephanie Sigman,1.0,bomb|espionage|sequel|spy|terrorist,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,Tom Hardy,The Dark Knight Rises,1144337,106759,Joseph Gordon-Levitt,0.0,deception|imprisonment|lawlessness|police offi...,http://www.imdb.com/title/tt1345836/?ref_=fn_t...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,Doug Walker,Star Wars: Episode VII - The Force Awakens ...,8,143,,0.0,,http://www.imdb.com/title/tt5289954/?ref_=fn_t...,,,,,,,12.0,7.1,,0


In [3]:
df.columns

Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
       'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
       'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],
      dtype='object')

## Question 1

A Hollywood executive wants to know how much an R-rated movie released after 2000 will earn. The data above is a sample of some of the movies with that rating during that timeframe, as well as other movies. How would you go about answering her question? Talk through it theoretically and then do it in code.

What is the 95% confidence interval for a post-2000 R-rated movie's box office gross?

In [4]:
# talk through your answer here
# The two variables of interest are title_year and gross.</i>
# First isolate all the movies released after 2000
# Using descriptive statistics, the avg gross can be found

In [5]:
pd.set_option('float_format', '{:f}'.format)

In [6]:
df['gross'].describe()

count        4159.000000
mean     48468407.526809
std      68452990.438753
min           162.000000
25%       5340987.500000
50%      25517500.000000
75%      62309437.500000
max     760505847.000000
Name: gross, dtype: float64

In [7]:
df['title_year'].value_counts()

2009.000000    260
2014.000000    252
2006.000000    239
2013.000000    237
2010.000000    230
              ... 
1930.000000      1
1944.000000      1
1958.000000      1
1935.000000      1
1927.000000      1
Name: title_year, Length: 91, dtype: int64

In [8]:
after_2000 = df[df['title_year'] >= 2000]

In [9]:
after_2000['gross'].shape

(3597,)

In [10]:
after_2000['title_year'].shape

(3597,)

In [11]:
after_2000['gross'].isna().sum()

527

In [12]:
after_2000_1 = after_2000.dropna(subset=['gross'])

In [14]:
after_2000_1['gross'].describe()

count        3070.000000
mean     48848305.844300
std      70884486.330287
min           162.000000
25%       4436326.000000
50%      24994342.000000
75%      61238951.250000
max     760505847.000000
Name: gross, dtype: float64

In [15]:
after_2000_1['content_rating'].value_counts()

R            1285
PG-13        1171
PG            428
G              57
Not Rated      47
Unrated        27
NC-17           3
Name: content_rating, dtype: int64

In [16]:
R_after_2000 = after_2000_1[after_2000_1['content_rating'] == 'R']

In [17]:
# do it in code here
n = R_after_2000['gross'].count()
mean = R_after_2000['gross'].mean()
std = R_after_2000['gross'].std()
z = stats.norm.ppf(q = 0.95)
margin_of_error = z * (std/math.sqrt(n))

In [18]:
# 95% confidence interval
conf = (mean - margin_of_error, mean + margin_of_error)
conf

(25815366.009060513, 29380056.22440252)

## Question 2a

Your ability to answer the first question has the executive excited and now she has many other questions about the types of movies being made and the differences in those movies budgets and gross amounts.

Read through the questions below and **determine what type of statistical test you should use** for each question and **write down the null and alternative hypothesis for those tests**.

t -test if population parameters are unknown

- Is there a relationship between the number of Facebook likes for a cast and the box office gross of the movie?
**Answer:** H-null: $\rho$ = 0  H_alt: $\rho$ ≠ 0, $\rho$>0, $\rho$ <0
t-test, 1 tailed test
If H_null is rejected then, there is sufficient evidence to believe that there is a relationship between the number of Facebook Like for a cast and the box office gross of the movie. If we fail to reject then there is sufficient evidence that corr coeff for the two variables is 0 i.e., not relationship exists


- Do foreign films perform differently at the box office than non-foreign films?
**Answer:** H-null: p <= 0.5  H_alt: p > 0.5
2- tailed, t test
proportion of foreign film box office earnings  to non-foreign film box office earnings
*unsure here about the signage in the null and alt


- Of all movies created are 40% rated R?
**Answer:** H-null: p = 0.4  H_alt: p ≠ 0.4
one tailed, t test


- Is there a relationship between the language of a film and the content rating (G, PG, PG-13, R) of that film?
**Answer:** chi square test

- Is there a relationship between the content rating of a film and its budget? 
Could check for linear relationship here H-null would state that beta 1 is 0. This would be a two tailed test. Or state H-null as rho or corr coeff = 0

In [19]:
# your answers here


## Question 2b

Calculate the answer for the second question:

- Do foreign films perform differently at the box office than non-foreign films?

In [20]:
df['country'].value_counts()

USA          3807
UK            448
France        154
Canada        126
Germany        97
             ... 
Indonesia       1
New Line        1
Cambodia        1
Slovakia        1
Slovenia        1
Name: country, Length: 65, dtype: int64

In [21]:
df['foreign_film'] = df['country'] != 'USA'

In [22]:
df['foreign_film'].value_counts()

False    3807
True     1236
Name: foreign_film, dtype: int64

In [23]:
gross_mu_USA = df.groupby(['foreign_film'])['gross'].mean().values[0]
gross_mu_foreign = df.groupby(['foreign_film'])['gross'].mean().values[1]

In [47]:
df[df['country'] == 'USA']

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,actor_1_name,movie_title,num_voted_users,cast_total_facebook_likes,actor_3_name,facenumber_in_poster,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes,foreign_film
0,Color,James Cameron,723.000000,178.000000,0.000000,855.000000,Joel David Moore,1000.000000,760505847.000000,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,Avatar,886204,4834,Wes Studi,0.000000,avatar|future|marine|native|paraplegic,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,3054.000000,English,USA,PG-13,237000000.000000,2009.000000,936.000000,7.900000,1.780000,33000,False
1,Color,Gore Verbinski,302.000000,169.000000,563.000000,1000.000000,Orlando Bloom,40000.000000,309404152.000000,Action|Adventure|Fantasy,Johnny Depp,Pirates of the Caribbean: At World's End,471220,48350,Jack Davenport,0.000000,goddess|marriage ceremony|marriage proposal|pi...,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,1238.000000,English,USA,PG-13,300000000.000000,2007.000000,5000.000000,7.100000,2.350000,0,False
3,Color,Christopher Nolan,813.000000,164.000000,22000.000000,23000.000000,Christian Bale,27000.000000,448130642.000000,Action|Thriller,Tom Hardy,The Dark Knight Rises,1144337,106759,Joseph Gordon-Levitt,0.000000,deception|imprisonment|lawlessness|police offi...,http://www.imdb.com/title/tt1345836/?ref_=fn_t...,2701.000000,English,USA,PG-13,250000000.000000,2012.000000,23000.000000,8.500000,2.350000,164000,False
5,Color,Andrew Stanton,462.000000,132.000000,475.000000,530.000000,Samantha Morton,640.000000,73058679.000000,Action|Adventure|Sci-Fi,Daryl Sabara,John Carter,212204,1873,Polly Walker,1.000000,alien|american civil war|male nipple|mars|prin...,http://www.imdb.com/title/tt0401729/?ref_=fn_t...,738.000000,English,USA,PG-13,263700000.000000,2012.000000,632.000000,6.600000,2.350000,24000,False
6,Color,Sam Raimi,392.000000,156.000000,0.000000,4000.000000,James Franco,24000.000000,336530303.000000,Action|Adventure|Romance,J.K. Simmons,Spider-Man 3,383056,46055,Kirsten Dunst,0.000000,sandman|spider man|symbiote|venom|villain,http://www.imdb.com/title/tt0413300/?ref_=fn_t...,1902.000000,English,USA,PG-13,258000000.000000,2007.000000,11000.000000,6.200000,2.350000,0,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5037,Color,Edward Burns,14.000000,95.000000,0.000000,133.000000,Caitlin FitzGerald,296.000000,4584.000000,Comedy|Drama,Kerry Bishé,Newlyweds,1338,690,Daniella Pineda,1.000000,written and directed by cast member,http://www.imdb.com/title/tt1880418/?ref_=fn_t...,14.000000,English,USA,Not Rated,9000.000000,2011.000000,205.000000,6.400000,,413,False
5039,Color,,43.000000,43.000000,,319.000000,Valorie Curry,841.000000,,Crime|Drama|Mystery|Thriller,Natalie Zea,The Following,73839,1753,Sam Underwood,1.000000,cult|fbi|hideout|prison escape|serial killer,http://www.imdb.com/title/tt2071645/?ref_=fn_t...,359.000000,English,USA,TV-14,,,593.000000,7.500000,16.000000,32000,False
5040,Color,Benjamin Roberds,13.000000,76.000000,0.000000,0.000000,Maxwell Moody,0.000000,,Drama|Horror|Thriller,Eva Boehnke,A Plague So Pleasant,38,0,David Chandler,0.000000,,http://www.imdb.com/title/tt2107644/?ref_=fn_t...,3.000000,English,USA,,1400.000000,2013.000000,0.000000,6.300000,,16,False
5041,Color,Daniel Hsia,14.000000,100.000000,0.000000,489.000000,Daniel Henney,946.000000,10443.000000,Comedy|Drama|Romance,Alan Ruck,Shanghai Calling,1255,2386,Eliza Coupe,5.000000,,http://www.imdb.com/title/tt2070597/?ref_=fn_t...,9.000000,English,USA,PG-13,,2012.000000,719.000000,6.300000,2.350000,660,False


In [48]:
df_USA = df[df['country'] == 'USA']
df_foreign = df[df['country'] != 'USA']

In [51]:
stats.ttest_ind(df_USA['gross'], df_foreign['gross'])

Ttest_indResult(statistic=nan, pvalue=nan)

In [24]:
# your answer here
print('The average earnings of a foreign film at the box office is {}% \n less \
than the average earnings of a non-foreign film' \
      .format(round(((gross_mu_USA-gross_mu_foreign)/gross_mu_USA)*100),2))

The average earnings of a foreign film at the box office is 55% 
 less than the average earnings of a non-foreign film


## Question 3

Now that you have answered all of those questions, the executive wants you to create a model that predicts the money a movie will make if it is released next year in the US. She wants to use this to evaluate different scripts and then decide which one has the largest revenue potential. 

Below is a list of potential features you could use in the model. Create a new frame containing only those variables.

Would you use all of these features in the model?

Identify which features you might drop and why.

*Remember you want to be able to use this model to predict the box office gross of a film **before** anyone has seen it.*

- **budget**: The amount of money spent to make the movie
- **title_year**: The year the movie first came out in the box office
- **years_old**: How long has it been since the movie was released
- **genre**: Each movie is assigned one genre category like action, horror, comedy
- **avg_user_rating**: This rating is taken from Rotten tomatoes, and is the average rating given to the movie by the audience
- **actor_1_facebook_likes**: The number of likes that the most popular actor in the movie has
- **cast_total_facebook_likes**: The sum of likes for the three most popular actors in the movie
- **language**: the original spoken language of the film


In [36]:
# your answer here
df_1 = df.drop(labels= ['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'movie_title', 'num_voted_users',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'country',
       'content_rating', 'actor_2_facebook_likes',
       'imdb_score', 'aspect_ratio', 'movie_facebook_likes', 'foreign_film', 'actor_1_name'], axis = 1)

In [37]:
df_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5043 entries, 0 to 5042
Data columns (total 7 columns):
actor_1_facebook_likes       5036 non-null float64
gross                        4159 non-null float64
genres                       5043 non-null object
cast_total_facebook_likes    5043 non-null int64
language                     5031 non-null object
budget                       4551 non-null float64
title_year                   4935 non-null float64
dtypes: float64(4), int64(1), object(2)
memory usage: 275.9+ KB


In [38]:
df_1.corr()
## use corr, intuition and possibly visualization

Unnamed: 0,actor_1_facebook_likes,gross,cast_total_facebook_likes,budget,title_year
actor_1_facebook_likes,1.0,0.154468,0.951661,0.022639,0.086873
gross,0.154468,1.0,0.2474,0.102179,0.030886
cast_total_facebook_likes,0.951661,0.2474,1.0,0.036557,0.109971
budget,0.022639,0.102179,0.036557,1.0,0.045726
title_year,0.086873,0.030886,0.109971,0.045726,1.0


In [27]:
df_1.dropna(subset=['gross','budget', 'language'], inplace = True)

In [28]:
df_1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3888 entries, 0 to 5042
Data columns (total 7 columns):
actor_1_facebook_likes       3885 non-null float64
gross                        3888 non-null float64
genres                       3888 non-null object
cast_total_facebook_likes    3888 non-null int64
language                     3888 non-null object
budget                       3888 non-null float64
title_year                   3888 non-null float64
dtypes: float64(4), int64(1), object(2)
memory usage: 243.0+ KB


**Possibly drop** : *I might drop the two variables concerned with facebook likes. Facebook likes might actually give a biased view of how well a cast or actor is liked, since it is only upto facebook users. Over the last decade or so, there has been a demographic change in who uses which social media option.*
*Also drop language and genres since the dtype are qualitative or categorical *

In [29]:
df_1.drop(labels=['actor_1_facebook_likes'], axis = 1, inplace=True)
# highly correlated with cast_total_facebook_likes and less corr to gross than cast_total_facebook

In [30]:
df_1.drop(labels=['language', 'genres'], axis = 1, inplace=True)

In [31]:
df_1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3888 entries, 0 to 5042
Data columns (total 3 columns):
gross         3888 non-null float64
budget        3888 non-null float64
title_year    3888 non-null float64
dtypes: float64(3)
memory usage: 121.5 KB


In [35]:
df_1.corr()

Unnamed: 0,gross,budget,title_year
gross,1.0,0.102101,0.045648
budget,0.102101,1.0,0.044989
title_year,0.045648,0.044989,1.0


## Question 4a

Create the following variables:

- `years_old`: The number of years since the film was released.
- Dummy categories for each of the following ratings:
    - `G`
    - `PG`
    - `R`
    
Once you have those variables, create a summary output for the following OLS model:

`gross~cast_total_facebook_likes+budget+years_old+G+PG+R`

In [32]:
df_1['title_year'].isna().sum()

0

In [33]:
# df.dropna(subset=['title_year'], inplace = True)

In [34]:
# your answer here
df_1['years_old'] = (2020 - df['title_year']).astype(int)
df_1['years_old']

ValueError: Cannot convert non-finite values (NA or inf) to integer

In [None]:
lr_model = ols(formula='gross~budget+years_old', data=df_1).fit()
lr_model.summary()

## Question 4b

Below is the summary output you should have gotten above. Identify any key takeaways from it.
- How ‘good’ is this model?
- Which features help to explain the variance in the target variable? 
    - Which do not? 


<img src="ols_summary.png" style="withd:300px;">

# your answer here
the model fit ness is questionable here, based on the R-squared values. With a 0.079 R-square, only 7.9 % of the variance in the target var is explaned by the model. 

## Question 5

**Bayes Theorem**

An advertising executive is studying television viewing habits of married men and women during prime time hours. Based on the past viewing records he has determined that during prime time wives are watching television 60% of the time. It has also been determined that when the wife is watching television, 40% of the time the husband is also watching. When the wife is not watching the television, 30% of the time the husband is watching the television. Find the probability that if the husband is watching the television, the wife is also watching the television.

In [None]:
# your answer here


## Question 6

Explain what a Type I error is and how it relates to the significance level when doing a statistical test. 

In [None]:
# your answer here
