### Permutation Tests

##### Prompt: Are movies that contain John Goodman significantly different in number of box office hits (either higher or lower) than those that do not?
Consider the following hypothesis:
Movies that contain John Goodman have a significantly different number of box office hits (either higher
or lower) than those that do not.

Code and execute a permutation test evaluating this hypothesis. Can the null hypothesis (that movies featuring
John Goodman have the same proportion of box office hits as those that do not) be rejected with a significance level
of α = 0.01? If so, what is the size and direction (positive or negative) of the effect?

In [1]:
#Importations
import numpy as np
import pandas as pd

#Initializations
B = 10000 #number of permutations

In [2]:
#read data file
data = pd.read_csv("movie.features.txt",sep='\t',header=-1)
data.head()

Unnamed: 0,0,1,2
0,Horror,1,975900
1,Science_Fiction,1,975900
2,Supernatural,1,975900
3,Adventure,1,975900
4,Action,1,975900


In [3]:
data.shape

(39360, 3)

In [4]:
#Create array 'movies'(all movies), X_observed (movies that have John Goodman) and  Y_observed (movies that don't have J. Goodman)
movies = data[2].unique()

#Select all the rows that have 'John_Goodman' in their first column
rows_with_John = data.loc[data[0] == 'John_Goodman'] 
X_observed = rows_with_John[2].unique()
X_size = len(X_observed)

#Get Y by excluding movies in X from the set of all movies
Y_observed = set(movies) - set(X_observed)

print('All Movies: ',len(movies),'\nMovies with John:',X_size,'\nMovies without John:',len(Y_observed))


All Movies:  8304 
Movies with John: 59 
Movies without John: 8245


In [5]:
#Calculate observed difference in means    
# 1. Read second data file with box office hit indicators
df = pd.read_csv("movie.box_office.txt",sep='\t',header=-1)
df.head()

Unnamed: 0,0,1
0,975900,0
1,10408933,0
2,171005,0
3,77856,1
4,612710,0


In [6]:
# 2. Function that looks up the box-office hit values  (1/0)s given an array of Movie IDs
def get_box_office_hits(array):
    #Select all the rows that have their MovieID in the passed array
    box_office = df.select(lambda x: df.loc[x][0] in array, axis=0) 
    return box_office[1]
    
# 3. Functions that returns the mean difference in box-office hit values
def mean_difference(X_hits,Y_hits):
    try:
        mean_X = sum(X_hits)/len(X_hits)
        mean_Y = sum(Y_hits)/len(Y_hits)
    except:
        return None
    return mean_X-mean_Y

# Observed_mean_difference 
Observed_mean_difference = mean_difference (get_box_office_hits(X_observed),get_box_office_hits(Y_observed) )

Observed_mean_difference

0.15765898181743432

###### Permutations
Now that we have the observed mean difference, we can resample by performing permutations assuming that the labels X and Y (that reflect whether John acted the movie or not) do not matter. We split the data randomly based on an X_size, Y_size Split


In [7]:
#While B=10,000 the permutations take a significant amount of time to run.
#I have reduced the size of B here for demonstration purposes.

permutated_diffs = []
all_permutated_diffs = []
B=10
def permutations(): 
    for x in range(B):
        movies_perm = np.random.permutation(movies) #randomize
        X_perm = movies_perm[X_size:]
        Y_perm = movies_perm[:X_size]
        permutated_mean_difference = mean_difference (get_box_office_hits(X_perm),get_box_office_hits(Y_perm) )
        all_permutated_diffs.append(permutated_mean_difference)
        if permutated_mean_difference >=Observed_mean_difference:
            permutated_diffs.append(permutated_mean_difference)
    return all_permutated_diffs, permutated_diffs   

perm = permutations()
perm

([-0.038165914627252273,
  0.013045399882825742,
  0.047186276222877743,
  0.081327152562929772,
  0.047186276222877743,
  0.013045399882825742,
  -0.089377229137330261,
  0.081327152562929772,
  0.030115838052851757,
  -0.10644766730735628],
 [])

In [8]:
p_value = len(perm[1])/float(B)
print("%.2f" % p_value)

0.00


In [9]:
#calculate the effect size by comparing the observed mean with that of the permutated values

overall_permutated_mean_difference = sum(perm[0])/len(perm[0])
effect_size = Observed_mean_difference - overall_permutated_mean_difference
print(effect_size)

0.149734713386


### Conclusion

The p value of the data is 0.00 which implies that the null hypothesis is almost impossible. Since it's less than the significance level of  α = 0.01 it follows that the results are statistically significant and we can reject the null hypothesis that ‘Movies that have John Goodman have the same number of box office hits as those that don’t.’ with a 99% confidence level. The effect size is 0.149734713386 in the positive direction.
