 Hypothesis Formulation & Testing

🧪 Step 1: Formulate a Hypothesis
Let’s say we want to test:

🎯 Hypothesis Example 1:
H₀ (Null): The average revenue of Netflix content is $400M
H₁ (Alternate): The average revenue is not $400M

We'll use a one-sample t-test.



In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder


# Load the dataset
df = pd.read_csv('netflix_movies_series_final.csv')

from scipy.stats import ttest_1samp

# Remove missing values
revenue = df['Revenue (Million $)'].dropna()

# Hypothesized mean
mu_0 = 400

# Perform t-test
t_stat, p_value = ttest_1samp(revenue, mu_0)

print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")

# Interpretation
if p_value < 0.05:
    print("Reject the null hypothesis: average revenue is significantly different from 400M.")
else:
    print("Fail to reject the null hypothesis: no significant difference from 400M.")


t-statistic: 5.7908
p-value: 0.0000
Reject the null hypothesis: average revenue is significantly different from 400M.


Two-Sample t-test: High vs Low IMDb Rating Movies on Revenue

In [14]:
from scipy.stats import ttest_ind

# Split groups based on IMDb Rating
group1 = df[df['IMDB Rating'] > 7]['Revenue (Million $)'].dropna()
group2 = df[df['IMDB Rating'] <= 7]['Revenue (Million $)'].dropna()

# Perform two-sample t-test
t_stat, p_value = ttest_ind(group1, group2, equal_var=False)

print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")

if p_value < 0.05:
    print("Reject H₀: Significant difference in revenue between high and low IMDb rating movies.")
else:
    print("Fail to reject H₀: No significant difference in revenue.")


t-statistic: -1.4912
p-value: 0.1359
Fail to reject H₀: No significant difference in revenue.


Z-test (Optional if population std is known)

In [17]:
from statsmodels.stats.weightstats import ztest

# Example: Test if mean Watch Time ≠ 120 mins
watch_time = df['Avg User Watch Time (Minutes)'].dropna()

z_stat, p_value = ztest(watch_time, value=120)
print(f"z-statistic: {z_stat:.4f}")
print(f"p-value: {p_value:.4f}")


z-statistic: 0.2476
p-value: 0.8045
