Introduction to notebook:
In this notebook, we will address the following problem: Videos that were posted more recently (closer to the date March 1,2025) have been around for a shorter amount of time and thus will have slightly less views. 

We want to exclude from our data all videos that are so recent that they have significantly less views than they "should."
We only need to exclude this data when we are investigating views. 

Conclusion:
We ran two t-tests: One to see if the first most recent month had lower views, and one to see if the second month had lower views. We used a significance level of alpha = 0.005. We found that the first month did have lower views but the second month did not. We excluded the first month of data from our data set. 

We did not run a finer analysis such as checking if certain weeks had lower views to avoid running multiple t-tests. 

In [1]:
import pandas as pd
import numpy as np
import datetime

df = pd.read_csv("../data/cleaned_data_correct_dates.csv")

march_1_2025_str = '2025-03-01T00:00:00.000Z'
march_1_2025 = datetime.datetime.fromisoformat(march_1_2025_str) #This is an "aware" datetime object in the same format as the others 

df["date"] = df["date"].apply(datetime.datetime.fromisoformat) #changes from string to datetime object using vectorized code 
df["date"]

0       2025-01-01 17:03:25+00:00
1       2024-09-26 17:09:34+00:00
2       2024-12-13 17:06:39+00:00
3       2025-02-16 18:19:19+00:00
4       2024-07-07 17:33:45+00:00
                   ...           
15869   2024-12-03 21:02:47+00:00
15870   2024-12-03 21:13:12+00:00
15871   2024-12-03 20:57:23+00:00
15872   2024-07-31 00:01:49+00:00
15873   2024-04-12 17:35:38+00:00
Name: date, Length: 15874, dtype: datetime64[ns, UTC]

In [60]:
#We are running t-tests on the full data set. We want to run as few t-tests as possible. 
#This means we want to take things one month at a time rather than weekly or daily.

In [61]:
#Now we want to run some T-Tests to see if the first two months or so are significantly lower. 
#We use a significance level of alpha = 0.01
import scipy.stats as stats

first_month_df =df.loc[  (df["date"] <= (march_1_2025 - datetime.timedelta(days = (0)*30)) ) & (df["date"] >= (march_1_2025 - datetime.timedelta(days = 1*30))) ]

not_first_month_df = df.loc[ df["date"] < march_1_2025 - datetime.timedelta(days = 1*30) ]

not_first_month = not_first_month_df["viewCount"] 
first_month = first_month_df["viewCount"]

t_stat, p_value = stats.ttest_ind( first_month, not_first_month ) 
print(p_value) 

8.537449850065806e-05


In [62]:
#Conclusion: Yes, the first month has significantly lower views! Now let's look at the second month. 
#We use a significance level of alpha = 0.005

In [63]:
second_month_df =df.loc[  (df["date"] <= (march_1_2025 - datetime.timedelta(days = (1)*30)) ) & (df["date"] >= (march_1_2025 - datetime.timedelta(days = 2*30))) ]
second_month = second_month_df["viewCount"]

not_second_month_df = df.loc[ (df["date"] < march_1_2025 - datetime.timedelta(days = 2*30) ) | (df["date"] >  march_1_2025 - datetime.timedelta(days = 1*30))]
not_second_month = first_month_df["viewCount"]

t_stat, p_value = stats.ttest_ind( not_second_month, second_month ) 
print(p_value) 

#Conclusion: No, the second month does not have significantly lower views. 

0.08537943923226873


In [65]:
final_df = df.loc[ (df["date"] <= (march_1_2025 - datetime.timedelta(days = 90)))]
final_df["date"]
final_df.to_csv("no_early_dates_30_days.csv")