Introduction to notebook:
In this notebook, we will address the following problem: Videos that were posted more recently (closer to the date March 1,2025) have been around for a shorter amount of time and thus will have slightly less views. 

We want to exclude from our data all videos that are so recent that they have significantly less views than they "should."

Summary of notebook and conclusions:
We found that the overall variability in the views means that being just one standard deviation away is a negative number. This means there is so much variability in the views that it is hard to say a video has abnormally low views without further investigation.

What we did next was compute the mean number of views per week for each week of the data set. "Average views per week" is the new variable with a new mean (mu) and a new standard deviation, which is much lower. We then assessed each week, and see if it had a significantly lower mean. We saw that the first 8 weeks of the data had a mean one standard deviation away from mu, but almost all other weeks had a mean that was within one standard deviation from mu. This is a clear pattern that indicates the "newer" data has slightly lower views.

Almost all weeks were within two standard deviations from the mean. 

In conclusion, to be safe we excluded all data from the first 8 weeks. Our threshold for what is considered "abnormally low" was to be only one standard deviation away, but there is no shortage of data for us to consider for our analysis so we are safe to exclude data. 

In [28]:
import pandas as pd
import numpy as np
import datetime

df = pd.read_csv(r"C:\Users\arubi\Desktop\datascience2025\erdos-uncreatives\Data Cleaning and Visualization\correct_dates_keywords_primetime.csv")

march_1_2025_str = '2025-03-01T00:00:00.000Z'
march_1_2025 = datetime.datetime.fromisoformat(march_1_2025_str) #This is an "aware" datetime object in the same format as the others 

df["datetime_date"] = df["datetime_date"].apply(datetime.datetime.fromisoformat) #changes from string to datetime object using vectorized code 


df['hasAdinTitle'] = df['title'].str.lower().str.contains('ad|sponsored|collaboration|promo|partner|affiliate|paid|gift', case=False, na=False).astype(int)
df['hasAdinText'] = df['text'].str.lower().str.contains('ad|sponsored|collaboration|promo|partner|affiliate|paid|gift', case=False, na=False).astype(int)
df[

0        0
1        0
2        0
3        0
4        0
        ..
15041    0
15042    0
15043    0
15044    0
15045    0
Name: hasAdinTitle, Length: 15046, dtype: int64

In [7]:
print(df["viewCount"].mean())

for n in range(1, 50):
    new_df =df.loc[  (df["datetime_date"] <= (march_1_2025 - datetime.timedelta(days = (n-1)*7)) ) & (df["datetime_date"] >= (march_1_2025 - datetime.timedelta(days = n*7))) ]
    #Note to self: When filtering rows using boolean conditions, use the BITWISE | (or) and & (and) operators
    print("n is ", n)
    print(new_df["viewCount"].mean())

352041.8183570384
n is  1
106693.91089108911
n is  2
104200.1026392962
n is  3
124624.29936305732
n is  4
164792.07453416148
n is  5
108707.05214723926
n is  6
169356.88888888888
n is  7
157007.09841269843
n is  8
144583.39285714287
n is  9
189381.925
n is  10
161341.66666666666
n is  11
136594.78032786885
n is  12
395536.4671532847
n is  13
177529.8141025641
n is  14
369129.59922178986
n is  15
156443.4448051948
n is  16
263757.18076923076
n is  17
303307.10638297873
n is  18
282643.94042553194
n is  19
508668.1167315175
n is  20
213507.70955882352
n is  21
551365.1857142857
n is  22
489490.1505376344
n is  23
478914.6384615385
n is  24
429081.56934306567
n is  25
400297.6188811189
n is  26
367830.7913385827
n is  27
333414.15328467154
n is  28
412499.7037037037
n is  29
404252.71768707485
n is  30
253099.0
n is  31
325268.492481203
n is  32
544226.560311284
n is  33
592086.12267658
n is  34
407123.8550185874
n is  35
317366.25418060203
n is  36
652613.6102941176
n is  37
625872.35955

In [8]:
#By visual inspection, $n= 6-12$ is a good amount to go back. That's 1.5-3 months.

In [9]:
#let's check months 

for n in range(1, 10):
    new_df =df.loc[  (df["datetime_date"] <= (march_1_2025 - datetime.timedelta(days = (n-1)*30)) ) & (df["datetime_date"] >= (march_1_2025 - datetime.timedelta(days = n*30))) ]
    print("n is ", n)
    print(new_df["viewCount"].mean())

n is  1
120677.71669106881
n is  2
156768.20244565216
n is  3
212837.92896174864
n is  4
264830.159252669
n is  5
394971.3832599119
n is  6
433995.4845890411
n is  7
357507.0155979203
n is  8
461036.5
n is  9
532178.6404023471


In [10]:
#Two months back is probably fine.
#We can't literally run a t-test comparing the mean of a given month to the mean of the whole sample because to do this we need two samples of the same size.
#Let's use some common sense though. Let's look at the standard deviation of the whole sample.

print(df["viewCount"].std())

print(df["viewCount"].mean() - df["viewCount"].std())

#Conclusion: The standard deviation is SO LARGE that one standard deviation below the mean would be negative. That's crazy!!! 
#Okay, let's try a new approach. 

2326561.154682883
-1974519.3363258447


In [11]:
week_means = []

for n in range(1, 53):
    new_df =df.loc[  (df["datetime_date"] <= (march_1_2025 - datetime.timedelta(days = (n-1)*7)) ) & (df["datetime_date"] >= (march_1_2025 - datetime.timedelta(days = n*7))) ]
    week_means.append(new_df["viewCount"].mean())

week_means_arr = np.array(week_means)
week_means_arr.std()

#The question is now: Which weeks have a mean significantly above or below the mean of what a week should be? 

np.float64(181734.74856885787)

In [12]:
target_low = week_means_arr.mean() - week_means_arr.std()
target_high = week_means_arr.mean() + week_means_arr.std()

for n in range(1, 53):
    new_df =df.loc[  (df["datetime_date"] <= (march_1_2025 - datetime.timedelta(days = (n-1)*7)) ) & (df["datetime_date"] >= (march_1_2025 - datetime.timedelta(days = n*7))) ]
    print("n is ", n)
    print(new_df["viewCount"].mean())
    
    if  target_low  < new_df["viewCount"].mean() < target_high :
        print("Target within one std")
    else:
        print("Target outside range") 

n is  1
106693.91089108911
Target outside range
n is  2
104200.1026392962
Target outside range
n is  3
124624.29936305732
Target outside range
n is  4
164792.07453416148
Target outside range
n is  5
108707.05214723926
Target outside range
n is  6
169356.88888888888
Target outside range
n is  7
157007.09841269843
Target outside range
n is  8
144583.39285714287
Target outside range
n is  9
189381.925
Target within one std
n is  10
161341.66666666666
Target outside range
n is  11
136594.78032786885
Target outside range
n is  12
395536.4671532847
Target within one std
n is  13
177529.8141025641
Target outside range
n is  14
369129.59922178986
Target within one std
n is  15
156443.4448051948
Target outside range
n is  16
263757.18076923076
Target within one std
n is  17
303307.10638297873
Target within one std
n is  18
282643.94042553194
Target within one std
n is  19
508668.1167315175
Target within one std
n is  20
213507.70955882352
Target within one std
n is  21
551365.1857142857
Target 

In [13]:
#This means if we want to be safe, we should exclude the first 8 weeks. 

In [14]:
final_df = df.loc[ (df["datetime_date"] <= (march_1_2025 - datetime.timedelta(days = 8*7)))]
final_df["datetime_date"]

#We are missing the ad features from this data set. We are adding them here at the last minute. 




final_df.to_csv("no_early_dates.csv")