# <p style ="text-align: center">The People & Blogs Category</p>

### Imports

In [1]:
import matplotlib.pyplot as plt
import plotly.graph_objs as go
import plotly.offline as pyo
import plotly.express as px
from sklearn.linear_model import LinearRegression
import numpy as np
import os 
import pandas as pd
import seaborn as sns
import sys

Join us on a journey through YouTube's dynamic landscape as we unravel the transformation of channels within the People & Blogs category over the past decade. From sudden pivots to strategic shifts, we delve into the reasons behind channels morphing in this vibrant space.

What motivates a channel's transition? Was it audience demand, a quest for reinvention, or a strategic play? Our quest is to uncover the motivations and quality that drove these channel metamorphoses. Get ready to decipher the narrative behind YouTube's shape-shifting channels as we navigate through data, decode patterns, and reveal the untold stories behind this intriguing evolution.

# Our Dataset


This data story relies on the Youniverse dataset, a comprehensive repository encompassing YouTube channel data spanning the last decade (2005-2019). It encapsulates key metrics such as views, likes, comments, subscriber counts, and video metadata. The dataset's breadth and depth make it an ideal choice for our analysis, enabling a nuanced exploration of channel transitions within the People & Blogs category.

Let's start by loading the general channel data, as well as their associated time series. The original dataset only consists of channels with at least 10 videos and 10000 subscribers, so we won't need to filter out any channels. These thresholds ensure that we only consider channels with a significant amount of content and a large enough audience to be relevant, and are the basis of our analysis.

In [2]:
NOTEBOOK_PATH = os.getcwd()
DIR_PATH = os.path.dirname(NOTEBOOK_PATH)
DATA_PATH = os.path.join(DIR_PATH, "Data_youniverse")
UTILS_PATH = os.path.join(DIR_PATH, "utils")

In [3]:
df_channels_en = pd.read_csv(f"{DATA_PATH}/df_channels_en.tsv.gz", compression="infer", sep="\t") 
df_timeseries_en = pd.read_csv(f"{DATA_PATH}/df_timeseries_en.tsv.gz", compression="infer", sep="\t")

In [4]:
df_channels_en.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 136470 entries, 0 to 136469
Data columns (total 8 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   category_cc         136342 non-null  object 
 1   join_date           136469 non-null  object 
 2   channel             136470 non-null  object 
 3   name_cc             136460 non-null  object 
 4   subscribers_cc      136470 non-null  int64  
 5   videos_cc           136470 non-null  int64  
 6   subscriber_rank_sb  136470 non-null  float64
 7   weights             136470 non-null  float64
dtypes: float64(2), int64(2), object(4)
memory usage: 8.3+ MB


In [5]:
df_channels_en['category_cc'].value_counts()

category_cc
Music                    24285
Entertainment            22951
Gaming                   20143
People & Blogs           18413
Howto & Style            11875
Education                 7803
Film and Animation        6875
Sports                    5148
Science & Technology      4864
Comedy                    3767
Autos & Vehicles          3705
News & Politics           2263
Travel & Events           1989
Pets & Animals            1292
Nonprofits & Activism      969
Name: count, dtype: int64

In [16]:
# plotly category distribution with ax labels
fig = px.histogram(df_channels_en, x="category_cc", title="Distribution of videos per category" , labels={'category_cc':'Category'}, color="category_cc").update_layout(yaxis_title="Number of videos")
fig.show()

pyo.plot(fig, filename='category_distribution.html')

'category_distribution.html'

These first numbers outline already a few heavyweight categories of the Youtube scene over that tiemscale, such as the Music and the Entertainment industries. Our analysis centers on the People & Blogs category due to its prominent representation and unique content dynamics. This category not only boasts a substantial channel count but also offers a diverse range of human-centric content, including personal narratives, vlogs, and informational videos. Moreover, it serves as an engagement magnet, drawing audiences seeking relatable and engaging content while fostering high interaction through comments, shares, and discussions. Additionally, People & Blogs stands out as a transition hotspot, historically attracting channels diversifying their content, making it an intriguing focal point for exploring the dynamic evolution and transitions within the YouTube ecosystem.

# The People and Blogs Category

The People & Blogs category on YouTube serves as a diverse hub, housing a wide spectrum of content primarily centered around personal narratives, experiences, and informational dialogues. It encapsulates vlogs, storytelling, lifestyle advice, commentary, and discussion-oriented videos, offering viewers a glimpse into diverse perspectives and human experiences. Let's dive into it !

In [18]:
df_pewdipie = df_channels_en[df_channels_en['name_cc'] == 'PewDiePie']
df_pewdipie = df_pewdipie.reset_index(drop=True)
df_pewdipie['category_cc'].value_counts()

category_cc
Gaming    1
Name: count, dtype: int64

In [19]:
df_channels_en[df_channels_en['subscriber_rank_sb']==3.0]

Unnamed: 0,category_cc,join_date,channel,name_cc,subscribers_cc,videos_cc,subscriber_rank_sb,weights
0,Gaming,2010-04-29,UC-lHJZR3Gqxm24_Vd_AJ5Yw,PewDiePie,101000000,3956,3.0,2.087


In [20]:
CATEGORY = "People & Blogs"

In [21]:
df_people_and_blogs = df_channels_en[df_channels_en['category_cc'] == CATEGORY].copy()

In [22]:
df_people_and_blogs.info()

<class 'pandas.core.frame.DataFrame'>
Index: 18413 entries, 89 to 136469
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   category_cc         18413 non-null  object 
 1   join_date           18413 non-null  object 
 2   channel             18413 non-null  object 
 3   name_cc             18413 non-null  object 
 4   subscribers_cc      18413 non-null  int64  
 5   videos_cc           18413 non-null  int64  
 6   subscriber_rank_sb  18413 non-null  float64
 7   weights             18413 non-null  float64
dtypes: float64(2), int64(2), object(4)
memory usage: 1.3+ MB


### Channel Join Dates

We will begin our analysis with a focus on channel join dates within the People & Blogs category on YouTube. By examining these dates, we aim to uncover trends and notable periods of channel creation, potentially unveiling influential milestones or temporal patterns shaping the category's evolution.

In [23]:
# extract the join date from the dataframe
join_dates = pd.to_datetime(df_people_and_blogs['join_date'])

# get join counts for each date
join_dates_counts = join_dates.dt.date.value_counts().sort_index()

fig = go.Figure()
fig.add_trace(go.Scatter(x=join_dates_counts.index, y=join_dates_counts.values, mode='lines', name='Channel Join Dates'))

# Fit LinReg
X = np.arange(len(join_dates_counts)).reshape(-1, 1)
y = join_dates_counts.values.reshape(-1, 1)
model = LinearRegression().fit(X, y)
y_pred = model.predict(X)

fig.add_trace(go.Scatter(x=join_dates_counts.index, y=y_pred.flatten(), mode='lines', name='Linear Regression'))

# update layout 
fig.update_layout(title='Distribution of Channel Join Dates over the Year', xaxis_title='Join Date', yaxis_title='Number of Channels')

fig.show()

In [28]:
# per year plot
df_people_and_blogs['Year'] = join_dates.dt.year
channel_join_count = df_people_and_blogs.groupby('Year').size().reset_index(name='Channel Count')

fig = px.line(channel_join_count, x='Year', y='Channel Count', title='Increase in New People & Blogs Channels Over Time')
fig.update_traces(marker=dict(color='blue', line=dict(color='black', width=1)))

# fit linear regression
X = np.arange(len(channel_join_count)).reshape(-1, 1)
y = channel_join_count['Channel Count'].values.reshape(-1, 1)
model = LinearRegression().fit(X, y)
y_pred = model.predict(X)

fig.add_trace(go.Scatter(x=channel_join_count['Year'], y=y_pred.flatten(), mode='lines', name='Linear Regression'))

fig.update_layout(xaxis_title='Year', yaxis_title='Number of Channels Joined')
fig.show()

# pyo.plot(fig, filename='PB_Channel_Joins.html')

'PB_Channel_Joins.html'

This first vizualisation of the category's join dates reveals a steady increase in channel creation over the past decade, with a notable spike in 2016. This spike is likely due to the rise of vlogging and the emergence of new content creators, as well as the increasing popularity of YouTube as a platform. We however notice quite a dip from 2017, which we will attempt to explore further. We also fit a linear regression to the data to better visualize the trend, which reveals a positive slope, aligning with the general perception of the genre's growth over that time frame.

Let's pursue this analysis by examining the channel join dates in more detail, by breaking them down by year.

In [35]:
df_people_and_blogs['join_date'] = pd.to_datetime(df_people_and_blogs['join_date'])

# extract year and month from join dates
df_people_and_blogs.loc[:, 'Year'] = df_people_and_blogs['join_date'].dt.year
df_people_and_blogs.loc[:, 'Month'] = df_people_and_blogs['join_date'].dt.month

fig = px.histogram(df_people_and_blogs, x='Year', color='Month', title='Channel Join Dates Distribution by Month/Year')
fig.update_layout(barmode='group', xaxis_title='Year', yaxis_title='Number of Channels')
fig.show()

These new results align with the general trend observed of increasing channel creation over time. Looking at the monthly join counts, they seem distributed relatively evenly throughout every year, with a few exceptions. The first one is the spike on October 2011, which we will explore in more detail hereafter. The second one is the missing values of channel creation in late 2019, which is likely due to the dataset's incomplete nature, as it was collected in October of that year.

In [36]:
df_people_and_blogs['join_date'] = pd.to_datetime(df_people_and_blogs['join_date'])

# filter data for October 2011
october_2011 = df_people_and_blogs[(df_people_and_blogs['join_date'].dt.year == 2011) & (df_people_and_blogs['join_date'].dt.month == 10)]

# Group by day and count channel joins
daily_join_counts = october_2011['join_date'].dt.day.value_counts().sort_index()

fig = px.line(x=daily_join_counts.index, y=daily_join_counts.values, title='Daily Channel Join Counts in October 2011')
fig.update_layout(xaxis_title='Day', yaxis_title='Number of Channels Joined')
fig.show()

In [37]:
df_channels_en['join_date'] = pd.to_datetime(df_channels_en['join_date'])

# extract data for October 2011
october_2011 = df_channels_en[(df_channels_en['join_date'].dt.year == 2011) &
                               (df_channels_en['join_date'].dt.month == 10)]

# Group by category and count channel joins for October 2011
category_join_counts = october_2011.groupby('category_cc').size().reset_index(name='Join Counts')

fig = px.bar(category_join_counts, x='category_cc', y='Join Counts', 
             title='Channel Join Counts by Category in October 2011',
             labels={'category_cc': 'Category', 'Join Counts': 'Number of Channels Joined'},
             color='category_cc')
fig.update_layout(xaxis_title='Category', yaxis_title='Number of Channels')
fig.show()

In [39]:
df_channels_en['join_date'] = pd.to_datetime(df_channels_en['join_date'])

# extract data for the year 2011
df_channels_2011 = df_channels_en[df_channels_en['join_date'].dt.year == 2011].copy()

df_channels_2011.loc[:, 'Year'] = df_channels_2011['join_date'].dt.year
df_channels_2011.loc[:, 'Month'] = df_channels_2011['join_date'].dt.month

# Group by Year, Month, and Category and count channel joins for the year 2011
category_join_counts_2011 = df_channels_2011.groupby(['Year', 'Month', 'category_cc']).size().reset_index(name='Join Counts')

fig = px.bar(category_join_counts_2011, x='Month', y='Join Counts', 
             title='Channel Join Counts by Category in 2011',
             labels={'Month': 'Month', 'Join Counts': 'Number of Channels Joined'},
             color='category_cc', barmode='group', facet_col='Year')
fig.update_layout(xaxis_title='Month', yaxis_title='Number of Channels')
fig.show()

After taking a closer look to that join spike in October 2011, we see that this increase in channel joins is not specific to the People and Blogs categories, as it is also observed in other categories. This suggests that this spike is not due to a specific event or trend in the People and Blogs category, but rather a general trend in the YouTube ecosystem.

### Average channel metrics

Let's now take a look at the general distribution of channel metrics within the People & Blogs category, i.e. the views and subscriber counts within the category.

In [40]:
# Calculate the average values
avg_videos = df_people_and_blogs['videos_cc'].mean()
avg_subscribers = df_people_and_blogs['subscribers_cc'].mean()

# Print the average values
print(f'Average number of videos: {avg_videos:.2f}')
print(f'Average number of subscribers: {avg_subscribers:.2f}')

Average number of videos: 322.88
Average number of subscribers: 155921.60


In [35]:
# print average values for each category
for category in df_channels_en['category_cc'].unique():
    avg_videos = df_channels_en[df_channels_en['category_cc'] == category]['videos_cc'].mean()
    avg_subscribers = df_channels_en[df_channels_en['category_cc'] == category]['subscribers_cc'].mean()
    print(f'Average number of videos for {category}: {avg_videos:.2f}')
    print(f'Average number of subscribers for {category}: {avg_subscribers:.2f}')

Average number of videos for Gaming: 750.96
Average number of subscribers for Gaming: 202022.44
Average number of videos for Education: 552.90
Average number of subscribers for Education: 268202.64
Average number of videos for Entertainment: 645.80
Average number of subscribers for Entertainment: 351383.38
Average number of videos for Howto & Style: 392.43
Average number of subscribers for Howto & Style: 233022.90
Average number of videos for Sports: 1053.80
Average number of subscribers for Sports: 184476.34
Average number of videos for Music: 666.22
Average number of subscribers for Music: 292134.64
Average number of videos for Film and Animation: 325.53
Average number of subscribers for Film and Animation: 228242.99
Average number of videos for Comedy: 287.11
Average number of subscribers for Comedy: 432108.48
Average number of videos for Nonprofits & Activism: 902.77
Average number of subscribers for Nonprofits & Activism: 94647.70
Average number of videos for People & Blogs: 322.8

These values showcase a quite impressive average channel size within the People & Blogs category, with an average of 156000 subscribers and 323 videos. This is a clear indication of the category's popularity and engagement, as well as the high level of competition within the category. However, these numbers are likely skewed by a few outliers, as we can see in the following boxplots.

In [36]:
# Boxplot for 'videos_cc' column
fig_videos_box = px.box(df_people_and_blogs, y='videos_cc', title='Boxplot of Videos')
fig_videos_box.update_traces(marker=dict(color='blue', line=dict(color='black', width=1)))
fig_videos_box.update_layout(yaxis_title='Number of Videos')
fig_videos_box.show()

pyo.plot(fig_videos_box, filename='videos_boxplot.html')

# Boxplot for 'subscribers_cc' column
fig_subscribers_box = px.box(df_people_and_blogs, y='subscribers_cc', title='Boxplot of Subscribers')
fig_subscribers_box.update_traces(marker=dict(color='green', line=dict(color='black', width=1)))
fig_subscribers_box.update_layout(yaxis_title='Number of Subscribers')
fig_subscribers_box.show()

pyo.plot(fig_subscribers_box, filename='subscribers_boxplot.html')

'subscribers_boxplot.html'

These plots confirm that the distribution of views and subscribers is highly skewed, with a few channels having a very high number of views and subscribers. This is a common pattern in social networks, where a few channels have a very high number of followers, while the majority of channels have a relatively low number of followers. This aligns with the findings of [James Zern](https://www.telegraph.co.uk/technology/news/8464418/Almost-all-YouTube-views-come-from-just-30-of-films.html), who stated in 2011 that 30 percent of videos accounted for 99 percent of views on the site.

### Number of Videos and Subscribers, an underlying correlation ?

As we see quite average but very skewed values for the number of videos and subscribers, we can wonder if there is a correlation between these two metrics. Let's take a look at the scatterplot of these two metrics.

In [37]:
fig = px.scatter(df_people_and_blogs, x='videos_cc', y='subscribers_cc',
                 title='Scatter Plot of Number of Videos and Subscribers',
                 labels={'videos_cc': 'Number of Videos', 'subscribers_cc': 'Number of Subscribers'})
fig.update_traces(marker=dict(color='blue', size=8, line=dict(width=1, color='DarkSlateGrey')))
fig.show()

pyo.plot(fig, filename='correlation.html')

'correlation.html'

The above plot does not seem to indicate a clear correlation between the number of videos and the number of subscribers. However, we can see that the majority of channels have a relatively low number of videos and subscribers, while a few channels have a very high number of videos and subscribers. On the contrary, for the channels studied in our dataset, we observe that none of the most active channels in terms of their number of videos break the 2.5M subscriber mark. This could suggest a limit to the number of videos a channel can produce while maintaining a high level of quality and engagement.

In [43]:
numerical_columns = ['videos_cc', 'subscribers_cc']  # Add other relevant columns if needed

# Calculating the correlation matrix
correlation_matrix = df_people_and_blogs[numerical_columns].corr()

# Creating a heatmap to visualize the correlation matrix
fig = px.imshow(correlation_matrix,
                labels=dict(color="Correlation"),
                x=numerical_columns,
                y=numerical_columns,
                title='Correlation Heatmap of Numerical Columns')
fig.update_layout(width=600, height=500)
fig.show()

Having explored the fundamental metrics within the People and Blogs category, let's delve into the intricacies of tag analysis. Tags play a pivotal role in content categorization and viewer engagement on YouTube. Analyzing the usage, diversity, and impact of tags can unearth valuable insights into content dynamics, audience preferences, and channel visibility. Let's investigate the tag semantics and their influence within this category to gain a deeper understanding of content strategies and audience interactions.