# **HORIZON CONSULTANCY AND MARKETING GROUP:** *PREDICTING HIGH-GROWTH YOUTUBE CHANNELS USING CLASSIFICATION MODEL*

## **Business Understanding**

## Overview

Horizon Consultancy and Marketing Group is looking forward to patnering with influential YouTubers to help promote several brands that the company will be working with in the next few months.
The company's Business Development Manager needs information to identify YouTube channels that are likely to experience high subscriber growth in the next month so they can target them for early brand partnerships. Therefore they require to make data-driven decisions based on the predictions from the classification model.

## **Data Understanding**

The data set used in this project is Global YouTube Statistics. It offers a perfect avenue to analyze and gain valuable insights from the experts on the platform. It contains multiple categorical variables to choose from such as: category, channel type, a drived class like; High vs Low Growth Channel, Monetizable vs Non-monetizable channel which makes it suitable for classification.

## **Data Preparation for Classification**

In [40]:
import pandas as pd 
import numpy as np

In [41]:
df = pd.read_csv('Global YouTube Statistics.csv', encoding='latin1')


In [42]:
df.head()

Unnamed: 0,rank,Youtuber,subscribers,video views,category,Title,uploads,Country,Abbreviation,channel_type,...,subscribers_for_last_30_days,created_year,created_month,created_date,Gross tertiary education enrollment (%),Population,Unemployment rate,Urban_population,Latitude,Longitude
0,1,T-Series,245000000,228000000000.0,Music,T-Series,20082,India,IN,Music,...,2000000.0,2006.0,Mar,13.0,28.1,1366418000.0,5.36,471031528.0,20.593684,78.96288
1,2,YouTube Movies,170000000,0.0,Film & Animation,youtubemovies,1,United States,US,Games,...,,2006.0,Mar,5.0,88.2,328239500.0,14.7,270663028.0,37.09024,-95.712891
2,3,MrBeast,166000000,28368840000.0,Entertainment,MrBeast,741,United States,US,Entertainment,...,8000000.0,2012.0,Feb,20.0,88.2,328239500.0,14.7,270663028.0,37.09024,-95.712891
3,4,Cocomelon - Nursery Rhymes,162000000,164000000000.0,Education,Cocomelon - Nursery Rhymes,966,United States,US,Education,...,1000000.0,2006.0,Sep,1.0,88.2,328239500.0,14.7,270663028.0,37.09024,-95.712891
4,5,SET India,159000000,148000000000.0,Shows,SET India,116536,India,IN,Entertainment,...,1000000.0,2006.0,Sep,20.0,28.1,1366418000.0,5.36,471031528.0,20.593684,78.96288


In [43]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 995 entries, 0 to 994
Data columns (total 28 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   rank                                     995 non-null    int64  
 1   Youtuber                                 995 non-null    object 
 2   subscribers                              995 non-null    int64  
 3   video views                              995 non-null    float64
 4   category                                 949 non-null    object 
 5   Title                                    995 non-null    object 
 6   uploads                                  995 non-null    int64  
 7   Country                                  873 non-null    object 
 8   Abbreviation                             873 non-null    object 
 9   channel_type                             965 non-null    object 
 10  video_views_rank                         994 non-n

Our target variable is High growth channel which will be based on 'subscribers_for_last_30_days'. This variable represents recent growth, which is exactly what the business would need to predict for our business problem. This column contains a high percentage of missing data hence we'll be replacing the values with the median below.

In [44]:
df['subscribers_for_last_30_days'] = (
    df['subscribers_for_last_30_days']
    .fillna(df['subscribers_for_last_30_days'].median())
)


In [45]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 995 entries, 0 to 994
Data columns (total 28 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   rank                                     995 non-null    int64  
 1   Youtuber                                 995 non-null    object 
 2   subscribers                              995 non-null    int64  
 3   video views                              995 non-null    float64
 4   category                                 949 non-null    object 
 5   Title                                    995 non-null    object 
 6   uploads                                  995 non-null    int64  
 7   Country                                  873 non-null    object 
 8   Abbreviation                             873 non-null    object 
 9   channel_type                             965 non-null    object 
 10  video_views_rank                         994 non-n

In [46]:
#Creating a new High Growth Channel column

df['high_growth_channel'] = np.where(df['subscribers_for_last_30_days'] > df['subscribers_for_last_30_days'].median(), 1, 0)


In [47]:
df.head()

Unnamed: 0,rank,Youtuber,subscribers,video views,category,Title,uploads,Country,Abbreviation,channel_type,...,created_year,created_month,created_date,Gross tertiary education enrollment (%),Population,Unemployment rate,Urban_population,Latitude,Longitude,high_growth_channel
0,1,T-Series,245000000,228000000000.0,Music,T-Series,20082,India,IN,Music,...,2006.0,Mar,13.0,28.1,1366418000.0,5.36,471031528.0,20.593684,78.96288,1
1,2,YouTube Movies,170000000,0.0,Film & Animation,youtubemovies,1,United States,US,Games,...,2006.0,Mar,5.0,88.2,328239500.0,14.7,270663028.0,37.09024,-95.712891,0
2,3,MrBeast,166000000,28368840000.0,Entertainment,MrBeast,741,United States,US,Entertainment,...,2012.0,Feb,20.0,88.2,328239500.0,14.7,270663028.0,37.09024,-95.712891,1
3,4,Cocomelon - Nursery Rhymes,162000000,164000000000.0,Education,Cocomelon - Nursery Rhymes,966,United States,US,Education,...,2006.0,Sep,1.0,88.2,328239500.0,14.7,270663028.0,37.09024,-95.712891,1
4,5,SET India,159000000,148000000000.0,Shows,SET India,116536,India,IN,Entertainment,...,2006.0,Sep,20.0,28.1,1366418000.0,5.36,471031528.0,20.593684,78.96288,1


In [48]:
# Dropping unnecessary columns for modeling
cols_to_drop = [
    'Youtuber',
    'title',
    'subscribers_for_last_30_days',
    'lowest_monthly_earnings',
    'highest_monthly_earnings',
    'lowest_yearly_earnings',
    'highest_yearly_earnings',
    'created_year',
    'created_month',
    'created_date',
    'rank',
    'video_views_rank',
    'country_rank',
    'channel_type_rank',
    'Latitude',
    'Longitude',
    'Abbreviation'
]

df_model = df.drop(columns=[col for col in cols_to_drop if col in df.columns])
df_model


Unnamed: 0,subscribers,video views,category,Title,uploads,Country,channel_type,video_views_for_the_last_30_days,Gross tertiary education enrollment (%),Population,Unemployment rate,Urban_population,high_growth_channel
0,245000000,2.280000e+11,Music,T-Series,20082,India,Music,2.258000e+09,28.1,1.366418e+09,5.36,471031528.0,1
1,170000000,0.000000e+00,Film & Animation,youtubemovies,1,United States,Games,1.200000e+01,88.2,3.282395e+08,14.70,270663028.0,0
2,166000000,2.836884e+10,Entertainment,MrBeast,741,United States,Entertainment,1.348000e+09,88.2,3.282395e+08,14.70,270663028.0,1
3,162000000,1.640000e+11,Education,Cocomelon - Nursery Rhymes,966,United States,Education,1.975000e+09,88.2,3.282395e+08,14.70,270663028.0,1
4,159000000,1.480000e+11,Shows,SET India,116536,India,Entertainment,1.824000e+09,28.1,1.366418e+09,5.36,471031528.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
990,12300000,9.029610e+09,Sports,Natan por Aï¿,1200,Brazil,Entertainment,5.525130e+08,51.3,2.125594e+08,12.08,183241641.0,1
991,12300000,1.674410e+09,People & Blogs,Free Fire India Official,1500,India,Games,6.473500e+07,28.1,1.366418e+09,5.36,471031528.0,1
992,12300000,2.214684e+09,,HybridPanda,2452,United Kingdom,Games,6.703500e+04,60.0,6.683440e+07,3.85,55908316.0,0
993,12300000,3.741235e+08,Gaming,RobTopGames,39,Sweden,Games,3.871000e+06,67.0,1.028545e+07,6.48,9021165.0,0


In [49]:
df_model.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 995 entries, 0 to 994
Data columns (total 13 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   subscribers                              995 non-null    int64  
 1   video views                              995 non-null    float64
 2   category                                 949 non-null    object 
 3   Title                                    995 non-null    object 
 4   uploads                                  995 non-null    int64  
 5   Country                                  873 non-null    object 
 6   channel_type                             965 non-null    object 
 7   video_views_for_the_last_30_days         939 non-null    float64
 8   Gross tertiary education enrollment (%)  872 non-null    float64
 9   Population                               872 non-null    float64
 10  Unemployment rate                        872 non-n

In [50]:
df_model.duplicated().sum()

0

In [51]:
df_model.isna().sum()

subscribers                                  0
video views                                  0
category                                    46
Title                                        0
uploads                                      0
Country                                    122
channel_type                                30
video_views_for_the_last_30_days            56
Gross tertiary education enrollment (%)    123
Population                                 123
Unemployment rate                          123
Urban_population                           123
high_growth_channel                          0
dtype: int64

In [52]:
# Checking percentage of missing values for each column
missing_percentage = (df_model.isna().sum() / len(df_model)) * 100
missing_percentage

subscribers                                 0.000000
video views                                 0.000000
category                                    4.623116
Title                                       0.000000
uploads                                     0.000000
Country                                    12.261307
channel_type                                3.015075
video_views_for_the_last_30_days            5.628141
Gross tertiary education enrollment (%)    12.361809
Population                                 12.361809
Unemployment rate                          12.361809
Urban_population                           12.361809
high_growth_channel                         0.000000
dtype: float64

In [53]:
#Dropping missing values
df_model = df_model.dropna(subset = ['category', 'channel_type'])

In [54]:
#Dropping null values
df_model = df_model.dropna(subset = ['video_views_for_the_last_30_days'])

In [55]:
#Filling missing values with median for numerical columns
df_model['Country'] = df_model['Country'].fillna('Unknown')

In [56]:
df_model['Gross tertiary education enrollment (%)'] = (
    df_model['Gross tertiary education enrollment (%)']
    .fillna(df_model['Gross tertiary education enrollment (%)'].median())
)


In [57]:
df_model['Population'] = (
    df_model['Population']
    .fillna(df_model['Population'].median())
)


In [58]:
df_model['Unemployment rate'] = (
    df_model['Unemployment rate']
    .fillna(df_model['Unemployment rate'].median())
)


In [59]:
df_model['Urban_population'] = (
    df_model['Urban_population']
    .fillna(df_model['Urban_population'].median())
)


In [60]:
df_model.isna().sum()

subscribers                                0
video views                                0
category                                   0
Title                                      0
uploads                                    0
Country                                    0
channel_type                               0
video_views_for_the_last_30_days           0
Gross tertiary education enrollment (%)    0
Population                                 0
Unemployment rate                          0
Urban_population                           0
high_growth_channel                        0
dtype: int64

## **Modeling**

Categorical variables: (Country, category and channel_type) were transformed using one-hot encoding to convert them into a numerical format suitable for classification models. However the 'Title' feature was excluded from this process because it contains thousands of unique values and one-hot encoding would create thousands of sparse columns which would slow training of our model. 
Therefore, it was ideal to drop the column from our modeling process. This won't have any impact on our model since it's a non-predictive feature to our target (channel growth) just like other columns dropped earlier.

In [61]:
df_model = df_model.drop(columns=['Title'])


In [62]:
df_model_encoded = pd.get_dummies(
    df_model,
    columns=['Country', 'category', 'channel_type'],
    drop_first=True
)
df_model_encoded.head()

Unnamed: 0,subscribers,video views,uploads,video_views_for_the_last_30_days,Gross tertiary education enrollment (%),Population,Unemployment rate,Urban_population,high_growth_channel,Country_Argentina,...,channel_type_Entertainment,channel_type_Film,channel_type_Games,channel_type_Howto,channel_type_Music,channel_type_News,channel_type_Nonprofit,channel_type_People,channel_type_Sports,channel_type_Tech
0,245000000,228000000000.0,20082,2258000000.0,28.1,1366418000.0,5.36,471031528.0,1,0,...,0,0,0,0,1,0,0,0,0,0
1,170000000,0.0,1,12.0,88.2,328239500.0,14.7,270663028.0,0,0,...,0,0,1,0,0,0,0,0,0,0
2,166000000,28368840000.0,741,1348000000.0,88.2,328239500.0,14.7,270663028.0,1,0,...,1,0,0,0,0,0,0,0,0,0
3,162000000,164000000000.0,966,1975000000.0,88.2,328239500.0,14.7,270663028.0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,159000000,148000000000.0,116536,1824000000.0,28.1,1366418000.0,5.36,471031528.0,1,0,...,1,0,0,0,0,0,0,0,0,0


In [63]:
df_model_encoded.shape

(889, 87)

## **Baseline Model (Logistic Regression)**


A logistic regression model was used as a baseline due to its interpretability and effectiveness for binary classification problems.

In [64]:
#Separating features and target variable
X = df_model_encoded.drop(columns='high_growth_channel')
y = df_model_encoded['high_growth_channel']


In [65]:
#split before training to prevent data leakage
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)


In [66]:
#Scaling features 
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [67]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(
    max_iter=1000,
    random_state=42
)

log_reg.fit(X_train_scaled, y_train)


LogisticRegression(max_iter=1000, random_state=42)

In [68]:
y_train_pred = log_reg.predict(X_train_scaled)
y_test_pred = log_reg.predict(X_test_scaled)


**Classification Metrics**

In [69]:
from sklearn.metrics import accuracy_score, classification_report

print("Training Accuracy:", accuracy_score(y_train, y_train_pred))
print("Test Accuracy:", accuracy_score(y_test, y_test_pred))

print("\nClassification Report (Test Set):")
print(classification_report(y_test, y_test_pred))


Training Accuracy: 0.870604781997187
Test Accuracy: 0.8089887640449438

Classification Report (Test Set):
              precision    recall  f1-score   support

           0       0.84      0.93      0.88       134
           1       0.67      0.45      0.54        44

    accuracy                           0.81       178
   macro avg       0.75      0.69      0.71       178
weighted avg       0.80      0.81      0.80       178



Model performance was evaluated using accuracy, precision, recall, and F1-score to assess how well the model identifies high-growth channels.

### **Tuned Logistic Regression Model**

Hyperparameter tuning allows us to control model complexity and reduce overfitting, potentially improving performance on unseen data.

In [70]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

In [71]:
pipeline = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('log_reg', LogisticRegression(max_iter=1000, random_state=42))
])

In [72]:
param_grid = {
    'log_reg__C': [0.01, 0.1, 1, 10, 100],
    'log_reg__penalty': ['l1', 'l2'],
    'log_reg__solver': ['liblinear']
}

In [73]:
grid_search = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    scoring='f1',
    cv=5,
    n_jobs=-1
)

In [74]:
grid_search.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('log_reg',
                                        LogisticRegression(max_iter=1000,
                                                           random_state=42))]),
             n_jobs=-1,
             param_grid={'log_reg__C': [0.01, 0.1, 1, 10, 100],
                         'log_reg__penalty': ['l1', 'l2'],
                         'log_reg__solver': ['liblinear']},
             scoring='f1')

In [75]:
best_model = grid_search.best_estimator_

print("Best Parameters:")
print(grid_search.best_params_)

Best Parameters:
{'log_reg__C': 1, 'log_reg__penalty': 'l1', 'log_reg__solver': 'liblinear'}


In [76]:
from sklearn.metrics import accuracy_score, classification_report

y_test_pred_tuned = best_model.predict(X_test)

print("Tuned Model Test Accuracy:", accuracy_score(y_test, y_test_pred_tuned))

print("\nClassification Report (Tuned Model):")
print(classification_report(y_test, y_test_pred_tuned))

Tuned Model Test Accuracy: 0.8202247191011236

Classification Report (Tuned Model):
              precision    recall  f1-score   support

           0       0.84      0.93      0.89       134
           1       0.70      0.48      0.57        44

    accuracy                           0.82       178
   macro avg       0.77      0.71      0.73       178
weighted avg       0.81      0.82      0.81       178



After tuning the regularization parameters, the logistic regression model showed improved generalization on the test set, indicating that controlling model complexity helped reduce overfitting.

## **Decision Tree**

A decision tree is a nonparametric model that can capture nonlinear relationships and interactions between features without requiring feature scaling.

In [77]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report


In [78]:
dt_model = DecisionTreeClassifier(
    random_state=42
)


In [79]:
dt_model.fit(X_train, y_train)


DecisionTreeClassifier(random_state=42)

In [80]:
y_test_pred_dt = dt_model.predict(X_test)

print("Decision Tree Test Accuracy:", accuracy_score(y_test, y_test_pred_dt))
print("\nClassification Report (Decision Tree):")
print(classification_report(y_test, y_test_pred_dt))


Decision Tree Test Accuracy: 0.8426966292134831

Classification Report (Decision Tree):
              precision    recall  f1-score   support

           0       0.88      0.91      0.90       134
           1       0.70      0.64      0.67        44

    accuracy                           0.84       178
   macro avg       0.79      0.77      0.78       178
weighted avg       0.84      0.84      0.84       178



In [81]:
feature_importances = pd.Series(
    dt_model.feature_importances_,
    index=X_train.columns
).sort_values(ascending=False)

feature_importances.head(10)


video_views_for_the_last_30_days           0.654387
video views                                0.101487
Gross tertiary education enrollment (%)    0.042733
uploads                                    0.041950
subscribers                                0.039018
channel_type_People                        0.017030
category_News & Politics                   0.013831
Unemployment rate                          0.012030
Country_Mexico                             0.008119
Country_Ecuador                            0.006993
dtype: float64

## **Tuned Decision Tree**

While decision trees can capture complex nonlinear relationships, they are prone to overfitting. Hyperparameter tuning helps control model complexity and improve generalization to unseen data.

In [82]:
param_grid = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 5, 10]
}


In [83]:
dt = DecisionTreeClassifier(random_state=42)

grid_search_dt = GridSearchCV(
    estimator=dt,
    param_grid=param_grid,
    scoring='f1',
    cv=5,
    n_jobs=-1
)


In [84]:
grid_search_dt.fit(X_train, y_train)


GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=42), n_jobs=-1,
             param_grid={'max_depth': [3, 5, 10, None],
                         'min_samples_leaf': [1, 5, 10],
                         'min_samples_split': [2, 5, 10]},
             scoring='f1')

In [85]:
grid_search_dt.best_params_


{'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 2}

In [86]:
best_dt_model = grid_search_dt.best_estimator_


In [87]:
y_test_pred_best_dt = best_dt_model.predict(X_test)

print("Tuned Decision Tree Test Accuracy:",
      accuracy_score(y_test, y_test_pred_best_dt))

print("\nClassification Report (Tuned Decision Tree):")
print(classification_report(y_test, y_test_pred_best_dt))


Tuned Decision Tree Test Accuracy: 0.8876404494382022

Classification Report (Tuned Decision Tree):
              precision    recall  f1-score   support

           0       0.90      0.96      0.93       134
           1       0.85      0.66      0.74        44

    accuracy                           0.89       178
   macro avg       0.87      0.81      0.84       178
weighted avg       0.89      0.89      0.88       178



After tuning, the decision tree shows improved balance between precision and recall, indicating better generalization compared to the untuned model.

In [None]:
#Updated Feature Importances after tuning
feature_importances_tuned = pd.Series(
    best_dt_model.feature_importances_,
    index=X_train.columns
).sort_values(ascending=False)

feature_importances_tuned.head(10)


video_views_for_the_last_30_days           0.738844
video views                                0.098868
Gross tertiary education enrollment (%)    0.049926
subscribers                                0.026715
uploads                                    0.015080
channel_type_People                        0.014840
Country_Mexico                             0.010012
category_Music                             0.009850
Country_Ecuador                            0.008623
Country_Ukraine                            0.008584
dtype: float64