# **HORIZON CONSULTANCY AND MARKETING GROUP:** *PREDICTING HIGH-GROWTH YOUTUBE CHANNELS USING CLASSIFICATION MODEL*

## **Business Understanding**

## Overview

Horizon Consultancy and Marketing Group is looking forward to patnering with influential YouTubers to help promote several brands that the company will be working with in the next few months.
The company's Business Development Manager needs information to identify YouTube channels that are likely to experience high subscriber growth in the next month so they can target them for early brand partnerships. Therefore they require to make data-driven decisions based on the predictions from the classification model.

## **Data Understanding**

The data set used in this project is Global YouTube Statistics. It offers a perfect avenue to analyze and gain valuable insights from the experts on the platform. It contains multiple categorical variables to choose from such as: category, channel type, a drived class like; High vs Low Growth Channel, Monetizable vs Non-monetizable channel which makes it suitable for classification.

## **Data Preparation for Classification**

In [49]:
import pandas as pd 
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix


In [50]:
df = pd.read_csv('Global YouTube Statistics.csv', encoding='latin1')


In [51]:
df.head()

Unnamed: 0,rank,Youtuber,subscribers,video views,category,Title,uploads,Country,Abbreviation,channel_type,...,subscribers_for_last_30_days,created_year,created_month,created_date,Gross tertiary education enrollment (%),Population,Unemployment rate,Urban_population,Latitude,Longitude
0,1,T-Series,245000000,228000000000.0,Music,T-Series,20082,India,IN,Music,...,2000000.0,2006.0,Mar,13.0,28.1,1366418000.0,5.36,471031528.0,20.593684,78.96288
1,2,YouTube Movies,170000000,0.0,Film & Animation,youtubemovies,1,United States,US,Games,...,,2006.0,Mar,5.0,88.2,328239500.0,14.7,270663028.0,37.09024,-95.712891
2,3,MrBeast,166000000,28368840000.0,Entertainment,MrBeast,741,United States,US,Entertainment,...,8000000.0,2012.0,Feb,20.0,88.2,328239500.0,14.7,270663028.0,37.09024,-95.712891
3,4,Cocomelon - Nursery Rhymes,162000000,164000000000.0,Education,Cocomelon - Nursery Rhymes,966,United States,US,Education,...,1000000.0,2006.0,Sep,1.0,88.2,328239500.0,14.7,270663028.0,37.09024,-95.712891
4,5,SET India,159000000,148000000000.0,Shows,SET India,116536,India,IN,Entertainment,...,1000000.0,2006.0,Sep,20.0,28.1,1366418000.0,5.36,471031528.0,20.593684,78.96288


In [52]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 995 entries, 0 to 994
Data columns (total 28 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   rank                                     995 non-null    int64  
 1   Youtuber                                 995 non-null    object 
 2   subscribers                              995 non-null    int64  
 3   video views                              995 non-null    float64
 4   category                                 949 non-null    object 
 5   Title                                    995 non-null    object 
 6   uploads                                  995 non-null    int64  
 7   Country                                  873 non-null    object 
 8   Abbreviation                             873 non-null    object 
 9   channel_type                             965 non-null    object 
 10  video_views_rank                         994 non-n

In [None]:
# Dropping unnecessary columns
cols_to_drop = [
    'Youtuber',
    'title',
    'subscribers_for_last_30_days',
    'lowest_monthly_earnings',
    'highest_monthly_earnings',
    'lowest_yearly_earnings',
    'highest_yearly_earnings',
    'created_year',
    'created_month',
    'created_date',
    'rank',
    'video_views_rank',
    'country_rank',
    'channel_type_rank',
    'Latitude',
    'Longitude',
    'Abbreviation'
]

df_model = df.drop(columns=[col for col in cols_to_drop if col in df.columns])
df_model


Unnamed: 0,subscribers,video views,category,Title,uploads,Country,channel_type,video_views_for_the_last_30_days,Gross tertiary education enrollment (%),Population,Unemployment rate,Urban_population
0,245000000,2.280000e+11,Music,T-Series,20082,India,Music,2.258000e+09,28.1,1.366418e+09,5.36,471031528.0
1,170000000,0.000000e+00,Film & Animation,youtubemovies,1,United States,Games,1.200000e+01,88.2,3.282395e+08,14.70,270663028.0
2,166000000,2.836884e+10,Entertainment,MrBeast,741,United States,Entertainment,1.348000e+09,88.2,3.282395e+08,14.70,270663028.0
3,162000000,1.640000e+11,Education,Cocomelon - Nursery Rhymes,966,United States,Education,1.975000e+09,88.2,3.282395e+08,14.70,270663028.0
4,159000000,1.480000e+11,Shows,SET India,116536,India,Entertainment,1.824000e+09,28.1,1.366418e+09,5.36,471031528.0
...,...,...,...,...,...,...,...,...,...,...,...,...
990,12300000,9.029610e+09,Sports,Natan por Aï¿,1200,Brazil,Entertainment,5.525130e+08,51.3,2.125594e+08,12.08,183241641.0
991,12300000,1.674410e+09,People & Blogs,Free Fire India Official,1500,India,Games,6.473500e+07,28.1,1.366418e+09,5.36,471031528.0
992,12300000,2.214684e+09,,HybridPanda,2452,United Kingdom,Games,6.703500e+04,60.0,6.683440e+07,3.85,55908316.0
993,12300000,3.741235e+08,Gaming,RobTopGames,39,Sweden,Games,3.871000e+06,67.0,1.028545e+07,6.48,9021165.0


In [54]:
df_model.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 995 entries, 0 to 994
Data columns (total 12 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   subscribers                              995 non-null    int64  
 1   video views                              995 non-null    float64
 2   category                                 949 non-null    object 
 3   Title                                    995 non-null    object 
 4   uploads                                  995 non-null    int64  
 5   Country                                  873 non-null    object 
 6   channel_type                             965 non-null    object 
 7   video_views_for_the_last_30_days         939 non-null    float64
 8   Gross tertiary education enrollment (%)  872 non-null    float64
 9   Population                               872 non-null    float64
 10  Unemployment rate                        872 non-n

In [55]:
df_model.duplicated().sum()

0

In [56]:
df_model.isna().sum()

subscribers                                  0
video views                                  0
category                                    46
Title                                        0
uploads                                      0
Country                                    122
channel_type                                30
video_views_for_the_last_30_days            56
Gross tertiary education enrollment (%)    123
Population                                 123
Unemployment rate                          123
Urban_population                           123
dtype: int64

In [None]:
# Checking percentage of missing values for each column
missing_percentage = (df_model.isna().sum() / len(df_model)) * 100
missing_percentage

subscribers                                 0.000000
video views                                 0.000000
category                                    4.623116
Title                                       0.000000
uploads                                     0.000000
Country                                    12.261307
channel_type                                3.015075
video_views_for_the_last_30_days            5.628141
Gross tertiary education enrollment (%)    12.361809
Population                                 12.361809
Unemployment rate                          12.361809
Urban_population                           12.361809
dtype: float64

In [58]:
#Dropping missing values
df_model = df_model.dropna(subset = ['category', 'channel_type'])

In [59]:
#Dropping null values
df_model = df_model.dropna(subset = ['video_views_for_the_last_30_days'])

In [None]:
#Filling missing values with median for numerical columns
df_model['Country'] = df_model['Country'].fillna('Unknown')

In [70]:
df_model['Gross tertiary education enrollment (%)'] = (
    df_model['Gross tertiary education enrollment (%)']
    .fillna(df_model['Gross tertiary education enrollment (%)'].median())
)


In [63]:
df_model['Population'] = (
    df_model['Population']
    .fillna(df_model['Population'].median())
)


In [65]:
df_model['Unemployment rate'] = (
    df_model['Unemployment rate']
    .fillna(df_model['Unemployment rate'].median())
)


In [66]:
df_model['Urban_population'] = (
    df_model['Urban_population']
    .fillna(df_model['Urban_population'].median())
)


In [68]:
df_model.isna().sum()

subscribers                                0
video views                                0
category                                   0
Title                                      0
uploads                                    0
Country                                    0
channel_type                               0
video_views_for_the_last_30_days           0
Gross tertiary education enrollment (%)    0
Population                                 0
Unemployment rate                          0
Urban_population                           0
dtype: int64

## **Modeling**