# Исследовательский анализ топ 200 YouTube-блогеров

В данном проекте был произведен импорт и первичная обработка данных о 200 крупнейших каналов YouTube (с точки зрения количества просмотров), а так же произведен исследовательский анализ данных с визуализацией полученных результатов.

<b>Словарь данных:</b>

 - `Country`: Name of the country ('IN', 'US', 'KR', 'CA', 'BR', 'MX', 'SV', 'CL', 'NO', 'PR', 'BY', 'RU', 'PH', 'TH', 'AE', 'CO', 'ES', 'GB', 'AR', 'ID', 'NL','ES', 'IE', 'PK', 'AU', 'KW', 'SO')
 - `Channel Name`: Name of the channel
 - `Category`: Category ('Gaming & Apps','Sports', 'Music', 'Beauty & Fashion', 'Science & Tech', 'Fashion', 'LifeStyle')
 - `Main Video Category`: ('Music', 'Education', 'Shows', 'Gaming', 'Entertainment', 'People & Blogs', 'Sports', 'Howto & Style', 'Film & Animation', 'News & Politics', 'Pop music', 'Comedy', 'Nonprofits & Activism', 'Action-adventure game', 'Strategy video game', 'TV shows')
 - `Username`: Username of the youtube channel
 - `Followers`: No. of followers
 - `Main topic`: Topic that has been discussed more.
 - `More Topics`: Topics apart from main topic.
 - `Likes`: Total Likes
 - `Boost Index`: Boost index value.
 - `Engagement Rate`: Rate of engagement with the users.
 - `Engagement Rate 60days`: Rate of engagement with the users for 60 days
 - `View`: Total Views
 - `Views. Avg`: Average views
 - `Avg. 1 Day`: Average views for one day
 - `Avg. 3 Day`: Average views for 3 days
 - `Avg. 7 Day`: Average views for 7 days
 - `Avg. 14 Day`: Average views for 14 days
 - `Avg. 30 Day`: Average view for 30 days
 - `Avg. 60 Day`: Average view for 60 days
 - `Comments Avg`: Average comments.
 - `Youtube Link`: Link of the channel



## Предобработка данных


### Импорт библиотек
<b>Первоначально импортируем все необходимые нам библиотеки:</b>

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
pd.set_option('display.max_columns', 22)

### Импорт данных
<b>Для начала сохраним данные в DataFrame и посмотрим информацию о нем.</b>

In [2]:
folders = Path.cwd()
file_name = Path(folders, 'top_200_youtubers.csv')

df = pd.read_csv(file_name)

Найдем размер таблицы:

In [3]:
size = df.shape
display(size)

(857, 22)

Посмотрим информацию о содержащихся данных:

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 857 entries, 0 to 856
Data columns (total 22 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Country                 707 non-null    object 
 1   Channel Name            857 non-null    object 
 2   Category                736 non-null    object 
 3   Main Video Category     855 non-null    object 
 4   username                857 non-null    object 
 5   followers               857 non-null    int64  
 6   Main topic              855 non-null    object 
 7   More topics             855 non-null    object 
 8   Likes                   857 non-null    float64
 9   Boost Index             857 non-null    int64  
 10  Engagement Rate         855 non-null    float64
 11  Engagement Rate 60days  857 non-null    float64
 12  Views                   857 non-null    int64  
 13  Views Avg.              855 non-null    float64
 14  Avg. 1 Day              494 non-null    fl

Выведем первые 10 строк таблицы, что посмотреть на данные:

In [5]:
df.head(10)

Unnamed: 0,Country,Channel Name,Category,Main Video Category,username,followers,Main topic,More topics,Likes,Boost Index,Engagement Rate,Engagement Rate 60days,Views,Views Avg.,Avg. 1 Day,Avg. 3 Day,Avg. 7 Day,Avg. 14 Day,Avg. 30 day,Avg. 60 day,Comments Avg,Youtube Link
0,IN,T-Series,Gaming & Apps,Music,T-Series,220000000,Music of Asia,"Entertainment,Music of Asia,Music,Movies",1602680000.0,83,0.033463,0.010879,195660744416,2095329.0,152244.8,2134569.625,1809830.0,2306178.0,1676330.0,2295416.0,4493.984146,UCq-Fj5jknLsUf-MWSy4_brA
1,US,ABCkidTV - Nursery Rhymes,Gaming & Apps,Education,ABCkidTV - Nursery Rhymes,138000000,Movies,"Entertainment,Music,Movies",220990100.0,63,0.641716,0.116004,133025325473,70271260.0,1837916.0,1837916.0,4891832.0,7052576.0,12654330.0,15722840.0,146.700252,UCbCmjCuTUZos6Inko4u57UQ
2,IN,SET India,Gaming & Apps,Shows,SET India,137000000,Movies,"Entertainment,TV shows,Music,Movies",174875200.0,79,0.001206,0.002366,121741739317,109572.9,,586040.0,280127.6,343788.1,353601.9,322033.6,76.244316,UCpEhnqL0y41EpW2TvWAHD7Q
3,US,PewDiePie,Gaming & Apps,Gaming,PewDiePie,111000000,Lifestyle,"Gaming,Action game,Lifestyle,Action-adventure ...",2191406000.0,88,0.063426,0.044846,28424113942,7718345.0,,,3497395.0,3094440.0,3620274.0,4454120.0,35839.781347,UC-lHJZR3Gqxm24_Vd_AJ5Yw
4,US,MrBeast,Gaming & Apps,Entertainment,MrBeast,98100000,Lifestyle,"Entertainment,Lifestyle,Technology",1731833000.0,60,0.72921,0.57027,16242634269,98762500.0,,,29941020.0,29941020.0,29941020.0,53434730.0,113432.373684,UCX6OQ3DkcsbYNE6H8uQQuVA
5,US,Like Nastya,,People & Blogs,Like Nastya,97300000,Hobby,"Lifestyle,Hobby",280877700.0,79,1.026761,0.231388,80111555805,138163700.0,,3946868.0,5829929.0,9480138.0,13756910.0,22108750.0,0.29249,UCJplp5SjeGSdVdwsfb9Q7lQ
6,US,✿ Kids Diana Show,Gaming & Apps,Entertainment,✿ Kids Diana Show,97200000,Lifestyle,"Lifestyle,Hobby",235190400.0,67,0.690342,0.243853,77340155581,86417350.0,,,7539490.0,9148088.0,14347390.0,23467100.0,0.0,UCk8GzjMOrta8yxDcKfylJYw
7,IN,T-Series,Gaming & Apps,Music,T-Series,220000000,Music of Asia,"Entertainment,Music of Asia,Music,Movies",1602680000.0,83,0.033463,0.010879,195660744416,2095329.0,152244.8,2134569.625,1809830.0,2306178.0,1676330.0,2295416.0,4493.984146,UCq-Fj5jknLsUf-MWSy4_brA
8,US,ABCkidTV - Nursery Rhymes,Gaming & Apps,Education,ABCkidTV - Nursery Rhymes,138000000,Movies,"Entertainment,Music,Movies",220990100.0,63,0.641716,0.116004,133025325473,70271260.0,1837916.0,1837916.0,4891832.0,7052576.0,12654330.0,15722840.0,146.700252,UCbCmjCuTUZos6Inko4u57UQ
9,IN,SET India,Gaming & Apps,Shows,SET India,137000000,Movies,"Entertainment,TV shows,Music,Movies",174875200.0,79,0.001206,0.002366,121741739317,109572.9,,586040.0,280127.6,343788.1,353601.9,322033.6,76.244316,UCpEhnqL0y41EpW2TvWAHD7Q


Можно заметить, что в данных имеются дубликаты, а также в некоторых полях пропущены значения.

### Удаление дубликатов
Удалим дубликаты и посмотрим как изменился наш набор данных:

In [6]:
df = df.drop_duplicates().reset_index(drop=True)
df.info()
df.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 22 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Country                 164 non-null    object 
 1   Channel Name            200 non-null    object 
 2   Category                164 non-null    object 
 3   Main Video Category     199 non-null    object 
 4   username                200 non-null    object 
 5   followers               200 non-null    int64  
 6   Main topic              199 non-null    object 
 7   More topics             199 non-null    object 
 8   Likes                   200 non-null    float64
 9   Boost Index             200 non-null    int64  
 10  Engagement Rate         199 non-null    float64
 11  Engagement Rate 60days  200 non-null    float64
 12  Views                   200 non-null    int64  
 13  Views Avg.              199 non-null    float64
 14  Avg. 1 Day              117 non-null    fl

(200, 22)

После удаления дубликатов размер таблицы заметно уменьшился. Однако видно, что в столбцах <b><i>Country, Category,Main vedio category, Main vedio topics, More topics, Engagement Rates,Views avg, Avg.1 Day,Avg. 3 Day, Avg. 7 Day, Avg. 14 Day, Avg. 30 Day, Comments Avg</i></b> присутствуют пустые значения. Их необходимо обработать.

### Обработка пустых значений

Достаточный интерес представляют столбцы, в которых остутствует всего одно значение (такие как Main topic, More topics, Engagement Rate, Comments Avg и др.). Возможно в массиве данных есть какая-то одна строка со всеми незаполненными полями? Проверим:

In [7]:
display(df[df['Main topic'].isna() == True])

Unnamed: 0,Country,Channel Name,Category,Main Video Category,username,followers,Main topic,More topics,Likes,Boost Index,Engagement Rate,Engagement Rate 60days,Views,Views Avg.,Avg. 1 Day,Avg. 3 Day,Avg. 7 Day,Avg. 14 Day,Avg. 30 day,Avg. 60 day,Comments Avg,Youtube Link
75,,News,,,News,36100000,,,0.0,85,,0.0,0,,0.0,0.0,0.0,0.0,0.0,0.0,,UCYfdidRxbB8Qhf0Nx7ioOYw


Действительно, практически все поля в этой строке не имеют значний. Такого рода строка нерепрезентативна, поэтому ее можно удалить:

In [8]:
df = df.drop(labels=75, axis=0)

<b>Найдем долю количества пропущенных значений:</b>

In [9]:
count_col = df.shape[0]
round(((1-df.count()/count_col)*100),1).sort_values(ascending=False)

Avg. 1 Day                41.7
Avg. 3 Day                23.6
Country                   17.6
Category                  17.6
Avg. 7 Day                14.1
Avg. 14 Day                9.0
Avg. 30 day                4.5
Views                      0.0
Comments Avg               0.0
Avg. 60 day                0.0
Views Avg.                 0.0
Engagement Rate 60days     0.0
Channel Name               0.0
Engagement Rate            0.0
Boost Index                0.0
Likes                      0.0
More topics                0.0
Main topic                 0.0
followers                  0.0
username                   0.0
Main Video Category        0.0
Youtube Link               0.0
dtype: float64

Среди тех полей, в которых отсутствуют значения можно выделить категориальные переменные, такие как <b><i>Country, Category</i></b>, так и количественные, по типу <b><i>Avg.1 Day,Avg. 3 Day, Avg. 7 Day, Avg. 14 Day, Avg. 30 Day</i></b>.
Для заполнения количественных переменных воспользуемся средним значением, в зависимости от главной тематики видео:

In [11]:
df['Avg. 1 Day'] = df.groupby(['Main topic'])['Avg. 1 Day'].transform(lambda x : x.fillna(x.mean()))
df['Avg. 3 Day'] = df.groupby(['Main topic'])['Avg. 3 Day'].transform(lambda x : x.fillna(x.mean()))
df['Avg. 7 Day'] = df.groupby(['Main topic'])['Avg. 7 Day'].transform(lambda x : x.fillna(x.mean()))
df['Avg. 14 Day'] = df.groupby(['Main topic'])['Avg. 14 Day'].transform(lambda x : x.fillna(x.mean()))
df['Avg. 30 day'] = df.groupby(['Main topic'])['Avg. 30 day'].transform(lambda x : x.fillna(x.mean()))

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 199 entries, 0 to 199
Data columns (total 22 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Country                 164 non-null    object 
 1   Channel Name            199 non-null    object 
 2   Category                164 non-null    object 
 3   Main Video Category     199 non-null    object 
 4   username                199 non-null    object 
 5   followers               199 non-null    int64  
 6   Main topic              199 non-null    object 
 7   More topics             199 non-null    object 
 8   Likes                   199 non-null    float64
 9   Boost Index             199 non-null    int64  
 10  Engagement Rate         199 non-null    float64
 11  Engagement Rate 60days  199 non-null    float64
 12  Views                   199 non-null    int64  
 13  Views Avg.              199 non-null    float64
 14  Avg. 1 Day              193 non-null    fl

Видно, что в столбцах <b><i>Avg.1 Day,Avg. 3 Day, Avg. 7 Day</i></b> остались пустые значения. Это связано с тем, что по некоторым категориям все значения были пустыми и среднее не могло быть рассчитано. Для них воспользуемся средним значением по всему полю:

In [13]:
df['Avg. 1 Day'] = df['Avg. 1 Day'].fillna(df['Avg. 1 Day'].mean())
df['Avg. 3 Day'] = df['Avg. 3 Day'].fillna(df['Avg. 3 Day'].mean())
df['Avg. 7 Day'] = df['Avg. 7 Day'].fillna(df['Avg. 7 Day'].mean())

In [14]:
df.describe()

Unnamed: 0,followers,Likes,Boost Index,Engagement Rate,Engagement Rate 60days,Views,Views Avg.,Avg. 1 Day,Avg. 3 Day,Avg. 7 Day,Avg. 14 Day,Avg. 30 day,Avg. 60 day,Comments Avg
count,199.0,199.0,199.0,199.0,199.0,199.0,199.0,199.0,199.0,199.0,199.0,199.0,199.0,199.0
mean,39494470.0,182212700.0,64.874372,0.460653,0.077,20810410000.0,16639300.0,167526.6,464716.5,930116.6,1740344.0,2219552.0,2727980.0,13733.15046
std,22678560.0,308929500.0,16.467271,1.072026,0.177622,21251820000.0,40821300.0,340593.5,891252.6,2548751.0,5796456.0,6416406.0,7032095.0,29601.444784
min,24000000.0,0.0,1.0,0.000261,2.8e-05,994418000.0,680.6065,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,27400000.0,24790220.0,60.0,0.0275,0.00623,11550300000.0,686548.9,10585.06,17117.29,26049.73,47694.66,76823.01,94922.76,51.86865
50%,32500000.0,71000740.0,70.0,0.120258,0.025626,16380870000.0,2911653.0,73708.5,135974.0,206363.5,278023.1,426330.3,545527.3,1209.178571
75%,42150000.0,185718100.0,76.0,0.463765,0.065306,23228440000.0,14333920.0,188980.8,571304.5,700223.5,1098052.0,1884944.0,2053867.0,12120.383323
max,220000000.0,2191406000.0,88.0,10.584084,1.519044,195660700000.0,423923500.0,3472638.0,6596001.0,29941020.0,68777320.0,68777320.0,53835230.0,199523.467742


<b>Посмотрим на строки набора данных, в которых не указана категория канала:</b>

In [16]:
display(df[df['Category'].isna() == True])

Unnamed: 0,Country,Channel Name,Category,Main Video Category,username,followers,Main topic,More topics,Likes,Boost Index,Engagement Rate,Engagement Rate 60days,Views,Views Avg.,Avg. 1 Day,Avg. 3 Day,Avg. 7 Day,Avg. 14 Day,Avg. 30 day,Avg. 60 day,Comments Avg,Youtube Link
5,US,Like Nastya,,People & Blogs,Like Nastya,97300000,Hobby,"Lifestyle,Hobby",280877700.0,79,1.026761,0.231388,80111555805,138163700.0,167526.6,3946868.0,5829929.0,9480138.0,13756910.0,22108750.0,0.29249,UCJplp5SjeGSdVdwsfb9Q7lQ
9,US,Vlad and Niki,,Entertainment,Vlad and Niki,83500000,Hobby,"Lifestyle,Hobby",146245400.0,82,0.976381,0.121066,65143080313,108949700.0,167526.6,4354700.0,3010898.0,3141541.0,6909962.0,9793169.0,0.0,UCvlE5gTbOvjiolFlEm-c_Ow
32,,Ariana Grande,,Pop music,Ariana Grande,51500000,Pop music,"Music,Pop music",44484410.0,33,0.241983,0.006167,21945548174,6972101.0,0.0,0.0,0.0,0.0,0.0,0.0,12890.231884,UC9CoOnJkIBMdeijd9qYoT_g
52,US,Like Nastya Show,,Entertainment,Like Nastya Show,41300000,Lifestyle,"Music,Lifestyle,Hobby",39269150.0,80,0.379554,0.008067,18673777146,15267910.0,173034.3,17459.0,233881.0,144042.0,164313.7,266031.5,0.0,UCS94J1s6-qc8v7btCdS2pNg
64,IN,Voot Kids,,Entertainment,Voot Kids,38100000,Entertainment,"Movies,TV shows,Entertainment",46360340.0,51,0.098878,0.120771,17243982736,4136043.0,26555.0,1309208.0,1824389.0,1679224.0,2486546.0,3656634.0,18.612973,UCJg19noZp7-BYIGvypu_cow
67,US,xxxtentacion,,Music,xxxtentacion,37400000,Hip hop music,"Music,Hip hop music,Pop music,Electronic music",33550310.0,65,0.1649,0.038311,9632080442,6642088.0,7635.5,9783.375,14908.5,81946.05,480777.9,1147973.0,18591.939759,UCM9r1xn6s30OnlJWb-jc3Sw
72,,Shakira,,People & Blogs,Shakira,36700000,Pop music,"Music,Pop music,Music of Latin America",16923110.0,82,0.120258,0.04664,23356797129,894883.5,202614.7,989748.0,316558.0,316558.0,293879.4,1644056.0,2526.352941,UCYLNGLIzMhRTi6ZOLjAPSmw
74,US,Toys and Colors,,Entertainment,Toys and Colors,36300000,Lifestyle,"Hobby,Lifestyle,Entertainment",50988590.0,79,0.651807,0.094289,40261342099,27828990.0,173034.3,397653.0,337525.7,1325537.0,2690516.0,3359177.0,0.0,UCgFXm4TI8htWmCyJ6cVPG_A
77,,Maroon 5,,Music,Maroon 5,35700000,Music,"Music,Pop music",3395907.0,43,0.018715,0.000612,20285080239,382719.0,0.0,0.0,0.0,0.0,0.0,0.0,1030.571429,UCBVjMGOIkavEAhyqpxJ73Dw
79,AE,shfa2 - شفا,,People & Blogs,shfa2 - شفا,35600000,Lifestyle,"Lifestyle,Hobby",33139060.0,25,0.151214,0.027335,20513944057,3087191.0,14977.0,258225.3,328154.9,381734.2,562335.6,947859.5,0.0,UCQ7x25F6YXY9DvGeHFxLhRQ


Можно с достаточной уверенностью сказать, что каналы, которые в `Main Video Category` имеют <b><i>Music</i></b> или <b><i>Pop music</i></b> можно отнести к категории <b><i>Music</i></b>. Аналогично <b><i>Gaming, Strategy video game</i></b> и <b><i>Action-adventure game</i></b> можно отнести к категории <b><i>Gaming & Apps</i></b>, а <b><i>Sports</i></b> к <b><i>Sports</i></b>. Кроме того, с большой нятяжкой, можно отнести <b><i>Education</i></b> к <b><i>Science & Tech:</i></b>

In [17]:
df.loc[(df['Category'].isna() == True) & ((df['Main Video Category']=='Music') | 
                                          (df['Main Video Category']=='Pop music')),'Category'] = 'Music'

df.loc[(df['Category'].isna() == True) & ((df['Main Video Category']=='Gaming') | 
                                          (df['Main Video Category']=='Strategy video game') | 
                                          (df['Main Video Category']=='Action-adventure game')),'Category'] = 'Gaming & Apps'

df.loc[(df['Category'].isna() == True) & (df['Main Video Category']=='Sports'),'Category'] = 'Sports'

df.loc[(df['Category'].isna() == True) & (df['Main Video Category']=='Education'),'Category'] = 'Science & Tech'

df.loc[(df['Category'].isna() == True) & ((df['Main Video Category']=='People & Blogs') | 
                                          (df['Main Video Category']=='Howto & Style')),'Category'] = 'LifeStyle'



Однако все равно остаются категории видео, такие как <b><i>Shows, TV shows, Entertainment, Comedy</i></b> и др. Поэтому предлагаю ввести дополнительную категорию каналов - <b><i>Entertainment</i></b> и <b><i>Other:</i></b>

In [18]:
df.loc[(df['Category'].isna() == True) & ((df['Main Video Category']=='Shows') | 
                                          (df['Main Video Category']=='TV shows') | 
                                          (df['Main Video Category']=='Entertainment') |
                                          (df['Main Video Category']=='Comedy') | 
                                          (df['Main Video Category']=='Film & Animation')),'Category'] = 'Entertainment'

df.loc[(df['Category'].isna() == True) & ((df['Main Video Category']=='News & Politics') | 
                                          (df['Main Video Category']=='Nonprofits & Activism')),'Category'] = 'Other'


К сожалению заполнить пропуски в `Country` исходя из текущего набора данных мы не можем. Исключать из набора данных строки с пропусками в `Country` мы тоже не можем, т.к. они составляют 18% от общей выборки. Конечно, мы можем заполнить их основываясь на предположении, что Maron 5, например, являясь американской группой и создавая YouTube канал укажет в качестве страны США, а Shakira - Колумбию, но так как наша цель - проведение первичного исследовательского анализа, думаю можно заменить пустые значения на <b><i>unfilled</i></b>. А при необходимости более глубокого ислледования в дальнейшем, можно будет выбрать более подхоящий способ заполнения региона.

In [24]:
df['Country'] = df['Country'].fillna('unfilled')
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 199 entries, 0 to 199
Data columns (total 22 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Country                 199 non-null    object 
 1   Channel Name            199 non-null    object 
 2   Category                199 non-null    object 
 3   Main Video Category     199 non-null    object 
 4   username                199 non-null    object 
 5   followers               199 non-null    int64  
 6   Main topic              199 non-null    object 
 7   More topics             199 non-null    object 
 8   Likes                   199 non-null    float64
 9   Boost Index             199 non-null    int64  
 10  Engagement Rate         199 non-null    float64
 11  Engagement Rate 60days  199 non-null    float64
 12  Views                   199 non-null    int64  
 13  Views Avg.              199 non-null    float64
 14  Avg. 1 Day              199 non-null    fl

Отлично! Все значния заполнены, значит можно приступать к визуализации данных.

## Анализ и визуализация