In this notebook we will load in the relevant data from the Youtube trending data set, which for this analysis will be the videos from the United States and Canada. First we will load the necessary packages and then read in the .csv's .

In [72]:
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

data_us = pd.read_csv('data/USvideos.csv')
data_ca = pd.read_csv('data/CAvideos.csv')
data = data_us.append(data_ca)

In [75]:
nltk.download('stopwords')
nltk.download('punkt')
print(data.head())
print(data.describe())
print(data.shape)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Jonathan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Jonathan\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


      video_id trending_date  \
0  2kyS6SvSYSE      17.14.11   
1  1ZAPwfrtAFY      17.14.11   
2  5qpjK5DgCt4      17.14.11   
3  puqaWrEC7tY      17.14.11   
4  d380meD0W0M      17.14.11   

                                               title          channel_title  \
0                 WE WANT TO TALK ABOUT OUR MARRIAGE           CaseyNeistat   
1  The Trump Presidency: Last Week Tonight with J...        LastWeekTonight   
2  Racist Superman | Rudy Mancuso, King Bach & Le...           Rudy Mancuso   
3                   Nickelback Lyrics: Real or Fake?  Good Mythical Morning   
4                           I Dare You: GOING BALD!?               nigahiga   

   category_id              publish_time  \
0           22  2017-11-13T17:13:01.000Z   
1           24  2017-11-13T07:30:00.000Z   
2           23  2017-11-12T19:05:24.000Z   
3           24  2017-11-13T11:00:04.000Z   
4           24  2017-11-12T18:01:41.000Z   

                                                tags    views   lik

In [15]:
print(data.isna().sum())

video_id                     0
trending_date                0
title                        0
channel_title                0
category_id                  0
publish_time                 0
tags                         0
views                        0
likes                        0
dislikes                     0
comment_count                0
thumbnail_link               0
comments_disabled            0
ratings_disabled             0
video_error_or_removed       0
description               1866
dtype: int64


This data is very clean right off the bat, only 1,866 videos trended without a description, but since this is not a column of interest for us anyway, we can disregard it. Now we can make sure that the data types are correct.

In [17]:
data.dtypes

video_id                  object
trending_date             object
title                     object
channel_title             object
category_id                int64
publish_time              object
tags                      object
views                      int64
likes                      int64
dislikes                   int64
comment_count              int64
thumbnail_link            object
comments_disabled           bool
ratings_disabled            bool
video_error_or_removed      bool
description               object
dtype: object

In [24]:
data['trending_date'] =pd.to_datetime(data['trending_date'], format = '%y.%d.%m')
data.dtypes

video_id                          object
trending_date             datetime64[ns]
title                             object
channel_title                     object
category_id                        int64
publish_time                      object
tags                              object
views                              int64
likes                              int64
dislikes                           int64
comment_count                      int64
thumbnail_link                    object
comments_disabled                   bool
ratings_disabled                    bool
video_error_or_removed              bool
description                       object
dtype: object

In [26]:
data['publish_time'] = pd.to_datetime(data['publish_time']) 
data.dtypes

video_id                               object
trending_date                  datetime64[ns]
title                                  object
channel_title                          object
category_id                             int64
publish_time              datetime64[ns, UTC]
tags                                   object
views                                   int64
likes                                   int64
dislikes                                int64
comment_count                           int64
thumbnail_link                         object
comments_disabled                        bool
ratings_disabled                         bool
video_error_or_removed                   bool
description                            object
dtype: object

Now we make some potentially enlightening groupings for analysis.

In [43]:
channels = data.set_index('channel_title').groupby('channel_title').mean()
print(channels.views)
full_views_by_channel = channels.views
%store full_views_by_channel

channel_title
#AndresSTyle             4.433970e+05
#Mind Warehouse          1.160725e+06
#SeekingTheTruth         1.190110e+05
* Martyna *              1.499100e+04
- 欢迎订阅 -浙江卫视【奔跑吧】官方频道    1.354697e+06
                             ...     
창조영감클럽                   7.595300e+04
타우TV                     4.719190e+05
포스트쉐어                    1.658045e+06
포크포크                     6.059360e+05
활력소TV                    8.207900e+05
Name: views, Length: 6251, dtype: float64
Stored 'full_views_by_channel' (Series)


We see that the 81830 trending videos over the time period were created by a scant 6251 creators. We will now cut that down even more by removing from our analysis channels exclusively for psoting professional music, since these are not patterns that can be followed by a typical youtube creator.

In [44]:
strings_drop = 'VEVO'

music_channels = data['channel_title'].str.contains(strings_drop)
print(data[~music_channels].shape)
data1 = data[~music_channels]

(77631, 16)


This cuts out about 25,000 videos, but will leave us hopefully with a more instructive model. Next we want to remove the titles with non-english characters, since this will be a NLP.

In [64]:
english = data1.title.map(lambda x: x.isascii())
print(data1.title[~english])
data2 = data1[english]

9        Why the rise of the robots won’t mean the end ...
57       Kellyanne Conway on Roy Moore This Week Abc: T...
68       Watch Norman Reedus Come Face to Face with his...
75       Rosie O’Donnell On Donald Trump’s Hostility To...
76       Mayo Clinic's first face transplant patient me...
                               ...                        
40866            سوحليفة: الحلقة 28 | Souhlifa: Episode 28
40872    NCT 미니게임천국 #3: 최강 손가락 컨트롤러 (Professional Finge...
40875          Вечер с Владимиром Соловьевым от 13.06.2018
40878    KINGDOM HEARTS III – SQUARE ENIX E3 SHOWCASE 2...
40880                     【完整版】遇到恐怖情人該怎麼辦？2018.06.13小明星大跟班
Name: title, Length: 10779, dtype: object


In [65]:
channels = data2.set_index('channel_title').groupby('channel_title').mean()
print(channels.views)
views_by_channel = channels.views
%store views_by_channel

channel_title
#AndresSTyle          4.433970e+05
#Mind Warehouse       1.160725e+06
* Martyna *           1.499100e+04
078jordan1            1.746905e+05
0b1knob               6.463600e+04
                          ...     
Émile Roy             5.847500e+03
Легендарный Киллер    5.669440e+05
Никита Ордынский      1.313062e+06
ХайП                  1.342340e+05
【JindaRK】             2.496670e+05
Name: views, Length: 4965, dtype: float64
Stored 'views_by_channel' (Series)


I want to preserve the information from the titles of all-caps and puncuation like exclamation points and question marks so I will create a separate column for each before preparing the title column for NLP.

In [67]:
all_caps = data2['title'].str.isupper()
data2['all_caps'] = all_caps
exclamation = data2['title'].str.contains('\!')
question = data2['title'].str.contains('\?')
data2['exclamation'] = exclamation
data2['question'] = question

print(data2.head())
print(data2.shape)

      video_id trending_date  \
0  2kyS6SvSYSE    2017-11-14   
1  1ZAPwfrtAFY    2017-11-14   
2  5qpjK5DgCt4    2017-11-14   
3  puqaWrEC7tY    2017-11-14   
4  d380meD0W0M    2017-11-14   

                                               title          channel_title  \
0                 WE WANT TO TALK ABOUT OUR MARRIAGE           CaseyNeistat   
1  The Trump Presidency: Last Week Tonight with J...        LastWeekTonight   
2  Racist Superman | Rudy Mancuso, King Bach & Le...           Rudy Mancuso   
3                   Nickelback Lyrics: Real or Fake?  Good Mythical Morning   
4                           I Dare You: GOING BALD!?               nigahiga   

   category_id              publish_time  \
0           22 2017-11-13 17:13:01+00:00   
1           24 2017-11-13 07:30:00+00:00   
2           23 2017-11-12 19:05:24+00:00   
3           24 2017-11-13 11:00:04+00:00   
4           24 2017-11-12 18:01:41+00:00   

                                                tags    views   lik

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data2['all_caps'] = all_caps
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data2['exclamation'] = exclamation
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data2['question'] = question


Now we do some pre-processing specific to NLP. This involves removing stop words, punction, and capitilization.

In [84]:
#example_sent = 'This is a sample sentence, showing off the stop words filtration'
stop_words = set(stopwords.words('english'))

data2["title_stop"] = data2['title'].apply(lambda x: ' '.join([word for word in word_tokenize(x.lower()) if word not in (stop_words)]))
data2['title_simple'] = data2['title_stop'].str.replace('[^\w\s]','')
print(data2.head())

      video_id trending_date  \
0  2kyS6SvSYSE    2017-11-14   
1  1ZAPwfrtAFY    2017-11-14   
2  5qpjK5DgCt4    2017-11-14   
3  puqaWrEC7tY    2017-11-14   
4  d380meD0W0M    2017-11-14   

                                               title          channel_title  \
0                 WE WANT TO TALK ABOUT OUR MARRIAGE           CaseyNeistat   
1  The Trump Presidency: Last Week Tonight with J...        LastWeekTonight   
2  Racist Superman | Rudy Mancuso, King Bach & Le...           Rudy Mancuso   
3                   Nickelback Lyrics: Real or Fake?  Good Mythical Morning   
4                           I Dare You: GOING BALD!?               nigahiga   

   category_id              publish_time  \
0           22 2017-11-13 17:13:01+00:00   
1           24 2017-11-13 07:30:00+00:00   
2           23 2017-11-12 19:05:24+00:00   
3           24 2017-11-13 11:00:04+00:00   
4           24 2017-11-12 18:01:41+00:00   

                                                tags    views   lik

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data2["title_stop"] = data2['title'].apply(lambda x: ' '.join([word for word in word_tokenize(x.lower()) if word not in (stop_words)]))
  data2['title_simple'] = data2['title_stop'].str.replace('[^\w\s]','')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data2['title_simple'] = data2['title_stop'].str.replace('[^\w\s]','')


Finally we write this dataframe to a csv for the next stage of analysis.

In [85]:
data2.to_csv('clean_data/clean_data.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'clean_data/clean_data.csv'