# Chapter 1. Getting the data

The first step of the project was to choose cleverly which YouTube channel we are going to use to fit our future model. The choice fell on [TED channel](https://www.youtube.com/user/TEDtalksDirector) as almost every video of that has subtitles; moreover, there's no advertisement, the topics that the videocontent covers is extensive and the amount of videos exceeds 3.5 thousands, which is undoubtedly favorable for our purposes.

## Imports

In order to get metadata about the videos we need to interact with Google API (using module `google-api-python-client` 2.2.0). Module `pprint` is optional, nevertheless I decided to use it to make some outputs look clearer. 

So, we want to download captions, the vast amount of captions. Unfortunately, hardly could we do that easily using Google API. 
One call of the method for captions download costs 50 quota (10.000 of which is given you a day), and that would take you several weeks to get the job done for 3.5K videos. That is why the solution is to use another module for that, in our case `pytube`; it's gonna handle everything this very day.

`utils` is a custom file that contains various functions and constants which will be needed throughout the project.

In [3]:
import pandas as pd
import googleapiclient.discovery
import pprint

from pytube import YouTube
from utils import DATA, TEMP, MIME, google_const, upload_to_googledrive

In [6]:
pp = pprint.PrettyPrinter(indent=4)

## Downloading initial metadata

First of all we create a `youtube` object that will be responsible for making queries. It requires your own Google API key which can be created on [Google Cloud Platform](https://console.cloud.google.com/). 

In [7]:
youtube = googleapiclient.discovery.build(serviceName="youtube", version="v3", developerKey=google_const['apikey'])

Next we construct our first request. This one is going to retrieve some metadata from an [all-videos playlist](https://www.youtube.com/playlist?list=UUAuUUnT6oDeKwE6v1NGQxug) on TED channel. The data that we want to know now for each video in the playlist is its id, title, description and a thumbnail link (just in case). That will be enough for now.

In [8]:
videos_request = youtube.playlistItems().list(
    part = "id, snippet, contentDetails, status",
    playlistId = google_const['playlist_id'],
    maxResults = 50
)

In [9]:
videos_response = videos_request.execute()

playlist_items = []

while videos_request is not None:
    videos_response = videos_request.execute()
    playlist_items += videos_response["items"]
    videos_request = youtube.playlistItems().list_next(videos_request, videos_response)

In [10]:
print(f"total: {len(playlist_items)}")

total: 3689


Wow, we did really get something! Let's take a look at the first metadata piece.

In [11]:
pp.pprint(playlist_items[0])

{   'contentDetails': {   'videoId': 'XQJhRDbsDzI',
                          'videoPublishedAt': '2021-07-10T14:00:12Z'},
    'etag': 'wxvFVE9dT64ljmYX8wfyv6RX4K0',
    'id': 'VVVBdVVVblQ2b0RlS3dFNnYxTkdReHVnLlhRSmhSRGJzRHpJ',
    'kind': 'youtube#playlistItem',
    'snippet': {   'channelId': 'UCAuUUnT6oDeKwE6v1NGQxug',
                   'channelTitle': 'TED',
                   'description': 'Watch the full talk: '
                                  'http://tedtalks.social/superpower\n'
                                  '\n'
                                  'A clip from America Ferrera\'s TED Talk "My '
                                  'identity is a superpower -- not an '
                                  'obstacle" from TED2019\n'
                                  '\n'
                                  'Hollywood needs to stop resisting what the '
                                  'world actually looks like, says actor, '
                                  'director and activist

**Note:** there are two types of id-s: `id` uniquely identifies a video inside of the playlist, and `videoId` is a global identificational number used in video links, for example. At this point we will keep both of them.

In [12]:
ids, titles, descriptions, thumbnails, video_ids = [], [], [], [], []

for dictionary in playlist_items:
    ids.append(dictionary["id"])
    titles.append(dictionary["snippet"]["title"])
    descriptions.append(dictionary["snippet"]["description"])
    thumbnails.append(dictionary["snippet"]["thumbnails"]["default"]["url"])
    video_ids.append(dictionary["snippet"]["resourceId"]["videoId"])
    
assert len(video_ids) == len(titles) == len(descriptions) == len(thumbnails)

Now let's make a dataframe out of the data we've got and take a look at what we've done. Saving the data to a temporary directory as a checkpoint is also a good idea.

In [13]:
vids_ted = pd.DataFrame(list(zip(ids, titles, descriptions, thumbnails, video_ids)), 
                        columns=["id", "title", "description", "thumbnail", "video_id"])

In [14]:
vids_ted.to_csv(DATA + TEMP + "videos_demo.csv", sep=',')

In [15]:
upload_to_googledrive("videos_demo.csv", google_const['temp_data_folder_id'], path=DATA+TEMP, mime=MIME)

In [16]:
vids_ted.head()

Unnamed: 0,id,title,description,thumbnail,video_id
0,VVVBdVVVblQ2b0RlS3dFNnYxTkdReHVnLlhRSmhSRGJzRHpJ,Your identity is your superpower | America Fer...,Watch the full talk: http://tedtalks.social/su...,https://i.ytimg.com/vi/XQJhRDbsDzI/default.jpg,XQJhRDbsDzI
1,VVVBdVVVblQ2b0RlS3dFNnYxTkdReHVnLnh2RzNmdEV2NXJN,Documentary films that explore trauma -- and m...,Visit http://TED.com/shapeyourfuture to watch ...,https://i.ytimg.com/vi/xvG3ftEv5rM/default.jpg,xvG3ftEv5rM
2,VVVBdVVVblQ2b0RlS3dFNnYxTkdReHVnLkRFU0NjalNRU0tZ,A cleanse won't detox your body -- but here's ...,Put down the cayenne-lemon water and step away...,https://i.ytimg.com/vi/DESCcjSQSKY/default.jpg,DESCcjSQSKY
3,VVVBdVVVblQ2b0RlS3dFNnYxTkdReHVnLkQzTEZWbEk1aXI4,What should humans take to space (and leave be...,Visit http://TED.com/shapeyourfuture to watch ...,https://i.ytimg.com/vi/D3LFVlI5ir8/default.jpg,D3LFVlI5ir8
4,VVVBdVVVblQ2b0RlS3dFNnYxTkdReHVnLkVmZWlLUFNCS2Zn,How to be a professional troublemaker | Luvvie...,Visit http://TED.com to get our entire library...,https://i.ytimg.com/vi/EfeiKPSBKfg/default.jpg,EfeiKPSBKfg


## Are there any subtitles at all?

That is the question that can be easily answered using one more Google API query. 

**Note:** `maxResults` parameter's maximum value is 50, which means that every query is only able to return metadata about 50 videos. That is why we need a `chunks()` function that splits lists into chunks of any desirable size. We'll use it to create a list of lists, each consisting of 50 videoIds, in order to iterate over it herewith making as less queries as possible. Thus, we're actually gonna make not one, but several queries.

In [17]:
def chunks(lst, n):
    for i in range(0, len(lst), n):
        yield lst[i:i + n]

In [18]:
video_ids = list(chunks(video_ids, 50))

In [19]:
ifcapts_items = []

for video_id in video_ids:

    ifcapts_request = youtube.videos().list(
        part = "id, contentDetails",
        id = video_id,
        maxResults = 50
    )

    while ifcapts_request is not None:
        ifcapts_response = ifcapts_request.execute()
        ifcapts_items.append(ifcapts_response["items"])
        ifcapts_request = youtube.videos().list_next(ifcapts_request, ifcapts_response)

By the way, we shouldn't forget that our `ifcapts_items` list is actually a list of lists, and we need to flatten it.

In [20]:
ifcapts_items = [item for sublist in ifcapts_items for item in sublist]
print(f"total: {len(ifcapts_items)}")

total: 3689


Now we can add a boolean column to our `vids_ted` dataframe to point out whether a certain video has captions.

In [21]:
capts_available = []

for dictionary in ifcapts_items:
    capts_available.append(dictionary["contentDetails"]["caption"])

vids_ted['capts_available'] = pd.Series(capts_available)
print(vids_ted['capts_available'].value_counts())

true     3593
false      96
Name: capts_available, dtype: int64


Just as presumed, only few videos don't have available subtitles. We can drop these 119 rows of data without a twinge of conscience.

In [22]:
vids_ted = vids_ted[vids_ted['capts_available'] != 'false']
vids_ted = vids_ted.drop(columns=['capts_available'])
vids_ted.reset_index(drop=True, inplace=True)

In [23]:
vids_ted.head()

Unnamed: 0,id,title,description,thumbnail,video_id
0,VVVBdVVVblQ2b0RlS3dFNnYxTkdReHVnLlhRSmhSRGJzRHpJ,Your identity is your superpower | America Fer...,Watch the full talk: http://tedtalks.social/su...,https://i.ytimg.com/vi/XQJhRDbsDzI/default.jpg,XQJhRDbsDzI
1,VVVBdVVVblQ2b0RlS3dFNnYxTkdReHVnLnh2RzNmdEV2NXJN,Documentary films that explore trauma -- and m...,Visit http://TED.com/shapeyourfuture to watch ...,https://i.ytimg.com/vi/xvG3ftEv5rM/default.jpg,xvG3ftEv5rM
2,VVVBdVVVblQ2b0RlS3dFNnYxTkdReHVnLkRFU0NjalNRU0tZ,A cleanse won't detox your body -- but here's ...,Put down the cayenne-lemon water and step away...,https://i.ytimg.com/vi/DESCcjSQSKY/default.jpg,DESCcjSQSKY
3,VVVBdVVVblQ2b0RlS3dFNnYxTkdReHVnLkQzTEZWbEk1aXI4,What should humans take to space (and leave be...,Visit http://TED.com/shapeyourfuture to watch ...,https://i.ytimg.com/vi/D3LFVlI5ir8/default.jpg,D3LFVlI5ir8
4,VVVBdVVVblQ2b0RlS3dFNnYxTkdReHVnLkVmZWlLUFNCS2Zn,How to be a professional troublemaker | Luvvie...,Visit http://TED.com to get our entire library...,https://i.ytimg.com/vi/EfeiKPSBKfg/default.jpg,EfeiKPSBKfg


In [24]:
vids_with_capts = vids_ted["video_id"].tolist()
print(f'Total videos with captions: {len(vids_with_capts)}')

Total videos with captions: 3593


## Downloading subtitles

That is the moment to switch the instrument for data retrieval to the aforementioned `pytube`. Here we're also going to split data into chunks, but mainly in order to keep up with how's the download going. Though it won't be a week, it's still gonna take some time and I want to be able to look at the progress so as to know that something is happening and the cell is not just frozen.

By the way, for such a massive task it's really important to forsee the possibility of exceptions. You never know what these may be, so I just wrote an `except` block that is suitable for any type of exceptions. If something bad happens with a caption, a special `failed_captions` list will be appended with a `videoId` of the video that is being processed and `captions` list will be appended with an empty string. Thereby the download won't be interrupted while in progress and the data won't be lost or spoiled, and I just won't have to resume the download manually.

In [25]:
vids_with_capts = list(chunks(vids_with_capts, 50))

In [26]:
captions, failed_captions = [], []

In [27]:
for chunk in vids_with_capts:
    for video in chunk:
        try:
            yt = YouTube(f'http://youtube.com/watch?v={video}')
            caption = yt.captions['en']
            captions.append(caption.generate_srt_captions())
        except Exception as exception: 
            exception_name = type(exception).__name__
            print(f"{exception_name} Exception occured for video {video}.")
            captions.append("")
            failed_captions.append(video)
            
    print(f"Captions: {len(captions)}")
    
print(f"Total captions: {len(captions)}")

KeyError Exception occured for video wybp16nlroo.
Captions: 50
KeyError Exception occured for video eKkI6-HeWXo.
KeyError Exception occured for video OcMQ1-P6RdM.
KeyError Exception occured for video CQL3ogC4qS8.
Captions: 100
Captions: 150
Captions: 200
Captions: 250
Captions: 300
Captions: 350
Captions: 400
Captions: 450
Captions: 500
Captions: 550
Captions: 600
Captions: 650
Captions: 700
Captions: 750
Captions: 800
Captions: 850
Captions: 900
Captions: 950
Captions: 1000
Captions: 1050
Captions: 1100
Captions: 1150
Captions: 1200
Captions: 1250
Captions: 1300
Captions: 1350
Captions: 1400
Captions: 1450
Captions: 1500
Captions: 1550
Captions: 1600
Captions: 1650
Captions: 1700
Captions: 1750
Captions: 1800
Captions: 1850
Captions: 1900
Captions: 1950
Captions: 2000
Captions: 2050
Captions: 2100
Captions: 2150
Captions: 2200
Captions: 2250
Captions: 2300
Captions: 2350
Captions: 2400
Captions: 2450
Captions: 2500
Captions: 2550
Captions: 2600
Captions: 2650
Captions: 2700
Captions: 

In [28]:
captions[:] = [x for x in captions if x]

success = captions.copy()
fail = failed_captions.copy()

print(f"Total successful captions to save: {len(success)}")
print(f"Total failed captions to save: {len(fail)}")

Total successful captions to save: 3589
Total failed captions to save: 4


So, approximately three hours later we're back here with a bunch of new juicy text data! I'm going to make a backup of everything that we've got and save that to a temporary directory.

In [29]:
print(f"Saving...\n")
    
success = pd.Series(success)
success.to_csv(DATA + TEMP + "raw_captions.csv", sep=',')
upload_to_googledrive("raw_captions.csv", google_const['temp_data_folder_id'], path=DATA+TEMP)
print(f"Saved! [1]")

print(f"\n...\n")

fail = pd.Series(fail)
fail.to_csv(DATA + TEMP + "failed.csv", sep=',')
upload_to_googledrive("failed.csv", google_const['temp_data_folder_id'], path=DATA+TEMP)
print(f"Saved! [2]")

Saving...

Saved! [1]

...

Saved! [2]


And we should remove the failed rows, of course.

In [54]:
print(vids_ted.shape)

for video in failed_captions:
    vids_ted = vids_ted[vids_ted['video_id'] != video]
    
vids_ted.reset_index(drop=True, inplace=True)

print(vids_ted.shape)

(3589, 13)
(3589, 13)


In [58]:
vids_ted['captions'] = captions

## Downloading tags & additional metadata

Now that we know which videos we're gonna work with, it's time to download the additional metadata that may be helpful for us. First of all, tags. Those are extremely important for the project as we are going to make them a target value to fit a model. Besides that, some statistics about videos such as views and likes, date of publication and id of the channel is something that we would like to see in our dataframe as well.

In [59]:
video_ids = list(vids_ted['video_id'])
video_ids = list(chunks(video_ids, 50))

In [60]:
additional_items = []

for chunk_id in video_ids:

    additional_request = youtube.videos().list(
        id = chunk_id,
        part = "statistics, snippet",
        maxResults = 50
    )

    while additional_request is not None:
        additional_response = additional_request.execute()
        additional_items += additional_response["items"]
        additional_request = youtube.videos().list_next(additional_request, additional_response)

In [61]:
pp.pprint(additional_items[3484])

{   'etag': 'RTiIaQ4ujfNby6fxzRIHp_Exc98',
    'id': 'e3NA-aKpgFk',
    'kind': 'youtube#video',
    'snippet': {   'categoryId': '20',
                   'channelId': 'UCAuUUnT6oDeKwE6v1NGQxug',
                   'channelTitle': 'TED',
                   'defaultLanguage': 'en',
                   'description': 'http://www.ted.com In a friendly, '
                                  'high-speed presentation, Will Wright demos '
                                  'his newest game, Spore, which promises to '
                                  'dazzle users even more than his previous '
                                  'masterpieces.\r\n'
                                  '\r\n'
                                  'TEDTalks is a daily video podcast of the '
                                  'best talks and performances from the TED '
                                  "Conference, where the world's leading "
                                  'thinkers and doers are invited to give the '
    

Quite a big output. Luckily, we won't need the most of that.

**Note:** some videos have no comments or they are turned off, as well as likes and dislikes. Because of that I've made a `try_to_append` function in order to handle KeyError exceptions that are inevitabely coming.

In [62]:
def try_to_append(data_list, dictionary, thing):
    try:
        data_list.append(dictionary[thing])
    except KeyError:
        data_list.append(None)

In [63]:
datetime, tags, channel_id, likes, dislikes, views, comments = [], [], [], [], [], [], []

for dictionary in additional_items:
    datetime.append(dictionary['snippet']['publishedAt'])
    tags.append(dictionary['snippet']['tags'])
    channel_id.append(dictionary['snippet']['channelId'])
    views.append(dictionary['statistics']['viewCount'])
    try_to_append(likes, dictionary['statistics'], 'likeCount')
    try_to_append(dislikes, dictionary['statistics'], 'dislikeCount')
    try_to_append(comments, dictionary['statistics'], 'commentCount')

So how many KeyError exceptions did we handle and how many `None` values we'll have to put up with?

In [64]:
likes.count(None), dislikes.count(None), comments.count(None)

(10, 10, 37)

Now we can say that we're done with downloading data. It's time to form a final dataframe and save it to a .csv file.

In [65]:
vids_ted['published_at'] = datetime
vids_ted['tags'] = tags
vids_ted['channel_id'] = channel_id
vids_ted['views'] = views
vids_ted['likes'] = likes
vids_ted['dislikes'] = dislikes
vids_ted['comments_count'] = comments

In [66]:
vids_ted.head()

Unnamed: 0,id,title,description,thumbnail,video_id,captions,published_at,tags,channel_id,views,likes,dislikes,comments_count
0,VVVBdVVVblQ2b0RlS3dFNnYxTkdReHVnLlhRSmhSRGJzRHpJ,Your identity is your superpower | America Fer...,Watch the full talk: http://tedtalks.social/su...,https://i.ytimg.com/vi/XQJhRDbsDzI/default.jpg,XQJhRDbsDzI,"1\n00:00:00,350 --> 00:00:06,470\nWhen I was 1...",2021-07-10T14:00:12Z,"[TEDTalk, TEDTalks, TED Talk, TED Talks, busin...",UCAuUUnT6oDeKwE6v1NGQxug,69496,2991,385,8
1,VVVBdVVVblQ2b0RlS3dFNnYxTkdReHVnLnh2RzNmdEV2NXJN,Documentary films that explore trauma -- and m...,Visit http://TED.com/shapeyourfuture to watch ...,https://i.ytimg.com/vi/xvG3ftEv5rM/default.jpg,xvG3ftEv5rM,"1\n00:00:14,331 --> 00:00:16,542\n[SHAPE YOUR ...",2021-07-09T15:30:06Z,"[TEDTalk, TEDTalks, TED Talk, TED Talks, story...",UCAuUUnT6oDeKwE6v1NGQxug,26010,1076,35,59
2,VVVBdVVVblQ2b0RlS3dFNnYxTkdReHVnLkRFU0NjalNRU0tZ,A cleanse won't detox your body -- but here's ...,Put down the cayenne-lemon water and step away...,https://i.ytimg.com/vi/DESCcjSQSKY/default.jpg,DESCcjSQSKY,"1\n00:00:00,000 --> 00:00:07,000\nTranscriber:...",2021-07-08T15:15:00Z,"[TEDTalk, TEDTalks, TED Talk, TED Talks, food,...",UCAuUUnT6oDeKwE6v1NGQxug,457410,23220,452,828
3,VVVBdVVVblQ2b0RlS3dFNnYxTkdReHVnLkQzTEZWbEk1aXI4,What should humans take to space (and leave be...,Visit http://TED.com/shapeyourfuture to watch ...,https://i.ytimg.com/vi/D3LFVlI5ir8/default.jpg,D3LFVlI5ir8,"1\n00:00:14,871 --> 00:00:16,579\n[SHAPE YOUR ...",2021-07-07T15:14:57Z,"[TEDTalk, TEDTalks, TED Talk, TED Talks, cultu...",UCAuUUnT6oDeKwE6v1NGQxug,23259,628,153,145
4,VVVBdVVVblQ2b0RlS3dFNnYxTkdReHVnLkVmZWlLUFNCS2Zn,How to be a professional troublemaker | Luvvie...,Visit http://TED.com to get our entire library...,https://i.ytimg.com/vi/EfeiKPSBKfg/default.jpg,EfeiKPSBKfg,"1\n00:00:00,000 --> 00:00:07,000\nTranscriber:...",2021-07-06T16:45:00Z,"[TEDTalk, TEDTalks, TED Talk, TED Talks, Socia...",UCAuUUnT6oDeKwE6v1NGQxug,25494,844,96,95


In [67]:
vids_ted.to_csv(DATA + "videos.csv", sep=',')
upload_to_googledrive("videos.csv", google_const['data_folder_id'])