# Step 1: Get metadata for Sydney's January 2023 videos

0. Set up access to YouTube's Data API
1. Create a dataframe with channel metadata
2. Augment dataframe with video metadata
3. Acquire transcripts for each video

### 0. Set up access to YouTube's Data API
First things first: follow the steps on with [YouTube's Data API Overview](https://developers.google.com/youtube/v3/getting-started) to obtain an API key. You will need a Google Developer account -- I balked at signing on for this but eventually decided to go with it. Read up on it and make the decision that works for you.

Once you get your API key, put it somewhere safe! **Do not hard code your key into your code.** You can get fancy and use a key manager, but if you are relatively new to coding, put it in a file somewhere that will NOT be pushed to git. My two step method is maybe not the best way -- storing keys in a config folder in my project and adding that folder name to .gitignore -- but it works. 

In [1]:
with open('../config/youtube_api') as file:
    DEVELOPER_KEY = file.readline()

Now my key is safely loaded into memory, but no one reading my code can see the actual value. Another way to do this would be with an environment variable. If you are dealing with keys at work, please follow the best practices of your workplace.

**If you have not already followed the steps on [the Python Quickstart page](https://developers.google.com/youtube/v3/quickstart/python), take a break and go do that first.** This tutorial will still be here! Come back when you are ready. 

These next two cells get you set up to make requests with the YouTube API.

In [2]:
# API client library
import googleapiclient.discovery
# API information
api_service_name = "youtube"
api_version = "v3"

In [3]:
# API client
youtube = googleapiclient.discovery.build(
    api_service_name, api_version, developerKey = DEVELOPER_KEY)

The next cell specifies the YouTube channel id. Perhaps this will be obvious to others, but it took me a while to figure out that this value is NOT the name of the YouTube channel, but the hash that goes along with it. It took me a while to track it down.

If you want to query some other channel's data, consider this an invitation to go figure out the channel id! 

In [4]:
CHANNEL_ID = 'UCVQJZE_on7It_pEv6tn-jdA' # Sydney's channel id

I wanted to see what a request to the youtube client returned, so I formulated a request to list the channel's first video of the year.

Please note the format of the datetime value is [International Standard ISO 8601](https://en.wikipedia.org/wiki/ISO_8601). You will certainly encounter other formats in your data science career, but I would be surprised if this is the last time you see this one.

In [5]:
request = youtube.search().list(
    part="snippet",
    channelId="UCVQJZE_on7It_pEv6tn-jdA",
    maxResults=31,
    publishedAfter="2023-01-02T00:00:00Z",
    publishedBefore="2023-01-03T00:00:00Z",
    type="video"
)
response = request.execute()

Let's see what it looks like.

In [6]:
response

{'kind': 'youtube#searchListResponse',
 'etag': 'vj5RWVnb9NxGiBPbM23M08JWY8s',
 'regionCode': 'US',
 'pageInfo': {'totalResults': 3, 'resultsPerPage': 1},
 'items': [{'kind': 'youtube#searchResult',
   'etag': 'pC-CzKtb-Sq3vGfXqccypRHXUjo',
   'id': {'kind': 'youtube#video', 'videoId': 'qLKflJjWDi4'},
   'snippet': {'publishedAt': '2023-01-02T10:00:18Z',
    'channelId': 'UCVQJZE_on7It_pEv6tn-jdA',
    'title': '30 Minute Full Body Strong &amp; Fit Workout | EFFORT - Day 1',
    'description': "It's time to put in the EFFORT! DAY 1 of our PROCESS program kicks today off with a full body quick and sweaty workout to get us ...",
    'thumbnails': {'default': {'url': 'https://i.ytimg.com/vi/qLKflJjWDi4/default.jpg',
      'width': 120,
      'height': 90},
     'medium': {'url': 'https://i.ytimg.com/vi/qLKflJjWDi4/mqdefault.jpg',
      'width': 320,
      'height': 180},
     'high': {'url': 'https://i.ytimg.com/vi/qLKflJjWDi4/hqdefault.jpg',
      'width': 480,
      'height': 360}},
   

The response is a dictionary that includes lots of metadata about the video. Let me see how to extract some of the information. For example, start with the title. 

In [7]:
response['items'][0]['snippet']['title']

'30 Minute Full Body Strong &amp; Fit Workout | EFFORT - Day 1'

And the description:

In [8]:
response['items'][0]['snippet']['description']

"It's time to put in the EFFORT! DAY 1 of our PROCESS program kicks today off with a full body quick and sweaty workout to get us ..."

### 1. Create a dataframe with channel metadata

Did someone say dataframe? My package of choice is Pandas. You might prefer something else. That's fine! Do what works for you.

In [9]:
import pandas as pd

To minimize the number of calls to YouTube, edit the dates in the request.

In [10]:
request = youtube.search().list(
    part="snippet",
    channelId="UCVQJZE_on7It_pEv6tn-jdA",
    maxResults=31,
    publishedAfter="2023-01-01T00:00:00Z",
    publishedBefore="2023-02-01T00:00:00Z",
    type="video"
)
response = request.execute()

Remember that the response has lots of metadata! If you are interested, you can dive into the other keys in the response dictionary, but for this project, I only care about `items`.

In [11]:
response.keys()

dict_keys(['kind', 'etag', 'regionCode', 'pageInfo', 'items'])

I'll check that the response has the number of items I expect: Sydney published 20 videos in January 2023.

In [12]:
len(response['items'])

20

Great! Now I am ready to create the dataframe of metadata for the month. I'll start with the video ids, their titles, their publication dates, and descriptions.

You'll notice that I do a couple of checks after extracting the data: what is the length of the list and what the content looks like. 

In [13]:
ids = []
for item in response['items']:
    id = item['id']['videoId']
    ids.append(id)

In [14]:
len(ids)

20

In [15]:
ids[0]

'WbkHfqIuhcE'

In [16]:
titles = []
for item in response['items']:
    title = item['snippet']['title']
    titles.append(title)

In [17]:
len(titles)

20

In [18]:
titles[0]

'30 Minute Lower Body &amp; Jump Rope Bootcamp | EFFORT - Day 10'

In [19]:
times = []
for item in response['items']:
    time = item['snippet']['publishedAt']
    times.append(time)

In [20]:
len(times)

20

In [21]:
times[0]

'2023-01-14T10:00:20Z'

You may be thinking that this is getting a little repetitive and that I should write a function to get this done. If I were putting this into production somewhere, I would agree with you.

In [22]:
descs = []
for item in response['items']:
    desc = item['snippet']['description']
    descs.append(desc)

In [23]:
len(descs)

20

In [24]:
descs[0]

"Let's work today everyone! It's DAY 10 of our PROCESS program and this workout is going to push you by working your lower ..."

We can now create the dataframe of metadata about Sydney's videos for the month of January.

In [25]:
data = pd.DataFrame({'vid':ids,'title':titles,'published':times,
             'description':descs})

In [26]:
data.head()

Unnamed: 0,vid,title,published,description
0,WbkHfqIuhcE,30 Minute Lower Body &amp; Jump Rope Bootcamp ...,2023-01-14T10:00:20Z,Let's work today everyone! It's DAY 10 of our ...
1,5lXhLLaSQh8,30 Minute Arms &amp; Abs HIIT Workout | EFFORT...,2023-01-17T10:00:01Z,This week is a BIG WEEK FOR YOU! You're still ...
2,uq3ElaQPAVw,40 Minute Cardio &amp; Upper Body Sweat Workou...,2023-01-24T10:00:23Z,It's a conditioning and upper body sweat sessi...
3,lijTEQmIo4o,45 Minute Greatest Glutes &amp; Cardio Workout...,2023-01-28T10:00:27Z,It's the final workout of EFFORT in January 20...
4,kU8M0Sc_WTQ,40 Minute Legs &amp; Abs Sculpt Workout | EFFO...,2023-01-25T10:00:32Z,Let's work today everyone! It's DAY 18 of our ...


Looks good! I notice, however, that the `published` values are not in chronological order. This is probably going to bug me so...

In [27]:
data = data.sort_values('published')

In [53]:
data.head()

Unnamed: 0_level_0,vid,title,description,viewCount,likeCount,commentCount,time
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2023-01-02,qLKflJjWDi4,30 Minute Full Body Strong &amp; Fit Workout |...,It's time to put in the EFFORT! DAY 1 of our P...,229361,16373,1605,30
2023-01-03,NzIylrLhkJ4,40 Minute Upper Body and Plank Challenge | EFF...,It's time to put in the EFFORT! DAY 2 of our P...,113614,10380,561,40
2023-01-04,A5kHPgAi3I0,30 Minute Lean Legs &amp; Cardio Workout | EFF...,Oh my Quad! It's DAY 3 of our PROCESS program ...,119897,11288,772,30
2023-01-06,A67VMX9hlmc,30 Minute Upper Body &amp; HIIT Cardio Workout...,It's time to put in the EFFORT! DAY 4 of our P...,108729,10261,504,30
2023-01-07,lpv4RdNefyU,45 Minute Strong Glutes &amp; Abs Workout | EF...,These glutes are on FIYAH! It's DAY 5 of our P...,99108,8675,468,45


That's better.

### 2. Augment dataframe with video metadata

To get more information about each video, we formulate a new request. 

In [28]:
def get_vid_stats(vid):
    request = youtube.videos().list(
        part="statistics",
        id=vid
    )
    response = request.execute()
    return response

Let's see what kind of information we get about one of the videos in our dataframe.

In [29]:
response = get_vid_stats(data.loc[0,'vid'])
response

{'kind': 'youtube#videoListResponse',
 'etag': 'CVwEmYwnUmk3l0wQ8GRIWoavsjA',
 'items': [{'kind': 'youtube#video',
   'etag': 'uT2FDFTPSgEz6JInvYCkbfWws5k',
   'id': 'WbkHfqIuhcE',
   'statistics': {'viewCount': '69261',
    'likeCount': '7862',
    'favoriteCount': '0',
    'commentCount': '714'}}],
 'pageInfo': {'totalResults': 1, 'resultsPerPage': 1}}

I spy `statistics` in the `items` list. How to extract them?

In [30]:
response['items'][0]['statistics']

{'viewCount': '69261',
 'likeCount': '7862',
 'favoriteCount': '0',
 'commentCount': '714'}

In [31]:
keys = list(response['items'][0]['statistics'].keys())
keys

['viewCount', 'likeCount', 'favoriteCount', 'commentCount']

For this I will write a function.

In [32]:
def extract_stats(row):
    '''
    Add new columns with video statistics to the row of a pd.DataFrame
    '''
    response = get_vid_stats(row['vid'])
    for key in keys:
        row[key] = response['items'][0]['statistics'][key]
    return row

And test it out...

In [33]:
extract_stats(data.loc[0])

vid                                                    WbkHfqIuhcE
title            30 Minute Lower Body &amp; Jump Rope Bootcamp ...
published                                     2023-01-14T10:00:20Z
description      Let's work today everyone! It's DAY 10 of our ...
viewCount                                                    69261
likeCount                                                     7862
favoriteCount                                                    0
commentCount                                                   714
Name: 0, dtype: object

It works as expected, so apply to all the rows:

In [34]:
data = data.apply(extract_stats, axis=1)

See what this looks like...

In [35]:
data.head()

Unnamed: 0,vid,title,published,description,viewCount,likeCount,favoriteCount,commentCount
18,qLKflJjWDi4,30 Minute Full Body Strong &amp; Fit Workout |...,2023-01-02T10:00:18Z,It's time to put in the EFFORT! DAY 1 of our P...,229361,16373,0,1605
12,NzIylrLhkJ4,40 Minute Upper Body and Plank Challenge | EFF...,2023-01-03T10:00:11Z,It's time to put in the EFFORT! DAY 2 of our P...,113614,10380,0,561
13,A5kHPgAi3I0,30 Minute Lean Legs &amp; Cardio Workout | EFF...,2023-01-04T10:00:04Z,Oh my Quad! It's DAY 3 of our PROCESS program ...,119897,11288,0,772
17,A67VMX9hlmc,30 Minute Upper Body &amp; HIIT Cardio Workout...,2023-01-06T10:00:10Z,It's time to put in the EFFORT! DAY 4 of our P...,108729,10261,0,504
16,lpv4RdNefyU,45 Minute Strong Glutes &amp; Abs Workout | EF...,2023-01-07T10:00:17Z,These glutes are on FIYAH! It's DAY 5 of our P...,99108,8675,0,468


At this point, my data science brain is popcorning with hypotheses about what day of the week has the most plays, if there are time series trends, how this January compares to last January...but I'm trying to stay on topic.

Just a little more metadata that interests me: how long is each workout? Conveniently, the answer is in the title of each video.

I'll use a regular expression to extract the minutes. My favorite resource for testing out regex strings is [regex101.com](https://regex101.com). If you use this resource, don't put any sensitive information in the test string!!!

In [36]:
import re

Test this out...I expect to get the number 30.

In [37]:
match = r'(\d+)'
matches = re.search(match, data.loc[0,'title'])
matches.groups()[0]

'30'

Subtle but important to note ... this `'30'` is a string and not an integer. I'll make sure to convert it before storing in the dataframe.

In [38]:
def get_workout_time(title):
    match = r'(\d+)'
    matches = re.search(match, title)
    try:
        return int(matches.groups()[0])
    except:
        return -1

In [39]:
data['time'] = data['title'].apply(get_workout_time)

Do a quick check:

In [40]:
data[['title','time']].head()

Unnamed: 0,title,time
18,30 Minute Full Body Strong &amp; Fit Workout |...,30
12,40 Minute Upper Body and Plank Challenge | EFF...,40
13,30 Minute Lean Legs &amp; Cardio Workout | EFF...,30
17,30 Minute Upper Body &amp; HIIT Cardio Workout...,30
16,45 Minute Strong Glutes &amp; Abs Workout | EF...,45


Are you following along with a channel of your own interest? Look for video total time with a different request. Consult the YouTube Data API. You should find `duration` in the `part` called `contentDetails`. 

YouTube has its own time format `PTXMXS` so you'll have to get creative to extract the information you're looking for.

The last bit of metadata that I want is to extract the date from the `published` column. Conveniently, Pandas has this kind of conversion built in. If you can remember the syntax, more power to you. I have to google it every single time.

Here's the right syntax, tested out on the first five rows of the `published` column.

In [41]:
pd.to_datetime(data['published'].head()).dt.date

18    2023-01-02
12    2023-01-03
13    2023-01-04
17    2023-01-06
16    2023-01-07
Name: published, dtype: object

Polish this dataframe up: extract the publication date of each video, and then use these dates as the index. 

In [43]:
data['date'] = pd.to_datetime(data['published']).dt.date

In [47]:
data = data.set_index('date')
data.head()

Unnamed: 0_level_0,vid,title,published,description,viewCount,likeCount,favoriteCount,commentCount,time
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2023-01-02,qLKflJjWDi4,30 Minute Full Body Strong &amp; Fit Workout |...,2023-01-02T10:00:18Z,It's time to put in the EFFORT! DAY 1 of our P...,229361,16373,0,1605,30
2023-01-03,NzIylrLhkJ4,40 Minute Upper Body and Plank Challenge | EFF...,2023-01-03T10:00:11Z,It's time to put in the EFFORT! DAY 2 of our P...,113614,10380,0,561,40
2023-01-04,A5kHPgAi3I0,30 Minute Lean Legs &amp; Cardio Workout | EFF...,2023-01-04T10:00:04Z,Oh my Quad! It's DAY 3 of our PROCESS program ...,119897,11288,0,772,30
2023-01-06,A67VMX9hlmc,30 Minute Upper Body &amp; HIIT Cardio Workout...,2023-01-06T10:00:10Z,It's time to put in the EFFORT! DAY 4 of our P...,108729,10261,0,504,30
2023-01-07,lpv4RdNefyU,45 Minute Strong Glutes &amp; Abs Workout | EF...,2023-01-07T10:00:17Z,These glutes are on FIYAH! It's DAY 5 of our P...,99108,8675,0,468,45


Before I save this dataframe, I'm going to drop a couple of columns that don't look useful.

In [48]:
drop = ['published','favoriteCount']

In [50]:
data = data.drop(columns=drop)

In [51]:
data.head()

Unnamed: 0_level_0,vid,title,description,viewCount,likeCount,commentCount,time
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2023-01-02,qLKflJjWDi4,30 Minute Full Body Strong &amp; Fit Workout |...,It's time to put in the EFFORT! DAY 1 of our P...,229361,16373,1605,30
2023-01-03,NzIylrLhkJ4,40 Minute Upper Body and Plank Challenge | EFF...,It's time to put in the EFFORT! DAY 2 of our P...,113614,10380,561,40
2023-01-04,A5kHPgAi3I0,30 Minute Lean Legs &amp; Cardio Workout | EFF...,Oh my Quad! It's DAY 3 of our PROCESS program ...,119897,11288,772,30
2023-01-06,A67VMX9hlmc,30 Minute Upper Body &amp; HIIT Cardio Workout...,It's time to put in the EFFORT! DAY 4 of our P...,108729,10261,504,30
2023-01-07,lpv4RdNefyU,45 Minute Strong Glutes &amp; Abs Workout | EF...,These glutes are on FIYAH! It's DAY 5 of our P...,99108,8675,468,45


And now time to save! Will continue augmenting this with transcript information in the next notebook.

In [52]:
data.to_csv('../data/2023-01-effort-metadata.csv')