# Step 1: Get the transcripts for Sydney's January 2023 videos

0. Set up access to YouTube's Data API
1. Create a dataframe with channel metadata
2. Augment dataframe with video metadata
3. Acquire transcripts for each video
4. Isolate the motivational talk from each transcript
5. Save dataframe to a file (if you want to get fancy, you could create a database)

### 0. Set up access to YouTube's Data API
First things first: follow the steps on with [YouTube's Data API Overview](https://developers.google.com/youtube/v3/getting-started) to obtain an API key. You will need a Google Developer account -- I balked at signing on for this but decided it was fine for me. Read up on it and make the decision that works for you.

Once you get your API key, put it somewhere safe! **Do not hard code your key into your code.** You can get fancy and use a key manager, but if you are relatively new to coding, put it in a file somewhere that will NOT be pushed to git. My two step method is maybe not the best way -- storing keys in a config folder in my project and adding that folder name to .gitignore -- but it works. 

In [2]:
with open('../config/youtube_api') as file:
    DEVELOPER_KEY = file.readline()

Now my key is safely loaded into memory, but no one reading my code can see the actual value. Another way to do this would be with an environment variable. If you are dealing with keys at work, please follow the best practices of your workplace.

**If you have not already followed the steps on [the Python Quickstart page](https://developers.google.com/youtube/v3/quickstart/python), take a break and go do that first.**

This tutorial will still be here! Come back when you are ready. 

These next two cells get you set up to make requests with the YouTube API.

In [4]:
# API client library
import googleapiclient.discovery
# API information
api_service_name = "youtube"
api_version = "v3"

In [5]:
# API client
youtube = googleapiclient.discovery.build(
    api_service_name, api_version, developerKey = DEVELOPER_KEY)

The next cell specifies the YouTube channel id. Perhaps this will be obvious to others, but it took me a while to figure out that this value is NOT the name of the YouTube channel, but the hash that goes along with it. It took me a while to track it down.

If you want to query some other channel's data, consider this an invitation to go figure out the channel id! 

In [6]:
CHANNEL_ID = 'UCVQJZE_on7It_pEv6tn-jdA' # Sydney's channel id

I wanted to see what a request to the youtube client returned, so I formulated a request to list the channel's first video of the year.

Please note the format of the datetime value is [International Standard ISO 8601](https://en.wikipedia.org/wiki/ISO_8601). You will certainly encounter other formats in your data science career, but I would be surprised if this is the last time you see this one.

In [15]:
request = youtube.search().list(
    part="snippet",
    channelId="UCVQJZE_on7It_pEv6tn-jdA",
    maxResults=31,
    publishedAfter="2023-01-02T00:00:00Z",
    publishedBefore="2023-01-03T00:00:00Z",
    type="video"
)
response = request.execute()

Let's see what it looks like.

In [16]:
response

{'kind': 'youtube#searchListResponse',
 'etag': 'vj5RWVnb9NxGiBPbM23M08JWY8s',
 'regionCode': 'US',
 'pageInfo': {'totalResults': 3, 'resultsPerPage': 1},
 'items': [{'kind': 'youtube#searchResult',
   'etag': 'pC-CzKtb-Sq3vGfXqccypRHXUjo',
   'id': {'kind': 'youtube#video', 'videoId': 'qLKflJjWDi4'},
   'snippet': {'publishedAt': '2023-01-02T10:00:18Z',
    'channelId': 'UCVQJZE_on7It_pEv6tn-jdA',
    'title': '30 Minute Full Body Strong &amp; Fit Workout | EFFORT - Day 1',
    'description': "It's time to put in the EFFORT! DAY 1 of our PROCESS program kicks today off with a full body quick and sweaty workout to get us ...",
    'thumbnails': {'default': {'url': 'https://i.ytimg.com/vi/qLKflJjWDi4/default.jpg',
      'width': 120,
      'height': 90},
     'medium': {'url': 'https://i.ytimg.com/vi/qLKflJjWDi4/mqdefault.jpg',
      'width': 320,
      'height': 180},
     'high': {'url': 'https://i.ytimg.com/vi/qLKflJjWDi4/hqdefault.jpg',
      'width': 480,
      'height': 360}},
   

The response is a dictionary that includes lots of metadata about the video. Let me see how to extract some of the information. For example, start with the title. 

In [17]:
response['items'][0]['snippet']['title']

'30 Minute Full Body Strong &amp; Fit Workout | EFFORT - Day 1'

And the description:

In [18]:
response['items'][0]['snippet']['description']

"It's time to put in the EFFORT! DAY 1 of our PROCESS program kicks today off with a full body quick and sweaty workout to get us ..."

### 1. Create a dataframe with channel metadata

Did someone say dataframe? My package of choice is Pandas. You might prefer something else. That's fine! Do what works for you.

In [36]:
import pandas as pd

To minimize the number of calls to YouTube, edit the dates in the request.

In [20]:
request = youtube.search().list(
    part="snippet",
    channelId="UCVQJZE_on7It_pEv6tn-jdA",
    maxResults=31,
    publishedAfter="2023-01-01T00:00:00Z",
    publishedBefore="2023-02-01T00:00:00Z",
    type="video"
)
response = request.execute()

Remember that the response has lots of metadata! If you are interested, you can dive into the other keys in the response dictionary, but for this project, I only care about `items`.

In [21]:
response.keys()

dict_keys(['kind', 'etag', 'regionCode', 'pageInfo', 'items'])

I'll check that the response has the number of items I expect: Sydney published 20 videos in January 2023.

In [22]:
len(response['items'])

20

Great! Now I am ready to create the dataframe of metadata for the month. I'll start with the video ids, their titles, their publication dates, and descriptions.

You'll notice that I do a couple of checks after extracting the data: what is the length of the list and what the content looks like. 

In [23]:
ids = []
for item in response['items']:
    id = item['id']['videoId']
    ids.append(id)

In [24]:
len(ids)

20

In [25]:
ids[0]

'A5kHPgAi3I0'

In [26]:
titles = []
for item in response['items']:
    title = item['snippet']['title']
    titles.append(title)

In [27]:
len(titles)

20

In [28]:
titles[0]

'30 Minute Lean Legs &amp; Cardio Workout | EFFORT - DAY 3'

In [29]:
times = []
for item in response['items']:
    time = item['snippet']['publishedAt']
    times.append(time)

In [30]:
len(times)

20

In [31]:
times[0]

'2023-01-04T10:00:04Z'

You'll notice that this is getting a little repetitive and you might be thinking I should write a function to get this done. If I were putting this into production somewhere, I would agree with you.

In [32]:
descs = []
for item in response['items']:
    desc = item['snippet']['description']
    descs.append(desc)

In [33]:
len(descs)

20

In [34]:
descs[0]

"Oh my Quad! It's DAY 3 of our PROCESS program and this workout is going to push you by getting your heart rate up and working ..."

We can now create the dataframe of metadata about Sydney's videos for the month of January.

In [37]:
data = pd.DataFrame({'vid':ids,'title':titles,'published':times,
             'description':descs})

In [38]:
data.head()

Unnamed: 0,vid,title,published,description
0,A5kHPgAi3I0,30 Minute Lean Legs &amp; Cardio Workout | EFF...,2023-01-04T10:00:04Z,Oh my Quad! It's DAY 3 of our PROCESS program ...
1,DhnmowHRYjk,40 Minute Legs &amp; HIIT Cardio Sweat Workout...,2023-01-18T10:00:12Z,Let's work today everyone! It's DAY 13 of our ...
2,WbkHfqIuhcE,30 Minute Lower Body &amp; Jump Rope Bootcamp ...,2023-01-14T10:00:20Z,Let's work today everyone! It's DAY 10 of our ...
3,kU8M0Sc_WTQ,40 Minute Legs &amp; Abs Sculpt Workout | EFFO...,2023-01-25T10:00:32Z,Let's work today everyone! It's DAY 18 of our ...
4,lijTEQmIo4o,45 Minute Greatest Glutes &amp; Cardio Workout...,2023-01-28T10:00:27Z,It's the final workout of EFFORT in January 20...


Looks good! I notice, however, that the `published` values are not in chronological order. Just something to keep in mind in the future.