# Video Transcripts

The main goal for this pet project is to capture Sydney's motivational talks. To do that, I need access to the transcripts for each video. This notebook captures that process.

Note: Section links work in JupyterLab but do not work when viewed in Github.

1. [Load](#Load-video-ids) video ids from the output of the [previous notebook](https://github.com/ltran17/motivational_messages/blob/main/notebooks/01-metadata.ipynb)
2. [Acquire](#Acquire-transcripts-for-each-video) transcripts for each video
3. [Extract](#Extract-the-motivational-talk-from-each-transcript) the motivational talk from each transcript
4. [Save](#Save-dataframe-to-a-file) dataframe to a file 

### Load video ids

In [1]:
import pandas as pd

I'm going to take a little side-trip here and talk about data hygiene.

* Raw data should never be modified. Results of data cleaning or other transformations should never overwrite the original file. Best practices for data pipelines sometimes refer to [Bronze, Silver, and Gold level data](https://www.linkedin.com/pulse/consider-gold-silver-bronze-your-data-just-olympics-ruaidhri-hallinan/). These are code-words for data in raw, interim, or processed stages.

* This is a small enough project that I'm not going to create separate folders for each stage of data. For any project larger than this, I would use the [Cookie Cutter Data Science file structure](https://drivendata.github.io/cookiecutter-data-science/#directory-structure).

* **GitHub is a code repository and not a data repository.** I do not push data to GitHub. Data versioning tools exist! I encourage you to look into those. 

All that is to say: you will not find the output of the previous notebook in this repo. You will need to run the code yourself and save the data where you want it to live.

* Here is where I stored the data:

In [2]:
datafile = '../data/2023-01-effort-metadata.csv'

* Since I know what my data looks like, I will take advantage of the parameters in the Pandas [`.read_csv` function](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).

In [3]:
data = pd.read_csv(datafile, parse_dates=['date'], index_col='date')
data.head()

Unnamed: 0_level_0,vid,title,description,viewCount,likeCount,commentCount,time
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2023-01-02,qLKflJjWDi4,30 Minute Full Body Strong &amp; Fit Workout |...,It's time to put in the EFFORT! DAY 1 of our P...,246754,16921,1624,30
2023-01-03,NzIylrLhkJ4,40 Minute Upper Body and Plank Challenge | EFF...,It's time to put in the EFFORT! DAY 2 of our P...,118414,10624,565,40
2023-01-04,A5kHPgAi3I0,30 Minute Lean Legs &amp; Cardio Workout | EFF...,Oh my Quad! It's DAY 3 of our PROCESS program ...,125434,11581,780,30
2023-01-06,A67VMX9hlmc,30 Minute Upper Body &amp; HIIT Cardio Workout...,It's time to put in the EFFORT! DAY 4 of our P...,114556,10591,511,30
2023-01-07,lpv4RdNefyU,45 Minute Strong Glutes &amp; Abs Workout | EF...,These glutes are on FIYAH! It's DAY 5 of our P...,104764,8970,475,45


* And now I have the video ids.

In [5]:
vids = data['vid'].copy()
test_vid = vids[0]
test_vid

'qLKflJjWDi4'

[Return to top](#Get-transcripts-for-each-video)

### Acquire transcripts for each video

The YouTube API does not make it easy to get the transcripts from videos! As I googled around to figure it out, I landed on [YouTube Transcript API](https://pypi.org/project/youtube-transcript-api/) (available through pypi.org). 

* This package makes the transcript acquisition task straightforward. I encourage you to familiarize yourself with the package through the link above or [fork it yourself from the GitHub page](https://github.com/jdepoix/youtube-transcript-api).

* **Caveat emptor**: The package author notes, "This code uses an undocumented part of the YouTube API, which is called by the YouTube web-client. So there is no guarantee that it won't stop working tomorrow, if they change how things work." Good to know.

In [4]:
from youtube_transcript_api import YouTubeTranscriptApi

* I'm going to test out YouTube Transcript API's `.get_transcript` function.

In [6]:
transcript = YouTubeTranscriptApi.get_transcript(test_vid)

In [7]:
type(transcript)

list

In [8]:
len(transcript)

562

* That's a long list! I'll inspect just the first few elements.

In [9]:
transcript[:3]

[{'text': "what's up everyone it's Sydney welcome",
  'start': 0.0,
  'duration': 4.86},
 {'text': 'to process this is stage one the effort',
  'start': 2.159,
  'duration': 4.381},
 {'text': 'so grab some dumbbells for our 30 minute',
  'start': 4.86,
  'duration': 5.48}]

* `transcript` is a list of dictionaries with three keys each: the `text`, `start` and `duration` in seconds (to the hundredth place).

* I could write a function to extract the text values ... but the YouTubeTranscript API has already done this! How did I know it was there? I read [the documentation](https://github.com/jdepoix/youtube-transcript-api#using-formatters). This is a thoughtfully created package.

In [10]:
from youtube_transcript_api.formatters import TextFormatter

In [11]:
formatter = TextFormatter()
text = formatter.format_transcript(transcript)

In [12]:
type(text)

str

In [13]:
len(text)

14895

In [14]:
text[:100]

"what's up everyone it's Sydney welcome\nto process this is stage one the effort\nso grab some dumbbell"

* This formatter takes each of the `text` values from the `transcript` and appends them together with newline whitespace, `\n`, in between. I'll remove those.

In [15]:
text = text.replace('\n', ' ')

In [16]:
text[:100]

"what's up everyone it's Sydney welcome to process this is stage one the effort so grab some dumbbell"

* Now that I know how to get the transcript for each video, I'll write a function.

* Because this is not part of the official YouTube API, I am going to be extra careful with these calls and not hammer the server with transcript requests. Maybe I'm being excessively cautious but I really don't want my YouTube access throttled.

In [17]:
import time

In [18]:
def get_transcript(vid, sleep=3):
    '''
    Returns formatted transcript for vid 
    Pause for sleep seconds before returning the transcript for vid
    '''
    transcript = YouTubeTranscriptApi.get_transcript(vid)
    text = formatter.format_transcript(transcript)
    text = text.replace('\n', ' ')
    time.sleep(3)
    return text

* Because of the pause in the function, this next cell takes a minute to run. If you are braver than I am, you can use a smaller value for the `sleep` parameter.

In [19]:
transcripts = vids.apply(get_transcript)

* Just how long are these transcripts? Too long to read through them all 😬

In [20]:
transcripts.str.len()

date
2023-01-02    14895
2023-01-03    20794
2023-01-04    16269
2023-01-06    16901
2023-01-07    23054
2023-01-09    19550
2023-01-10    15901
2023-01-11    21334
2023-01-13    20445
2023-01-14    16964
2023-01-16    19286
2023-01-17    15765
2023-01-18    22121
2023-01-20    15751
2023-01-21    24206
2023-01-23    19954
2023-01-24    18046
2023-01-25    21238
2023-01-27    19219
2023-01-28    21330
Name: vid, dtype: int64

[Return to top](#Get-transcripts-for-each-video)

### Extract the motivational talk from each transcript

* Because I have done (literally) hundreds of Sydney's workouts, I know that the motivational talk almost always begins after the phrase, "You have made it to your cool down!"

* I explored the transcripts and discovered that sometimes this phrase is captured as "cool down" and sometimes as "cooldown." 

* There are also occasions where Sydney uses a different phrase, like "As the clock winds down to zero," or "Before you leave this workout."

In [21]:
phrases = ['to your cool down',
           'to your cooldown',
           'as the clock',
           'leave this workout']

In [22]:
def get_motivational_text(transcript):
    '''
    Return the motivational part of the transcript.
    '''
    text = transcript.lower()
    for phrase in phrases:
        loc = text.find(phrase)
        if loc > 0:
            motivation = text[loc+len(phrase):]
            return motivation
    return 'closing phrase not found'

Test out the function:

In [23]:
test_transcript = transcripts[0]
len(test_transcript)

14895

In [24]:
motivation = get_motivational_text(test_transcript)
len(motivation)

2133

Looks like it worked, but just to be sure...I will give it a read.

In [25]:
motivation

" big exhale shift your hips up and forward and before you head out if this is your first workout or your 100th workout with me i want you to make sure you're subscribed to the channel it helps youtube show more workouts from our channel to more people so that more people can have access to high quality fitness all over the world don't forget to grab your workout calendar as well in the description so you know what's coming it's a big month it's the start of a big series walk it back to your toes come on up bend your knees slowly come up to the top great job a lot of things very exciting this month cross your arm over your chest and as we get into this effort i want you to apply this everywhere okay not just in the workouts effort is a vigorous we're determined attempt at this lifestyle is what i want i want your mind to say i'm determined not i'm motivated not i'm in the mood not i'm new year's hype i'm determined i'm dedicated cross it over do not let this effort fade when the mood f

* That's still a lot of words! Many of them are boilerplate filler or directional -- "subscribe to the channel", "give this video a big thumbs up", "bend your knees slowly".

* This is good enough for me. I'm going to leave it to ChatGPT to filter out what I'm looking for.

* Time to get all the motivations.

In [26]:
motivations = transcripts.apply(get_motivational_text)

In [27]:
motivations

date
2023-01-02     big exhale shift your hips up and forward and...
2023-01-03     amazing job rest back on your glutes you did ...
2023-01-04     break deep breath nice wide stance hinge forw...
2023-01-06     give me two minutes here nice wide stance exh...
2023-01-07     amazing work give me two minutes do not leave...
2023-01-09     hands on your knees do not leave [music] don'...
2023-01-10     stay here don't move hands on your knees now ...
2023-01-11     amazing work come on over hands and knees big...
2023-01-13     hands down exhale and press up beautiful job ...
2023-01-14     toss your rope to the side wide stance exhale...
2023-01-16     amazing work don't leave yet come on down han...
2023-01-17                             closing phrase not found
2023-01-18     hands on your knees bend your legs just a lit...
2023-01-20     amazing job hands up overhead you've made it ...
2023-01-21     but give me time for that cool down shift you...
2023-01-23     we've got a nice lon

* Hmm...The motivation for January 17 is missing!

* Here's where you, data scientist, get to decide how to deal with missing data. How important is it to have this particular day's motivation? Do you want to impute the value, or can you drop the record?

* My choice is to drop the record. There's enough motivation to go around with the other 19 videos.

* Prepare the dataframe.

In [28]:
data = pd.DataFrame({'vid':vids,
              'motivation':motivations
             })
data

Unnamed: 0_level_0,vid,motivation
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2023-01-02,qLKflJjWDi4,big exhale shift your hips up and forward and...
2023-01-03,NzIylrLhkJ4,amazing job rest back on your glutes you did ...
2023-01-04,A5kHPgAi3I0,break deep breath nice wide stance hinge forw...
2023-01-06,A67VMX9hlmc,give me two minutes here nice wide stance exh...
2023-01-07,lpv4RdNefyU,amazing work give me two minutes do not leave...
2023-01-09,m3mM77Nx5EQ,hands on your knees do not leave [music] don'...
2023-01-10,LF_XV7KwO5s,stay here don't move hands on your knees now ...
2023-01-11,oMZuE6a_5-I,amazing work come on over hands and knees big...
2023-01-13,Yvphe6Y6ejE,hands down exhale and press up beautiful job ...
2023-01-14,WbkHfqIuhcE,toss your rope to the side wide stance exhale...


* I could hard code dropping the record for January 17 ... but future me will be really annoyed by this when I want motivations from, say, May 2024.

* My solution is to define a function to use as a filter.

In [29]:
def has_motivation(motivation):
    if motivation == 'closing phrase not found':
        return False
    return True

In [30]:
filt = data['motivation'].apply(has_motivation)
data = data[filt]
data

Unnamed: 0_level_0,vid,motivation
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2023-01-02,qLKflJjWDi4,big exhale shift your hips up and forward and...
2023-01-03,NzIylrLhkJ4,amazing job rest back on your glutes you did ...
2023-01-04,A5kHPgAi3I0,break deep breath nice wide stance hinge forw...
2023-01-06,A67VMX9hlmc,give me two minutes here nice wide stance exh...
2023-01-07,lpv4RdNefyU,amazing work give me two minutes do not leave...
2023-01-09,m3mM77Nx5EQ,hands on your knees do not leave [music] don'...
2023-01-10,LF_XV7KwO5s,stay here don't move hands on your knees now ...
2023-01-11,oMZuE6a_5-I,amazing work come on over hands and knees big...
2023-01-13,Yvphe6Y6ejE,hands down exhale and press up beautiful job ...
2023-01-14,WbkHfqIuhcE,toss your rope to the side wide stance exhale...


[Return to top](#Get-transcripts-for-each-video)

### Save dataframe to a file 

* This is ready to save! In the next notebook, I will ask ChatGPT to summarize these motivational talks.

In [31]:
data.to_csv('../data/2023-01-effort-motivations.csv')

[Return to top](#Get-transcripts-for-each-video)