In [53]:
import os
import pandas as pd
import numpy as np

from youtube_transcript_api import YouTubeTranscriptApi
from youtube_transcript_api._errors import TranscriptsDisabled
from dotenv import load_dotenv

#Get api keys
load_dotenv()
YOUTUBE_APIKEY = os.environ['YOUTUBE_APIKEY']
OPENAI_APIKEY = os.environ['OPENAI_APIKEY']

# Load the data

In [2]:
fp = '../data/nutrition_facts_videos_2597_2024-04-10.csv'
df = pd.read_csv(fp)

In [3]:
df

Unnamed: 0.1,Unnamed: 0,videoId,title,description,publishedAt
0,0,r3PuCQ8CxTc,Benefits and Side Effects of the Pneumonia Vac...,Randomized controlled trials have found that p...,2024-04-10T11:59:52Z
1,1,Oc8T8OGKnZ8,Live Q&A with Dr. Greger,"Join Dr. Greger on Thursday, April 25 at 3:00 ...",2024-04-09T19:50:43Z
2,2,oa6UtySJKFE,Benefits and Side Effects of the Flu Vaccine,Flu shots can prevent more than just the flu. ...,2024-04-08T11:59:55Z
3,3,wZLgy4dvM1Y,New Sponsorship: Big Broccoli,Help keep us ad- and sponsorship-free by makin...,2024-04-07T15:59:56Z
4,4,Et0lozAIbI8,Friday Favorites: Removing Warts with Duct Tape,Duct tape beat out cryotherapy (freezing) and ...,2024-04-05T11:59:52Z
...,...,...,...,...,...
2592,2592,P_X3exQtuGA,The Healthiest Herbal Tea,New subscribers to our e-newsletter always rec...,2011-02-08T14:59:42Z
2593,2593,ce-pvksbiwM,Update on Yerba Maté,New subscribers to our e-newsletter always rec...,2011-02-08T14:57:18Z
2594,2594,1Yb5MjU38ng,Update on MSG,New subscribers to our e-newsletter always rec...,2011-02-08T14:48:31Z
2595,2595,lD2RzNJYGxQ,Update on Juice Plus+®,New subscribers to our e-newsletter always rec...,2011-02-08T05:41:12Z


# Explore Video List

In [7]:
# Let's look at some of these titles
[t for t in list(zip(df.title.to_list(), df.description.to_list()))]

[('Benefits and Side Effects of the Pneumonia Vaccine',
  'Randomized controlled trials have found that pneumonia vaccines significantly reduce the risk of pneumococcal pneumonia in people 65 and older.\n\nIf you missed it, check out the previous video: Benefits and Side Effects of the Flu Vaccine (https://nutritionfacts.org/video/benefits-and-side-effects-of-the-flu-vaccine). Next? Benefits and Side Effects of the Shingles Vaccine (https://nutritionfacts.org/video/benefits-and-side-effects-of-the-shingles-vaccine). \n\nNew subscribers to our e-newsletter always receive a free gift. Get yours here: https://nutritionfacts.org/subscribe/.\n\nHave a question about this video? Leave it in the comment section at http://nutritionfacts.org/video/benefits-and-side-effects-of-the-pneumonia-vaccine and someone on the NutritionFacts.org team will try to answer it.\n\nWant to get a list of links to all the scientific sources used in this video? Click on Sources Cited at https://nutritionfacts.org/

There seem to be a lot of different types of videos - podcasts, Q&A, regular topic videos which could be on a food or a idsease, update videos, live videos, flashback videos. 

In [31]:
[t for t in df.title.to_list()]

['Benefits and Side Effects of the Pneumonia Vaccine',
 'Live Q&A with Dr. Greger',
 'Benefits and Side Effects of the Flu Vaccine',
 'New Sponsorship: Big Broccoli',
 'Friday Favorites: Removing Warts with Duct Tape',
 'Podcast: Hot Flashes',
 'How to Preserve Your Sense of Smell',
 'Ginkgo and Nicotinamide for Glaucoma Treatment',
 'Dietary Sources of the ‘Longevity Vitamin’ Ergothioneine  #antiaging',
 'Friday Favorites: Do Apricot Seeds Work as an Alternative Cancer Cure?',
 'Q&A: Dr. Greger talks Vitamin Needs, Longevity, and Lowering Cholesterol',
 'Podcast: What’s on That Label?',
 'Adrenal Fatigue: What It Is and How to Treat It',
 'Your challenge is to add at least three servings of fruit to your day.#dailydozenchallenge',
 'The Secret to Weight Loss Through Exercise',
 'Friday Favorites: The Effects of Marijuana on Car Accidents',
 'Rosemary oil for hair loss? #nutritionfactsorg #hairloss #aging #antiaging',
 'Podcast: How Not to Age (Part 2)',
 'The Exercise “Myth” for Weigh

**Used Vertex-AI to try out gemini 1.5-pro preview 409, 1mm context window to find topics and themes**

## Key Topics and Themes in Dr. Greger's YouTube Videos:
Based on the provided titles, here's a breakdown of the key topics and recurring themes:

**Diet and Health:**

- **Plant-based diets:** This is the most dominant theme, exploring the benefits of plant-based diets for preventing and treating various diseases, including heart disease, cancer, diabetes, osteoporosis, autoimmune diseases, and more.
- **Specific foods and nutrients:** Numerous videos delve into the health effects of individual foods and nutrients, like berries, greens, beans, whole grains, nuts, seeds, spices, and supplements like vitamin B12 and vitamin D.
- **Weight management:** Exploring strategies for weight loss and healthy weight maintenance, including dietary changes, exercise, fasting, and debunking weight loss myths.
Specific health conditions: Addressing dietary and lifestyle approaches to managing various health conditions, such as high blood pressure, high cholesterol, menopause, Alzheimer's disease, Parkinson's disease, arthritis, acne, and gut health issues.
- **Food industry and policy:** Discussing the influence of the food industry on dietary guidelines and public health, advocating for policies that promote healthier food choices.
**Lifestyle and Longevity:**

- **Anti-aging strategies:** Investigating evidence-based methods to slow down the aging process and promote longevity, including diet, exercise, sleep, stress management, and exposure to nature.
- **Healthy habits:** Promoting healthy lifestyle habits, such as regular exercise, adequate sleep, stress reduction techniques, and avoiding harmful substances like tobacco and excessive alcohol.
 -**Environmental impact of food choices:** Highlighting the environmental consequences of different dietary patterns and advocating for sustainable food choices.

**Video Types:**

- **Live Q&A:** Dr. Greger interacts with viewers in real-time, answering questions about nutrition and health.
- **Podcasts:** In-depth discussions on various health and nutrition topics, often featuring guest experts.
- **Presentations:** Slideshow-style presentations summarizing research findings on specific topics.
- **Cooking demonstrations:** Dr. Greger demonstrates how to prepare healthy plant-based recipes.
- **Friday Favorites:** Short videos discussing interesting research findings or debunking common nutrition myths.
- **Shorts:** Short, engaging videos highlighting key nutrition facts.
- **Additional Observations:**

**Focus on scientific evidence:** Dr. Greger consistently emphasizes the importance of evidence-based information and relies on scientific studies to support his recommendations.
- **Emphasis on whole foods:** The content promotes consuming whole, unprocessed plant foods as the foundation of a healthy diet.
- **Empowerment and self-care:** Dr. Greger encourages viewers to take control of their health through informed choices and healthy habits.

In [14]:
# Let's look at dates
# Create date and year columns
df['date'] = pd.to_datetime(df['publishedAt']).dt.date
df['year'] = pd.to_datetime(df['publishedAt']).dt.year
df[['date', 'year']]

Unnamed: 0,date,year
0,2024-04-10,2024
1,2024-04-09,2024
2,2024-04-08,2024
3,2024-04-07,2024
4,2024-04-05,2024
...,...,...
2592,2011-02-08,2011
2593,2011-02-08,2011
2594,2011-02-08,2011
2595,2011-02-08,2011


In [21]:
# Groupby date to see how many videos have been published each year
df.groupby('year').size()

year
2011    181
2012    257
2013    157
2014    157
2015    162
2016    156
2017    177
2018    171
2019    159
2020    186
2021    215
2022    249
2023    289
2024     81
dtype: int64

# Get transcripts

In [38]:
def get_transcript(video_id):
    try:
        transcript_list = YouTubeTranscriptApi.get_transcript(video_id)
        # Combine the text of all segments into one large string
        transcript_text = ' '.join([segment['text'].replace('\n',' ') for segment in transcript_list])
        return transcript_text
    except Exception as e:
        print(f"Error for video {video_id}: {e}")
        return None



# video_id = 'your_video_id_here'
# try:
#     transcript_list = YouTubeTranscriptApi.get_transcript(video_id, languages=['en'])
#     print(transcript_list)
# except Exception as e:
#     print("An error occurred:", e)

In [39]:
# Run a small test
df_test = df.head(20)
# Apply the function to each video ID in the DataFrame
df_test['transcript'] = df_test['videoId'].apply(get_transcript)

Error for video Oc8T8OGKnZ8: 
Could not retrieve a transcript for the video https://www.youtube.com/watch?v=Oc8T8OGKnZ8! This is most likely caused by:

Subtitles are disabled for this video

If you are sure that the described cause is not responsible for this error and that a transcript should be retrievable, please create an issue at https://github.com/jdepoix/youtube-transcript-api/issues. Please add which version of youtube_transcript_api you are using and provide the information needed to replicate the error. Also make sure that there are no open issues which already describe your problem!


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_test['transcript'] = df_test['videoId'].apply(get_transcript)


In [40]:
# View the transcripts
df_test.loc[2, "transcript"]

'In this 3-video series, I show the science behind the pros and cons of the flu vaccine, pneumonia vaccine, and shingles vaccine. "Benefits and Side Effects of the Flu Vaccine" Every year, influenza typically kills between 4,000 and 20,000 Americans, though the death toll for the 2017 to 2018 season was estimated at 80,000, making it one of the deadliest in the last half century. Most hospitalizations and 90 percent of flu-related mortality occur in those 65 and older, and most over the age of 75. Mortality rates for the flu at ages 75 and older are 50 times higher than those below age 65. Nonetheless, the CDC recommends everyone over the age of 6 months get a routine annual flu shot every year, if for no other reason than to help prevent transmission to the more vulnerable. The cruel irony is that older adults, the ones who need protection the most, acquire less robust protection from flu shots due to waning immunity with age. Depending on the season, vaccination typically reduces the

In [41]:
# Save this small sample
# df_test.to_csv('../data/transcripts_20.csv')

In [43]:
# Get all the transcripts for 2020 - present
df['transcript'] = df['videoId'].apply(get_transcript)
# Save the transcripts
df.to_csv('../data/transcripts_all_2024-04-10')

Error for video Oc8T8OGKnZ8: 
Could not retrieve a transcript for the video https://www.youtube.com/watch?v=Oc8T8OGKnZ8! This is most likely caused by:

Subtitles are disabled for this video

If you are sure that the described cause is not responsible for this error and that a transcript should be retrievable, please create an issue at https://github.com/jdepoix/youtube-transcript-api/issues. Please add which version of youtube_transcript_api you are using and provide the information needed to replicate the error. Also make sure that there are no open issues which already describe your problem!
Error for video P1IGY-UYETE: 
Could not retrieve a transcript for the video https://www.youtube.com/watch?v=P1IGY-UYETE! This is most likely caused by:

Subtitles are disabled for this video

If you are sure that the described cause is not responsible for this error and that a transcript should be retrievable, please create an issue at https://github.com/jdepoix/youtube-transcript-api/issues. Pl

In [44]:
df

Unnamed: 0.1,Unnamed: 0,videoId,title,description,publishedAt,date,year,transcript
0,0,r3PuCQ8CxTc,Benefits and Side Effects of the Pneumonia Vac...,Randomized controlled trials have found that p...,2024-04-10T11:59:52Z,2024-04-10,2024,"""Benefits and Side Effects of the Pneumonia Va..."
1,1,Oc8T8OGKnZ8,Live Q&A with Dr. Greger,"Join Dr. Greger on Thursday, April 25 at 3:00 ...",2024-04-09T19:50:43Z,2024-04-09,2024,
2,2,oa6UtySJKFE,Benefits and Side Effects of the Flu Vaccine,Flu shots can prevent more than just the flu. ...,2024-04-08T11:59:55Z,2024-04-08,2024,"In this 3-video series, I show the science beh..."
3,3,wZLgy4dvM1Y,New Sponsorship: Big Broccoli,Help keep us ad- and sponsorship-free by makin...,2024-04-07T15:59:56Z,2024-04-07,2024,I’m Dr. Michael Greger and ever since I starte...
4,4,Et0lozAIbI8,Friday Favorites: Removing Warts with Duct Tape,Duct tape beat out cryotherapy (freezing) and ...,2024-04-05T11:59:52Z,2024-04-05,2024,You can find home remedies for all sorts of ai...
...,...,...,...,...,...,...,...,...
2592,2592,P_X3exQtuGA,The Healthiest Herbal Tea,New subscribers to our e-newsletter always rec...,2011-02-08T14:59:42Z,2011-02-08,2011,"""The Healthiest Herbal Tea"" Walking through th..."
2593,2593,ce-pvksbiwM,Update on Yerba Maté,New subscribers to our e-newsletter always rec...,2011-02-08T14:57:18Z,2011-02-08,2011,"""Update on Yerba Maté"" And finally, what about..."
2594,2594,1Yb5MjU38ng,Update on MSG,New subscribers to our e-newsletter always rec...,2011-02-08T14:48:31Z,2011-02-08,2011,"""Update on MSG"" What about MSG? The scientific..."
2595,2595,lD2RzNJYGxQ,Update on Juice Plus+®,New subscribers to our e-newsletter always rec...,2011-02-08T05:41:12Z,2011-02-08,2011,"""Update on Juice Plus+"" What about Juice Plus+..."


In [56]:
(df['transcript'].isna()).sum()

28

In [47]:
df_check = pd.read_csv('../data/transcripts_all_2024-04-10')

In [48]:
df_check

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,videoId,title,description,publishedAt,date,year,transcript
0,0,0,r3PuCQ8CxTc,Benefits and Side Effects of the Pneumonia Vac...,Randomized controlled trials have found that p...,2024-04-10T11:59:52Z,2024-04-10,2024,"""Benefits and Side Effects of the Pneumonia Va..."
1,1,1,Oc8T8OGKnZ8,Live Q&A with Dr. Greger,"Join Dr. Greger on Thursday, April 25 at 3:00 ...",2024-04-09T19:50:43Z,2024-04-09,2024,
2,2,2,oa6UtySJKFE,Benefits and Side Effects of the Flu Vaccine,Flu shots can prevent more than just the flu. ...,2024-04-08T11:59:55Z,2024-04-08,2024,"In this 3-video series, I show the science beh..."
3,3,3,wZLgy4dvM1Y,New Sponsorship: Big Broccoli,Help keep us ad- and sponsorship-free by makin...,2024-04-07T15:59:56Z,2024-04-07,2024,I’m Dr. Michael Greger and ever since I starte...
4,4,4,Et0lozAIbI8,Friday Favorites: Removing Warts with Duct Tape,Duct tape beat out cryotherapy (freezing) and ...,2024-04-05T11:59:52Z,2024-04-05,2024,You can find home remedies for all sorts of ai...
...,...,...,...,...,...,...,...,...,...
2592,2592,2592,P_X3exQtuGA,The Healthiest Herbal Tea,New subscribers to our e-newsletter always rec...,2011-02-08T14:59:42Z,2011-02-08,2011,"""The Healthiest Herbal Tea"" Walking through th..."
2593,2593,2593,ce-pvksbiwM,Update on Yerba Maté,New subscribers to our e-newsletter always rec...,2011-02-08T14:57:18Z,2011-02-08,2011,"""Update on Yerba Maté"" And finally, what about..."
2594,2594,2594,1Yb5MjU38ng,Update on MSG,New subscribers to our e-newsletter always rec...,2011-02-08T14:48:31Z,2011-02-08,2011,"""Update on MSG"" What about MSG? The scientific..."
2595,2595,2595,lD2RzNJYGxQ,Update on Juice Plus+®,New subscribers to our e-newsletter always rec...,2011-02-08T05:41:12Z,2011-02-08,2011,"""Update on Juice Plus+"" What about Juice Plus+..."


In [55]:
(df_check['transcript'].isna()).sum()

28

In [58]:
# Drop the null values and resave
df = df[df['transcript'].notna()]
# df.to_csv('../data/transcripts_all_2024-04-10_cleaned')