# Introduction
YouTube is an influential platform for health-related content, but with the volume of videos, misinformation can easily spread. This project uses Generative AI (GenAI) to extract health claims from YouTube video transcripts and validate them by cross-referencing scientific studies from PubMed.

In this notebook, we'll demonstrate how AI:
1. Identifies and extracts health-related claims from video transcripts.
2. Finds relevant keywords to search articles for fact checkin
3. Assess the credibility of the claims based on articles.

The goal is to help users find accurate and reliable health information on YouTube.

# Imports and environment setup
This section imports libraries for web scraping, data handling, and interacting with APIs, including YouTube and Google’s generative AI. It also sets up Google Cloud credentials and securely fetches API keys using kaggle_secrets.

In [112]:
# Importing necessary libraries for various tasks such as data manipulation, web scraping, NLP, and API interactions
import pandas as pd  # For data manipulation and analysis using DataFrames
import time  # To handle delays and timing-related functions
import os  # For interacting with the operating system (e.g., file handling)
import requests  # For making HTTP requests to fetch data from external APIs
import xml.etree.ElementTree as ET  # For parsing XML data, useful for handling API responses in XML format
from bs4 import BeautifulSoup  # For parsing HTML content and extracting data, typically used for web scraping
import re  # For using regular expressions to match patterns in text
import nltk  # Natural Language Toolkit for NLP tasks
from IPython.display import Markdown, display, YouTubeVideo  # For displaying Markdown and embedding YouTube videos in Jupyter notebooks
import google.generativeai as genai  # For interacting with Google's Generative AI API (e.g., for text generation or completion tasks)
from google.generativeai import caching  # For caching results from Google's Generative AI to speed up repeated requests
from google.auth.credentials import AnonymousCredentials  # For using anonymous credentials with Google APIs
from google.auth import compute_engine  # For using compute engine credentials (typically in cloud environments)
from IPython.display import Markdown, display  # Duplicate import of Markdown and display, used for rendering rich text in Jupyter notebooks
from kaggle_secrets import UserSecretsClient  # To securely access secrets from Kaggle (e.g., API keys or credentials)
from nltk.corpus import wordnet  # For accessing WordNet, a lexical database useful for synonym and antonym lookups

# Install required dependencies (for video search and transcript fetching)
! pip install -q youtube-search-python  # Installing the YouTube Search Python library to search YouTube videos programmatically
! pip install youtube-transcript-api  # Installing the YouTube Transcript API library to fetch video transcripts

# Importing libraries for searching YouTube videos and fetching transcripts
from youtubesearchpython import VideosSearch  # To search for YouTube videos based on search queries
from youtube_transcript_api import YouTubeTranscriptApi  # For fetching transcripts of YouTube videos
from youtube_transcript_api.formatters import JSONFormatter  # For formatting the fetched YouTube transcript data in JSON format




In [113]:
def mock_get_universe_domain(request):
    return "googleapis.com"

# Override the original metadata fetching function
compute_engine._metadata.get_universe_domain = mock_get_universe_domain

# Set the Google application credentials manually (replace with the actual path)
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/kaggle/input/google-cloud-key/esoteric-cab-443306-n6-8b6ccf376c34.json"

In [114]:
user_secrets = UserSecretsClient()
secret_value_0 = user_secrets.get_secret("GEMINI-API-KEY")
secret_value_1 = user_secrets.get_secret("ncbi_api_key")

In [115]:
genai.configure(api_key=secret_value_0)

In [116]:
save_dataset =  True

# Fetching PubMed Data for Fact-Checking
This section defines functions to retrieve research articles from PubMed to help validate health claims.

In [117]:
# Function to fetch PMIDs based on a search query
def fetch_pmid_list(query, max_results=100):
    """
    Fetch PMIDs from PubMed based on a search query.
    """
    base_url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?api_key={secret_value_1}"
    params = {
        "db": "pubmed",  # Search in PubMed database
        "term": query,   # Search term (query)
        "retmax": max_results,  # Max number of results to fetch
        "usehistory": "y"  # Use history for subsequent queries
    }
    
    # Make a request to the PubMed API
    response = requests.get(base_url, params=params)
    time.sleep(1)  # Adding delay to prevent hitting API rate limit
    
    if response.status_code == 200:
        # Parse the XML response to extract PMIDs, WebEnv, and QueryKey
        root = ET.fromstring(response.content)
        webenv = root.find("WebEnv").text
        query_key = root.find("QueryKey").text
        pmids = [id.text for id in root.findall("IdList/Id")]  # Extract PMIDs from the response
        return pmids, webenv, query_key
    else:
        print("Error fetching PMIDs.")
        return [], None, None

# Function to fetch article details using PMIDs
def fetch_article_details(pmids, webenv, query_key, retstart=0, retmax=100):
    """
    Fetch article details (title, authors, abstract) for given PMIDs.
    """
    base_url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?api_key={secret_value_1}"
    ids = ",".join(pmids)  # Join PMIDs for batch request
    params = {
        "db": "pubmed",  # PubMed database
        "id": ids,  # List of PMIDs
        "retstart": retstart,  # Start index for fetching results
        "retmax": retmax,  # Max number of results per fetch
        "WebEnv": webenv,  # Web environment for query continuation
        "query_key": query_key,  # Query key for reference
        "rettype": "xml",  # Return results in XML format
        "retmode": "xml"  # XML response mode
    }
    
    # Make a request to fetch article details
    response = requests.get(base_url, params=params)
    time.sleep(1)  # Adding delay to prevent hitting API rate limit
    
    if response.status_code == 200:
        # Parse the XML response to extract article details
        root = ET.fromstring(response.content)
        articles = []
        
        for docsum in root.findall("PubmedArticle"):
            article = {}
            medline_citation = docsum.find("MedlineCitation")
            if medline_citation is not None:
                article["pmid"] = medline_citation.find("PMID").text
                article["title"] = medline_citation.find("Article/ArticleTitle").text
                article["source"] = medline_citation.find("Article/Journal/Title").text
                article["authors"] = [
                    f"{author.find('ForeName').text} {author.find('LastName').text}"
                    for author in medline_citation.findall("Article/AuthorList/Author")
                    if author.find("LastName") is not None and author.find("ForeName") is not None
                ]
                article["abstract"] = medline_citation.find("Article/Abstract/AbstractText")
                if article["abstract"] is not None:
                    article["abstract"] = article["abstract"].text
                articles.append(article)
        
        return articles
    else:
        print("Error fetching article details.")
        return []

# Function to fetch full content of an article based on its PMID
def fetch_content(pmid):
    """
    Fetch the full HTML content of an article by its PMID.
    """
    url = f"https://pubmed.ncbi.nlm.nih.gov/{pmid}/?api_key={secret_value_1}"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
    }
    
    response = requests.get(url, headers=headers)
    time.sleep(1)  # Adding delay to prevent hitting API rate limit
    
    if response.status_code == 200:
        return response.text  # Return the full HTML content of the article
    else:
        print(f"Error fetching content for PMID {pmid}")
        return None

# Function to extract specific sections (Introduction, Clinical case, Methods, Results, Conclusion) from PubMed articles
def extract_article_sections(query, max_results=100):
    """
    Extract sections (Introduction, Methods, etc.) from PubMed articles.
    """
    # Fetch the PMIDs and necessary query details
    pmids, webenv, query_key = fetch_pmid_list(query, max_results=max_results)
    
    # Fetch detailed article information based on PMIDs
    articles = fetch_article_details(pmids, webenv, query_key)
    
    # Loop through each article, fetch its content, and extract sections
    for article in articles:
        pmid = article.get('pmid')
        if pmid:
            # Fetch the full HTML content of the article
            html_content = fetch_content(pmid)
            if html_content:
                # Parse the HTML content using BeautifulSoup
                soup = BeautifulSoup(html_content, 'html.parser')
                paragraphs = soup.find_all('p')  # Find all paragraphs in the article
                
                # Loop through the paragraphs and look for specific sections
                if paragraphs:
                    for para in paragraphs:
                        para_content = para.text.strip()
                        if para_content.startswith('Introduction'):
                            article.update({'Introduction': para_content})
                        elif para_content.startswith('Clinical case'):
                            article.update({'Clinical case': para_content})
                        elif para_content.startswith('Methods'):
                            article.update({'Methods': para_content})
                        elif para_content.startswith('Results'):
                            article.update({'Results': para_content})
                        elif para_content.startswith('Conclusion'):
                            article.update({'Conclusion': para_content})
    
    # Convert the list of articles into a DataFrame for easy manipulation
    df = pd.DataFrame(articles)
    return df


# Extracting Claims and Keywords from YouTube Transcripts
This section processes YouTube video transcripts to extract health-related claims and identify key phrases for fact-checking.

In [118]:
# Function to get the transcript of a YouTube video
def get_transcript(video_id):
    try:
        # Fetch the transcript
        transcript = YouTubeTranscriptApi.get_transcript(video_id)
        return " ".join([entry['text'] for entry in transcript])
    except Exception as e:
        # Error handling
        print(f"Error fetching transcript for video {video_id}: {e}")
        return None

def find_top_claims_and_keywords(video):
    model = genai.GenerativeModel("gemini-1.5-flash")
    
    # Extract video metadata
    title = video['title']
    description_snippet = video.get('descriptionSnippet', [])
    description = "".join([desc['text'] for desc in description_snippet]) if isinstance(description_snippet, list) else ""
    video_link = video['link']
    video_id = video['id']
    print(f"Processing video: {title} (ID: {video_id})")

    # Fetch transcript
    transcript = get_transcript(video_id)
    if not transcript:
        print(f"No transcript found for video: {title}")
        return None
        
    if save_dataset:
        with open("transcript.txt", "w") as file:
            file.write(transcript)
        
    claims_list = []

    # Process transcript 
    claims_prompt = f"Extract up to 3 unique, health-related, evidence-based claims from the following transcript chunk:\n\n{transcript}"
    claims_response = model.generate_content([claims_prompt])
    claims_text = claims_response.text.strip()

    if claims_text:
        claims_list.append(claims_text)
    else:
        print(f"No claims found in chunk")

    # Combine claims from all chunks
    all_claims = "\n".join(claims_list)

    # Extract individual claims using regex
    pattern = r"\d+\.\s\*\*(.*?)\*\*\s*(.*?)(?=\n\d+\.|\Z)"
    claims = re.findall(pattern, all_claims, re.DOTALL)
    claims_list = [f"{claim[0]} {claim[1]}" for claim in claims]

    # Generate keywords for each claim
    keyword_dict = {}
    for claim in claims_list:
        time.sleep(1)
        keywords_prompt = (
            f"Identify 1 to 4 specific keywords or phrases related to the following claim that would help validate its accuracy with academic evidence:\n\n"
            f"Claim: {claim}\n\n"
            f"Choose keywords or phrases that are directly related to the central topic of the claim, and are likely to be found in reliable academic sources."
        )
        keywords_response = model.generate_content([keywords_prompt])
        keywords_text = keywords_response.text.strip()
        keywords = re.findall(r'\*\*(.*?)(?::)?\*\*', keywords_text)
        keyword_dict[claim] = keywords
    
    # Display results
    if len(claims_list) > 0:
        display(Markdown(
            f"### {title}\n"
            f"[Watch Video]({video_link})\n\n"
            f"**Top Claims:**\n\n" +
            "\n\n".join([f"{idx + 1}. {claim}" for idx, claim in enumerate(claims_list)]) 
        ))
    else:
        print("No claims found.")
   
    return {
        "title": title,
        "link": video_link,
        "claims": claims_list,
        "keywords": keyword_dict,
        'claim_list': claims_list
    }


# Fact-Checking Claims with Gemini AI
This section defines a function to use the Gemini AI model to perform fact-checking on health-related claims using relevant articles. 

In [119]:
# Function to perform fact-checking using Gemini AI
def fact_check_claims_with_confidence(articles_df, claim):
    # Initialize a Gemini model (make sure you have the appropriate API and model)
    model = genai.GenerativeModel("gemini-1.5-flash")

    # Initialize a dictionary to store the fact-check results
    fact_check_results = {}

    # Extract relevant sections from the articles (e.g., abstract, results, and conclusion)
    articles_text = ""
    for _, row in articles_df.iterrows():
        article_content = (
            f"PMID: {row.get('pmid', 'N/A')}\n"
            f"Title: {row.get('title', 'N/A')}\n"
            f"Abstract: {row.get('abstract', 'N/A')}\n"
            f"Methods: {row.get('Methods', 'N/A')}\n"
            f"Results: {row.get('Results', 'N/A')}\n"
            f"Conclusion: {row.get('Conclusion', 'N/A')}\n"
        )
        articles_text += article_content + "\n"

    # Prepare the prompt for Gemini AI to fact-check the claim based on articles
    time.sleep(1)
    
    prompt = f"""
    Fact-check the following claim based on web data and provided articles. 
    Provide the fact-check result (True/False/Not able to validate/Conflicting results reported) 
    and a confidence score between 0 and 1:

    Claim: '{claim}'

    Articles:
    {articles_text}

    Please respond with:
    1. The fact-check result: True/False/Not able to validate/Conflicting results reported
    2. The confidence score: A numerical value between 0 and 1
    3. Explanation: Clearly explain the reasoning behind your fact-check result and the confidence score, referencing specific evidence from the articles where applicable.
    """
    print(model.count_tokens(prompt))
        
    # Generate response from Gemini AI
    response = model.generate_content([prompt])

    # Extract the fact-checking result and confidence score from the response text
    result_text = response.text.strip()
   
    # Extract Fact-check result
    fact_check_match = re.search(r"\*\*Fact-check result:\*\* (True|False|Not able to validate|Conflicting results reported)", result_text)
    fact_check_result = fact_check_match.group(1) if fact_check_match else "Fact-check result not found"
    
    # Extract Confidence score
    confidence_match = re.search(r"\*\*Confidence score:\*\* ([\d\.]+)", result_text)
    confidence_score = float(confidence_match.group(1)) if confidence_match else None  # None if not found
    
    # Extract Explanation
    explanation_match = re.search(r"\*\*Explanation:\*\*([\s\S]+)", result_text)
    explanation = explanation_match.group(1).strip() if explanation_match else "Explanation not found"
    
    # Print extracted values
    print("Fact-check result:", fact_check_result)
    print("Confidence score:", confidence_score)
    print("Explanation:", explanation)


    try:
        # Try to parse the fact-check result and confidence score from the response
        fact_check_result = result_text.split(",")[0].split(":")[1].strip()
        confidence_score = float(result_text.split(",")[1].split(":")[1].strip())

        # Store the results in a dictionary
        fact_check_results[claim] = {'result': fact_check_result, 'confidence_score': confidence_score}

    except (IndexError, ValueError) as e:
        # Handle cases where the response format is not as expected
        fact_check_results[claim] = {'result': 'Error', 'confidence_score': 0.0, 'error': str(e)}

    return fact_check_results

# Main: Searching, Extracting Claims, and Fact-Checking
This section performs the primary workflow of searching for health-related videos, extracting claims from video transcripts, identifying relevant keywords, and fact-checking these claims using academic sources.

In [120]:
search = VideosSearch('health', limit=1)  # Adjust the limit as needed
results = search.result()


# Check each result for claim-related keywords in the title, description, or transcript
for video in results['result']:
    claim_results = find_top_claims_and_keywords(video)
    if claim_results:
        for ii, claim in enumerate(claim_results['claim_list']):
            print('--------------------------------------------------------------------------------------------')
            print(f"\n\033[1mFact check results for claim: {claim[:claim.find(':')]}\033[0m.")
            articles_df = pd.DataFrame()
            if claim_results['keywords'][claim]:
                keywords = claim_results['keywords'][claim]
                for item in keywords:
                    if len(item)>0:
                        new_articles_df = extract_article_sections(query=item, max_results=200)
                        articles_df = pd.concat([articles_df, new_articles_df])
                        if len(new_articles_df)>0:
                            print(f"Fetched {len(new_articles_df)} articles with keywrods: {item}.")
                if len(articles_df)>0:
                    fact_check_results = fact_check_claims_with_confidence(articles_df, claim)
                    if save_dataset:
                        file_name = f'articles_{ii}.csv'
                        articles_df.to_csv(file_name)
            
                else:
                    print(f"No articles found for the claim with keywords: {claim_results['keywords'][claim]}.\n")
                    search_academic_articles(keywords)

Processing video: America’s Health Crisis EXPOSED - Why Toxic Food Industry FEARS RFK Jr. (ID: vfI5xQo7XiY)


### America’s Health Crisis EXPOSED - Why Toxic Food Industry FEARS RFK Jr.
[Watch Video](https://www.youtube.com/watch?v=vfI5xQo7XiY)

**Top Claims:**

1. Higher healthcare spending, lower life expectancy: The US spends a significantly higher percentage of its GDP on healthcare than other comparable countries (e.g., Germany, France, Korea), yet has a lower life expectancy.  This is supported by the presenter's charts comparing GDP expenditure on healthcare to life expectancy across multiple nations.


2. Higher rates of avoidable deaths: The US has a considerably higher rate of avoidable deaths (due to factors like smoking, drinking, and poor diet) per 100,000 population compared to other developed nations. This is shown through presented charts comparing avoidable death rates.


3. Higher rates of infant and maternal mortality: The US possesses substantially higher rates of infant and maternal mortality than the OECD average, a discrepancy highlighted with supporting data illustrating the significant difference in rates.

--------------------------------------------------------------------------------------------

[1mFact check results for claim: Higher healthcare spending, lower life expectancy[0m.
Fetched 100 articles with keywrods: Healthcare expenditure and life expectancy.
Fetched 100 articles with keywrods: Cross-national healthcare comparisons.
Fetched 98 articles with keywrods: US healthcare system performance.
Fetched 100 articles with keywrods: GDP per capita and health outcomes.
total_tokens: 153916

Fact-check result: True
Confidence score: 0.8
Explanation: The claim states that the US spends a significantly higher percentage of its GDP on healthcare than comparable countries but has lower life expectancy.  Several of the provided articles support this assertion, albeit indirectly.  None directly present the presenter's charts, so the visual evidence remains unverified. However, the textual data strongly suggests the claim's accuracy.

**Supporting Evidence:**

* **Multiple articles (PMID: