<center>
<img src="https://laelgelcpublic.s3.sa-east-1.amazonaws.com/lael_50_years_narrow_white.png.no_years.400px_96dpi.png" width="300" alt="LAEL 50 years logo">
<h3>APPLIED LINGUISTICS GRADUATE PROGRAMME (LAEL)</h3>
</center>
<hr>

# Corpus Linguistics - Study 2 - Phase 2 - Elaine

## Prerequisites

### Required Python packages

- beautifulsoup4
- lxml
- pandas

### Importing the required libraries

#### For [TED Talks](https://www.ted.com/talks) data extraction

In [1]:
import pandas as pd
import os
import requests
import logging
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

#### For scraping the `TED Talks` URLs

In [2]:
from bs4 import BeautifulSoup
import pandas as pd
import os
import json
import html  # Import the HTML module for decoding HTML entities
import numpy as np # Import NumPy for handling missing values if necessary

### Defining input variables

In [3]:
html_talks_dir = 'html_talks'
txt_dir = 'txt'
input_file = 'df_tedtalks_urls3.jsonl'
output_file1 = 'df_tedtalks_urls3_enriched'
output_file2 = 'spreadsheet_elaine'
log_filename = 'cl_st2_elaine.log'

### Setting up logging

In [4]:
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    filename = log_filename
)

### Creating output directories

#### For [TED Talks](https://www.ted.com/talks) data extraction

In [5]:
# Check if the output directory already exists. If it does, do nothing. If it doesn't exist, create it.
if os.path.exists(html_talks_dir):
    logging.info(f"Output directory {html_talks_dir} already exists.")
else:
    try:
        os.makedirs(html_talks_dir)
        logging.info(f"Output directory {html_talks_dir} successfully created.")
    except OSError as e:
        logging.error(f"Failed to create the {html_talks_dir} directory:", e)
        sys.exit(1)

#### For scraping the `TED Talks` URLs

In [6]:
# Check if the output directory already exists. If it does, do nothing. If it doesn't exist, create it.
if os.path.exists(txt_dir):
    logging.info(f"Output directory {txt_dir} already exists.")
else:
    try:
        os.makedirs(txt_dir)
        logging.info(f"Output directory {txt_dir} successfully created.")
    except OSError as e:
        logging.error(f"Failed to create the {txt_dir} directory:", e)
        sys.exit(1)

## [TED Talks](https://www.ted.com/talks) data extraction - Part 2

Considering the period 2020 to 2025 (raw data extracted on 14/03/2025 at 11:38 am Brasilia).

### Importing the data into a DataFrame

In [7]:
df_tedtalks_urls3 = pd.read_json(input_file, lines=True)

In [8]:
# Ensure the 'File ID' column is treated as strings
df_tedtalks_urls3['File ID'] = df_tedtalks_urls3['File ID'].astype('str')

In [9]:
# Pad the values in the 'File ID' column with leading zeros to make them 6 digits
df_tedtalks_urls3['File ID'] = df_tedtalks_urls3['File ID'].str.zfill(6)

In [10]:
df_tedtalks_urls3.dtypes

File ID          object
TED Talks URL    object
dtype: object

In [11]:
df_tedtalks_urls3

Unnamed: 0,File ID,TED Talks URL
0,000001,https://www.ted.com/talks/nazzy_pakpour_this_i...
1,000002,https://www.ted.com/talks/ryan_gilliam_a_concr...
2,000003,https://www.ted.com/talks/sharon_zicherman_wha...
3,000004,https://www.ted.com/talks/rachel_yang_how_gian...
4,000005,https://www.ted.com/talks/leo_villareal_how_li...
...,...,...
4077,004078,https://www.ted.com/talks/sir_ken_robinson_do_...
4078,004079,https://www.ted.com/talks/majora_carter_greeni...
4079,004080,https://www.ted.com/talks/david_pogue_simplici...
4080,004081,https://www.ted.com/talks/al_gore_averting_the...


### Getting the `TED Talks` URLs

In [None]:
# Retry mechanism setup
retry_strategy = Retry(
    total=3,  # Retry up to 3 times
    backoff_factor=2,  # Exponential backoff: wait 2s, 4s, 8s...
    status_forcelist=[429, 500, 502, 503, 504],  # Retry on these HTTP error codes
)
adapter = HTTPAdapter(max_retries=retry_strategy)
http = requests.Session()
http.mount("https://", adapter)

# Loop through the DataFrame rows
for _, row in df_tedtalks_urls3.iterrows():
    file_id = row['File ID']  # Get the File ID
    url = row['TED Talks URL']  # Get the URL

    try:
        # Log the start of the request
        logging.info(f"Fetching HTML for File ID: {file_id} from URL: {url}")

        # Fetch the HTML content from the URL
        response = http.get(url)
        response.raise_for_status()  # Raise an exception for HTTP errors

        # Save the HTML content to a file in the 'html_talks' directory
        file_path = os.path.join(html_talks_dir, f"{file_id}.html")
        with open(file_path, 'w', encoding='utf-8') as html_file:
            html_file.write(response.text)

        # Log success
        logging.info(f"Successfully saved HTML for File ID: {file_id} to {file_path}")

    except requests.exceptions.RequestException as e:
        # Log any failures or retries
        logging.error(f"Failed to fetch HTML for File ID: {file_id} from URL: {url}: {e}")

#### Adapting the programme for command line

The programme was named `geturls.py`.

In [None]:
import pandas as pd
import os
import requests
import logging
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

def main():
    # Defining input variables
    html_talks_dir = 'html_talks'
    input_file = 'df_tedtalks_urls3.jsonl'
    log_filename = 'cl_st2_elaine.log'
    
    # Setting up logging
    logging.basicConfig(
        level=logging.INFO,
        format='%(asctime)s - %(levelname)s - %(message)s',
        filename = log_filename
    )

    # Creating output directories
    if os.path.exists(html_talks_dir):
        logging.info(f"Output directory {html_talks_dir} already exists.")
    else:
        try:
            os.makedirs(html_talks_dir)
            logging.info(f"Output directory {html_talks_dir} successfully created.")
        except OSError as e:
            logging.error(f"Failed to create the {html_talks_dir} directory:", e)
            sys.exit(1)
        
    # Importing the data into a DataFrame
    df_tedtalks_urls3 = pd.read_json(input_file, lines=True)
    
    # Ensure the 'File ID' column is treated as strings
    df_tedtalks_urls3['File ID'] = df_tedtalks_urls3['File ID'].astype('str')
    
    # Pad the values in the 'File ID' column with leading zeros to make them 6 digits
    df_tedtalks_urls3['File ID'] = df_tedtalks_urls3['File ID'].str.zfill(6)
    
    # Getting the `TED Talks` URLs
    # Retry mechanism setup
    retry_strategy = Retry(
        total=3,  # Retry up to 3 times
        backoff_factor=2,  # Exponential backoff: wait 2s, 4s, 8s...
        status_forcelist=[429, 500, 502, 503, 504],  # Retry on these HTTP error codes
    )
    adapter = HTTPAdapter(max_retries=retry_strategy)
    http = requests.Session()
    http.mount("https://", adapter)
    
    # Loop through the DataFrame rows
    for _, row in df_tedtalks_urls3.iterrows():
        file_id = row['File ID']  # Get the File ID
        url = row['TED Talks URL']  # Get the URL
    
        try:
            # Log the start of the request
            logging.info(f"Fetching HTML for File ID: {file_id} from URL: {url}")
    
            # Fetch the HTML content from the URL
            response = http.get(url)
            response.raise_for_status()  # Raise an exception for HTTP errors
    
            # Save the HTML content to a file in the 'html_talks' directory
            file_path = os.path.join(html_talks_dir, f"{file_id}.html")
            with open(file_path, 'w', encoding='utf-8') as html_file:
                html_file.write(response.text)
    
            # Log success
            logging.info(f"Successfully saved HTML for File ID: {file_id} to {file_path}")
    
        except requests.exceptions.RequestException as e:
            # Log any failures or retries
            logging.error(f"Failed to fetch HTML for File ID: {file_id} from URL: {url}: {e}")

if __name__ == "__main__":
    main()

## Scraping the `TED Talks` URLs

In [12]:
# Initialize new columns in the DataFrame
new_columns = ['Speaker', 'Title', 'Duration', 'Tags', 'Views', 'Year', 'Talk', 'Video', 'Event', 'TED_ID', 'Transcript', 'Transcript Available']
for col in new_columns:
    df_tedtalks_urls3[col] = None

# Iterate through each row in the DataFrame
for idx, row in df_tedtalks_urls3.iterrows():
    file_id = row['File ID']
    file_path = f"{html_talks_dir}/{file_id}.html"
    
    try:
        # Open and parse the corresponding HTML file
        with open(file_path, 'r', encoding='utf-8') as html_file:
            soup = BeautifulSoup(html_file, 'lxml')
        
        # Extract data from the first <script type="application/ld+json"> tag
        transcript_script = soup.find('script', {'type': 'application/ld+json'})
        transcript_json = json.loads(transcript_script.string)
        transcript = transcript_json.get('transcript', "")

        # Decode the HTML entities in the transcript
        transcript_plain_text = html.unescape(transcript)

        # Extract data from the <script id="__NEXT_DATA__" type="application/json"> tag
        data_script = soup.find('script', {'id': '__NEXT_DATA__', 'type': 'application/json'})
        data_json = json.loads(data_script.string)
        video_data = data_json['props']['pageProps']['videoData']

        # Parse the playerData JSON (nested string)
        player_data = json.loads(video_data.get('playerData', "{}"))

        # Safely access the targeting object
        targeting_data = player_data.get('targeting', {})

        # Populate the new columns with extracted data
        df_tedtalks_urls3.at[idx, 'Speaker'] = player_data.get('speaker')
        df_tedtalks_urls3.at[idx, 'Title'] = player_data.get('title')
        df_tedtalks_urls3.at[idx, 'Duration'] = player_data.get('duration')
        df_tedtalks_urls3.at[idx, 'Tags'] = targeting_data.get('tag', '')
        df_tedtalks_urls3.at[idx, 'Views'] = video_data.get('viewedCount', 0)
        df_tedtalks_urls3.at[idx, 'Year'] = targeting_data.get('year', '')
        df_tedtalks_urls3.at[idx, 'Talk'] = targeting_data.get('talk', '')
#        df_tedtalks_urls3.at[idx, 'Video'] = player_data.get('resources', {}).get('h264', [{}])[0].get('file', '')

        # Handle Video (h264 resource) safely
        resources = player_data.get('resources', {})
        h264_resources = resources.get('h264', None)  # Safely retrieve 'h264'
        
        if isinstance(h264_resources, list) and len(h264_resources) > 0:
            # If 'h264' is a non-empty list, retrieve the 'file'
            df_tedtalks_urls3.at[idx, 'Video'] = h264_resources[0].get('file', '')
        else:
            # If 'h264' is null or not a valid list, leave 'Video' empty
            df_tedtalks_urls3.at[idx, 'Video'] = None

        df_tedtalks_urls3.at[idx, 'Event'] = targeting_data.get('event', '')
        df_tedtalks_urls3.at[idx, 'TED_ID'] = targeting_data.get('id', '')
        df_tedtalks_urls3.at[idx, 'Transcript'] = transcript_plain_text
        df_tedtalks_urls3.at[idx, 'Transcript Available'] = 'Yes' if transcript_plain_text.strip() else 'No'
        
    except (FileNotFoundError, json.JSONDecodeError, KeyError) as e:
        logging.error(f"Error processing file {file_path}: {e}")

## Adding the column `Word Count`

In [13]:
# Create a new column named 'Word Count'
df_tedtalks_urls3['Word Count'] = np.where(
    df_tedtalks_urls3['Transcript Available'] == 'Yes',  # Condition
    df_tedtalks_urls3['Transcript'].apply(lambda x: len(str(x).split())),  # Word count if condition is met
    None  # Leave as None (or NaN) if 'Transcript available' is not 'Yes'
)

In [14]:
df_tedtalks_urls3

Unnamed: 0,File ID,TED Talks URL,Speaker,Title,Duration,Tags,Views,Year,Talk,Video,Event,TED_ID,Transcript,Transcript Available,Word Count
0,000001,https://www.ted.com/talks/nazzy_pakpour_this_i...,Nazzy Pakpour,This is the most common way to get head lice,273,"TED-Ed,education,animation,health,public healt...",671,2025,nazzy_pakpour_this_is_the_most_common_way_to_g...,https://py.tedcdn.com/consus/projects/00/75/96...,TED-Ed,144934,The six-legged creature creeps down the canopy...,Yes,650
1,000002,https://www.ted.com/talks/ryan_gilliam_a_concr...,Ryan Gilliam,A concrete plan for sustainable cement,352,"sustainability,climate change,environment,poll...",202527,2024,ryan_gilliam_a_concrete_plan_for_sustainable_c...,https://py.tedcdn.com/consus/projects/00/76/49...,TED Countdown: Overcoming Dilemmas in the Gree...,146336,,No,
2,000003,https://www.ted.com/talks/sharon_zicherman_wha...,Sharon Zicherman,What you miss when you focus on the average,342,"decision-making,math,health",232690,2024,sharon_zicherman_what_you_miss_when_you_focus_...,https://py.tedcdn.com/consus/projects/00/76/15...,TED@BCG,145334,,No,
3,000004,https://www.ted.com/talks/rachel_yang_how_gian...,Rachel Yang,The century-old technology that could change t...,245,"TED-Ed,education,animation,technology,sustaina...",71804,2025,rachel_yang_the_century_old_technology_that_co...,https://py.tedcdn.com/consus/projects/00/76/44...,TED-Ed,146163,"In 1919, American mechanic Charles Strite inve...",Yes,664
4,000005,https://www.ted.com/talks/leo_villareal_how_li...,Leo Villareal,How light and code can transform a city,525,"art,design,public space,cities,beauty,creativi...",249481,2024,leo_villareal_how_light_and_code_can_transform...,https://py.tedcdn.com/consus/projects/00/77/09...,TEDNext 2024,147273,,No,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4077,004078,https://www.ted.com/talks/sir_ken_robinson_do_...,Sir Ken Robinson,Do schools kill creativity?,1148,"creativity,culture,dance,education,parenting,t...",78343046,2006,sir_ken_robinson_do_schools_kill_creativity,https://py.tedcdn.com/consus/projects/00/12/51...,TED2006,66,Good morning. How are you? (Audience) Good. It...,Yes,3170
4078,004079,https://www.ted.com/talks/majora_carter_greeni...,Majora Carter,Greening the ghetto,1116,"activism,business,cities,environment,politics,...",3179156,2006,majora_carter_greening_the_ghetto,https://py.tedcdn.com/consus/projects/00/08/09...,TED2006,53,If you're here today -- and I'm very happy tha...,Yes,3071
4079,004080,https://www.ted.com/talks/david_pogue_simplici...,David Pogue,Simplicity sells,1286,"computers,entertainment,media,music,performanc...",2036894,2006,david_pogue_simplicity_sells,https://py.tedcdn.com/consus/projects/00/17/84...,TED2006,7,"(Music: ""The Sound of Silence,"" Simon & Garfun...",Yes,3371
4080,004081,https://www.ted.com/talks/al_gore_averting_the...,Al Gore,Averting the climate crisis,954,"climate change,culture,environment,global issu...",3755382,2006,al_gore_averting_the_climate_crisis,https://py.tedcdn.com/consus/projects/00/21/03...,TED2006,1,"Thank you so much, Chris. And it's truly a gre...",Yes,2153


## Checking the range of years

In [15]:
df_tedtalks_urls3['Year'].unique()

array(['2025', '2024', '2023', '2020', '2022', '2021', '2019', '2017',
       '2015', '2016', '2018', '2013', '2014', '2012', '2009', '2011',
       '2010', '2006', '1972', '2007', '1983', '2005', '2008', '2004',
       '2003', '2002', '2001', '1998', '1990', '1984'], dtype=object)

## Exporting the enriched DataFrame to a `JSONL` file

In [16]:
df_tedtalks_urls3.to_json(f"{output_file1}.jsonl", orient='records', lines=True)

## Inspecting a few texts

In [17]:
print(df_tedtalks_urls3.loc[0,'Transcript'])

The six-legged creature creeps down the canopy, extends its slender trunk, and pierces the ground. Up comes blood. This is no regular forest. Living where the scalp meets the hair, these nightmarish figures are, in fact, sesame seed-sized insects, otherwise known as head lice. The earliest archaeological evidence of humans’ close-knit relationship with lice is a fully preserved egg, discovered in the hair of a 10,000-year-old Brazilian mummy. And it seems that for as long as we’ve had lice, we’ve fought hard to get rid of them. Nit combs, the fine-tooth brushes used to remove lice and their sticky eggs have been found among the ancient remains of cultures across the globe. This battle continues today, as it's estimated we spend billions of dollars each year treating infestations. So, why are lice so difficult to get rid of? There are at are at least several thousand louse species, as nearly all mammals deal with these parasites. Humans are pestered by three different types, each specia

In [18]:
print(df_tedtalks_urls3.loc[2568,'Transcript'])

This chimpanzee stumbles  across a windfall of overripe plums. Many of them have split open, drawing him  to their intoxicating fruity odor. He gorges himself and begins to experience some…  strange effects. This unwitting ape  has stumbled on a process that humans will eventually harness to create  beer, wine, and other alcoholic drinks. The sugars in overripe fruit  attract microscopic organisms known as yeasts. As the yeasts feed on the fruit sugars  they produce a compound called ethanol— the type of alcohol  in alcoholic beverages. This process is called fermentation. Nobody knows exactly when humans began  to create fermented beverages. The earliest known evidence  comes from 7,000 BCE in China, where residue in clay pots has revealed that people  were making an alcoholic beverage from fermented rice, millet,  grapes, and honey. Within a few thousand years, cultures all over the world  were fermenting their own drinks. Ancient Mesopotamians and Egyptians  made beer throughout the

## Slicing `df_tedtalks_urls3` to create the `df_tedtalks_urls4` DataFrame

Conditions:
- The column 'Transcript Available' equals 'Yes';
- The column 'Year' equals '2020' or later.

In [19]:
# Filter rows where 'Transcript available' equals 'Yes' and 'Year' is 2020 or later
df_tedtalks_urls4 = df_tedtalks_urls3[
    (df_tedtalks_urls3['Transcript Available'] == 'Yes') & 
    (df_tedtalks_urls3['Year'].astype(int) >= 2020)
].copy()  # Use .copy() to create a new independent DataFrame

In [20]:
df_tedtalks_urls4

Unnamed: 0,File ID,TED Talks URL,Speaker,Title,Duration,Tags,Views,Year,Talk,Video,Event,TED_ID,Transcript,Transcript Available,Word Count
0,000001,https://www.ted.com/talks/nazzy_pakpour_this_i...,Nazzy Pakpour,This is the most common way to get head lice,273,"TED-Ed,education,animation,health,public healt...",671,2025,nazzy_pakpour_this_is_the_most_common_way_to_g...,https://py.tedcdn.com/consus/projects/00/75/96...,TED-Ed,144934,The six-legged creature creeps down the canopy...,Yes,650
3,000004,https://www.ted.com/talks/rachel_yang_how_gian...,Rachel Yang,The century-old technology that could change t...,245,"TED-Ed,education,animation,technology,sustaina...",71804,2025,rachel_yang_the_century_old_technology_that_co...,https://py.tedcdn.com/consus/projects/00/76/44...,TED-Ed,146163,"In 1919, American mechanic Charles Strite inve...",Yes,664
6,000007,https://www.ted.com/talks/ben_proudfoot_the_tr...,Ben Proudfoot,The true story of the iconic tagline “Because ...,1055,"film,social change,women,media,culture,marketi...",55663,2025,ben_proudfoot_the_true_story_of_the_iconic_tag...,https://py.tedcdn.com/consus/projects/00/77/10...,TED Docs,147074,"- I don’t remember exactly, but ... I use the ...",Yes,1743
7,000008,https://www.ted.com/talks/m_alejandra_perotti_...,M. Alejandra Perotti,You might be surprised by what you'd find in y...,303,"TED-Ed,education,animation,health,human body,s...",186310,2025,m_alejandra_perotti_you_might_be_surprised_by_...,https://py.tedcdn.com/consus/projects/00/75/97...,TED-Ed,144912,"In 1841, German anatomist Jacob Henle was exam...",Yes,647
11,000012,https://www.ted.com/talks/malay_bera_the_tale_...,Malay Bera,The tale of the brothers who outwitted the dem...,326,"TED-Ed,education,animation,storytelling,culture",195933,2025,malay_bera_the_tale_of_the_brothers_who_outwit...,https://py.tedcdn.com/consus/projects/00/75/69...,TED-Ed,144589,The kingdom of Achinpur was on the precipice o...,Yes,657
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2553,002554,https://www.ted.com/talks/alex_rosenthal_the_a...,Alex Rosenthal,"The Artists | Think Like A Coder, Ep 5",385,"animation,TED-Ed,education,engineering,code,sc...",523871,2020,alex_rosenthal_the_artists_think_like_a_coder_...,https://py.tedcdn.com/consus/projects/00/50/88...,TED-Ed,56535,Dawn and the train are both breaking when Eth...,Yes,962
2556,002557,https://www.ted.com/talks/alex_gendler_can_you...,Alex Gendler,Can you solve the dragon jousting riddle?,263,"TED-Ed,animation,education,math",2839810,2020,alex_gendler_can_you_solve_the_dragon_jousting...,https://py.tedcdn.com/consus/projects/00/45/99...,TED-Ed,55981,"After centuries of war, the world’s kingdoms ...",Yes,664
2559,002560,https://www.ted.com/talks/eden_girma_the_myste...,Eden Girma,The mysterious life and death of Rasputin,292,"TED-Ed,education,history,religion,animation,de...",4169442,2020,eden_girma_the_mysterious_life_and_death_of_ra...,https://py.tedcdn.com/consus/projects/00/45/96...,TED-Ed,55593,"On a cold winter night in 1916, Felix Yusupov ...",Yes,633
2563,002564,https://www.ted.com/talks/julian_burschka_coul...,Julian Burschka,Could a breathalyzer detect cancer?,257,"TED-Ed,cancer,education,animation,illness,dise...",269985,2020,julian_burschka_could_a_breathalyzer_detect_ca...,https://py.tedcdn.com/consus/projects/00/45/98...,TED-Ed,55793,How is it that a breathalyzer can measure the...,Yes,601


## Exporting the column `Transcript` to individual text files

In [21]:
# Export each 'Transcript' to a file named '<File ID>.txt'
for idx, row in df_tedtalks_urls4.iterrows():
    file_id = row['File ID']  # Extract the File ID
    transcript = row['Transcript']  # Get the transcript text
    file_path = os.path.join(txt_dir, f"{file_id}.txt")  # Create the file path
    with open(file_path, 'w', encoding='utf-8') as txt_file:
        txt_file.write(transcript)  # Write the transcript to the file
    logging.info(f"Transcript for File ID {file_id} (Year {row['Year']}) exported to {file_path}")

## Exporting the filtered DataFrame to a `JSONL` file

In [22]:
df_tedtalks_urls4[['File ID', 'TED Talks URL', 'Speaker', 'Word Count', 'Title', 'Duration', 'Tags', 'Views', 'Year', 'Talk', 'Video', 'Event', 'TED_ID', 'Transcript']].to_json(f"{output_file2}.jsonl", orient='records', lines=True)

## Exporting the filtered DataFrame to an `Excel` file

The column `Transcript` was excluded because Excel cannot handle long texts.

In [24]:
df_tedtalks_urls4[['File ID', 'TED Talks URL', 'Speaker', 'Word Count', 'Title', 'Duration', 'Tags', 'Views', 'Year', 'Talk', 'Video', 'Event', 'TED_ID']].to_excel(f"{output_file2}.xlsx", index=False)