<center>
<img src="https://laelgelcpublic.s3.sa-east-1.amazonaws.com/lael_50_years_narrow_white.png.no_years.400px_96dpi.png" width="300" alt="LAEL 50 years logo">
<h3>APPLIED LINGUISTICS GRADUATE PROGRAMME (LAEL)</h3>
</center>
<hr>

# Corpus Linguistics - Study 2 - Phase 1 - Elaine

## [TED Talks](https://www.ted.com/talks) data extraction

Considering the period 2020 to 2025 (raw data extracted on 14/03/2025 at 11:38 am Brasilia).

## Required Python packages

- beautifulsoup4
- lxml
- pandas

## Importing the required libraries

In [1]:
from bs4 import BeautifulSoup
import pandas as pd

## Defining input variables

In [2]:
input_directory = 'indexes'
input_file = 'TED_7061_of_7061_20250314.html'

## Scraping the `TED Talks` URLs

### Scraping the input HTML file

In [3]:
with open(f'{input_directory}/{input_file}', 'r', encoding='utf8', newline='\n') as html_doc:
    soup = BeautifulSoup(html_doc, 'lxml')

In [4]:
# Find all 'a' tags with the class 'relative'
tedtalks_urls = soup.find_all('a', class_='relative')

In [5]:
# Extract the 'href' attribute of each link and store them in a list
tedtalks_urls_list = [tedtalks_url.get('href') for tedtalks_url in tedtalks_urls if tedtalks_url.get('href')]

### Exporting the data into a DataFrame

In [6]:
# Create a Pandas DataFrame from the list of links
df_tedtalks_urls1 = pd.DataFrame(tedtalks_urls_list, columns=['TED Talks URL'])

# Filter the DataFrame to keep only rows where 'TED Talks URL' contains 'https://www.ted.com/talks/'
df_tedtalks_urls1 = df_tedtalks_urls1[df_tedtalks_urls1['TED Talks URL'].str.contains('https://www.ted.com/talks/')]
df_tedtalks_urls1 = df_tedtalks_urls1.reset_index(drop=True)

In [7]:
# Append '/transcript?language=en' to each 'TED Talks URL'
df_tedtalks_urls1['TED Talks URL'] = df_tedtalks_urls1['TED Talks URL'] + '/transcript?language=en'

In [8]:
df_tedtalks_urls1

Unnamed: 0,TED Talks URL
0,https://www.ted.com/talks/nazzy_pakpour_this_i...
1,https://www.ted.com/talks/ryan_gilliam_a_concr...
2,https://www.ted.com/talks/sharon_zicherman_wha...
3,https://www.ted.com/talks/rachel_yang_how_gian...
4,https://www.ted.com/talks/leo_villareal_how_li...
...,...
7056,https://www.ted.com/talks/sir_ken_robinson_do_...
7057,https://www.ted.com/talks/majora_carter_greeni...
7058,https://www.ted.com/talks/david_pogue_simplici...
7059,https://www.ted.com/talks/al_gore_averting_the...


### Inspecting a few rows

In [9]:
df_tedtalks_urls1.loc[1000, 'TED Talks URL']

'https://www.ted.com/talks/miguel_goncalves_how_millennials_and_gen_z_can_invest_in_a_better_future/transcript?language=en'

## Importing the previous study's `TED Talks` URLs

In [10]:
# Read the CSV file into a DataFrame, specifying the TAB character as the delimiter
df_tedtalks_urls2 = pd.read_csv('previous_study_valid_urls', delimiter='\t', header=None)

# Define the column name
df_tedtalks_urls2.columns = ['File ID', 'TED Talks URL']

In [11]:
df_tedtalks_urls2

Unnamed: 0,File ID,TED Talks URL
0,1,https://www.ted.com/talks/alex_gendler_a_brief...
1,2,https://www.ted.com/talks/alex_gendler_a_day_i...
2,3,https://www.ted.com/talks/alex_gendler_can_you...
3,4,https://www.ted.com/talks/andrew_marantz_insid...
4,5,https://www.ted.com/talks/anne_f_broadbridge_t...
...,...,...
3974,3975,https://www.ted.com/talks/steve_truglia_a_leap...
3975,3976,https://www.ted.com/talks/stewart_brand_procla...
3976,3977,https://www.ted.com/talks/tom_wujec_on_3_ways_...
3977,3978,https://www.ted.com/talks/vishal_vaid_s_hypnot...


## Creating `df_tedtalks_urls3` by excluding rows from `df_tedtalks_urls1` that are present in `df_tedtalks_urls2`

The reason for this is to exclude any videos that were considered in the previous study.

In [12]:
df_tedtalks_urls3 = df_tedtalks_urls1[~df_tedtalks_urls1['TED Talks URL'].isin(df_tedtalks_urls2['TED Talks URL'])]
df_tedtalks_urls3 = df_tedtalks_urls3.reset_index(drop=True)

In [13]:
df_tedtalks_urls3

Unnamed: 0,TED Talks URL
0,https://www.ted.com/talks/nazzy_pakpour_this_i...
1,https://www.ted.com/talks/ryan_gilliam_a_concr...
2,https://www.ted.com/talks/sharon_zicherman_wha...
3,https://www.ted.com/talks/rachel_yang_how_gian...
4,https://www.ted.com/talks/leo_villareal_how_li...
...,...
4077,https://www.ted.com/talks/sir_ken_robinson_do_...
4078,https://www.ted.com/talks/majora_carter_greeni...
4079,https://www.ted.com/talks/david_pogue_simplici...
4080,https://www.ted.com/talks/al_gore_averting_the...


### Creating the column 'File ID'

In [14]:
# Create the 'File ID' column starting from '000001'
df_tedtalks_urls3['File ID'] = (df_tedtalks_urls3.index + 1).astype(str).str.zfill(6)

In [15]:
df_tedtalks_urls3

Unnamed: 0,TED Talks URL,File ID
0,https://www.ted.com/talks/nazzy_pakpour_this_i...,000001
1,https://www.ted.com/talks/ryan_gilliam_a_concr...,000002
2,https://www.ted.com/talks/sharon_zicherman_wha...,000003
3,https://www.ted.com/talks/rachel_yang_how_gian...,000004
4,https://www.ted.com/talks/leo_villareal_how_li...,000005
...,...,...
4077,https://www.ted.com/talks/sir_ken_robinson_do_...,004078
4078,https://www.ted.com/talks/majora_carter_greeni...,004079
4079,https://www.ted.com/talks/david_pogue_simplici...,004080
4080,https://www.ted.com/talks/al_gore_averting_the...,004081


## Creating the file `valid_urls`

In [16]:
df_tedtalks_urls3[['File ID', 'TED Talks URL']].to_csv('valid_urls', sep='\t', index=False, header=False, encoding='utf-8', lineterminator='\n')

## Exporting to a `JSONL` file

In [17]:
df_tedtalks_urls3[['File ID', 'TED Talks URL']].to_json('df_tedtalks_urls3.jsonl', orient='records', lines=True)

In [18]:
df_tedtalks_urls3.dtypes

TED Talks URL    object
File ID          object
dtype: object

## Adapting the programme for command line

The programme was named `buildurls.py`.

In [None]:
from bs4 import BeautifulSoup
import pandas as pd

def main():
    # Defining input variables
    input_directory = 'indexes'
    input_file = 'TED_7061_of_7061_20250314.html'

    # Scraping the 'TED Talks' URLs
    with open(f'{input_directory}/{input_file}', 'r', encoding='utf8', newline='\n') as html_doc:
        soup = BeautifulSoup(html_doc, 'lxml')

    # Find all 'a' tags with the class 'relative'
    tedtalks_urls = soup.find_all('a', class_='relative')

    # Extract the 'href' attribute of each link and store them in a list
    tedtalks_urls_list = [tedtalks_url.get('href') for tedtalks_url in tedtalks_urls if tedtalks_url.get('href')]

    # Create a Pandas DataFrame from the list of links
    df_tedtalks_urls1 = pd.DataFrame(tedtalks_urls_list, columns=['TED Talks URL'])

    # Filter the DataFrame to keep only rows where 'TED Talks URL' contains 'https://www.ted.com/talks/'
    df_tedtalks_urls1 = df_tedtalks_urls1[df_tedtalks_urls1['TED Talks URL'].str.contains('https://www.ted.com/talks/')]
    df_tedtalks_urls1 = df_tedtalks_urls1.reset_index(drop=True)

    # Append '/transcript?language=en' to each 'TED Talks URL'
    df_tedtalks_urls1['TED Talks URL'] = df_tedtalks_urls1['TED Talks URL'] + '/transcript?language=en'

    # Importing the previous study's TED Talks URLs
    df_tedtalks_urls2 = pd.read_csv('previous_study_valid_urls', delimiter='\t', header=None)
    df_tedtalks_urls2.columns = ['File ID', 'TED Talks URL']

    # Creating df_tedtalks_urls3 by excluding rows from df_tedtalks_urls1 that are present in df_tedtalks_urls2
    df_tedtalks_urls3 = df_tedtalks_urls1[~df_tedtalks_urls1['TED Talks URL'].isin(df_tedtalks_urls2['TED Talks URL'])]
    df_tedtalks_urls3 = df_tedtalks_urls3.reset_index(drop=True)

    # Create the 'File ID' column starting from '000001'
    df_tedtalks_urls3['File ID'] = (df_tedtalks_urls3.index + 1).astype(str).str.zfill(6)

    # Creating the file 'valid_urls'
    df_tedtalks_urls3[['File ID', 'TED Talks URL']].to_csv('valid_urls', sep='\t', index=False, header=False, encoding='utf-8', lineterminator='\n')

if __name__ == "__main__":
    main()