# Metadata
- This notebook is for processing the various information scraped and combining it into one `metadata.csv`
- For each data source, the file will be stored at `data/raw/{data source}/metadata.csv`

In [1]:
import pandas as pd
from tqdm import tqdm
tqdm.pandas()
from multiprocessing import Pool
import os
import requests

In [2]:
# change working directory to the project root directory
current_dir = os.getcwd()
os.chdir(current_dir + '/../../')
# this should be the project root directory
os.getcwd()

'/home/ben/projects/SaoPauloBrazilChapter_BrazilianSignLanguage'

## INES

- Uses the existing columns in `INES_Metadata.csv`
- checks if the video file exists or not
- adds the column `file_exists`

Load the `INES_metadata.csv` to get the Video URLs

In [3]:
ines_csv = pd.read_csv('data/raw/INES/INES_Metadata.csv')
ines_csv.head()

Unnamed: 0,Letter,Word,Video URL,Image URL
0,B,B,https://www.ines.gov.br/dicionario-de-libras/p...,No Image Available
1,B,BABA,https://www.ines.gov.br/dicionario-de-libras/p...,No Image Available
2,B,BABÁ1,https://www.ines.gov.br/dicionario-de-libras/p...,No Image Available
3,B,BABÁ2,https://www.ines.gov.br/dicionario-de-libras/p...,No Image Available
4,B,BABADO,https://www.ines.gov.br/dicionario-de-libras/p...,No Image Available


Find a difference in response between URLs for words that have/don't have a video file 

In [4]:
# invalid url
invalid = 'https://www.ines.gov.br/dicionario-de-libras/public/media/palavras/videos/baba1Sm_Prog001.mp4'
response = requests.get(invalid, stream=True)
response.headers['Content-Type']

'text/html;charset=utf-8'

In [5]:
# valid url
valid = 'https://www.ines.gov.br/dicionario-de-libras/public/media/palavras/videos/babadorSm_Prog001.mp4'
response = requests.get(valid, stream=True)
response.headers['Content-Type']

'video/mp4'

This takes a while to run since it checks every URL, so just get the info from `video_file_exists.csv` or `metadata.csv`

*(took X minutes on Ben's PC/connection)*

In [10]:
import asyncio
import aiohttp
import random
from aiohttp import ClientConnectorError, ClientTimeout

In [None]:
video_urls = ines_csv['Video URL'].tolist()

file_exists = []

In [13]:
# Limit concurrent requests
MAX_CONCURRENT = 10  # Reduce this if still getting errors
RETRY_COUNT = 5     # Increased from 3

async def check_url(url, index, total, semaphore, retry_count=RETRY_COUNT):
    timeout = ClientTimeout(total=30)
    
    async with semaphore:  # Control concurrent requests
        for attempt in range(retry_count):
            try:
                # Longer delay between requests
                await asyncio.sleep(random.uniform(0.5, 1.0))
                
                async with aiohttp.ClientSession(timeout=timeout) as session:
                    async with session.get(url) as response:
                        exists = response.headers['Content-Type'] == 'video/mp4'
                        percent = (index + 1) / total * 100
                        if percent % 5 < (1 / total * 100):
                            print(f"Progress: {percent:.1f}% ({index + 1}/{total} URLs checked)")
                        return exists
                        
            except (ClientConnectorError, asyncio.TimeoutError) as e:
                if attempt == retry_count - 1:  # Last attempt
                    print(f"\nFailed to check URL after {retry_count} attempts: {url}\nError: {str(e)}")
                    return False
                print(f"\nRetrying URL {index + 1}/{total} (attempt {attempt + 2}/{retry_count})")
                # Longer wait between retries
                await asyncio.sleep(random.uniform(2, 3))
            except Exception as e:
                print(f"\nUnexpected error for URL {index + 1}/{total}: {str(e)}")
                return False

async def check_all_urls(urls):
    total = len(urls)
    print(f"Starting to check {total} URLs...")
    
    # Create semaphore to limit concurrent requests
    semaphore = asyncio.Semaphore(MAX_CONCURRENT)
    
    tasks = [check_url(url, i, total, semaphore) for i, url in enumerate(urls)]
    results = await asyncio.gather(*tasks, return_exceptions=True)
    
    # Filter out any exceptions and count successes
    valid_results = [r for r in results if isinstance(r, bool)]
    exists_count = sum(1 for r in valid_results if r)
    
    print(f"\nFinished checking URLs. Found {exists_count} existing videos out of {total}")
    return results

# Use it with:
file_exists = await check_all_urls(video_urls)

Starting to check 13778 URLs...
Progress: 5.0% (689/13778 URLs checked)

Retrying URL 1066/13778 (attempt 2/5)

Retrying URL 1074/13778 (attempt 2/5)

Retrying URL 1070/13778 (attempt 2/5)

Retrying URL 1090/13778 (attempt 2/5)

Retrying URL 1160/13778 (attempt 2/5)

Retrying URL 1174/13778 (attempt 2/5)

Retrying URL 1066/13778 (attempt 3/5)

Retrying URL 1074/13778 (attempt 3/5)

Retrying URL 1070/13778 (attempt 3/5)

Retrying URL 1090/13778 (attempt 3/5)

Retrying URL 1216/13778 (attempt 2/5)

Retrying URL 1229/13778 (attempt 2/5)
Progress: 10.0% (1378/13778 URLs checked)
Progress: 15.0% (2067/13778 URLs checked)
Progress: 20.0% (2756/13778 URLs checked)
Progress: 25.0% (3445/13778 URLs checked)
Progress: 30.0% (4134/13778 URLs checked)
Progress: 35.0% (4823/13778 URLs checked)
Progress: 40.0% (5512/13778 URLs checked)
Progress: 45.0% (6201/13778 URLs checked)
Progress: 50.0% (6889/13778 URLs checked)
Progress: 55.0% (7578/13778 URLs checked)

Retrying URL 8060/13778 (attempt 2/5)
P

In [14]:
len(file_exists)

13778

Save this as `video_file_exists.csv` so that if `metadata.csv` is overwritten, we don't lose this info

In [16]:
file_exists_df = ines_csv[['Word', 'Video URL']].copy()
file_exists_df.rename(columns={'Word':'label','Video URL': 'video_url'}, inplace=True)
file_exists_df['file_exists'] = file_exists

file_exists_df.to_csv('data/raw/INES/video_file_exists.csv', index=False)

Make `metadata_df` using the already saved `video_file_exists.csv` to avoid running the requests code again

In [17]:
metadata_df = pd.read_csv('data/raw/INES/video_file_exists.csv')

# Join on any other info 
pass
pass
pass

# Save the metadata.csv file
metadata_df.to_csv('data/raw/INES/metadata.csv', index=False)

## V-Librasil

- Uses the existing information in `v_librasil_words_n_links.txt`
- Uses the links to each word's page
- Collects the video file links for each interpreter
- Combines it into a `.csv` file

#### Fixing `v_librasil_words_n_links.txt`

In [27]:
# load .txt file as a string
words_and_links = open('data/raw/V-Librasil/words/v_librasil_words_n_links.txt', 'r').read()
words_and_links = words_and_links.split('\n')

In [29]:
# one line had two lines combined
for line in words_and_links:
    if len(line.split('https://')) > 2:
        print(line)

Bigode https://libras.cin.ufpe.br/sign/826Bilhão https://libras.cin.ufpe.br/sign/105


In [31]:
# i edited the file directly to fix it
# load .txt file as a string
words_and_links = open('data/raw/V-Librasil/words/v_librasil_words_n_links.txt', 'r').read()
words_and_links = words_and_links.split('\n')

# one line had two lines combined
for line in words_and_links:
    if 'Bigode' in line:
        print(line)
    if 'Bilhão' in line:
        print(line)

Bigode https://libras.cin.ufpe.br/sign/826
Bilhão https://libras.cin.ufpe.br/sign/105


#### Making `metadata.csv`

turning `v_librasil_words_n_links.txt` into a DataFrame

In [3]:
words_and_links = pd.read_csv('data/raw/V-Librasil/words/v_librasil_words_n_links.txt', sep='https', header=None, engine='python')
words_and_links.columns = ['label', 'sign_url']
words_and_links.sign_url = words_and_links.sign_url.apply(lambda x: 'https' + x)
words_and_links.head()

Unnamed: 0,label,sign_url
0,À noite toda,https://libras.cin.ufpe.br/sign/885
1,À tarde toda,https://libras.cin.ufpe.br/sign/100
2,Abacaxi,https://libras.cin.ufpe.br/sign/817
3,Abanar,https://libras.cin.ufpe.br/sign/1536
4,Abandonar,https://libras.cin.ufpe.br/sign/71


getting video URLs (~3 for each sign)

In [None]:
from bs4 import BeautifulSoup

def get_video_urls(url):
    response = requests.get(url)

    if response.status_code == 200:
        video_links = []
        signer_numbers = []
        signer_order = ''
        html = response.content
        soup = BeautifulSoup(html, 'html.parser')
        
        # go inside div class container
        container = soup.find('section', class_='page-section').find('div', class_='container')
        # go inside div class row
        rows = [child for child in container.children if child.name == 'div']

        # get the video and signer number
        for row in rows:
            signer_number = row.find('h2').text.strip().split(' ')[1]
            link = row.find('source').get('src')
            video_links.append(link)
            signer_numbers.append(signer_number)
            signer_order += signer_number
        return video_links, signer_numbers, signer_order

    else:
        print(f"Response code != 200, Failed to get video URLs for {link}")
        return None

In [None]:
words_and_links['all_video_info'] = words_and_links['sign_url'].progress_apply(get_video_urls)

One by one takes a while, so I asked AI to make it async

In [9]:
df = words_and_links.copy()

In [13]:
import pandas as pd
import asyncio
import aiohttp
from tqdm import tqdm
import logging
import random
from typing import Dict, Any
from bs4 import BeautifulSoup

class AsyncRequestProcessor:
    def __init__(
        self,
        max_concurrent: int = 10,
        max_retries: int = 3,
        retry_delay: float = 1.0,
        timeout: int = 30
    ):
        self.max_concurrent = max_concurrent
        self.max_retries = max_retries
        self.retry_delay = retry_delay
        self.timeout = timeout
        self.setup_logging()

    def setup_logging(self):
        logging.basicConfig(
            level=logging.INFO,
            format='%(asctime)s - %(levelname)s - %(message)s'
        )
        self.logger = logging.getLogger(__name__)

    async def process_url(
        self,
        url: str,
        session: aiohttp.ClientSession,
        semaphore: asyncio.Semaphore
    ) -> Dict[str, Any]:
        for attempt in range(self.max_retries):
            try:
                async with semaphore:
                    # Random delay between requests
                    await asyncio.sleep(random.uniform(0.1, 0.3))
                    
                    # Call get_video_urls function
                    video_links, signer_numbers, signer_order = await get_video_urls(url, session)
                    return {
                        'url': url,
                        'status': 'success',
                        'video_links': video_links,
                        'signer_numbers': signer_numbers,
                        'signer_order': signer_order,
                        'attempts': attempt + 1
                    }
                    
            except aiohttp.ClientError as e:
                if attempt == self.max_retries - 1:
                    return {
                        'url': url,
                        'status': 'error',
                        'error': f'Network error: {str(e)}',
                        'attempts': attempt + 1
                    }
                await asyncio.sleep(self.retry_delay * (attempt + 1))
            except Exception as e:
                return {
                    'url': url,
                    'status': 'error',
                    'error': str(e),
                    'attempts': attempt + 1
                }

    async def process_urls(self, urls: list) -> list:
        semaphore = asyncio.Semaphore(self.max_concurrent)
        timeout = aiohttp.ClientTimeout(total=self.timeout)
        
        async with aiohttp.ClientSession(timeout=timeout) as session:
            tasks = [
                self.process_url(url, session, semaphore)
                for url in urls
            ]
            
            results = []
            failed = 0
            for f in tqdm(
                asyncio.as_completed(tasks),
                total=len(tasks),
                desc="Processing URLs"
            ):
                result = await f
                if result['status'] == 'error':
                    failed += 1
                    self.logger.warning(
                        f"Failed to process {result['url']}: {result['error']}"
                    )
                results.append(result)
            
            self.logger.info(
                f"Completed processing {len(results)} URLs. "
                f"Failed: {failed}"
            )
            return results

async def get_video_urls(url: str, session: aiohttp.ClientSession):
    async with session.get(url) as response:
        if response.status == 200:
            video_links = []
            signer_numbers = []
            signer_order = ''
            
            # Get the HTML content
            html = await response.text()
            soup = BeautifulSoup(html, 'html.parser')
            
            # go inside div class container
            container = soup.find('section', class_='page-section').find('div', class_='container')
            # go inside div class row
            rows = [child for child in container.children if child.name == 'div']

            # get the video and signer number
            for row in rows:
                signer_number = row.find('h2').text.strip().split(' ')[1]
                link = row.find('source').get('src')
                video_links.append(link)
                signer_numbers.append(signer_number)
                signer_order += signer_number
            return video_links, signer_numbers, signer_order
        else:
            raise aiohttp.ClientError(f"Response code {response.status}, Failed to get video URLs for {url}")

# For Jupyter notebook execution
async def main(df):
    processor = AsyncRequestProcessor(
        max_concurrent=10,
        max_retries=3,
        retry_delay=1.0,
        timeout=30
    )
    
    results = await processor.process_urls(df['sign_url'].tolist())
    return pd.DataFrame(results)

# Execute in Jupyter
try:
    # Get current notebook's event loop
    loop = asyncio.get_event_loop()
    results_df = await main(df)  # Use await here in Jupyter
    
    # Process results into final dataframe
    successful_results = results_df[results_df['status'] == 'success']
    
    # Create separate columns for video links and signer information
    df_expanded = pd.concat([
        df,
        pd.DataFrame({
            'video_links': successful_results['video_links'],
            'signer_numbers': successful_results['signer_numbers'],
            'signer_order': successful_results['signer_order']
        })
    ], axis=1)
    
except Exception as e:
    logging.error(f"Processing failed: {str(e)}")

Processing URLs: 100%|██████████| 1000/1000 [02:31<00:00,  6.59it/s]


In [16]:
results_df.head()

Unnamed: 0,url,status,video_links,signer_numbers,signer_order,attempts
0,https://libras.cin.ufpe.br/sign/1070,success,[https://libras.cin.ufpe.br/storage/videos/202...,"[3, 1, 2]",312,1
1,https://libras.cin.ufpe.br/sign/220,success,[https://libras.cin.ufpe.br/storage/videos/202...,"[2, 1, 3]",213,1
2,https://libras.cin.ufpe.br/sign/655,success,[https://libras.cin.ufpe.br/storage/videos/202...,"[2, 1, 3]",213,1
3,https://libras.cin.ufpe.br/sign/439,success,[https://libras.cin.ufpe.br/storage/videos/202...,"[1, 2, 3]",123,1
4,https://libras.cin.ufpe.br/sign/115,success,[https://libras.cin.ufpe.br/storage/videos/202...,"[1, 2, 3]",123,1


In [15]:
results_df.status.value_counts()

status
success    1000
Name: count, dtype: int64

In [18]:
results_df = results_df.explode(['video_links', 'signer_numbers'])
results_df.head()

Unnamed: 0,url,status,video_links,signer_numbers,signer_order,attempts
0,https://libras.cin.ufpe.br/sign/1070,success,https://libras.cin.ufpe.br/storage/videos/2021...,3,312,1
0,https://libras.cin.ufpe.br/sign/1070,success,https://libras.cin.ufpe.br/storage/videos/2021...,1,312,1
0,https://libras.cin.ufpe.br/sign/1070,success,https://libras.cin.ufpe.br/storage/videos/2021...,2,312,1
1,https://libras.cin.ufpe.br/sign/220,success,https://libras.cin.ufpe.br/storage/videos/2020...,2,213,1
1,https://libras.cin.ufpe.br/sign/220,success,https://libras.cin.ufpe.br/storage/videos/2020...,1,213,1


In [21]:
words_and_links.head()

Unnamed: 0,label,sign_url
0,À noite toda,https://libras.cin.ufpe.br/sign/885
1,À tarde toda,https://libras.cin.ufpe.br/sign/100
2,Abacaxi,https://libras.cin.ufpe.br/sign/817
3,Abanar,https://libras.cin.ufpe.br/sign/1536
4,Abandonar,https://libras.cin.ufpe.br/sign/71


In [22]:
# join back to words_and_links on sign_url
results_df.rename(columns={'url':'sign_url', 'video_links':'video_url', 'signer_numbers':'signer_number'}, inplace=True)
metadata_df = words_and_links.merge(results_df, on='sign_url', how='left')
metadata_df.head()

Unnamed: 0,label,sign_url,status,video_url,signer_number,signer_order,attempts
0,À noite toda,https://libras.cin.ufpe.br/sign/885,success,https://libras.cin.ufpe.br/storage/videos/2021...,1,132,1
1,À noite toda,https://libras.cin.ufpe.br/sign/885,success,https://libras.cin.ufpe.br/storage/videos/2021...,3,132,1
2,À noite toda,https://libras.cin.ufpe.br/sign/885,success,https://libras.cin.ufpe.br/storage/videos/2021...,2,132,1
3,À tarde toda,https://libras.cin.ufpe.br/sign/100,success,https://libras.cin.ufpe.br/storage/videos/2020...,1,123,1
4,À tarde toda,https://libras.cin.ufpe.br/sign/100,success,https://libras.cin.ufpe.br/storage/videos/2020...,2,123,1


In [23]:
metadata_df.columns

Index(['label', 'sign_url', 'status', 'video_url', 'signer_number',
       'signer_order', 'attempts'],
      dtype='object')

Save selected columns

In [24]:
metadata_df[
    ['label', 'signer_number', 'video_url', 'sign_url', 'signer_order']
       ].to_csv('data/raw/V-Librasil/metadata.csv', index=False)

## SignBank

tbc