# Video Scraping from Links
- This notebook is for when we have collected labels and links already

- Downloading all videos takes a long time, so we will only scrape selected videos

- With this notebook, we can use `download_videos.py` to:
    1. Download all videos for one word from one data source 
        - (for review during the data organisation and cleaning)
    2. Download all videos for one word from all data sources 
        - (for review during the data organisation and cleaning)
    3. Download all videos for a collection of words from all data source 
        - (for creating our raw combined dataset, after we have decided our target words)

In [2]:
import pathlib
import os
import pandas as pd
import requests
import download_videos as dv

In [3]:
# change working directory to the project root directory
current_dir = os.getcwd()
os.chdir(current_dir + '/../../')
# this should be the project root directory
os.getcwd()

'/home/ben/projects/SaoPauloBrazilChapter_BrazilianSignLanguage'

## *for review during the data organisation and cleaning*

In [3]:
# use the unclean metadata csv file for now
metadata_df = pd.read_csv('data/raw/combined/metadata_combined_unclean.csv')

### 1. Download all videos for one word from one data source

Set the data source

In [4]:
# choose from 'INES, 'V-Librasil', 'SignBank' or 'UFV' using 'ne', 'vl', 'sb' or 'uf'
data_source_key = 'ne'

Set the label for the word/video

In [5]:
# choose word that exists in the metadata csv file
label = 'ABSOLVER'

Download the video
- during the review stage, videos will be downloaded to the `data/raw/{data_source}/videos/` folder
- during the creation of the (raw) combined dataset, videos will be downloaded to the `data/raw/combined/videos/` folder

Download the video(s) to that Data Source's `/video` folder, with the filename `{label}_{i}.mp4`

In [6]:
dv.download_videos_from_metadata(
    label = label,
    metadata = metadata_df,
    data_source_key = data_source_key,
    verbose=True,
    verify_ssl=True
)

Downloading video 1 from https://www.ines.gov.br/dicionario-de-libras/public/media/palavras/videos/absolverSm_Prog001.mp4
Video successfully downloaded to data/raw/INES/videos/ABSOLVER_ne_1.mp4



### 2. Download all videos for one word from all data sources

In [7]:
label = 'ABACATE'

In [8]:
dv.download_videos_from_metadata(
    label = label,
    metadata = metadata_df,
    data_source_key = None,
    verbose=True,
    verify_ssl=True
)

Downloading video 1 from https://www.ines.gov.br/dicionario-de-libras/public/media/palavras/videos/abacateSm_Prog001.mp4
Video successfully downloaded to data/raw/INES/videos/ABACATE_ne_1.mp4

Downloading video 2 from https://videos.nals.cce.ufsc.br/SignBank/Vídeos/ABACATE.mp4
Error downloading video from https://videos.nals.cce.ufsc.br/SignBank/Vídeos/ABACATE.mp4: HTTPSConnectionPool(host='videos.nals.cce.ufsc.br', port=443): Max retries exceeded with url: /SignBank/V%C3%ADdeos/ABACATE.mp4 (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1006)')))



## *for creating our combined raw dataset, after we have decided our target words*

### 3. Download all videos for a collection of words from all data sources

define list of target words 

In [49]:
metadata_df = pd.read_csv('data/raw/combined/metadata_combined_unclean.csv')
metadata_df.label = metadata_df.label.str.upper().str.strip()

In [50]:
sample_words = ['ABACAXI', 'BANANA', 'CAFÉ', 'CARNE', 'CEBOLA']
ne_sb_vl_sample_metadata = metadata_df[metadata_df.label.isin(sample_words)].sort_values(['label', 'data_source']).reset_index(drop=True)
ne_sb_vl_sample_metadata.label.value_counts()

label
ABACAXI    7
CEBOLA     6
BANANA     5
CAFÉ       5
CARNE      5
Name: count, dtype: int64

In [51]:
for index in [2, 3, 24]:
    ne_sb_vl_sample_metadata = ne_sb_vl_sample_metadata.drop(index)
ne_sb_vl_sample_metadata.label.value_counts()

label
ABACAXI    5
BANANA     5
CAFÉ       5
CARNE      5
CEBOLA     5
Name: count, dtype: int64

download videos

In [52]:
for word in sample_words:
    print('-----')
    print(f'Downloading videos for {word}')
    print('-----')
    dv.download_videos_from_metadata(
        label = word,
        metadata = ne_sb_vl_sample_metadata,
        combined = True,
        verbose=True,
        verify_ssl=False)

-----
Downloading videos for ABACAXI
-----
Downloading video 1 from https://www.ines.gov.br/dicionario-de-libras/public/media/palavras/videos/abacaxiSm_Prog001.mp4




Video successfully downloaded to data/raw/combined/videos/ABACAXI_ne_1.mp4
Downloading video 2 from https://videos.nals.cce.ufsc.br/SignBank/Vídeos/ABACAXI.mp4




Video successfully downloaded to data/raw/combined/videos/ABACAXI_sb_2.mp4
Downloading video 3 from https://libras.cin.ufpe.br/storage/videos/20210127091036_6011583c87073.mp4




Video successfully downloaded to data/raw/combined/videos/ABACAXI_vl_3.mp4
Downloading video 4 from https://libras.cin.ufpe.br/storage/videos/20210128115432_601378e8d032d.mp4




Video successfully downloaded to data/raw/combined/videos/ABACAXI_vl_4.mp4
Downloading video 5 from https://libras.cin.ufpe.br/storage/videos/20210929043133_6154bf1508f4a.mp4




Video successfully downloaded to data/raw/combined/videos/ABACAXI_vl_5.mp4
-----
Downloading videos for BANANA
-----
Downloading video 1 from https://www.ines.gov.br/dicionario-de-libras/public/media/palavras/videos/bananaSm_Prog001.mp4




Video successfully downloaded to data/raw/combined/videos/BANANA_ne_1.mp4
Downloading video 2 from https://videos.nals.cce.ufsc.br/SignBank/V%C3%ADdeos/BANANA.mp4




Video successfully downloaded to data/raw/combined/videos/BANANA_sb_2.mp4
Downloading video 3 from https://libras.cin.ufpe.br/storage/videos/20201222122443_5fe20fbb47921.mp4




Video successfully downloaded to data/raw/combined/videos/BANANA_vl_3.mp4
Downloading video 4 from https://libras.cin.ufpe.br/storage/videos/20201222122443_5fe20fbb554a7.mp4




Video successfully downloaded to data/raw/combined/videos/BANANA_vl_4.mp4
Downloading video 5 from https://libras.cin.ufpe.br/storage/videos/20210329110319_6061de2742ee7.mp4




Video successfully downloaded to data/raw/combined/videos/BANANA_vl_5.mp4
-----
Downloading videos for CAFÉ
-----
Downloading video 1 from https://www.ines.gov.br/dicionario-de-libras/public/media/palavras/videos/cafeSm_Prog001.mp4




Video successfully downloaded to data/raw/combined/videos/CAFÉ_ne_1.mp4
Downloading video 2 from https://videos.nals.cce.ufsc.br/SignBank/V%C3%ADdeos/CAF%C3%89.mp4




Video successfully downloaded to data/raw/combined/videos/CAFÉ_sb_2.mp4
Downloading video 3 from https://libras.cin.ufpe.br/storage/videos/20210126070737_601092a939fcf.mp4




Video successfully downloaded to data/raw/combined/videos/CAFÉ_vl_3.mp4
Downloading video 4 from https://libras.cin.ufpe.br/storage/videos/20210411080241_6072d751c1a00.mp4




Video successfully downloaded to data/raw/combined/videos/CAFÉ_vl_4.mp4
Downloading video 5 from https://libras.cin.ufpe.br/storage/videos/20210613113637_60c617f5a4e11.mp4




Video successfully downloaded to data/raw/combined/videos/CAFÉ_vl_5.mp4
-----
Downloading videos for CARNE
-----
Downloading video 1 from https://www.ines.gov.br/dicionario-de-libras/public/media/palavras/videos/carneSm_Prog001.mp4




Video successfully downloaded to data/raw/combined/videos/CARNE_ne_1.mp4
Downloading video 2 from https://videos.nals.cce.ufsc.br/SignBank/V%C3%ADdeos/CARNE.mp4




Video successfully downloaded to data/raw/combined/videos/CARNE_sb_2.mp4
Downloading video 3 from https://libras.cin.ufpe.br/storage/videos/20210126070759_601092bf73513.mp4




Video successfully downloaded to data/raw/combined/videos/CARNE_vl_3.mp4
Downloading video 4 from https://libras.cin.ufpe.br/storage/videos/20210411080306_6072d76a749a2.mp4




Video successfully downloaded to data/raw/combined/videos/CARNE_vl_4.mp4
Downloading video 5 from https://libras.cin.ufpe.br/storage/videos/20210613111901_60c613d510001.mp4




Video successfully downloaded to data/raw/combined/videos/CARNE_vl_5.mp4
-----
Downloading videos for CEBOLA
-----
Downloading video 1 from https://www.ines.gov.br/dicionario-de-libras/public/media/palavras/videos/cebolaSm_Prog001.mp4




Video successfully downloaded to data/raw/combined/videos/CEBOLA_ne_1.mp4
Downloading video 2 from https://videos.nals.cce.ufsc.br/SignBank/V%C3%ADdeos/CEBOLA.mp4




Video successfully downloaded to data/raw/combined/videos/CEBOLA_sb_2.mp4
Downloading video 3 from https://libras.cin.ufpe.br/storage/videos/20210124043727_600dcc77de427.mp4




Video successfully downloaded to data/raw/combined/videos/CEBOLA_vl_3.mp4
Downloading video 4 from https://libras.cin.ufpe.br/storage/videos/20210126072950_601097dec5bd3.mp4




Video successfully downloaded to data/raw/combined/videos/CEBOLA_vl_4.mp4
Downloading video 5 from https://libras.cin.ufpe.br/storage/videos/20210613125025_60c629417c513.mp4




Video successfully downloaded to data/raw/combined/videos/CEBOLA_vl_5.mp4


In [57]:
ne_sb_vl_sample_metadata['filename'] = '-'

In [58]:
for word in sample_words:
    filtered_metadata_df = ne_sb_vl_sample_metadata[ne_sb_vl_sample_metadata.label == word]
    i = 1
    for df_index, row in filtered_metadata_df.iterrows():
        video_name = dv.make_video_filename(row, i)
        ne_sb_vl_sample_metadata.loc[df_index, 'filename'] = video_name
        i+=1   

In [97]:
ufv_sample_metadata = pd.DataFrame({
    'label': ['ABACAXI', 'BANANA', 'CAFÉ', 'CARNE', 'CEBOLA'],
    'data_source': ['uf', 'uf', 'uf', 'uf', 'uf'],
    'filename': ['ABACAXI_uf_6.mp4', 'BANANA_uf_6.mp4', 'CAFÉ_uf_6.mp4', 'CARNE_uf_6.mp4', 'CEBOLA_uf_6.mp4']
})

In [98]:
sample_metadata = pd.concat([ne_sb_vl_sample_metadata, ufv_sample_metadata], ignore_index=True)
sample_metadata['filename_index'] = sample_metadata.filename.str.split('_').str[-1].str.replace('.mp4', '').astype(int)
sample_metadata.sort_values(['label', 'filename_index'], inplace=True)
sample_metadata.reset_index(drop=True, inplace=True)
sample_metadata.label.value_counts()

label
ABACAXI    6
BANANA     6
CAFÉ       6
CARNE      6
CEBOLA     6
Name: count, dtype: int64

In [100]:
sample_metadata.to_csv('data/raw/combined/target_dataset_metadata(sample).csv', index=False)

## *for collecting metadata from the videos*

In [101]:
metadata_df = pd.read_csv('data/raw/combined/target_dataset_metadata(sample).csv')

In [103]:
video_metadata = dv.collect_metadata_from_directory('data/raw/combined/videos')

In [104]:
video_metadata[0]

{'filename': 'ABACAXI_ne_1.mp4',
 'frame_count': 63,
 'fps': 12.0,
 'width': 240,
 'height': 176,
 'duration_sec': 5}

In [105]:
for i,d in enumerate(video_metadata):
    if d is None:
        print(i)
# should be no prints

In [108]:
video_metadata_df = pd.DataFrame(video_metadata)
video_metadata_df = video_metadata_df.merge(metadata_df, left_on='filename', right_on='filename', how='left').sort_values(['label', 'filename_index'])
video_metadata_df.head()

Unnamed: 0,filename,frame_count,fps,width,height,duration_sec,label,video_url,signer_number,data_source,...,scraped_label,scraped_video_url,sign_variant,signer_number.1,video_url_root,video_url_ext,number_in_label,sign_url,signer_order,filename_index
0,ABACAXI_ne_1.mp4,63,12.0,240,176,5,ABACAXI,https://www.ines.gov.br/dicionario-de-libras/p...,0.0,ne,...,,,,,,,,,,1
1,ABACAXI_sb_2.mp4,84,29.97003,1280,720,2,ABACAXI,https://videos.nals.cce.ufsc.br/SignBank/Vídeo...,2.0,sb,...,ABACAXI,https://levantelab.storage.googleapis.com/libr...,1.0,2.0,levantelab,mp4,False,,,2
3,ABACAXI_vl_3.mp4,173,29.97003,1920,1080,5,ABACAXI,https://libras.cin.ufpe.br/storage/videos/2021...,3.0,vl,...,,,,,,,,https://libras.cin.ufpe.br/sign/817,312.0,3
4,ABACAXI_vl_4.mp4,172,29.97003,1920,1080,5,ABACAXI,https://libras.cin.ufpe.br/storage/videos/2021...,1.0,vl,...,,,,,,,,https://libras.cin.ufpe.br/sign/817,312.0,4
5,ABACAXI_vl_5.mp4,135,29.97003,1920,1080,4,ABACAXI,https://libras.cin.ufpe.br/storage/videos/2021...,2.0,vl,...,,,,,,,,https://libras.cin.ufpe.br/sign/817,312.0,5


In [109]:
video_metadata_df.columns

Index(['filename', 'frame_count', 'fps', 'width', 'height', 'duration_sec',
       'label', 'video_url', 'signer_number', 'data_source', 'file_exists',
       'letter', 'assuntos', 'acepção', 'exemplo', 'exemplo libras',
       'classe gramatical', 'origem', 'scraped_label', 'scraped_video_url',
       'sign_variant', 'signer_number.1', 'video_url_root', 'video_url_ext',
       'number_in_label', 'sign_url', 'signer_order', 'filename_index'],
      dtype='object')

In [111]:
video_metadata_df[[
    'filename',
    'label',
    'data_source',
    'signer_number',
    'filename_index',
    'frame_count',
    'fps',
    'duration_sec',
    'width',
    'height'
]].to_csv('data/raw/combined/target_dataset_video_metadata(sample).csv', index=False)