# Video Scraping from Links
- This notebook is for when we have collected labels and links already

- Downloading all videos takes a long time, so we will only scrape selected videos

- With this notebook, we can use `download_videos.py` to:
    1. Download all videos for one word from one data source 
        - (for review during the data organisation and cleaning)
    2. Download all videos for one word from all data sources 
        - (for review during the data organisation and cleaning)
    3. Download all videos for a collection of words from all data source 
        - (for creating our raw combined dataset, after we have decided our target words)

In [1]:
import pathlib
import os
import pandas as pd
import requests
import download_videos as dv

In [2]:
# change working directory to the project root directory
current_dir = os.getcwd()
os.chdir(current_dir + '/../../')
# this should be the project root directory
os.getcwd()

'/home/ben/projects/SaoPauloBrazilChapter_BrazilianSignLanguage'

## *for review during the data organisation and cleaning*

In [3]:
# use the unclean metadata csv file for now
metadata_df = pd.read_csv('data/raw/combined/metadata_combined_unclean.csv')

### 1. Download all videos for one word from one data source

Set the data source

In [4]:
# choose from 'INES, 'V-Librasil', 'SignBank' or 'UFV' using 'ne', 'vl', 'sb' or 'uf'
data_source_key = 'ne'

Set the label for the word/video

In [5]:
# choose word that exists in the metadata csv file
label = 'ABSOLVER'

Download the video
- during the review stage, videos will be downloaded to the `data/raw/{data_source}/videos/` folder
- during the creation of the (raw) combined dataset, videos will be downloaded to the `data/raw/combined/videos/` folder

Download the video(s) to that Data Source's `/video` folder, with the filename `{label}_{i}.mp4`

In [6]:
dv.download_videos_from_metadata(
    label = label,
    metadata = metadata_df,
    data_source_key = data_source_key,
    verbose=True,
    verify_ssl=True
)

Downloading video 1 from https://www.ines.gov.br/dicionario-de-libras/public/media/palavras/videos/absolverSm_Prog001.mp4
Video successfully downloaded to data/raw/INES/videos/ABSOLVER_ne_1.mp4



### 2. Download all videos for one word from all data sources

In [7]:
label = 'ABACATE'

In [8]:
dv.download_videos_from_metadata(
    label = label,
    metadata = metadata_df,
    data_source_key = None,
    verbose=True,
    verify_ssl=True
)

Downloading video 1 from https://www.ines.gov.br/dicionario-de-libras/public/media/palavras/videos/abacateSm_Prog001.mp4
Video successfully downloaded to data/raw/INES/videos/ABACATE_ne_1.mp4

Downloading video 2 from https://videos.nals.cce.ufsc.br/SignBank/Vídeos/ABACATE.mp4
Error downloading video from https://videos.nals.cce.ufsc.br/SignBank/Vídeos/ABACATE.mp4: HTTPSConnectionPool(host='videos.nals.cce.ufsc.br', port=443): Max retries exceeded with url: /SignBank/V%C3%ADdeos/ABACATE.mp4 (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1006)')))



## *for creating our combined raw dataset, after we have decided our target words*

### 3. Download all videos for a collection of words from all data sources

define list of target words 

In [9]:
target_words = ['ABACATE', 'ABSOLVER']

download videos

In [12]:
for word in target_words:
    print('-----')
    print(f'Downloading videos for {word}')
    print('-----')
    dv.download_videos_from_metadata(
        label = word,
        metadata = metadata_df,
        combined = True,
        verbose=True,
        verify_ssl=True)

-----
Downloading videos for ABACATE
-----
Downloading video 1 from https://www.ines.gov.br/dicionario-de-libras/public/media/palavras/videos/abacateSm_Prog001.mp4
Video successfully downloaded to data/raw/combined/videos/ABACATE_ne_1.mp4

Downloading video 2 from https://videos.nals.cce.ufsc.br/SignBank/Vídeos/ABACATE.mp4
Error downloading video from https://videos.nals.cce.ufsc.br/SignBank/Vídeos/ABACATE.mp4: HTTPSConnectionPool(host='videos.nals.cce.ufsc.br', port=443): Max retries exceeded with url: /SignBank/V%C3%ADdeos/ABACATE.mp4 (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1006)')))

-----
Downloading videos for ABSOLVER
-----
Downloading video 1 from https://www.ines.gov.br/dicionario-de-libras/public/media/palavras/videos/absolverSm_Prog001.mp4
Video successfully downloaded to data/raw/combined/videos/ABSOLVER_ne_1.mp4

