<a href="https://colab.research.google.com/github/juliuskoenning/deepsl/blob/main/DS/SignLanuageMSASLDownload.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 0. General

## Dataset:
- MS-ASL:
  - Official MS-Site:https://www.microsoft.com/en-us/research/project/ms-asl/
  - Kaggle: https://www.kaggle.com/datasets/saurabhshahane/american-sign-language-dataset

Notebook can be used to download the MS-ASL Dataset. Currently just parts of the dataset will be downloaded at once, because it will lead to memory issues on google colab otherwise. Notebook can be modified to download either validation, test or (a part of) the train dataset. Further preprocessing steps (like converting the video data in frames) are optional and currently also done during training.

# 1. Downloads

## 1.1 download meta-dataset

In [None]:
# connect to google drive --> allows to store kaggle.json API token under the main folder in the drive,
#   so that it doesn't have to be uploaded every time
# Alternatively the kaggle.json can be uploaded under /content/

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
! pip install kaggle
! mkdir -p ~/.kaggle/
# for the case, that kaggle.json is stored in the drive
! cp drive/MyDrive/kaggle.json ~/.kaggle/
# for the case, if kaggle.json is stored under the root dir
# ! cp kaggle.json ~/.kaggle/
# ! chmod 600 ~/.kaggle/kaggle.json



In [None]:
# download the tabacco image dataset and unzip images in the data diretory
! kaggle datasets download -d saurabhshahane/american-sign-language-dataset
! unzip -n -q american-sign-language-dataset.zip -d ms_asl

Downloading american-sign-language-dataset.zip to /content
 52% 1.00M/1.91M [00:00<00:00, 2.00MB/s]
100% 1.91M/1.91M [00:00<00:00, 3.31MB/s]


In [None]:
!pip install git+https://github.com/oncename/pytube # workaround through using the repo of someone who changed the cipher.py file
# ! pip install pytube
# Docu: https://pytube.io/en/latest/

Collecting git+https://github.com/oncename/pytube
  Cloning https://github.com/oncename/pytube to /tmp/pip-req-build-pat9gvw0
  Running command git clone --filter=blob:none --quiet https://github.com/oncename/pytube /tmp/pip-req-build-pat9gvw0
  Resolved https://github.com/oncename/pytube to commit 6c45936b9703ce986ccb8d0d3595c7974716f94b
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pytube
  Building wheel for pytube (setup.py) ... [?25l[?25hdone
  Created wheel for pytube: filename=pytube-15.0.0-py3-none-any.whl size=57637 sha256=4d0a705e3b2e2ea07541ae0402029d320a7fc5162b7819ac80044b13b0fb998e
  Stored in directory: /tmp/pip-ephem-wheel-cache-egcxvu25/wheels/66/7c/fb/bd87b6a83eae32b56a81095a542e8d9722a3f73d92b6576a5b
Successfully built pytube
Installing collected packages: pytube
Successfully installed pytube-15.0.0


In [None]:
! pip install remotezip tqdm opencv-python
! pip install -q git+https://github.com/tensorflow/docs

Collecting remotezip
  Downloading remotezip-0.12.1.tar.gz (7.5 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: remotezip
  Building wheel for remotezip (setup.py) ... [?25l[?25hdone
  Created wheel for remotezip: filename=remotezip-0.12.1-py3-none-any.whl size=7933 sha256=2f105bf1c0bbeeda6477ff9294363bdf9d58a04322817bf860a28070ba9ebdf0
  Stored in directory: /root/.cache/pip/wheels/fc/76/04/beed1a6df4eb7430ee13c3900746edd517e5e597298d1f73f3
Successfully built remotezip
Installing collected packages: remotezip
Successfully installed remotezip-0.12.1
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for tensorflow-docs (setup.py) ... [?25l[?25hdone


## 1.2 Imports

In [None]:
# Imports
import json
import os
from logging import error
from pytube import YouTube

import tqdm
import random
import pathlib
import itertools
import collections

import os
import cv2
import numpy as np
import pandas as pd
import remotezip as rz

import tensorflow as tf

# Some modules to display an animation using imageio.
import imageio
from IPython import display
from urllib import request
from tensorflow_docs.vis import embed

In [None]:
# Data Dirs
DATA_DIR = "/content/ms_asl/"

# Video dir
VIDEO_DIR = "/content/video/"
VIDEO_CUTTED_DIR = "/content/video_cutted/"

# Classes
MSASL_classes = DATA_DIR + 'MSASL_classes.json'

# Train Files
MSASL_train = DATA_DIR + 'MSASL_train.json'

# Val Files
MSASL_val = DATA_DIR + 'MSASL_val.json'

# Test Files
MSASL_test = DATA_DIR + 'MSASL_test.json'

data_files = [MSASL_train, MSASL_val, MSASL_classes, MSASL_test]

## 1.3 Download videos and general data setup

In [None]:
classes_json, train_json, val_json, test_json = '', '', '', ''
x = 0
while x < len(data_files):
  with open(data_files[x]) as user_file:
    file_contents = user_file.read()
  if x == 0:
    train_json = json.loads(file_contents)
  elif x == 1:
    val_json = json.loads(file_contents)
  elif x == 2:
    classes_json = json.loads(file_contents)
  else:
    test_json = json.loads(file_contents)
  x+=1

In [None]:
MSASL100 = classes_json[:100]
len(MSASL100)

100

In [None]:
with open(data_files[0]) as user_file:
    file_contents = user_file.read()

In [None]:
train_json[0]['label']

830

In [None]:
MSASL100_train_v2, MSASL100_val_v2, MSASL100_test_v2 = [], [], []

for x in train_json:
  if x['label'] < 100:
    MSASL100_train_v2.append(x)

for x in val_json:
  if x['label'] < 100:
    MSASL100_val_v2.append(x)

for x in test_json:
  if x['label'] < 100:
    MSASL100_test_v2.append(x)

print('Train set: ', len(MSASL100_train_v2))
print('Val set: ', len(MSASL100_val_v2))
print('Test set: ', len(MSASL100_test_v2))

Train set:  3790
Val set:  1190
Test set:  757


In [None]:
def Download(dataset, link):
    try:
      youtubeObject = YouTube(link)
      youtubeObject = youtubeObject.streams.get_highest_resolution()
      # youtubeObject.download(VIDEO_DIR+dataset, filename=f'{count}_{sl_class}.mp4')
      youtubeObject.download(VIDEO_DIR+dataset, filename=f'{link.split("=")[1]}.mp4')
      # .split("=", n = 1, expand = True)
      print("Download is completed successfully")
      return True
    except:

      print("An error has occurred")
      return False

In [None]:
dataset = 'val'
link = 'https://www.youtube.com/watch?v=N4n5CDpyX3w'

youtubeObject = YouTube(link)
youtubeObject = youtubeObject.streams.get_highest_resolution()
youtubeObject.download(VIDEO_DIR+dataset, filename=f'{link.split("=")[1]}.mp4')

'/content/video/val/N4n5CDpyX3w.mp4'

In [None]:
# MSASL100_test_df = pd.DataFrame(MSASL100_test)
MSASL100_test_df = pd.DataFrame(MSASL100_test)
missing_video_url = set()
dataset = 'test'
for url in MSASL100_test_df['url'].unique():
  if not url in missing_video_url:
    filled = Download(dataset, url)
    if not filled:
      missing_video_url.add(url)

An error has occurred
An error has occurred
An error has occurred
An error has occurred
Download is completed successfully
Download is completed successfully
An error has occurred
An error has occurred
An error has occurred
An error has occurred
Download is completed successfully
Download is completed successfully
Download is completed successfully
Download is completed successfully
An error has occurred
Download is completed successfully
Download is completed successfully
An error has occurred
Download is completed successfully
Download is completed successfully
Download is completed successfully
An error has occurred
Download is completed successfully
An error has occurred
An error has occurred
An error has occurred
Download is completed successfully
Download is completed successfully
An error has occurred
An error has occurred
An error has occurred
An error has occurred
An error has occurred
Download is completed successfully
An error has occurred
Download is completed successfully


In [None]:
len(missing_video_url)

228

In [None]:
!cp -r "/content/video/test" "/content/drive/MyDrive/DeepSL/MSASL100/Test"

cp: cannot create directory '/content/drive/MyDrive/DeepSL/MSASL100/Test': No such file or directory


In [None]:
len(MSASL100_test_df['url'].unique())

349

In [None]:
new_test_set = []
for i in range(len(test_json)):
  url = test_json[i]['url']
  if url not in missing_video_url:
    new_test_set.append(test_json[i])
print(len(new_test_set))

3943


In [None]:
print(len(test_json))
# print(len(missing_videos))

4172


## 1.4 Preprocessing

In [None]:
!mkdir /content/video_new

In [None]:
!cp /content/drive/MyDrive/DeepSL/MSASL100/Test/test/* /content/video_new

In [None]:
!cp /content/drive/MyDrive/DeepSL/MSASL100/Val/val/* /content/video_new

In [None]:
!cp /content/drive/MyDrive/DeepSL/MSASL100/Train/train/* /content/video_new

In [None]:
print("Test videos:", len(os.listdir("/content/drive/MyDrive/DeepSL/MSASL100/Test/test/")))
print("Train videos:", len(os.listdir("/content/drive/MyDrive/DeepSL/MSASL100/Train/train/")))
print("Validation videos:", len(os.listdir("/content/drive/MyDrive/DeepSL/MSASL100/Val/val/")))

Test videos: 121
Train videos: 1025
Validation videos: 160


In [None]:
JSON_PATH = '/content/ms_asl/'
test_f = JSON_PATH + "/MSASL_test.json"
with open(test_f) as f:
  test_json = json.load(f)
print(len(test_json))
print(len(test_json[:100]))

4172
100


In [None]:
def print_progress_bar(iteration, total, prefix='', suffix='', length=50, fill='█'):
    percent = "{:.1f}".format(100 * (iteration / float(total)))
    filled_length = int(length * iteration // total)
    bar = fill * filled_length + '-' * (length - filled_length)
    print(f'\r{prefix} |{bar}| {percent}% {suffix}')
    clear_output(wait=True)

CS230_DataProcessing.py
https://github.com/gerardodekay/Real-time-ASL-to-English-text-translation/blob/main/DataPreprocessing/CS230_DataProcessing.py

In [None]:
import sys
import logging
logging.basicConfig()
logging.getLogger().setLevel(logging.ERROR)
# import cv2
import os
import pickle
from os.path import join, exists
# import segment_hand as hs
import json
import pandas as pd
import numpy as np
from random import shuffle
from math import floor
from os import listdir
from os.path import isfile, join
import os
import shutil
from moviepy.video.io.ffmpeg_tools import ffmpeg_extract_subclip
import moviepy.video.fx.all as vfx
from moviepy.editor import VideoFileClip
from moviepy.video.fx.all import crop
from concurrent import futures
import time
import imageio_ffmpeg
import subprocess as sp
from IPython.display import clear_output

#read json file containing video url with other details required for preprocessing like start time, end time, etc.
MSASL_trainData = pd.read_json(DATA_DIR + 'MSASL_train.json')
MSASL_valData = pd.read_json(DATA_DIR + 'MSASL_val.json')
MSASL_testData = pd.read_json(DATA_DIR + 'MSASL_test.json')
MSASL_classes = pd.read_json(DATA_DIR + 'MSASL_classes.json')
MSASL_classes.columns = ['class']

MSASL_Data = pd.concat([MSASL_trainData, MSASL_valData, MSASL_testData], ignore_index=True)

# new data frame with url splitted to get the video name
split_df = MSASL_Data["url"].str.split("=", n = 1, expand = True)

# making separate Video name column from new data frame
MSASL_Data["VideoName"]= split_df[1]

def TrimVideoClip(data_dir):
    files = [f for f in listdir(data_dir) if isfile(join(data_dir, f))]

    for file_name in files:
        fileName = (file_name[:-4])
        VideoNameDF = MSASL_Data.loc[MSASL_Data['VideoName'] == fileName] #Filter for the file name in the df
        if VideoNameDF.empty:
            continue
        start_time = VideoNameDF['start_time'].min() # read the corresponding start and end time for the video from the df; min(), max() are just a proxy; we expect start and end time to be same for a given video name in case multiple enteries are present for the video
        end_time = VideoNameDF['end_time'].max()
        # print(fileName,start_time, end_time)
        videoInput_path = data_dir + file_name
        TrimmedVideo_TargetPath = data_dir + "TrimmedVideos/"

        if not os.path.exists(TrimmedVideo_TargetPath):
                os.mkdir(TrimmedVideo_TargetPath)

        ffmpeg_extract_subclip(videoInput_path, start_time, end_time, targetname=TrimmedVideo_TargetPath+file_name)


def copy_split(split_json, split_name="train", crop_images: bool = True, process_only_100_top_classses: bool = True):
    split_classes = []
    split_misses = []
    dir_split_name = f"{TARGET_PATH}/{split_name}/"
    dir_split_name_cropped = f"{TARGET_PATH}/{split_name}_cropped/"

    if not os.path.exists(TARGET_PATH):
        os.mkdir(TARGET_PATH)
    if not os.path.exists(dir_split_name):
        os.mkdir(dir_split_name)
    if not os.path.exists(dir_split_name_cropped):
        os.mkdir(dir_split_name_cropped)
    for counter, t in enumerate(split_json):
        print(counter/len(split_json), end='\r')
        print_progress_bar(counter, len(split_json), prefix='Progress:', suffix='Complete', length=30)
        if (process_only_100_top_classses and t['label'] < 100) or not process_only_100_top_classses:
            url = t["url"]
            file_name = url.split("=")[1] + ".mp4"
            file_path = VIDEOS_PATH + "/" + file_name
            target_dir = dir_split_name + t["clean_text"] + "/"
            target_dir_cropped = dir_split_name_cropped + t["clean_text"] + "/"
            target_path = target_dir + file_name
            target_path_cropped = target_dir_cropped + file_name

            if os.path.exists(file_path):
                if not os.path.exists(target_dir):
                    os.mkdir(target_dir)
                if not os.path.exists(target_path):
                    split_classes.append(t["clean_text"])
                    start_time = str(t["start_time"])
                    end_time = str(t["end_time"])
                    ffmpeg_path = imageio_ffmpeg.get_ffmpeg_exe()
                    sp.call([ffmpeg_path, '-ss', start_time, '-to', end_time, '-i', file_path, target_path, '-y'])
                    # ffmpeg_extract_subclip(file_path, start_time, end_time, targetname=target_path)


            if os.path.exists(target_path):
                if not os.path.exists(target_dir_cropped):
                    os.mkdir(target_dir_cropped)
                if crop_images:
                    # print(target_path)
                    if not os.path.exists(target_path_cropped):
                        clip = VideoFileClip(target_path)
                        y1, x1, y2, x2 = t["box"]
                        w = clip.w
                        h = clip.h
                        x1, x2 = x1 * w, x2 * w
                        y1, y2 = y1 * h, y2 * h
                        # print(x1)
                        new_clip = crop(clip, x1=x1, y1=y1, x2=x2, y2=y2)
                        new_clip.write_videofile(target_path_cropped, codec='mpeg4', audio=False, logger=None)
            else:
                split_misses.append((file_name, url))
    return split_classes, split_misses


def split_data_partly(number_of_splits: int = 10):
    train_f = JSON_PATH + "/MSASL_train.json"
    classes_f = JSON_PATH + "/MSASL_classes.json"
    test_f = JSON_PATH + "/MSASL_test.json"
    val_f = JSON_PATH + "/MSASL_val.json"

    with open(train_f) as f:
        train_json = json.load(f)
    with open(classes_f) as f:
        classes_json = json.load(f)
    with open(test_f) as f:
        test_json = json.load(f)
    with open(val_f) as f:
        val_json = json.load(f)

    for i in range(number_of_splits):
      items_per_batch = len(train_json)//number_of_splits
      start_time = time.time()

      train_classes, train_misses = copy_split(train_json[i*items_per_batch:(i+1)*items_per_batch], "train")
      # val_classes, val_misses = copy_split(val_json, "val")
      # test_classes, test_misses = copy_split(test_json, "test")
      end_time = time.time()
      print(f"Took {end_time - start_time} seconds to execute")

#     {train_classes}, {train_misses}
# Validation:
# {val_classes}, {val_misses}

      print(f"""Train:
  {train_classes}, {train_misses}
  Processed videos:
  {len(train_classes)}, {len(train_misses)}
      """)
      shutil.make_archive(f"drive/MyDrive/SignLanguagev2/TrainMSData{i}.zip", 'zip', "/content/MSData/")
      shutil.rmtree("/content/MSData")

def split_data():
    train_f = JSON_PATH + "/MSASL_train.json"
    classes_f = JSON_PATH + "/MSASL_classes.json"
    test_f = JSON_PATH + "/MSASL_test.json"
    val_f = JSON_PATH + "/MSASL_val.json"

    with open(train_f) as f:
        train_json = json.load(f)
    with open(classes_f) as f:
        classes_json = json.load(f)
    with open(test_f) as f:
        test_json = json.load(f)
    with open(val_f) as f:
        val_json = json.load(f)

    start_time = time.time()
    # train_classes, train_misses = copy_split(train_json, "train")
    val_classes, val_misses = copy_split(val_json, "val")
    # test_classes, test_misses = copy_split(test_json, "test")
    end_time = time.time()
    print(f"Took {end_time - start_time} seconds to execute")

#     {train_classes}, {train_misses}
# Validation:
# {val_classes}, {val_misses}

    print(f"""Train:
Test:
{val_classes}, {val_misses}
Processed videos:
{len(val_classes)}, {len(val_misses)}
    """)

if __name__ == '__main__':
    TARGET_PATH = '/content/MSData/'
    JSON_PATH = '/content/ms_asl/'
    VIDEOS_PATH = '/content/video_new/'
    TrimmedVideos_PATH = VIDEOS_PATH + "TrimmedVideos/"

    # TrimVideoClip(VIDEOS_PATH)
    split_data()

Took 3622.7983288764954 seconds to execute
Train:
Test:
['big', 'bathroom', 'table', 'walk', 'here', 'white', 'black', 'yellow', 'orange', 'red', 'pink', 'blue', 'green', 'brown', 'no', 'doctor', 'nurse', 'teacher', 'grandmother', 'sign', 'night', 'happy', 'sad', 'hungry', 'tired', 'sick', 'like', 'want', 'know', 'how', 'computer', 'family', 'go', 'where', 'lost', 'drink', 'milk', 'milk', 'finish', 'green', 'woman', 'man', 'lost', 'forget', 'happy', 'sad', 'like', 'work', 'same', 'different', 'white', 'tired', 'forget', 'nice', 'student', 'work', 'school', 'doctor', 'nurse', 'family', 'grandmother', 'grandfather', 'cousin', 'water', 'boy', 'girl', 'sister', 'brother', 'cousin', 'mother', 'father', 'family', 'what', 'fish', 'play', 'computer', 'eat', 'yes', 'school', 'help', 'beautiful', 'big', 'deaf', 'happy', 'hearing', 'hello', 'tired', 'you', 'pencil', 'bad', 'sad', 'forget', 'finish', 'fine', 'sit', 'table', 'deaf', 'hearing', 'woman', 'girl', 'man', 'boy', 'teacher', 'learn', 'stu

In [None]:
len(os.listdir("/content/MSData/val_cropped/"))

105

In [None]:
shutil.rmtree("/content/MSData")

/content/MSData//train/day/my4mxg6lXYQ.mp4
my4mxg6lXYQ

In [None]:
def convert_to_frames(input_data_path, word_count, input_type):
    """
    Takes Raw training Input dataset and converts them from video to frames
    """
    # need to change image data for different conversions
    image_data = os.path.join( "MS_Data_Pictures/image_data_" + input_type)
    if (not exists(image_data)):
        os.makedirs(image_data)

    frame_count = 0

    input_dir = os.path.join(input_data_path, input_type)

    # Get all files with raw data for words, only keep how many you want
    gesture_list = os.listdir(input_dir)
    #gesture_list = gesture_list[:word_count]
    print(gesture_list)
    for gesture in gesture_list:
        gesture_path = os.path.join(input_dir, gesture)
        #print("gesture_path: ",gesture_path)

        # Create directory to store images
        frames = os.path.join(image_data, gesture)
        if(not os.path.exists(frames)):
            os.makedirs(frames)
        #print("frames", frames)

        videos = os.listdir(gesture_path)
        videos = [video for video in videos if(os.path.isfile(gesture_path + '/' +  video))]
        # print(videos)
        for video in videos:
            video_name = video[:-4] #removing .mp4 from the video name
            #print("video_name: ", video_name)
            vidcap = cv2.VideoCapture(gesture_path + '/' +  video)
            success,image = vidcap.read()
            frame_count = 0
            # if success:
            #   print("reading video sucessful")
            # else:
            #   print(success)
            while success:
              # image = cv2.cvtcolor(image,cv2.color_bgr2gray) # to convert image to grayscale
              cv2.imwrite(f"{image_data}/{gesture}/{video_name}_frame{frame_count}.jpg", image)     # save frame as jpeg file
              success,image = vidcap.read()
              # print('read a new frame: ', success)
              frame_count += 1

In [None]:
# convert_to_frames("MSData/",10,"test_cropped")

## 1.5 Zip and store data


In [None]:
!mkdir drive/MyDrive/SignLanguagev2/

mkdir: cannot create directory ‘drive/MyDrive/SignLanguagev2/’: File exists


In [None]:
!zip -r drive/MyDrive/SignLanguagev2/MSData.zip /content/MSData

In [None]:
!zip -r drive/MyDrive/SignLanguagev2/ValMSData.zip /content/MSData

  adding: content/MSData/ (stored 0%)
  adding: content/MSData/val_cropped/ (stored 0%)
  adding: content/MSData/val_cropped/like/ (stored 0%)
  adding: content/MSData/val_cropped/like/cxXEULq9Jpc.mp4 (deflated 11%)
  adding: content/MSData/val_cropped/like/CxTSVyM-ij0.mp4 (deflated 7%)
  adding: content/MSData/val_cropped/like/EyWuC4JL4Tk.mp4 (deflated 8%)
  adding: content/MSData/val_cropped/like/GmxS5HkNc3o.mp4 (deflated 3%)
  adding: content/MSData/val_cropped/like/HUMEcnkvhJU&t.mp4 (deflated 13%)
  adding: content/MSData/val_cropped/like/1bj72qXSy8c.mp4 (deflated 7%)
  adding: content/MSData/val_cropped/like/0qeMFifNqC4.mp4 (deflated 24%)
  adding: content/MSData/val_cropped/like/N4n5CDpyX3w.mp4 (deflated 19%)
  adding: content/MSData/val_cropped/girl/ (stored 0%)
  adding: content/MSData/val_cropped/girl/0bIF7jh6lnE.mp4 (deflated 1%)
  adding: content/MSData/val_cropped/girl/KLD5f35qEDI&t.mp4 (deflated 11%)
  adding: content/MSData/val_cropped/girl/FRQbKDZeRzM.mp4 (deflated 7%)
 

In [None]:
!zip -r drive/MyDrive/SignLanguage/MSDataPictures.zip /content/MS_Data_Pictures