#ASL Dataset Analysis

The ASLG-PC12 dataset consists of ASL gloss and English sentence pairs.
The ASLLVD dataset has video segments of sign terms created by 6 unique signers. We will do some exploratory analysis of common word and phrases in the ASLG-PC12 dataset and a look at ASLLVD's vocabulary coverage of ASLG-PC12.

In [0]:
import pandas as pd
import numpy as np

In [0]:
# Download 2 datasets
!git clone https://github.com/imatge-upc/speech2signs-2017-nmt.git # ASLG-PC12 dataset
!wget http://www.bu.edu/asllrp/dai-asllvd-BU_glossing_with_variations_HS_information-extended-urls-RU.xlsx # ASLLVD dataset

In [0]:
# glosses is gloss vocabulary from ASLLVD. There are 4164 vocab words
glosses_data = pd.read_excel("dai-asllvd-BU_glossing_with_variations_HS_information-extended-urls-RU.xlsx")
removed = ['============', False]
glosses = glosses_data["Gloss Variant"].dropna().unique()
glosses = [gloss for gloss in glosses if gloss not in removed]
glosses = [gloss.rstrip("+") for gloss in glosses] # the "+" term signifies that the term was repeated several times in the video segment

## Clean up training Corpus

In [0]:
with open("speech2signs-2017-nmt/ASLG-PC12/ENG-ASL_Train_0.046.asl") as file_obj:
  separated_lines = file_obj.readlines()

lines = "".join(separated_lines)
separated_lines = [line.split() for line in separated_lines]

# WOULD LIKE -> WANT
removed_words = [".", ",", "?", "BE", "TO", "MR"] # These arent ASL gloss terms

# There are some ASL gloss convention mismatches between ASLLVD and ASLG-PC12
replacement_mapping = {
    "X-I": "IX-1p",
    "X-WE": "IX-1p-pl-arc",
    "X-IT": "IX:i",
    "X-HE": "IX:i",
    "X-YOU": "IX-2p",
    "X-Y": "IX-2p", # X-Y is you in the general sense so this might not be good
    "THIS": "IX:i",
    "EU": "ns-EUROPE",
    "EUROPE": "ns-EUROPE",
    "EUROPEAN": "ns-EUROPE",# This might not be right to assume
    "WILL": "FUTURE",
    "NEED": "SHOULD",
    "DESC-NOT": "NOT",
    "DESC-ALSO": "ALSO"
}

# Remopve all DESC- from terms
words = lines.split()
words = [word for word in words if word not in removed_words]
words = [word if word not in replacement_mapping else replacement_mapping[word] for word in words]
words = [word if not word.startswith("DESC-") else word[5:] for word in words]

## Word and trigram count

In [0]:
from collections import Counter
from nltk import ngrams

def extract_phrases(text, phrase_counter, length):
    words = nltk.word_tokenize(text)
    for phrase in ngrams(words, length):
        phrase_counter[phrase] += 1
                
phrase_counter = Counter()
extract_phrases(" ".join(words).replace(":", "-"), phrase_counter, 3)

word_counter = Counter()
extract_phrases(" ".join(words).replace(":", "-"), word_counter, 1)

In [34]:
most_common_words = word_counter.most_common(25)
for k,v in most_common_words:
    print('{0: <5}'.format(v), k)

28427 ('IX-i',)
21806 ('IX-1p-pl-arc',)
19799 ('IN',)
19602 ('AND',)
15421 ('THAT',)
14813 ('IX-1p',)
14255 ('HAVE',)
11636 ('FOR',)
10883 ('ns-EUROPE',)
10665 ('IX-2p',)
9168  ('NOT',)
8370  ('ON',)
7318  ('FUTURE',)
6801  ('SHOULD',)
5183  ('WITH',)
5027  ('DO',)
4464  ('MUST',)
4280  ('AS',)
4163  ('ALSO',)
3682  ('CAN',)
3599  ('RE',)
3590  ('VOTE',)
3552  ('BY',)
3505  ('WOULD',)
3384  ('AT',)


In [35]:
most_common_phrases = phrase_counter.most_common(25)
for k,v in most_common_phrases:
    print('{0: <5}'.format(v), k)

1239  ('IX-1p', 'WOULD', 'LIKE')
482   ('SIT', 'SEE', 'MINUTE')
470   ('IX-1p', 'BELIEVE', 'THAT')
421   ('IX-1p-pl-arc', 'CAN', 'NOT')
409   ('VOTE', 'IN', 'FAVOR')
370   ('LADY', 'AND', 'GENTLEMAN')
361   ('FUTURE', 'TAKE', 'PLACE')
360   ('IX-1p', 'DO', 'NOT')
333   ('VOTE', 'FUTURE', 'TAKE')
329   ('IX-1p', 'THINK', 'THAT')
320   ('THANK', 'IX-2p', 'FOR')
319   ('IX-1p-pl-arc', 'DO', 'NOT')
309   ('CUT', 'OFF', 'SPEAKER')
308   ('PRESIDENT', 'CUT', 'OFF')
300   ('CLOSE', 'VOTE', 'FUTURE')
281   ('THAT', 'IX-1p-pl-arc', 'HAVE')
274   ('DEBATE', 'CLOSE', 'VOTE')
257   ('WRITE', 'STATEMENT', 'RULE')
254   ('WOULD', 'LIKE', 'THANK')
248   ('THAT', 'IX-1p-pl-arc', 'SHOULD')
241   ('MADAM', 'PRESIDENT', 'IX-1p')
241   ('IX-1p', 'HOPE', 'THAT')
225   ('IN', 'IX-i', 'REGARD')
224   ('THANK', 'IX-2p', 'VERY')
224   ('IX-2p', 'VERY', 'MUCH')


## ASLLVD vocabulary coverage of training corpus

In [0]:
word_count = {}
for word in words:
  if word in word_count:
    word_count[word] += 1
  else:
    word_count[word] = 1

In [37]:
glosses_set = set(glosses)
sorted_word_count = sorted(word_count.items(), key=lambda x: x[1], reverse=True)

distinct_words = len(sorted_word_count)
total_words = len(words)
word_overlap = [count for word, count in word_count.items() if word in glosses_set]
distinct_word_overlap = len(word_overlap)
total_word_overlap = sum(word_overlap)

print("Vocabulary coverage in ASLG-PC12: {}".format(distinct_word_overlap / distinct_words))
print("Word coverage in ASLG-PC12: {}".format(total_word_overlap / total_words))

Vocabulary coverage in ASLG-PC12: 0.07809234277028311
Word coverage in ASLG-PC12: 0.5590735139884557


In [38]:
# Most common words not in ASLLVD vocabulary
[(word, count) for word, count in sorted_word_count[:100] if word not in glosses_set]

[('AS', 4280),
 ('RE', 3582),
 ('BY', 3552),
 ('WOULD', 3505),
 ('AT', 3384),
 ('SE', 3015),
 ('X-POSS', 2949),
 ('OR', 2833),
 ('COMMISSION', 2700),
 ('HOWEVER', 2210),
 ('UNION', 2175),
 ('REFORE', 2015),
 ('DEBATE', 1804),
 ('X-MY', 1800),
 ('NO', 1769),
 ('X-ITS', 1751),
 ('SO', 1625),
 ('ISSUE', 1589),
 ('COUNCIL', 1514),
 ('THANK', 1510),
 ('ONLY', 1508),
 ('CRISIS', 1380),
 ('POINT', 1233),
 ('AGREEMENT', 1222),
 ('RIGHTS', 1165),
 ('AREA', 1153),
 ('TODAY', 1143),
 ('PROPOSAL', 1091),
 ('BELIEVE', 1082),
 ('SUCH', 1041)]