# ASL Playground
In this notebook we experiment with data-structure for ASL project.

## ASL Dataset
First, let's look at `asl_data.py`, we have

* `AslDb`: a database for ASL
* `WordsData`: a data-structure that provides loading and getters for ASL-database suitable for `hmmlearn`
* `SinglesData`: similar as `WordsData`

In [2]:
import numpy as np
import pandas as pd
from asl_data import AslDb


asl = AslDb() # initializes the database
asl.df.head() # displays the first five rows of the asl database, indexed by video and frame

Unnamed: 0_level_0,Unnamed: 1_level_0,left-x,left-y,right-x,right-y,nose-x,nose-y,speaker
video,frame,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
98,0,149,181,170,175,161,62,woman-1
98,1,149,181,170,175,161,62,woman-1
98,2,149,181,170,175,161,62,woman-1
98,3,149,181,170,175,161,62,woman-1
98,4,149,181,170,175,161,62,woman-1


Looking at the code, we see that
* `AslDb` just loads data from a csv and stores data in a `pd.DataFrame` using [`video`, `frame`] as index
* `AslDb` allows build training/testing data where training/testing-data is represented by `WordsData/SinglesData` respectively

Note that, one can modify `asl.df` easily since it's a `pd.DataFrame` e.g computing relative hand's position with nose's position

In [3]:
asl.df['grnd-ry'] = asl.df['right-y'] - asl.df['nose-y']
asl.df['grnd-rx'] = asl.df['right-x'] - asl.df['nose-x']
asl.df['grnd-ly'] = asl.df['left-y']  - asl.df['nose-y']
asl.df['grnd-lx'] = asl.df['left-x']  - asl.df['nose-x']
asl.df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,left-x,left-y,right-x,right-y,nose-x,nose-y,speaker,grnd-ry,grnd-rx,grnd-ly,grnd-lx
video,frame,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
98,0,149,181,170,175,161,62,woman-1,113,9,119,-12
98,1,149,181,170,175,161,62,woman-1,113,9,119,-12
98,2,149,181,170,175,161,62,woman-1,113,9,119,-12
98,3,149,181,170,175,161,62,woman-1,113,9,119,-12
98,4,149,181,170,175,161,62,woman-1,113,9,119,-12


To build a training data, we need to provide a `feature_list` i.e column-names to be used as features e.g

In [4]:
features_ground = ['grnd-rx','grnd-ry','grnd-lx','grnd-ly']
training = asl.build_training(features_ground)

Above `training` is an instance of `WordsData` with following members
* `self._data`: a dictionary between word -> a list of sequences where each sequence is observed feature-values for the corresponding word
* `self._hmm_data`: for each word, we concatenate observed sequences and we keep a sequence length e.g
      'I' : [[1,2,3], [4,5]] will become ([1,2,3,4,5], [3, 2])
this is suitable format for using `hmmlearn`
* `self.num_items`: number of words
* `self.words`: list of all words   

In [5]:
word0 = training.words[0]
print('word0:            {}'.format(word0))
print('number of seq for {} = {}\n'.format(word0, len(training.get_word_sequences(word0))))
X, lengths = training.get_word_Xlengths(word0)
print('concate seq has len   = {}'.format(len(X)))
print('first 10-lengths      = {}'.format(lengths[:10]))
print('mean/min/max length   = {:.3f}/{}/{}'.format(np.mean(lengths), np.min(lengths), np.max(lengths)))

word0:            JOHN
number of seq for JOHN = 113

concate seq has len   = 1189
first 10-lengths      = [10, 12, 12, 22, 14, 16, 9, 9, 9, 7]
mean/min/max length   = 10.522/5/27


Let's look at `testing` dataset which can be created via `AslDb.build_test` given a `feature_list`

In [8]:
test_set =  asl.build_test(features_ground)

Above `test_set` is an instance of `SinglesData` with following members
* `self.df`: loaded pd.DataFrame from a csv
* `self._data`: similar as above with the only difference is that the key is now a number instead of a word
* `self._hmm_data`: similar as above with the only difference is that the key is now a number instead of a word
* `self.sentences_index`: a map from video-index -> sequences of frame-indices e.g
    $$2 \rightarrow [0, 1, 2]:\text{ means frame 0, 1, 2 belongs to video index 2}$$

In [28]:
print('Number of frame {}\n'.format(len(test_set.df)))
print('Few row of data\n{}\n'.format(test_set.df.head()))

print('Number of video {}\n'.format(len(test_set.sentences_index)))

videos = list(test_set.sentences_index.keys())

for i in range(10):
    video = videos[np.random.randint(len(videos))]
    print('Video [{}] has sequence {}'.format(video, test_set.sentences_index[video]))
    print('Video [{}] has words    {}\n'.format(video, [test_set.wordlist[i] for i in test_set.sentences_index[video]]))

Number of frame 178

Few row of data
   video  speaker      word  startframe  endframe
0      2  woman-1      JOHN           7        20
1      2  woman-1     WRITE          23        36
2      2  woman-1  HOMEWORK          38        63
3      7    man-1      JOHN          22        39
4      7    man-1       CAN          42        47

Number of video 40

Video [28] has sequence [24, 25, 26, 27, 28]
Video [28] has words    ['JOHN', 'LIKE', 'IX', 'IX', 'IX']

Video [167] has sequence [141, 142, 143, 144, 145]
Video [167] has words    ['JOHN', 'IX', 'SAY', 'LOVE', 'MARY']

Video [92] has sequence [96, 97, 98, 99, 100, 101]
Video [92] has words    ['JOHN', 'GIVE', 'IX', 'SOMETHING-ONE', 'WOMAN', 'BOOK']

Video [119] has sequence [120, 121, 122, 123, 124]
Video [119] has words    ['SUE', 'BUY', 'IX', 'CAR', 'BLUE']

Video [199] has sequence [169, 170, 171]
Video [199] has words    ['LIKE', 'CHOCOLATE', 'WHO']

Video [89] has sequence [83, 84, 85, 86, 87, 88, 89]
Video [89] has words    ['J

The getters are `self.get_item_sequences, self.get_item_Xlengths` are now taken an integer argument.

In [32]:
print(test_set.get_all_Xlengths().keys())

dict_keys([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177])


## Model Selector
For our problem, we have the following parameters

* number of state for our HMM
* features list to be used in training

In order to decide which parameters to be used, one can use some score for each parameters set then chose the one that has the best score. There are few way to score a model:

* **Bayesian Information Criterion** ([BIC](https://en.wikipedia.org/wiki/Bayesian_information_criterion)): we select the model with the lowest BIC score
$$
\texttt{score}^{BIC} = -2*\log L + p \log N
$$
where $\log L$ is the log-likelihood, $p$ is the number of free-parameters and $N$ is the sample-size.

* **Discriminative Information Criterion** ([DIC](https://en.wikipedia.org/wiki/Deviance_information_criterion)): we select the model with the lowest DIC score
$$
\texttt{score}^{DIC} = \log(P(X_i) - \frac{1}{M-1}\sum_{j\neq i}\log(P(X_j)
$$

### BIC
In above formula for **BIC**, the number of free-parameters $p$ for HMM-Gaussian is given as following
$$
\begin{split}
 p &= \texttt{numstates} - 1 + \texttt{numstates} \times (\texttt{numstates}-1) + 2\times \texttt{numfeatures} \times \texttt{numstates}\\
  &= \texttt{numstates}^2 - 1 + 2\times \texttt{numfeatures} \times \texttt{numstates}
\end{split}
$$