# Human Social Interaction Modeling Using Temporal Deep Networks

# Contents
* ABSTRACT
* Keywords
* 1 INTRODUCTION
* 2 RELATED WORK
    - Social Psychology
    - Affective Computing
    - Hybrid Models
    - Deep Networks
* 3 APPROACH
    - 3.1 Review of Prior Models
        - Restricted Boltzmann Machines
        - Discriminative Restricted Boltzmann Machines
        - Conditional Restricted Boltzmann Machines
    - 3.2 Model
        - Discriminative Conditional Restricted Boltzmann Machines
    - 3.3 Inference and Learning
        - Inference
        - Learning
* 4 EXPERIMENTS
    - 4.1 Datasets
        - Tower Game Dataset
        - Capture Setup
        - Data Annotation
    - 4.2 Quantitative Results
        - Implementation Details
        - Results
* 5 CONCLUSIONS AND FUTURE WORK

# ABSTRACT

* We present a novel approach to computational modeling of social interactions based on modeling of essential social interaction predicates (ESIPs) such as joint attention and entrainment.
* Based on sound social psychological theory and methodology, we collect a new “Tower Game” dataset consisting of audio-visual capture of dyadic interactions labeled with the ESIPs.
* We propose a novel joint Discriminative Conditional Restricted Boltzmann Machine (DCRBM) model that combines a discriminative component with the generative power of CRBMs.

# Keywords

* Hybrid Models
* Deep Learning
* DCRBMs
* Social Interaction
* Computational Social Psychology
* Tower Game Dataset

# 1 INTRODUCTION

* This research brings together multiple disciplines to explore the problem of social interaction modeling. 
* The goal of this work is to leverage research in social psychology, computer vision, signal processing, and machine learning to better understand human social interactions.
* we focus on social predicates that support rapport: 
    - joint attention, 
    - temporal synchrony, 
    - mimicry, and 
    - coordination.
* Our approach is guided by two key insights. 
    - The first is that apart from inferring the mental state of the other, social interactions require individuals to attend each other’s movements, utterances and context to coordinate actions jointly with each other. 
    - The second insight is that social interactions involve reciprocal acts, joint behaviors along with nested events (e.g. speech, eye gaze, gestures) at various timescales and therefore demand adaptive and cooperative behaviors of their participants

#### ESIPs
* We focus on detecting 
    - rhythmic coupling 
        - (also known as entrainment and attunement), 
    - mimicry (behavioral matching), 
    - movement simultaneity, 
    - kinematic turn taking patterns, and 
    - other measurable features of engaged social interaction.
* We established that behaviors such as <font color="blue">joint attention</font> and <font color="blue">entrainment</font> were the <font color="red">essential predicates of social interaction (ESIPs)</font>.
* With this in mind we focus on developing computational models of social interaction, that utilize <font color="red">multimodal sensing</font> and <font color="red">temporal deep learning models</font> <font color="blue">to detect and recognize</font> <font color="red">these ESIPs</font> as well as discover their actionable constituents.


#### models
* Discriminative models 
    - focus on maximizing the separation between classes, however, they are often uninterpretable. 
* On the other hand, generative models 
    - focus solely on modeling distributions and are often unable to incorporate higher level knowledge. 
* Hybrid models 
    - tend to address these problems by combining the advantages of discriminative and generative models. 
    - They encode higher level knowledge as well as model the distribution from a discriminative perspective. 
* We propose a novel hybrid model that allows us to recognize classes, correlate features, and generate social interaction data.

<font color="red">This paper proposes new approach to machine learning that answers questions posed by social psychology.</font>

<img src="figures/cap1.png" width=800 />

# 2 RELATED WORK
* Social Psychology
* Affective Computing
* Hybrid Models
* Deep Networks

#### Social Psychology

#### Affective Computing

#### Hybrid Models

#### Deep Networks

# 3 APPROACH
* 3.1 Review of Prior Models
* 3.2 Model
* 3.3 Inference and Learning

## 3.1 Review of Prior Models
* Restricted Boltzmann Machines
* Discriminative Restricted Boltzmann Machines
* Conditional Restricted Boltzmann Machines

#### 참고
* [2] Machine Learning 스터디 (19) Deep Learning - RBM, DBN, CNN - http://sanghyukchun.github.io/75/

<img src="figures/cap5.png" width=600 />
<img src="figures/cap6.png" width=600 />
<img src="figures/cap7.png" width=600 />

### Restricted Boltzmann Machines

<img src="figures/cap2.png" width=600 />

<img src="figures/eq2.png" width=600 />

<img src="figures/cap3.png" width=600 />

<img src="figures/cap4.png" width=600 />

### Discriminative Restricted Boltzmann Machines

<img src="figures/cap8.png" width=600 />

<img src="figures/cap9.png" width=600 />

<img src="figures/cap10.png" width=600 />

### Conditional Restricted Boltzmann Machines

<img src="figures/cap11.png" width=600 />

<img src="figures/cap12.png" width=600 />

<img src="figures/cap13.png" width=600 />

## 3.2 Model
* Discriminative Conditional Restricted Boltzmann Machines

### Discriminative Conditional Restricted Boltzmann Machines

<img src="figures/cap14.png" width=600 />

<img src="figures/cap15.png" width=600 />

<img src="figures/cap16.png" width=600 />

## 3.3 Inference and Learning
* Inference
* Learning

### Inference

<img src="figures/cap17.png" width=600 />

### Learning

<img src="figures/cap18.png" width=600 />

where $⟨·⟩_{data}$ is the expectation with respect to the data distribution and $⟨·⟩_{recon}$ is the expectation with respect to the reconstructed data.

<img src="figures/cap24.png" width=800 />
<img src="figures/cap25.png" width=800 />
<img src="figures/cap26.png" width=800 />

# 4 EXPERIMENTS
* 4.1 Datasets
* 4.2 Quantitative Results

## 4.1 Datasets
* Tower Game Dataset
* Capture Setup
* Data Annotation

### Tower Game Dataset
* architect-builder
* distinct-objective

#### 참고
* [3] Tower Game Dataset: A multimodal dataset for analyzing social interaction predicates - http://www.infomus.org/Events/proceedings/ACII2015/papers/Main_Conference/M2_Poster/Poster_Teaser_5/ACII2015_submission_19.pdf

<img src="figures/cap1.png" width=800 />

### Capture Setup

<img src="figures/cap19.png" width=600 />

### Data Annotation

* Since our focus is on joint attention and entrainment, we annotated 112 videos which were divided into 1213 10-second segments indicating the presence or absence of these two behaviors in each segment.
* To annotate the videos, we developed an innovative annotation schema drawn from concepts in the social psychology literature.
* Joint attention is the shared focus of two individuals on a common subject and it involves eye gaze (on a person and on an object) and body language.
* Entrainment is the alignment in the behavior of two individ- uals and it involves simultaneous movement, tempo similarity, and coordination.
* Each measure was rated using a low, medium, high measure for the entire 10 second segment. 
* We hired six undergraduate sociology and psychology students to annotate the videos. 

## 4.2 Quantitative Results
* Implementation Details
* Results

### Implementation Details

* For our experiments, we relied only on the skeleton features. We use the 11 joints from the upper body of the two players since the tower game almost entirely involves only upper body actions.
* Using the 11 joints we extracted a set of first order static and dynamic handcrafted skeleton features.
    - The static features 
        - are computed per frame. 
        - The features consist of, 
            - relationships between all pairs of joints of a single actor, 
                - as well as the relationships between all pairs of joints of both the actors.
    - The dynamic features 
        - are extracted per window (a set of 300 frames). 
        - In each window, we compute first and second order dynamics (velocities and accelerations) of each joint, 
            - as well as 
                - relative velocities and 
                - accelerations of pairs of joints per 
                    - actor, and 
                    - across actors.
* The dimensionality of the static and dynamic features is (257400 D). 
* To reduce their dimensionality we use Principle Component Analysis (PCA) (100 D), Bag-of-Words (BoW) (100 and 300 D). 
* We also extracted Deep Learning features using RBMs and CRBMs (50 dimensions)
* For the DRBM and DCRBM we used the raw joint locations normalized with respect to a selected origin point. We used the same dimensionality for both models D(v) = 66,D(h) = 50. 
* For DCRBM we empirically evaluated history windows of different sizes, and found that a window of size n = 15 works the best.

### Results

For the purpose of this paper we focused on the three ECIPs,
* Coordination, 
* Simultaneous Movement, and 
* Tempo Similarity.

#### Classification Task

The evaluation is done with respect to the six annotators {$A_1,A_2,...,A_6$} as well as the mean annotation.

<img src="figures/cap20.png" width=600 />
<img src="figures/cap21.png" width=600 />
<img src="figures/cap22.png" width=600 />
<img src="figures/cap23.png" width=600 />

#### Generation Task

* The second task is the Generation Task, where we are given the class label and our goal is to generate the data (i.e. the raw features) for that label. 
    - This task allows us to visualize what the classifier has learned.
* For generation, we initialize the model using 15 frames for each person, and then generate sequences of lengths varying from 16 to 300 frames.
* We measure the mean error between the groundtruth data and the generated data for each class label over 50 video instances.

<img src="figures/cap27.png" width=600 />

Generation is done in two different settings.
* In the first setting, 
    - given partial visible player data (one player’s features) as well as the class label, 
    - the goal is to generate the other player’s data.
* In the second setting, 
    - given only the class label, 
    - the goal is to generate the entire visible layer data (i.e. the raw features for both the players).

<img src="figures/cap24.png" width=800 />
<img src="figures/cap25.png" width=800 />
<img src="figures/cap26.png" width=800 />

# 5 CONCLUSIONS AND FUTURE WORK

# 참고자료 
* [1] Human Social Interaction Modeling Using Temporal Deep Networks - https://arxiv.org/abs/1505.02137
* [2] Machine Learning 스터디 (19) Deep Learning - RBM, DBN, CNN - http://sanghyukchun.github.io/75/
* [3] Tower Game Dataset: A multimodal dataset for analyzing social interaction predicates - http://www.infomus.org/Events/proceedings/ACII2015/papers/Main_Conference/M2_Poster/Poster_Teaser_5/ACII2015_submission_19.pdf