Skip to content

Autonomous Production of Gestures on a Social Robot using Deep Learning

Notifications You must be signed in to change notification settings

iamarcel/thesis

Repository files navigation

Thesis

Installation

The simplest way to get started is by using the virtualenv, managed with pipenv:

pip install --user pipenv
pipenv install
pipenv shell

This will take in all the Python dependencies and could take a while. The last command opens the virtualenv so you can start using all these dependencies.

This project was developed on a machine running Ubuntu 16.04 LTS. Most things will work on other distributions too. However, I’ve found that Choregraphe, SoftBank’s robot simulator, might not work on distributions with newer versions of certain packages (I can’t remember which, however on my machine running the latest Fedora it doesn’t work). Some parts of the project, like the Video Picker, also require system-level dependencies which aren’t managed by Pipenv.

If you have difficulties with dependencies, you can run the code in a Docker contaier instead. There are two options:

  • The machine specified in Dockerfile requires installation of nvidia-docker and thus requires an NVIDIA GPU. This also uses tensorflow-gpu to exploit your graphics card.
  • The machine specified in Dockerfile.cpu uses the regular Ubuntu 16.04 image and runs the CPU variant of TensorFlow.

To set up the Docker machine, run make build or make cpubuild and to start it, run make start or make cpustart, respectively.

Note that these Docker containers install the NAOqi SDK’s and Choregraphe but since I can’t host them here, you should download them yourself from the SoftBank website and place them in the root directory of this repository. Double-check the file names in the Dockerfile to make sure the version mentioned there is the same one as you downloaded.

Running OpenPose

For OpenPose, a separate Dockerfile is present in src/openpose. This builds OpenPose as part of the Docker build process, so you can immediately start using the compiled binaries.

Installing nvidia-docker

If you are running Ubuntu, installation is easy:

git clone git@github.com:ryanolson/bootstrap.git
cd bootstrap
./bootstrap.sh

If you have another setup, the installation is probably quite simple too (but not this simple 😉).

Running

Building the Dataset

Collect video

First, find a suitable video on YouTube. It should have the following properties:

  • Is in English (actually this is not strictly necessary, but don’t start messing with multiple languages 😉)
  • Has subtitles available (the download script downloads the automatically generated subtitles do if there are built-in ones that you want to use, you have to modify it)
  • Has at least one shot of a person who is fully visible in frame (from head to toe)

For these steps you need youtube-dl and ffmpeg installed. Open a command line in the src/video directory and run:

./download.sh '$YOUTUBE_URL'
./detect-shots.sh $NEW_VIDEO_FILE

If there are quotes in the file name created after running download.sh, this might cause trouble in the next step. I recommend removing them first. Keep the YouTube ID at the end of the file since it is used by other parts of the project.

Extract Video Clips

The Video Picker assists with selecting the right parts of a video and saves the initial data for the dataset. You need some system-level dependencies such as FFmpeg, Cairo, Gstreamer and the Python GTK libraries. You can refer to the Dockerfiles to see which packages this are or just run this from the Docker container if you don’t want to install them yourself.

Browse to the src/video-picker directory and run main.py:

python main.py

Click the Open icon or press Ctrl+O to open a video file. Then, the video will start playing and the program should look something like this:

./img/video-picker-screenshot.png

The most ergonomic way to extract video is to hold your left hand above your keyboard (you only need its left half anyways) and hold your mouse with your right hand.

Point the cursor roughly at the hip of the person you’re interested in. You can adjust the size of the rectangle by scrolling. (This information is saved but is not used anymore. It was in a previous version. So the size of the rectangle doesn’t really matter.)

Some terminoligy. The detect-shots.sh script you ran above runs a scene detection algorithm which detects the shots in a video. A shot is a single continuous piece of video. So, there is another shot if the camera cuts to another angle, for example. Sometimes these changes are not detected, for example, when there is a fading animation in betwen shots.

Navigate and record with the following shortcuts:

  • Ctrl+D go to the previous shot
  • Crtl+F go to the next shot
  • Ctrl+R record the current shot (rewinds to the start of this shot first)
  • Ctrl+R (while recording) stop recording immediately. Use this when there is a new shot the scene detection algorithm didn’t pick up.

While Video Picker is recording, just let it play until it stops recording. The cursor changes color according to what state it’s in:

Red cursor
recording
Semi-transparent cursor
this clip is unusable (probably because the subtitle is present during a change of shot) and is thus not being recorded
Green cursor
this clip is already recorded

Perform 2D Pose Detection with OpenPose

Move the src/video-picker/images folder to src/openpose/src/images.

Go so src/openpose in your command line and set up the container:

make build
make start

Once you’re in the container, run OpenPose on the images extracted by the Video Picker:

cd openpose
./build/examples/openpose/openpose.bin --image_dir ~/dev/images/ --write_json ~/dev/output/

Lift the poses to 3D with 3D Pose Baseline

The 3D Pose Baseline will read the output directory from OpenPose and save the lifted 3D poses into the clips.jsonl file created by the Video Picker.

Go to the src/openpose-baseline folder and simply run make.

Clean the data

Go to the src/ folder and run:

./util.py dataset preprocess

Detect the clusters

Go to the src/clustering folder and run:

R detect-clusters.r

You need to have R installed but R dependencies will be installed with Pacman.

Create the TFRecord dataset

Go to the src/ folder and run:

./util.py create-tfrecords

Phew! The dataset is ready.

Using the model

Training

Go to the src/learning model and run:

python model.py --train

Modify the parameters at the bottom of model.py if you want to.

Evaluating

Results are automatically evaluated at the end of trainig. You can inspect them by starting TensorBoard:

make board

Then, navigate to [http://localhost:6006](http://localhost:6006). Note that you can run TensorBoard while training and look at the results while training.

Inference

To plot the pose for a subtitle of your choice, run:

python model.py --predict --subtitle 'robots are smarter with machine learning'

Performing gestures on a robot

In src/util.py a few functions are implemented to play back poses. To specify how to connect the nao, use the --bot_address and --bot_port options. Defaults are 127.0.0.1 and 9559, respectively.

./util.py bot play-random-clip  # Take a random clip and play the ground truth gesture
./util.py bot play-clusters     # Play the clusters from `cluster-centers.json`

Add a method to play back predictions

Preparing the survey

There is a single create question functions that prepares the gestures needed for a question in the survey. Such a question contains a video recording of the robot performing 3 clips immediately after each other, in four different scenarios:

  • Ground truth (3D pose detections)
  • Baseline (built-in robot animations)
  • Classification-based prediction (uses the clusters)
  • Sequence-based prediction (directly predicts the gesture)

The create question will make the robot perform these scenarios after each other. While performing a gesture, its eye LED’s will be active and in between performances they will be turned off. It will also print the associated (combined) subtitle and save the metadata for the question in questions.jsonl.

Go to src/ and run:

./util.py survey create-question

Using a virtual robot

It is possible to generate a question using a virtual robot from a running Choregraphe instance.

Run the create_question function in src/survey.py with do_record_screen=True, do_generate_tts=True. You will probably need to update the code to make sure the correct region of your display is captured. In order to generate the TTS speech, the IBM Watson API is used (since the SoftBank TTS engine is not available in the simulator). For that to work, you need to sign up for an account and set up the following environment variables:

export WATSON_TTS_USERNAME='xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'
export WATSON_TTS_PASSWORD='xxxxxxxxxxxx'

Tip: Save this in a file .env in the root directory of this project. Pipenv will automatically load the environment variables when running pipenv shell. You’ll need load them manually, though, if you’re running this in a Docker container (since there’s no virtual environment in that case).