No description, website, or topics provided.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
LICENSE
README.md
caption.sh
eval.sh
tensorboard.sh
train.sh
train2.sh
translator.sh

README.md

Electric speech

Electric speech is a clouds reading machine which is able to translate abstract cloudy shapes into a proper english. It's a part of an ongoing speculative transmedia documentary started in respublica Tuva in 2015 and it’s also a web publishing and the remain of one of the first machine happening.

Inspired by both Do android dream of electric sheep? from Philip K. Dick and the relationship between datas, speeches and images in politic and contemporary mass medias as defined by Adam Curtis in Hypernormalisation as a part of a risks management system initiated with Aladdin, a super computer dedicated to the risk management division of the world largest investment management corporation, BlackRock, Inc. Electric speech is attempting to turn a useful system, which is used to shape our financial and digital realities, into a poietic counter-system.

This proposal builds on my previous work including From and Spleen which address the boundaries between languages and things, and the way speeches mediate those spaces. Most similar is unnarrative where I worked with the BOX gallery and Le Centre d'Art du Parc Saint-Léger to create an antomatically randomly generated movie made from sequences of isolated, zoomed and extracted extras from regular movies. The result is a low definition-like movie with no story, no actors, no heros and no climaxs. When it's shown, unnarrative is maintained by a sound proposal made by a different artist for each occasion.

In the work of others, I was inspired by Walter Benjamin's The translator's task, Mathelinda Nabugodi's Pure Language 2.0 and Giovanni Anselmo's Particolare. During the process of developing the proposal, a friend shown me Sucking on words by Kenneth Goldsmith which probably confirmed the work in its actual shape. Another interesting reference could be Avital Ronell's The Telephone Book which is linking technology and schizophrenia by exploring deep origins of well known technologies.

![](http://mathieu-arbez-hermoso.net/wp-content/uploads/2017/01/vlcsnap-2017-01-29-20h50m38s173.jpg)

In the early begining, we used Andrej Karpathy's torch implementation of the models proposed by Vinyals et al. from Google (CNN + LSTM) and by Karpathy and Fei-Fei from Stanford (CNN + RNN). Both models take an image and predict its sentence description through a Recurrent Neural Network (either an LSTM or an RNN). Firstly, we trained the models by using a custom dataset made from 1 million images and related hashtags as labels from Flickr. This solution leaded the language structure to gracefully fails at express complex concepts in a proper English but, at the same time, it was absorbing as a poetic purposal and process. Finally, we decided to use im2txt and the Inception v3 model which was released on Github in 2016. We used a pretrained checkpoint on the Imagenet dataset and finetuned it for months on the MS-COCO dataset. We wanted to use an "alternative intelligence" as close as possible as its used in industrial and competitive contexts.

Technical details

All footage was recorded over 144 minutes at 4k 24fps on a Panasonic GH4, modified with a Sigma 24-70mm lens. The European version of the GH4 outputs short videos (30 minutes) that are then stripped of audio and concatenated with ffmpeg. Before being concatenated the videos are copied to a temporary folder on the internal SSD which changes the processing time from days to minutes. All sequences are then edited, color graded and exported into h264. We're using ffprobe and ffmpeg to extract keyframes and its related time stamps and send it to im2txt which generates text English translation for each image. Finally, the videos and captions are uploaded to YouTube, which will handles the streaming and buffering for the online version.

Firstly, we wanted to present a live streaming of the A.I performing but since the learning phase is separated from the performing phase on actual neural network solutions we didn't see any reasons to go that way and decided to present a recorded stream of a machine happening occured on 03/01/2017 at 5:17pm (Paris Time).

Software details

Install Required Packages

Make sure you have installed the following required packages:

Prepare the Training Data

To train the model you will need to provide training data in native TFRecord format. The TFRecord format consists of a set of sharded files containing serialized tf.SequenceExample protocol buffers. Each tf.SequenceExample proto contains an image (JPEG format), a caption and metadata such as the image id.

Each caption is a list of words. During preprocessing, a dictionary is created that assigns each word in the vocabulary to an integer-valued id. Each caption is encoded as a list of integer word ids in the tf.SequenceExample protos.

Google Brain team provided a script to download and preprocess the [MSCOCO] (http://mscoco.org/) image captioning data set into this format. Downloading and preprocessing the data may take several hours depending on your network and computer speed.

Before running the script, ensure that your hard disk has at least 150GB of available space for storing the downloaded and processed data.

# Location to save the MSCOCO data.
MSCOCO_DIR="${HOME}/im2txt/data/mscoco"

# Build the preprocessing script.
bazel build im2txt/download_and_preprocess_mscoco

# Run the preprocessing script.
bazel-bin/im2txt/download_and_preprocess_mscoco "${MSCOCO_DIR}"

The final line of the output should read:

2016-09-01 16:47:47.296630: Finished processing all 20267 image-caption pairs in data set 'test'.

When the script finishes you will find 256 training, 4 validation and 8 testing files in DATA_DIR. The files will match the patterns train-?????-of-00256, val-?????-of-00004 and test-?????-of-00008, respectively.

Download the Inception v3 Checkpoint

The Show and Tell model requires a pretrained Inception v3 checkpoint file to initialize the parameters of its image encoder submodel.

This checkpoint file is provided by the TensorFlow-Slim image classification library which provides a suite of pre-trained image classification models. You can read more about the models provided by the library here.

Run the following commands to download the Inception v3 checkpoint.

# Location to save the Inception v3 checkpoint.
INCEPTION_DIR="${HOME}/im2txt/data"
mkdir -p ${INCEPTION_DIR}

wget "http://download.tensorflow.org/models/inception_v3_2016_08_28.tar.gz"
tar -xvf "inception_v3_2016_08_28.tar.gz" -C ${INCEPTION_DIR}
rm "inception_v3_2016_08_28.tar.gz"

Note that the Inception v3 checkpoint will only be used for initializing the parameters of the Show and Tell model. Once the Show and Tell model starts training it will save its own checkpoint files containing the values of all its parameters (including copies of the Inception v3 parameters). If training is stopped and restarted, the parameter values will be restored from the latest Show and Tell checkpoint and the Inception v3 checkpoint will be ignored. In other words, the Inception v3 checkpoint is only used in the 0-th global step (initialization) of training the Show and Tell model.

Initial Training

For initializing the initial training phase for the inception model, you can manually run the training script :

# Directory containing preprocessed MSCOCO data.
MSCOCO_DIR="${HOME}/im2txt/data/mscoco"

# Inception v3 checkpoint file.
INCEPTION_CHECKPOINT="${HOME}/im2txt/data/inception_v3.ckpt"

# Directory to save the model.
MODEL_DIR="${HOME}/im2txt/model"

# Build the model.
bazel build -c opt im2txt/...

# Run the training script.
bazel-bin/im2txt/train \
  --input_file_pattern="${MSCOCO_DIR}/train-?????-of-00256" \
  --inception_checkpoint_file="${INCEPTION_CHECKPOINT}" \
  --train_dir="${MODEL_DIR}/train" \
  --train_inception=false \
  --number_of_steps=1000000

Or automatically doing it by using :

$ ./train.sh

For initializing the second training phase for the inception model, you can manually run the training script :

# Restart the training script with --train_inception=true.
bazel-bin/im2txt/train \
  --input_file_pattern="${MSCOCO_DIR}/train-?????-of-00256" \
  --train_dir="${MODEL_DIR}/train" \
  --train_inception=true \
  --number_of_steps=3000000  # Additional 2M steps (assuming 1M in initial training).

Or automatically doing it by using :

$ ./train2.sh

For initializing the evaluation, you can manually run the eval script :

MSCOCO_DIR="${HOME}/im2txt/data/mscoco"
MODEL_DIR="${HOME}/im2txt/model"

# Ignore GPU devices (only necessary if your GPU is currently memory
# constrained, for example, by running the training script).
export CUDA_VISIBLE_DEVICES=""

# Run the evaluation script. This will run in a loop, periodically loading the
# latest model checkpoint file and computing evaluation metrics.
bazel-bin/im2txt/evaluate \
  --input_file_pattern="${MSCOCO_DIR}/val-?????-of-00004" \
  --checkpoint_dir="${MODEL_DIR}/train" \
  --eval_dir="${MODEL_DIR}/eval"

Or automatically doing it by using :

$ ./eval.sh

You should run the evaluation script in a separate process. This will log evaluation metrics to TensorBoard which allows training progress to be monitored in real-time.

Note that you may run out of memory if you run the evaluation script on the same GPU as the training script. You can run the command export CUDA_VISIBLE_DEVICES="" to force the evaluation script to run on CPU. If evaluation runs too slowly on CPU, you can decrease the value of --num_eval_examples.

For running tensorboard, you can manually run the training script :

MODEL_DIR="${HOME}/im2txt/model"

# Run a TensorBoard server.
tensorboard --logdir="${MODEL_DIR}"

Or automatically doing it by using :

$ ./tensorboard.sh

For captioning an image, you can manually run the caption script :

# Directory containing model checkpoints.
CHECKPOINT_DIR="${HOME}/im2txt/model/train"

# Vocabulary file generated by the preprocessing script.
VOCAB_FILE="${HOME}/im2txt/data/mscoco/word_counts.txt"

# JPEG image file to caption.
IMAGE_FILE="${HOME}/im2txt/data/mscoco/raw-data/val2014/COCO_val2014_000000224477.jpg"

# Build the inference binary.
bazel build -c opt im2txt/run_inference

# Ignore GPU devices (only necessary if your GPU is currently memory
# constrained, for example, by running the training script).
export CUDA_VISIBLE_DEVICES=""

# Run inference to generate captions.
bazel-bin/im2txt/run_inference \
  --checkpoint_path=${CHECKPOINT_DIR} \
  --vocab_file=${VOCAB_FILE} \
  --input_files=${IMAGE_FILE}

Or automatically doing it by using :

$ ./caption.sh

Using im2txt and the Inception v3 model on videos

For running the translation script which will take a video, extract each keyframe, caption them and generate a .srt file with the right time stamps. ${MODEL PATH} and ${KEYFRAME_TEMP_FOLDER} must be folders. ${.SRT OUTPUT PATH}, ${VIDEO FILE PATH} and ${MS COCO WORD_COUNT.TXT PATH} must be files, you'll have to use :

$ ./translator.sh ${MODEL PATH} ${KEYFRAME_TEMP_FOLDER} ${.SRT OUTPUT PATH} ${VIDEO FILE PATH} ${MS COCO WORD_COUNT.TXT PATH}

##Credits

ELECTRIC SPEECH (2016)
by MATHIEU ARBEZ HERMOSO
with DORIAN FAUCON

SUBSIDIZED by CONSEIL GENERAL DE COTE D'OR and DIRECTION REGIONALE DES AFFAIRES CULTURELLES

SPECIAL THANKS to ANTONIN RENAULT, GAËLLE LE FLOCH, DELPHINE PAUL, SHAWN QUIRK and THE TUVAN CULTURAL CENTER