<a href="https://colab.research.google.com/github/jcalandra/audiosynthesis_dl/blob/master/src/Pict2Audio.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pict2Audio : A Neural Network that associates Pictures to Audio Descriptors

This project consists in associating a sound with one or more characteristics defined by audio descriptors with a picture drawn by a composer . The long-term goal is to allow the composer to develop his own composition language in order to associate it with sounds from some effects banks.

I will first be interested in pitch, and I will propose as input to a Convolutional Neural Network trained for classification a database of couples:
- an image drawn by the composer,
- the label of the sound extract corresponding to the associated note pitch.
After training, we want to obtain a sound for a given image at the input of the network.

Validation tests will be conducted by verifying that the sounds obtained correspond to the desired descriptors.

## Importation of the libraries

First, we need to import all the package and libraries necessary to run the code.

The backend Tensorflow is used with the library Keras to implement the neural network.

In [1]:
from __future__ import print_function

from PIL import Image
import os, sys

import tensorflow as tf
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Conv2D, MaxPooling2D, Flatten
from keras.optimizers import RMSprop
from keras.preprocessing.image import ImageDataGenerator
from keras.preprocessing import image
from os import walk

import matplotlib.pyplot as plt
import numpy as np

Using TensorFlow backend.



## Importation of the Dataset :

the dataset is imported from github, using the repository audiosynthesis_dl. In this repository, you can also find documentation about sound synthesis using Neural Networks.

In [2]:
! git clone https://github.com/jcalandra/audiosynthesis_dl.git

Cloning into 'audiosynthesis_dl'...
remote: Enumerating objects: 249, done.[K
remote: Counting objects: 100% (249/249), done.[K
remote: Compressing objects: 100% (245/245), done.[K
remote: Total 588 (delta 6), reused 246 (delta 4), pack-reused 339[K
Receiving objects: 100% (588/588), 9.70 MiB | 31.54 MiB/s, done.
Resolving deltas: 100% (77/77), done.


In [3]:
print('tensorflow:', tf.__version__)
print('keras:', keras.__version__)

tensorflow: 1.13.1
keras: 2.2.4


## Loading the datas


In [4]:
#Loading the pictures.
x_train = np.empty((0,400,400,1))
for imgP in os.listdir( "./audiosynthesis_dl/data/pitch_img/img_train")[:] :
  if imgP.split(".")[-1] != "git":
    img = image.load_img( "./audiosynthesis_dl/data/pitch_img/img_train/"+imgP, 
                             target_size=(400, 400),
                             color_mode='grayscale')
    # To input our values in our network Conv2D layer, we need to reshape the 
    # datasets, i.e., pass from (60, 400, 400) to (60, 400, 400, 1) where 1 is 
    # the number of channels of our images
    x_train = np.concatenate((x_train,np.reshape(img,(1,400,400,1))),axis=0)
    
x_test = np.empty((0,400,400,1))
for imgP in os.listdir( "./audiosynthesis_dl/data/pitch_img/img_test")[:] :
  if imgP.split(".")[-1] != "git":
    img = image.load_img( "./audiosynthesis_dl/data/pitch_img/img_test/"+imgP, 
                             target_size=(400, 400),
                             color_mode='grayscale')
    x_test = np.concatenate((x_test,np.reshape(img,(1,400,400,1))),axis=0)


# x_train : 252 images of size 400x400, i.e., x_train.shape = (120, 400, 400)
# y_train : 252 sounds-labels (from 0 to 11 corresponding to each pitch in 1 octave)
# x_test  : 84 images of size 400x400, i.e., x_test.shape = (60, 400, 400)
# y_test  : 84 sounds-labels

print('x_train.shape=', x_train.shape) #252 elements
print('x_test.shape=', x_test.shape)   #84 elements


#Convert to float
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')

#Normalize inputs from [0; 255] to [0; 1]
x_train = x_train / 255
x_test = x_test / 255

#Global values
num_classes = 12

TRAIN_SIZE = x_train.shape[0]
TEST_SIZE = x_test.shape[0]


NB_VERSIONS_TRAIN = TRAIN_SIZE / num_classes
NB_VERSIONS_TEST = TEST_SIZE / num_classes

# Les labels sont un tableau où chaque élément correspond au label de l'image
# d'indice correspondant
y_train = np.empty(TRAIN_SIZE)
for i in range(TRAIN_SIZE):
  y_train[i] = i//NB_VERSIONS_TRAIN
  
y_test = np.empty(TEST_SIZE)
for i in range(TEST_SIZE):
  y_test[i] = i//NB_VERSIONS_TEST

print(len(x_train[0][0]))
print(len(x_test))  
  
print('y_train.shape=', y_train.shape)
print('y_test.shape=', y_test.shape)


#Convert class vectors to binary class matrices ("one hot encoding")
## Doc : https://keras.io/utils/#to_categorical
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

x_train.shape= (252, 400, 400, 1)
x_test.shape= (84, 400, 400, 1)
400
84
y_train.shape= (252,)
y_test.shape= (84,)


## The Convolutional Neural Network

Now we need to create and compile the CNN that will classify our datas.

In [0]:

#Creation of the Convolutional Network

model = Sequential()
model.add(Conv2D(filters=32, kernel_size=(3,3), strides=1, padding='valid', activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2), strides=None, padding='valid', data_format=None))
model.add(Conv2D(filters=64, kernel_size=(3,3), strides=1, padding='valid', activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2), strides=None, padding='valid', data_format=None))
#model.add(Conv2D(filters=128, kernel_size=(3,3), strides=1, padding='valid', activation='relu'))
#model.add(MaxPooling2D(pool_size=(2,2), strides=None, padding='valid', data_format=None))
model.add(Dropout(0.75))
model.add(Flatten(data_format=None))
model.add(Dense(units=128, activation='sigmoid'))
model.add(Dense(units=12, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

hist = model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs= 30, batch_size=6)
loss_and_metrics = model.evaluate(x_test, y_test, batch_size=6)
print('loss =', loss_and_metrics[0],'accuracy =', loss_and_metrics[1]);

model.summary();
classes = model.predict(x_test, batch_size=72);


Train on 252 samples, validate on 84 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30

## Loss and Accuracy


In [0]:
import matplotlib
import matplotlib.pyplot as plt
i = 0;
print('x_test.shape', x_test.shape, 'dtype', x_test.dtype)
print('y[{}]={}'.format(i, y_test[i]))
plt.imshow(x_test[i,:].reshape(400,400), cmap = matplotlib.cm.binary)
plt.axis("off")
plt.show()
plt.gcf().clear()

# summarize history for accuracy
plt.plot(hist.history['acc'])
plt.plot(hist.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

# summarize history for loss
plt.plot(hist.history['loss'])
plt.plot(hist.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()


## Results


1) loss : 0.2628 - acc: 0.9722 - val_loss : 3.9648 - val_acc: 0.0278

2) après ajout de dropout :
loss: 0.0153 - acc: 1.0000 - val_loss: 4.3796 - val_acc: 0.2222

3) après augmentation des données :
loss: 0.0141 - acc: 1.0000 - val_loss: 4.2034 - val_acc: 0.2500

4) encore augmentation des données :
c'est pire :'(

---



ajouter du Dropout ?

PROBLEME : pas la bonne image affichée