# Exercise 8: Sound transformations

In this exercise you will use the HPS model to creatively transform sounds. There are two parts in this exercise. In the first one you should perform a natural sounding transformation on the speech sound that you used in the previous exercise (E7). In the second part you should select a sound of your choice and do a "creative" transformation. You will have to write a short description of the sound and of the transformation you did, giving the link to the original sound and uploading several transformed sounds.

For this exercise, you can use the `transformations_GUI.py` (in `software/transformations_interface/`) to try things, once decided you can fill up the code in this file. You can also do everything from here and add any new code you wish.

In order to perform a good/interesting transformation you should make sure that you have performed an analysis that is adequate for the type of transformation you want to do. Not every HPS analysis representation will work for every type of sound transformation. There will be things in the analysis that when modified will result in undesired artifacts. In general, for any transformation, it is best to have the harmonic values as smooth and continuous as possible and an stochastic representation as smooth and with as few values as possible. It might be much better to start with an analysis representation that does not result in the best reconstruction in exchange of having smoother and more compact data.

To help you with the exercise, we give a brief description of the transformation parameters used by the HPS transformation function:

1. `freqScaling`: frequency scaling factors to be applied to the harmonics of the sound, in time-value pairs (where value of 1 is no scaling). The time values can be normalized, from 0 to 1, or can correspond to the times in seconds of the input sound. The scaling factor is a multiplicative factor, thus a value of 1 is no change. Example: to transpose an octave the sound you can specify `[0, 2, 1, 2]`.
2. `freqStretching`: frequency stretching factors to be applied to the harmonics of the sound, in time-value pairs (value of 1 is no stretching). The time values can be normalized, from 0 to 1, or can correspond to the times in seconds of the input sound. The stretching factor is a multiplicative factor whose effect depend on the harmonic number, higher harmonics being more affected that lower ones, thus resulting in an inharmonic effect. A value of 1 results in no transformation. Example: an array like `[0, 1.2, 1, 1.2]` will result in a perceptually large inharmonic effect.
3. `timbrePreservation`: 1 preserves the original timbre, 0 does not. It can only have a value of 0 or of 1. By setting the value to 1 the spectral shape of the original sound is preserved even when the frequencies of the sound are modified. In the case of speech it would correspond to the idea of preserving the identity of the speaker after the transformation.
4. `timeScaling`: time scaling factors to be applied to the whole sound, in time-value pairs (value of 1 is no scaling). The time values can be normalized, from 0 to 1, or can correspond to the times in seconds of the input sound. The time scaling factor is a multiplicative factor, thus 1 is no change. Example: to stretch the original sound to twice the original duration, we can specify `[0, 0, 1, 2]`.

All the transformation values can have as many points as desired, but they have to be in the form of an array with time-value pairs, so of even size. For example a good array for a frequency stretching of a sound that has a duration of 3.146 seconds could be: `[0, 1.2, 2.01, 1.2, 2.679, 0.7, 3.146, 0.7]`.

## Part 1. Perform natural sounding transformations of a speech sound

Use the HPS model with the sound `speech-female.wav`, available in the sounds directory, to first analyze and then obtain a natural sounding transformation of the sound. The synthesized sound should sound as different as possible to the original sound while sounding natural. By natural we mean that it should sound like speech, that it could have been possible to be produced by a human, and by listening we should consider it as a speech sound, even though we might not be able to understand it. You should first make sure that you start from a good analysis, then you can do time and/or frequency scaling transformations. The transformation should be done with a single pass, no mixing of sounds coming from different transformations. Since you used the same sound in A7, use that experience to get a good analysis, but consider that the analysis, given that we now want to use it for applying a very strong transformation, might be done differently than what you did in A7.

Write a short paragraph for every transformation, explaining what you wanted to obtain and explaining the transformations you did, giving both the analysis and transformation parameter values (sufficiently detailed for the evaluator to be able to reproduce the analysis and transformation).

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import get_window
import sys, os
sys.path.append('../software/models/')
sys.path.append('../software/transformations/')
import utilFunctions as UF
import stft as STFT
import hpsModel as HPS
import hpsTransformations as HPST
import harmonicTransformations as HT
import IPython.display as ipd

In [57]:
# 1.1 perform an analysis/synthesis using the HPS model

input_file = '../sounds/speech-female.wav'

### set the parameters
window ='hamming'
M = 1401
N = 2048
t = -100
minSineDur = 0.1
nH = 100
minf0 = 135
maxf0 = 245
f0et = 7
harmDevSlope = 0.07
stocf = 0.2

# no need to modify anything after this
Ns = 512
H = 128

(fs, x) = UF.wavread(input_file)
w = get_window(window, M, fftbins=True)
hfreq, hmag, hphase, stocEnv = HPS.hpsModelAnal(x, fs, w, N, H, t, nH, minf0, maxf0, f0et, harmDevSlope, minSineDur, Ns, stocf)
y, yh, yst = HPS.hpsModelSynth(hfreq, hmag, hphase, stocEnv, Ns, H, fs)

ipd.display(ipd.Audio(data=x, rate=fs))
ipd.display(ipd.Audio(data=y, rate=fs))

In [58]:
def convert_to_note(time_start, time_end, position, duration, total_beats, length_track, pitch):
    start_norm = time_start/length_track
    end_norm = time_end/length_track
    beat_start = position / float(total_beats)
    beat_end = (position + duration) / float(total_beats)
    return (
        np.array([start_norm, pitch, end_norm, pitch]),
        np.array([start_norm, beat_start, end_norm, beat_end])
    )

In [59]:
# cow rap:


# i en
# ---------
# l’al
# tre
#
#
#
#
#
# se
# ---------
# liha
# po
# sat
# un
# tel.
#
#
# La
# ---------
# va
# ca
#
#
#
#
#
# és
# ---------
# ce
# ga
#
#
#
#
#


beats = 32
beat_count = 0
track_length = 3.994
pitch_array = np.array([])
time_array = np.array([])

# start and end time of each syllable
times = [
    [0.118, 0.379],
    [0.379, 0.64],
    [0.64, 0.854],
    [0.854, 1.035],
    [1.035, 1.292],
    [1.292, 1.42],
    [1.42, 1.528],
    [1.528, 1.688],
    [1.688, 1.945],
    [1.945, 2.261],
    [2.261, 2.372],
    [2.372, 2.525],
    [2.525, 2.757],
    [2.757, 3],
    [3, 3.226],
    [3.226, 3.417],
    [3.417, 3.65],
    [3.65, 3.848],
    [3.848, track_length]
]

# pitch of each note
pitch = [
    0.9,
    1.1,
    1.15,
    1,
    0.9,
    0.9,
    1.2,
    1.2,
    1.2,
    1,
    1,
    1,
    1.2,
    1.2,
    1,
    1,
    1,
    1,
    1
]

# duration of each note
dur = [
    1,
    1,
    1,
    5,
    1,
    1,
    1,
    1,
    1,
    1,
    2,
    1,
    1,
    1,
    5,
    1,
    1,
    1,
    5
]

i = 0
while i < len(dur):
    beat_count += dur[i]
    pitch_e, time = convert_to_note(times[i][0],times[i][1], beat_count - dur[i], dur[i], beats, track_length, pitch[i])
    pitch_array = np.append(pitch_array, pitch_e)
    time_array = np.append(time_array, time)
    i += 1

print(beat_count)

32


In [65]:
# 1.2 Perform a transformation from the previous analysis

### define the transformations
freqScaling = np.append(pitch_array, np.array([1,1]))*0.9
freqStretching = np.array([0, 1, 1, 1])
timbrePreservation = 1
timeScaling = np.append(time_array, np.array([1,1]))

# no need to modify the following code 
Ns = 512
H = 128

# frequency scaling of the harmonics 
hfreqt, hmagt = HT.harmonicFreqScaling(hfreq, hmag, freqScaling, freqStretching, timbrePreservation, fs)

# time scaling the sound
yhfreq, yhmag, ystocEnv = HPST.hpsTimeScale(hfreqt, hmagt, stocEnv, timeScaling)

# synthesis from the trasformed hps representation 
y, yh, yst = HPS.hpsModelSynth(yhfreq, yhmag, np.array([]), ystocEnv, Ns, H, fs)

ipd.display(ipd.Audio(data=y, rate=fs))

### Explain Part 1

The parameters choosen for the analysis follows the same criteria than the A7 assignment. However, for this assignment there was no need to pick parameters to obtain a compressed representation of the audio. For this reason I choose 100 harmonics instead of 40, in order to get a better harmonic representation.

* M = 1401: small enough so it does not mess with the transients, but large enough so the fs/F0*#bin(hamming) = 1260  rule is followed. Then the closes k * 100 + 1 value is chosen 

* N = 2048: closest power of 2

* t = -100: threshold to allow the quiests syllables be identified

* minSineDur = 0.1: smallest syllable identifies ( 80 mili seconds) + margin

* nH = 100: try/error + analysis of the spectrogram strategy. 100 harmonics gave better results in terms of perceptual representation comparing with the 40 harmonics choosen for A7.

* minf0 = 135: Minimum F0 identifyied was 140Hz - margin

* maxf0 = 245: Max F0 identifyed was 220Hz + margin

* f0et = 7: try/error strategy to obtain the best HPS synthesis

* harmDevSlope = 0.07: try/error strategy to obtain the best HPS synthesis

* stocf = 0.2: try/error strategy to obtain the best HPS synthesis. The higher, the more frequency filters are created, but it also has the tradeoff of a heavier stochastic representation.

For the transformations, I wanted the female voice to "rap" the poem of the cow. In order to do this, I created the method convert_to_note, which maps the start and end representations of a syllable in seconds, the position, duration and total beats of the audio, the length of the track and the pitch transformation.

With this data I created a equally distributed space in relation with the total number of beats and the length of the track. Then the position and duration are used to create the note time stretch, being the duration the length of the note and the position the moment in time where the note stretching occurs. Finally the pitch is used to change the pitch of that particular note.

With this method, I composed a rhythmic pattern in 4/4 with small pitch changes to representante the rap intonation of the woman. The final rhythmic pattern has 32 beats (4 compasses).

I did not wanted to change the freqStretching because it changes the relationship between the harmonics and the voice is not recognized as a human anymore. Also the timbrePreservation parameter is set to True so the formants are conserved.

## Part 2. Perform creative transformations with a sound of your choice

Pick any natural and harmonic sound from Freesound and use the HPS model to do the most creative and interesting transformation you can come up with. Sounding as different as possible from the original sound.

It is essential that you start with a natural harmonic sound. Examples include (but not limited to) any acoustic harmonic instrument, speech, harmonic sound from nature, etc. As long as they have a harmonic structure, you can use it. You can even reuse the sound you used in A7-Part2 or upload your own sound to freesound and then use it.

The sound from Freesound to use could be in any format, but to use the sms-tools software you will have to first convert it to be a monophonic file (one channel), sampling rate of 44100, and 16bits samples.

You can do any interesting transformation with a single pass. It is not allowed to mix sounds obtained from different transformations. The transformed sound need not sound natural. So, time to show some creativity!

Write a short paragraph for every transformation, explaining what you wanted to obtain and explaining the transformations you did, giving both the analysis and transformation parameter values (sufficiently detailed for the evaluator to be able to reproduce the analysis and transformation).

In [71]:
# 1.1 perform an analysis/synthesis using the HPS model

### set the parameters
input_file = 'throat.wav'
window ='hamming'
M = 2001
N = 4096
t = -70
minSineDur = 4
nH = 20
minf0 = 80
maxf0 = 95
f0et = 7
harmDevSlope = 0.7
stocf = 0.3

# no need to modify anything after this
Ns = 512
H = 128

(fs, x) = UF.wavread(input_file)
w = get_window(window, M, fftbins=True)
hfreq, hmag, hphase, stocEnv = HPS.hpsModelAnal(x, fs, w, N, H, t, nH, minf0, maxf0, f0et, harmDevSlope, minSineDur, Ns, stocf)
y, yh, yst = HPS.hpsModelSynth(hfreq, hmag, hphase, stocEnv, Ns, H, fs)

ipd.display(ipd.Audio(data=x, rate=fs))
ipd.display(ipd.Audio(data=y, rate=fs))

In [67]:
times_wo_index = np.linspace(0, track_length, 19)
times = np.zeros(19)
times_wo_index

array([0.        , 0.22188889, 0.44377778, 0.66566667, 0.88755556,
       1.10944444, 1.33133333, 1.55322222, 1.77511111, 1.997     ,
       2.21888889, 2.44077778, 2.66266667, 2.88455556, 3.10644444,
       3.32833333, 3.55022222, 3.77211111, 3.994     ])

In [68]:

beats = 32
beat_count = 0
track_length = len(x)/float(fs)
pitch_array = np.array([])
time_array = np.array([])

# beggining and end of each note
times = [
    [0.26804989, 0.53609977],
    [0.53609977, 0.80414966],
    [0.80414966, 1.07219955],
    [1.07219955, 1.34024943],
    [1.34024943, 1.60829932],
    [1.60829932, 1.87634921],
    [1.87634921, 2.14439909],
    [2.14439909, 2.41244898],
    [2.41244898, 2.68049887],
    [2.68049887, 2.94854875],
    [2.94854875, 3.21659864],
    [3.21659864, 3.48464853],
    [3.48464853, 3.75269841],
    [3.75269841, 3.8515948 ],
    [3.85159481, 4.0207483 ],
    [4.0207483 , 4.28879819],
    [4.28879819, 4.55684807],
    [4.55684807, 4.82489796],
    [4.82489796, track_length]
]

# pitch of each note
pitch = [
    0.9,
    1.5,
    1.4,
    1.8,
    0.9,
    1,
    1.2,
    1.6,
    2,
    1.6,
    1.3,
    1,
    1.6,
    1.3,
    1,
    1.6,
    1.7,
    1.7,
    1.3
]

# duration of each note
dur = [
    1,
    1,
    1,
    5,
    1,
    1,
    1,
    1,
    1,
    1,
    2,
    1,
    1,
    1,
    5,
    1,
    1,
    1,
    5
]
i = 0
while i < len(dur):
    beat_count += dur[i]
    pitch_e, time = convert_to_note(times[i][0],times[i][1], beat_count - dur[i], dur[i], beats, track_length, pitch[i])
    pitch_array = np.append(pitch_array, pitch_e)
    time_array = np.append(time_array, time)
    i += 1

print(beat_count)

32


In [69]:
# 1.2 Perform a transformation from the previous analysis

### define the transformations
freqScaling = np.append(pitch_array, np.array([1,1]))
freqStretching = np.array([0, 4, 0.7, 0.8, 1, 1.2])
timbrePreservation = 0
timeScaling = np.append(time_array, np.array([1,1]))


# no need to modify anything after this
Ns = 512
H = 128

# frequency scaling of the harmonics 
hfreqt, hmagt = HT.harmonicFreqScaling(hfreq, hmag, freqScaling, freqStretching, timbrePreservation, fs)

# time scaling the sound
yhfreq, yhmag, ystocEnv = HPST.hpsTimeScale(hfreqt, hmagt, stocEnv, timeScaling)

# synthesis from the trasformed hps representation 
y, yh, yst = HPS.hpsModelSynth(yhfreq, yhmag, np.array([]), ystocEnv, Ns, H, fs)

ipd.display(ipd.Audio(data=y, rate=fs))

### Explain Part 2

I choose the same sound from freesound used in A7: https://freesound.org/people/djgriffin/sounds/22023

* window ='hamming': this is the default window for speech analysis. This is not speech, but is a mouth-generated sound.

* M = 2001: we need at least 44100/90*4 ~ 1960 the next k * 100 + 1 is 2001.

* N = 4096: closest power of 2. But a zero padding interpolation in needed.

* t = -70: : threshold to allow the quiests harmonics variation be identified

* minSineDur = 4: since the model identifies the harmonics with a harmDevSlope, A small value works well in order that the model identifies the variation of the frequencies in amplitude as a single harmonic with a high frequenfcy variation.

* nH = 20: try/error + analysis of the spectrogram strategy. 20 harmonics gave me the bests results without sacrifying the high end

* minf0 = 80: Minimum F0 identifyied was 85Hz - margin

* maxf0 = 95: Maximum F0 identifyied was 85Hz + margin

* f0et = 7: try/error strategy to obtain the best HPS synthesis

* harmDevSlope = 0.7: try/error strategy to obtain the best HPS synthesis

* stocf = 0.3: try/error strategy to obtain the best HPS synthesis. The higher, the more frequency filters are created, but it also has the tradeoff of a heavier stochastic representation.

I used the same convert_to_note method in order to create a distorted melody from the original sound. The pitch transformations here are more violent than the previous section of the assignment in order to obtain a melody. Since the fundamental frequency of the sound is stable over the time, the pitch transformations could be seen as absolute transpositions of the tone. The rhythmic pattern was created in order to obtain a 4/4 4 compasses rhythmic rich melody.

Finally, I did not wanted to preserve the formants of the voice so the timbrePreservation argument was set to False. Also, I did a frequency stretching as a V form ramp not so low to make it more interesting.