# Mboshi/French Corpus : Preliminary analysis

## References :

+ https://github.com/besacier/mboshi-french-parallel-corpus

+ https://github.com/hadware/voxpopuli

+ https://www.phon.ucl.ac.uk/home/sampa/

## Extracting the french sentences

The following snippet can be used to gather all the french sentences of the corpus into a single text file :

In [29]:
import os

# git clone https://github.com/besacier/mboshi-french-parallel-corpus
RESSOURCES_DIR = "/home/kaoro/Documents/vocal_synthesis/mboshi-french-parallel-corpus/full_corpus_newsplit/all/"
CORPUS_PATH = "french_corpus.txt"

paths = os.listdir(RESSOURCES_DIR)

with open('french_corpus.txt', 'w') as corpus:
    if os.stat(CORPUS_PATH).st_size == 0:
        for path in paths:
            if path.endswith(".fr"):
                with open(RESSOURCES_DIR + path, 'r') as sentence:
                    corpus.write(sentence.read())
        print("Corpus generated")
    else:
        print("File already existing")

Corpus generated


## Working with the corpus

The generated file, which contains all of the ~5130 french sentences, can then be used to extract some valuable information about phonemes. Those data would be useful to help us determinate to see what kind of process will be needed to make the corpus more suitable for Mbrola voice creation.

In [31]:
import sys
sys.path.append('venv/lib/python3.6/site-packages')
from voxpopuli import Voice

CORPUS_PATH = "french_corpus.txt"
PHONETIZED_CORPUS_PATH = "french_corpus_phonemes.txt"

voice = Voice(lang="fr")
phonetized_sentences = []
phonemes_count = []
    
with open(CORPUS_PATH, "r") as corpus:
    content = corpus.readlines()
    for x in content:
        phonetized_sentences.append(voice.to_phonemes(x.strip()))
        
print("Phonetized sentences list generated")
print("Corpus size : {}".format(len(phonetized_sentences)))

Phonetized sentences list generated
Corpus size : 5130


## Average on sentence length and phonemes representation
Now we can extract some basics statistics from this corpus, ie average phonemes length and phoneme representation in percent :

In [32]:
phonetized_sentences_count = []
phonemes_freq = {}

for x in phonetized_sentences:
    phonetized_sentences_count.append(len(x))

print("Average sentence lenght in phonemes : {}".format(sum(phonetized_sentences_count) /
                                                       len(phonetized_sentences_count)))

for x in phonetized_sentences:
    for y in x:
        if phonemes_freq.get(y.name):
            phonemes_freq[y.name] += 1
        else:
            phonemes_freq[y.name] = 1

        
print("Representation (in %) for each phonemes in the corpus :\n")
sorted_freq = sorted(phonemes_freq.items(), key=lambda kv: kv[1], reverse=True)
for key, value in sorted_freq:
    print("{} - {} ({}%)".format(key, value,
                                    round(value /
                                    sum(phonemes_freq.values()) * 100
                                        , 3)))


Average sentence lenght in phonemes : 27.151656920077972
Representation (in %) for each phonemes in the corpus :

_ - 12618 (9.059%)
a - 12432 (8.925%)
l - 9896 (7.105%)
R - 9264 (6.651%)
E - 6777 (4.865%)
t - 6595 (4.735%)
i - 6374 (4.576%)
s - 6345 (4.555%)
e - 6228 (4.471%)
d - 5369 (3.855%)
@ - 5261 (3.777%)
a~ - 4311 (3.095%)
m - 4303 (3.089%)
p - 4131 (2.966%)
k - 3647 (2.618%)
n - 3431 (2.463%)
y - 3378 (2.425%)
o~ - 2764 (1.984%)
b - 2600 (1.867%)
v - 2534 (1.819%)
u - 2441 (1.752%)
O - 2414 (1.733%)
f - 2369 (1.701%)
j - 2342 (1.681%)
z - 2205 (1.583%)
Z - 1947 (1.398%)
o - 1608 (1.154%)
S - 1229 (0.882%)
w - 949 (0.681%)
e~ - 897 (0.644%)
g - 881 (0.633%)
9~ - 600 (0.431%)
9 - 588 (0.422%)
2 - 553 (0.397%)
N - 7 (0.005%)


## Pandi Pandas

Let's load the corpus into a serie to leverage the power of Pandas for vizualisation :

In [67]:
import pandas as pd

phonetized_sentences_converted = []

for x in phonetized_sentences:
    sentence = [y.name for y in x]
    phonetized_sentences_converted.append(sentence)

sentences = pd.Series(phonetized_sentences_converted)

print(sentences.axes)
print("Average sentence size in phonemes : {}".format(sentences.mean(axis=x)))
print("Max sentence size in phonemes : {}".format(len(sentences.max())))
print("Min sentence size in phonemes : {}".format(len(sentences.min())))

[RangeIndex(start=0, stop=5130, step=1)]


TypeError: unhashable type: 'PhonemeList'