# Mboshi/French Corpus : Preliminary analysis

## References :

+ https://github.com/besacier/mboshi-french-parallel-corpus

+ https://github.com/hadware/voxpopuli

+ https://www.phon.ucl.ac.uk/home/sampa/

## Extracting the french sentences

The following snippet can be used to gather all the french sentences of the corpus into a single text file :

In [3]:
import os

RESSOURCES_DIR = "/home/kaoro/Documents/vocal_synthesis/mboshi-french-parallel-corpus/full_corpus_newsplit/all/"

paths = os.listdir(RESSOURCES_DIR)

with open('french_corpus.txt', 'w') as corpus:
    if os.stat(corpus_path).st_size == 0:
        for path in paths:
            if path.endswith(".fr"):
                with open(RESSOURCES_DIR + path, 'r') as sentence:
                    corpus.write(sentence.read())
        print("Corpus generated")
    else:
        print("File already existing")

Corpus generated


## Working with the corpus

The generated file, which contains all of the ~5100 french sentences, can then be used to extract some valuable information about phonemes. Those data would be useful to help us determinate to see what kind of process will be needed to make it the corpus more suitable for Mbrola voice creation.

In [11]:
import sys
sys.path.append('venv/lib/python3.6/site-packages')
from voxpopuli import Voice

CORPUS_PATH = "french_corpus.txt"
PHONETIZED_CORPUS_PATH = "french_corpus_phonemes.txt"

voice = Voice(lang="fr")
phonetized_sentences = []
phonemes_count = []
    
with open(CORPUS_PATH, "r") as corpus:
    content = corpus.readlines()
    for x in content:
        phonetized_sentences.append(voice.to_phonemes(x.strip()))
        
print("Phonetized sentences list generated")
print("Corpus size : {}".format(len(phonetized_sentences)))

Phonetized sentences list generated
Corpus size : 80


Now we can extract some basics statistics from this corpus, ie average phonemes length and phoneme representation in percent :

In [27]:
phonetized_sentences_count = []
phonemes_freq = {}

for x in phonetized_sentences:
    phonetized_sentences_count.append(len(x))

print("Average sentence lengh in phonemes : {}".format(sum(phonetized_sentences_count) /
                                                       len(phonetized_sentences_count)))

for x in phonetized_sentences:
    for y in x:
        if phonemes_freq.get(y.name):
            phonemes_freq[y.name] += 1
        else:
            phonemes_freq[y.name] = 1

        
print("Representation (in %) for each phonemes in the corpus :\n")
sorted_freq = sorted(phonemes_freq.items(), key=lambda kv: kv[1], reverse=True)
for key, value in sorted_freq:
    print("{} - {} ({}%)".format(key, value,
                                    round(value /
                                    sum(phonemes_freq.values()) * 100
                                        , 3)))


Average sentence lengh in phonemes : 27.3875
Representation (in %) for each phonemes in the corpus :

a - 201 (9.174%)
_ - 198 (9.037%)
l - 167 (7.622%)
R - 122 (5.568%)
i - 109 (4.975%)
t - 109 (4.975%)
s - 108 (4.929%)
E - 106 (4.838%)
e - 102 (4.655%)
@ - 83 (3.788%)
d - 75 (3.423%)
m - 75 (3.423%)
a~ - 60 (2.738%)
k - 57 (2.602%)
n - 54 (2.465%)
y - 50 (2.282%)
o~ - 48 (2.191%)
p - 47 (2.145%)
z - 44 (2.008%)
j - 43 (1.963%)
f - 40 (1.826%)
u - 39 (1.78%)
b - 37 (1.689%)
v - 36 (1.643%)
O - 36 (1.643%)
o - 29 (1.324%)
Z - 27 (1.232%)
w - 18 (0.822%)
S - 17 (0.776%)
2 - 14 (0.639%)
g - 12 (0.548%)
e~ - 12 (0.548%)
9 - 9 (0.411%)
9~ - 7 (0.319%)
