In [2]:
# check if ABG corpus was already downloaded
# if not, download it
if [ ! -f /tmp/Corpus_ABG.csv ]; then 
  wget -q https://raw.githubusercontent.com/SauronGuide/corpusABG/master/Corpus_ABG_Completo_Versao3.csv -O /tmp/Corpus_ABG.csv
fi

Check ABG Corpus file type.

In [3]:
file /tmp/Corpus_ABG.csv

/tmp/Corpus_ABG.csv: UTF-8 Unicode (with BOM) text, with CRLF line terminators


Since it uses CRLF line terminators, it was probably created on Windows. This double line terminator is unnacessary and carries an extra byte (```\r```, carriage return) that might appear as ```^M``` on Linux and create further problem.

The BOM marker is also redundant, since it is possible to infer the endianness by a simple analysis of the data. The BOM marker (in UTF-8) is made of the three inicial bytes ```0xEF```,```0xBB```,```0xBF```.

So let's start by removing both of them.

In [4]:
tail --bytes=+4 /tmp/Corpus_ABG.csv | tr -d '\r' > /tmp/CorpusABG.csv

Unfortunately the ABG corpus has many erros. We will try to fix some of them and leave many behind.

The first error we found is empty lines, that appear as: ```,,,,,,,,,,,,,,,,```.

Lets list them and remove them. We will use grep to find them and the parameter ```-n``` to print the line number, along the match.

In [5]:
grep -n '^,' /tmp/CorpusABG.csv 

[32m[K20608[m[K[36m[K:[m[K[01;31m[K,[m[K,,,,,,,,,,,,,,,
[32m[K20747[m[K[36m[K:[m[K[01;31m[K,[m[K,,,,,,,,,,,,,,,
[32m[K20957[m[K[36m[K:[m[K[01;31m[K,[m[K,,,,,,,,,,,,,,,
[32m[K48422[m[K[36m[K:[m[K[01;31m[K,[m[K,,,,,,,,,,,,,,,
[32m[K48764[m[K[36m[K:[m[K[01;31m[K,[m[K,,,,,,,,,,,,,,,
[32m[K49034[m[K[36m[K:[m[K[01;31m[K,[m[K,,,,,,,,,,,,,,,
[32m[K49705[m[K[36m[K:[m[K[01;31m[K,[m[K,,,,,,,,,,,,,,,
[32m[K49838[m[K[36m[K:[m[K[01;31m[K,[m[K,,,,,,,,,,,,,,,
[32m[K50175[m[K[36m[K:[m[K[01;31m[K,[m[K,,,,,,,,,,,,,,,
[32m[K51587[m[K[36m[K:[m[K[01;31m[K,[m[K,,,,,,,,,,,,,,,
[32m[K53205[m[K[36m[K:[m[K[01;31m[K,[m[K,,,,,,,,,,,,,,,
[32m[K92625[m[K[36m[K:[m[K[01;31m[K,[m[K,   ,,,,,,,,,,,,,,


Now, lets remove those lines. We're using the same command from above and cutting the result to get only the line number. Then we peform a loop on those numbers to create a string with everything we will remove using ```sed```.

In [6]:
LINES=""; 
for i in `grep -n '^,' /tmp/CorpusABG.csv | cut -d: -f1`; do 
  LINES="${LINES}${i}d;"; 
done; 
sed -ie "${LINES}" /tmp/CorpusABG.csv

The corpus has a sequence of commas ```,,,,,``` in the end of each line. We will remove it, since it is unused.

In [7]:
sed -i 's/,,,,,$//g' /tmp/CorpusABG.csv

The fields (columns) in the ABG corpus are listed on the first line.

In [8]:
head -n 1 /tmp/CorpusABG.csv 

ID,PALAVRA,CATMORF,LEMA,TRANSCRICAO,ACENTUACAO,ESTSILABICA,CATACENTUAL,FREQGERAL,FREQORAL,FREQESCRITA,Freq_Nivel


There are 12 fields: ID (1), PALAVRA (2), CATMORF (3), LEMA (4), TRANSCRICAO (5), ACENTUACAO (6), ESTSILABICA (7), CATACENTUAL (8), FREQGERAL (9), FREQORAL (10), FREQESCRITA (11), Freq_Nivel (12).

Another type of error we found in the database is lines that don't have exactly 12 fields (3 have more than 12 and 3 have less than 12). Probably the authors have inserted erroneous commas or have missed some, creating a malformed file. The script bellow present the rows that do not have 12 fields. We also print the line number, so we may once again remove the bad data.

In [9]:
cat /tmp/CorpusABG.csv | sed 's/,\+\s*$//' | awk -F, 'NF!=12{print NR":"$0}'

24089:24093,Jereissati,   F   jereissati,   &je-reJ-sa-Ti*,   &je-reJ-s1-Ty*,   &CV-CVG-CV-CV*,   parox?tona,6,   0o,6,2
35806:35816,Fuentes,   F   Fuentes,   &fE-tes*,   &fE-t4s*,   &CV-CVS*,   ox?tona,3,   0o,3,2
46495:46508,prociss?es,   NP,,   prociss?es,   &pro-si-sOJs*,   &pro-si-s#Js*,   &CCV-CV-CVGS*,   ox?tona,2,2,   0e,2
51341:51359,Josh,   F   Josh,   &jo-Sy*,   &j9-Sy*,   &CV-CV*,   parox?tona,2,   0o,2,2
54571:54607,Juvenile,   NOM,   Juvenile,   &ju-ve-ni-le*,   &ju-ve-n7-ly*,   &CV-CV-CV-CV*,   parox?tona,1,   0o,1,,54607
77325:77959,bulut,   F,   bulut,   &bu-lut*,   &bu-l$t*,   &CV-CVC*,   ox?tona,1,   0o,1,1,,1,   0o,1,1


Lets then remove those lines, just as we did before.

In [10]:
LINES=""; 
for i in `sed 's/,\+\s*$//' < /tmp/Corpus_ABG.csv | awk -F, 'NF!=12{print NR":"$0}' | cut -d: -f1`; do 
  LINES="${LINES}${i}d;"; 
done; 
sed -ie "${LINES}" /tmp/CorpusABG.csv

There are also rows where the FREQGERAL is zero or not a number. 

In [11]:
awk -F, '$9<1{print NR":"$0}' < /tmp/CorpusABG.csv

24086:24093,Jereissati,   F   jereissati,   &je-reJ-sa-Ti*,   &je-reJ-s1-Ty*,   &CV-CVG-CV-CV*,   parox?tona,6,   0o,6,2
35802:35816,Fuentes,   F   Fuentes,   &fE-tes*,   &fE-t4s*,   &CV-CVS*,   ox?tona,3,   0o,3,2
46490:46508,prociss?es,   NP,,   prociss?es,   &pro-si-sOJs*,   &pro-si-s#Js*,   &CCV-CV-CVGS*,   ox?tona,2,2,   0e,2
51329:51359,Josh,   F   Josh,   &jo-Sy*,   &j9-Sy*,   &CV-CV*,   parox?tona,2,   0o,2,2


We're going to remove these data as well.

In [12]:
LINES=""; 
for i in `awk -F, '$9<1{print NR":"$0}' < /tmp/CorpusABG.csv | cut -d: -f1`; do 
  LINES="${LINES}${i}d;"; 
done; 
sed -ie "${LINES}" /tmp/CorpusABG.csv

There are other frequency fields (FREQORAL (10), FREQESCRITA (11), Freq_Nivel (12)) and theit values should be numeric and positive. But many have non valid values.

In [13]:
awk -F, '($10<1||$11<1||$12<1){print NR":"$0}' < /tmp/CorpusABG.csv | wc -l

71043


This number represents a large amount of the total dada. We will leave it be.

In [14]:
TOTAL=$(wc -l /tmp/CorpusABG.csv | cut -d' ' -f1)
COUNTNN=$(awk -F, '($10<1||$11<1||$12<1){print NR":"$0}' < /tmp/CorpusABG.csv | wc -l)
echo "$COUNTNN/$TOTAL" | bc -l

.76726931052358735095


The morphological category (column 3) and stress category (column 8) are categorical variables, they might have values in a finite set and their value assigning each sample to a different group. Lets check the values used in the corpus.


Morphological category (column 3):

In [15]:
awk -F',[ ]*' '{print $3}' /tmp/CorpusABG.csv | sort | uniq -c | sort -rn

  36563 NOM
  32776 V
  11681 ADJ
   5067 F
   2188 V+P
   1456 C
   1061 ADV
    717 
    430 G
    184 P
    179 I
     80 PREP+P
     55 CONJ
     42 NUM
     39 PREP
     38 PREP+DET
     27 DET
      5 PREP+ADV
      1 V+p
      1 FNOM
      1 f
      1 CATMORF


We see there 717 with no morphological category assigned, there is one "V+p" that might be "V+P" and a "f" that might be "F". We might easily correct those two.

In [16]:
awk -F',[ ]*' 'BEGIN{OFS=", "} ($3=="f"){$3="F"} ($3=="V+p"){$3="V+P"} {print}' /tmp/CorpusABG.csv > /tmp/tmpABG
mv /tmp/tmpABG /tmp/CorpusABG.csv

The list of the 717 of entries with empty morphological category (column 3) is given bellow:

In [17]:
awk -F',[ ]*' '($3==""){print $1,$2}' /tmp/CorpusABG.csv | column

143 legal		395 crian?as		634 logo
149 segundo		396 pol?cia		635 estados
150 vida		397 sala		636 velho
151 outros		398 p?blico		637 m?dico
152 nunca		399 enquanto		638 a??o
153 foram		400 nosso		639 mal
154 ?poca		401 junto		640 votos
155 ia			402 sul			641 sociedade
156 S?o			403 falava		642 alunos
157 disse		404 quarto		643 relacionadas
158 est?o		405 situa??o		644 pequena
159 fui			406 talvez		645 aquelas
160 tipo		407 social		646 debate
161 ir			408 conhece		647 gostoso
162 quer		409 seguran?a		648 trabalhando
163 dentro		410 umas		649 antigamente
164 pessoal		411 idade		650 quiser
165 ficou		412 m?s			651 dilma
166 d?			413 dados		652 papel
167 ver			414 for			653 l?ngua
168 dizer		415 podem		654 equipe
169 falei		416 Mas			655 saiu
170 maior		417 bairros		656 seguinte
171 seus		418 importante		657 cinema
172 entendeu		419 filha		658 pensar
173 teve		420 pouquinho		659 ruim
174 jeito		421 conhe?o		660 eleitoral
175 tanto		422 curso		661 cidades
176 essas		423 processo		662 ideia
17

391 professor		630 entrevista		869 obrigada
392 viol?ncia		631 condi??es		870 Petrobras
393 viu			632 certeza		871 guerra
394 zona		633 baixo		872 poucos


Stress category (column 8):

In [18]:
awk -F',[ ]*' '{print $8}' /tmp/CorpusABG.csv | sort | uniq -c | sort -rn

  55262 parox?tona
  21694 ox?tona
   6800 paroxitona
   3470 oxitona
   3392 proparox?tona
   1530 mono
    396 proparoxitona
     40 4
      2 PARox?tona
      2 ox?tono
      1 quatro
      1 parox?ton
      1 parox?ona
      1 CATACENTUAL


There are 40 entries with value 4 for stress category, one entry with value 'quatro' (might be the same as 4) and there are many mistyped names which we might correct with a simple substitution.

In [19]:
# fix stress category
sed -i 's/ox?tona/oxitona/g' /tmp/CorpusABG.csv
sed -i 's/ox?tono/oxitona/g' /tmp/CorpusABG.csv
sed -i 's/PARoxitona/paroxitona/g' /tmp/CorpusABG.csv
sed -i 's/parox?ona/paroxitona/g' /tmp/CorpusABG.csv
sed -i 's/parox?ton/paroxitona/g' /tmp/CorpusABG.csv

Now, lets check some of those entries with value 4 in stress category.

In [20]:
awk -F',[ ]*' '($8==4){print}' /tmp/CorpusABG.csv | head -n 5

719,t?cnico,,   t?cnico,   &t5-ky-ni-ko*,   &t5-ky-ny-kw*,   &CV-CV-CV-CV*,4,500,151,349,3
1485,t?cnica,   NOM,   t?cnica,   &t5-ky-ni-ka*,   &t5-ky-ny-k@*,   &CV-CV-CV-CV*,4,231,70,161,3
2666,t?cnicos,   ADJ,   t?cnicos,   &t5-ky-ni-kos*,   &t5-ky-ny-kws*,   &CV-CV-CV-CVS*,4,121,13,108,3
4111,t?cnicas,   ADJ,   t?cnicas,   &t5-ky-ni-kas*,   &t5-ky-ny-k@s*,   &CV-CV-CV-CVC*,4,73,11,62,2
5060,d?ficit,   NOM,   d?ficit,   &d5-fi-si-ty*,   &d5-fy-sy-ty*,   &CV-CV-CV-CV*,4,56,3,53,2


And the complete list of words with value 4 (or 'quatro') in stress category is:

In [21]:
awk -F',[ ]*' '($8==4||$8=="quatro"){print $1,$2}' /tmp/CorpusABG.csv | column

719 t?cnico		31949 Polit?cnica	56961 antiss?ptico
1485 t?cnica		34472 eletrot?cnico	57174 antiss?pticos
2666 t?cnicos		34935 ?tnicas		57581 apocal?pticas
4111 t?cnicas		41512 aut?ctones	59961 c?psula
5060 d?ficit		42759 eletrot?cnica	64142 el?ptico
13979 l?xico		43225 el?ptica		64157 epil?ptica
16379 polit?cnica	44164 ex-t?cnico	64278 epil?pticos
16904 ?tnica		46546 logar?tmica	72143 lepid?pteros
19590 inc?gnita		50151 pan?ptico		74419 g?ngsteres
21275 ?tnicos		51463 r?tmico		77274 multi?tnico
23465 ?tnico		51627 sociot?cnica	83538 d?couvert
26420 inc?gnito		56107 h?bitat		84891 pirot?cnicos
28814 apocal?ptica	56835 anal?pticos	87837 Polit?cnico
29173 c?psulas		56954 antiss?ptica


It seems all of them are proparoxytone ('proparoxitona') in fact. So lets correct them.

In [22]:
awk -F',[ ]*' 'BEGIN{OFS=", "} ($8==4||$8=="quatro"){$8="proparoxitona"} {print}' /tmp/CorpusABG.csv > /tmp/tmpABG
mv /tmp/tmpABG /tmp/CorpusABG.csv

Now lets make an histogram for the stress category.

In [23]:
awk -F',[ ]*' '{print $8}' /tmp/CorpusABG.csv | sort | uniq -c | sort -rn | 
  head -n -1 | nl| 
  gnuplot -e "set terminal png; set output 'images/stress_category.png'; set xlabel 'categoria acentual'; set ylabel 'frequencia'; set style fill solid; set boxwidth 1; set title 'corpus abg'; set xtics rotate by 45 right; plot '/dev/stdin' using 1:2:xtic(3) with boxes notitle"

![](images/stress_category.png)

Every row should have 12 fields, but there are two rows with a different number. One with 16 and another with 17.

In [24]:
awk -F'[ ]*,[ ]*' 'BEGIN{OFS=","} {print NF}' /tmp/CorpusABG.csv | sort | uniq -c

  92590 12
      1 16
      1 17


These rows are presented bellow:

In [25]:
awk -F'[ ]*,[ ]*' 'BEGIN{OFS=","} (NF==16||NF==17){print}' /tmp/CorpusABG.csv

54607,Juvenile,   NOM,   Juvenile,   &ju-ve-ni-le*,   &ju-ve-n7-ly*,   &CV-CV-CV-CV*,   paroxitona,1,   0o,1,,54607,,,
77959,bulut,   F,   bulut,   &bu-lut*,   &bu-l$t*,   &CV-CVC*,   oxitona,1,   0o,1,1,,1,   0o,1,1


Lets correct those lines. Squeezing repeates of commas (,) and removing comma (,) in the end of the line. After lets print only the 12 first fields, since the line for **bulut** has a repeaded values of ones in the end.

In [26]:
awk -F'[ ]*,[ ]*' 'BEGIN{OFS=","} (NF>12){system("echo \""$0"\" | tr -s ',' | sed 's/,$//'")} (NF==12){print}' /tmp/CorpusABG.csv > /tmp/abgtmp
mv /tmp/abgtmp /tmp/CorpusABG.csv 
awk -F'[ ]*,[ ]*' 'BEGIN{OFS=","} (NF==12){print} (NF>12){print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12}' /tmp/CorpusABG.csv > /tmp/abgtmp
mv /tmp/abgtmp /tmp/CorpusABG.csv 

Fields 9, 10, 11 and 12 should be numeric, since they represent frequencies. Lets check how many are not numeric.

In [27]:
tail -n +2 /tmp/CorpusABG.csv | awk -F'[ ]*,[ ]*' 'BEGIN{OFS=","} (!($9~/^[0-9]+$/)){count9++} (!($10~/^[0-9]+$/)){count10++} (!($11~/^[0-9]+$/)){count11++} (!($12~/^[0-9]+$/)){count12++} END{printf "number of non-numeric values in fields: 9:%d, 10:%d, 11:%d, 12:%d\n",count9,count10,count11,count12}'

number of non-numeric values in fields: 9:0, 10:56110, 11:14933, 12:0


Lets check what is written in these fields that were supposed to be numeric.

In [28]:
tail -n +2 /tmp/CorpusABG.csv | awk -F'[ ]*,[ ]*' 'BEGIN{OFS=","} (!($9~/^[0-9]+$/)){print $9} (!($10~/^[0-9]+$/)){print $10} (!($11~/^[0-9]+$/)){print $11} (!($12~/^[0-9]+$/)){print $12}' | sort | uniq -c

  14933 0e
  56109 0o
      1 o


It seems we may replace those values (0e, 0o and o) by 0 (zero). So lets do it.

In [29]:
awk -F'[ ]*,[ ]*' 'BEGIN{OFS=","} (NR>1&&!($9~/^[0-9]+$/)){$9=0} (NR>1&&!($10~/^[0-9]+$/)){$10=0} (NR>1&&!($11~/^[0-9]+$/)){$11=0} (NR>1&&!($12~/^[0-9]+$/)){$12=0} {print}' /tmp/CorpusABG.csv > /tmp/tmpabg 
mv /tmp/tmpabg /tmp/CorpusABG.csv 

Many entries have ```?``` in the word (column 2) or lemma (column 5) transcription. The amount is:

In [30]:
awk -F',[ ]*' '($2~/?/)||($4~/?/){print $2, $4, $5}' /tmp/CorpusABG.csv | wc -l

18007


and some examples are given bellow:

In [31]:
awk -F',[ ]*' '($2~/?/)||($4~/?/){print $2, $4, $5}' /tmp/CorpusABG.csv | head

l? l? &l1*
tamb?m tamb?m &tA-b6*
est? est? &es-t1*
s?o s?o &sAW*
j? j? &j1*
s? s? &s!*
? ? a&a*
at? at? &a-t5*
?s ?s &as*
m?e m?e &mA-e*


In the ABG Corpus repository there is a file ```Corpus_Tag_Freq_Trans.txt``` (probably a intermediary file) which has some data that might be used to fix those ```?``` in the corpus file. 

Lets first download this file:

In [32]:
if [ ! -f /tmp/Acentuador.zip ]; then 
  wget -q https://github.com/SauronGuide/corpusABG/raw/master/7-%20Acentuador.zip -O /tmp/Acentuador.zip
fi 
#unzip -p /tmp/Acentuador.zip "7- Acentuador/Corpus_Tag_Freq_Trans.txt" > /tmp/Corpus_Tag_Freq_Trans.txt
unzip -p /tmp/Acentuador.zip "7- Acentuador/Corpus_Transcrito.xlsx" > /tmp/Corpus_Transcrito.xlsx

Now we're going to convert the .xlsx file into a .csv file by using libreoffice.

In [33]:
libreoffice --headless --convert-to csv --outdir /tmp/ /tmp/Corpus_Transcrito.xlsx

convert /tmp/Corpus_Transcrito.xlsx -> /tmp/Corpus_Transcrito.csv using filter : Text - txt - csv (StarCalc)
Overwriting: /tmp/Corpus_Transcrito.csv


Lets check the type of file created.

In [34]:
file /tmp/Corpus_Transcrito.csv 

/tmp/Corpus_Transcrito.csv: Non-ISO extended-ASCII text


It is a problematic file. We're gonna assume it was coded in ISO 8859-1, which was standard for old Windows in Latin America and then convert it to UTF-8 using *translit* option (if can't find the right symbol in the target coding, chose the most visually similar).

In [35]:
iconv -f ISO-8859-1 -t UTF-8//TRANSLIT < /tmp/Corpus_Transcrito.csv  > /tmp/Corpus_Transcrito_utf8.csv

Lets create a list of all entries that still need a fix.

In [36]:
awk -F',[ ]*' 'BEGIN{OFS=","} ($2~/?/){print NR, $2, $4, $5, $9, $10, $11}' /tmp/CorpusABG.csv > /tmp/needfixlist

And not lets make a script to fix it.

For each entry that has ```?``` in it, which is listed in file ```/tmp/needfixlist``` created above, we will find the corresponding entry (same pronounciation) and replace the mispelled word by the vertion in the downloaded file. If we find the corresponding word, we will fix it and, if the lemma is equal to the word, we will also fix the lema.

In [38]:
rm -f /tmp/not_found.csv
while read line; do
  LINNUM=$(echo $line | awk -F',' '{print $1}')
  TRANSC=$(echo $line | awk -F',' '{print $4}')
  PWORD=$(echo $line | awk -F',' '{print $2}')
  #TRANSS="${TRANSC//[-&*]}"
  F1=$(echo $line | awk -F',' '{print $5}')
  F2=$(echo $line | awk -F',' '{print $6}')
  F3=$(echo $line | awk -F',' '{print $7}')
  WORD=$(awk -F'[ ]*,[ ]*' -v prn="$TRANSC" -v f1="$F1" -v f2="$F2" -v f3="$F3" 'BEGIN{OFS=","} $4==prn&&($8==f1||($8!~/^[0-9]+$/&&f1==0))&&($9==f2||($9!~/^[0-9]+$/&&f2==0))&&($10==f3||($10!~/^[0-9]+$/&&f3==0)){print $1}' /tmp/Corpus_Transcrito_utf8.csv)
  for word in $WORD;
  do
    if [ ${#word} == ${#PWORD} ];
      then
      THEWORD="$word"
    fi;
  done
  if [ ! -z "$THEWORD" ]; then
     awk -F',[ ]*' -v linnum="$LINNUM" -v word="$THEWORD" 'BEGIN{OFS=","} (NR==linnum&&$2==$4){$4=tolower(word)} NR==linnum{$2=tolower(word)} {$1=$1}1' /tmp/CorpusABG.csv > /tmp/tmpabg
     mv /tmp/tmpabg /tmp/CorpusABG.csv
  else
     echo "$PWORD, $TRANSC" >> /tmp/not_found.csv
  fi
  THEWORD=""
done < /tmp/needfixlist

Still there are many entries that still need fix.

In [39]:
echo $(wc -l /tmp/not_found.csv | cut -d ' ' -f1)"/"$(wc -l /tmp/needfixlist | cut -d ' ' -f1) | bc -l

.20504874063221953006


In [40]:
wc -l /tmp/not_found.csv 

3639 /tmp/not_found.csv


3639 entries (20.5%) were not fixed.

Lets see 10 random entries that were not fixed:

In [41]:
shuf -n 10 /tmp/not_found.csv

policial-rob?, &po-li-si-aW-ro-bo*
encoura?ado, &E-koW-ra-sa-do*
h?ngaros, &%-ga-ros*
circula??o, &sir-ku-la-sAW*
rememora??o, &he-me-mo-ra-sAW*
an?lises, &a-n1-li-es*
incipit?rio, &I-si-pi-t1-rJo*
DESIST?NCIA, &de-zis-tE-sJa*
manipul?-lo, &ma-ni-pu-l1-lo*
Execu??es, &e-ze-ku-sOJs*


In [43]:
# add number of phonemes and syllables
awk -F','  'BEGIN{OFS=","} NR==1{printf "%s",$0; printf "%s",",NUMFONES,NUMSILABAS\n"} NR>1{$2=tolower($2); $4=tolower($4); gsub(/^\&|\*$/,"",$5); gsub(/^\&|\*$/,"",$6); gsub(/^\&|\*$/,"",$7); printf "%s,",$0; gsub(/-/,"",$5); printf "%d,%d\n", length($5), gsub(/-/,"",$7)+1}' /tmp/CorpusABG.csv > /tmp/CorpusABGv2.csv

using a spell correction to correct entries with ```?```

In [None]:
awk -F, 'BEGIN{OFS=FS} $2~/\?/{"echo \""$2"\" | ./spell.py"|& getline $2} $4~/\?/{"echo \""$4"\" | ./spell.py"|& getline $4} {$1=$1; print}' /tmp/CorpusABGv2.csv > /tmp/CorpusABGv2spell.csv

The substitutions made in field 2 are presented bellow:

In [44]:
diff <(awk -F, '{print $2}' /tmp/CorpusABGv2.csv) <(awk -F, '{print $2}' /tmp/CorpusABGv2spell.csv) | grep -v "^---" | grep -v "^[0-9c0-9]" | column 

< regi?o		< dic??o		< disposi??es
> região		> dicção		> disposições
< ningu?m		< flex?vel		< a?ucarado
> ninguém		> flexível		> açucarado
< qu?			< cal?ado		< s?c
> que			> calado		> sc
< opini?o		< desconstru??o		< rebeli?es
> opinião		> desconstrução		> rebeliões
< produ??o		< injusti?a		< radiois?topo
> produção		> injustiça		> radioisótopo
< s?rie			< sedu??o		< reconstr?i
> série			> sedução		> reconstrói
< servi?o		< viol?ncias		< brit?nicas
> serviço		> violências		> britânicas
< portugu?s		< destr?i		< desconex?o
> português		> destrói		> desconexo
< pr?prio		< descri??es		< comunica??o
> próprio		> descrições		> comunicação
< aten??o		< g?meos		< cal?ad?es
> atenção		> gêmeos		> calçadões
< condi??es		< escorpi?o		< injusti?ada
> condições		> escorpião		> injustiçada
< justi?a		< uni?es		< posti?os
> justiça		> uniões		> postiços
< constru??o		< extradi??o		< desperdi?a
> construção		> extradição		> desperdiça
< reuni?o		< redu??es		< degluti??o
> reunião		> reduções		> deglut

< cal?adas		< term?metros		< novi?a
> calçadas		> termômetros		> noviça
< prefer?ncias		< m?ller		< cadei?o
> preferências		> muller		> cadeião
< competi??es		< obsess?es		< sobreposi??es
> competições		> obsessões		> sobreposições
< resid?ncias		< mani?oba		< cr?nicos
> residências		> maniçoba		> crônicos
< licen?as		< mesti?agem		< realiza??es
> licenças		> mestiçagem		> realizações
< descal?o		< cal?amento		< prospec??o
> descalço		> calçamento		> prospecção
< inscri??es		< neurocirurgi?o	< transg?neros
> inscrições		> neurocirurgião	> transgêneros
< din?micas		< desservi?o		< hortali?as
> dinâmicas		> desserviço		> hortaliças
< dami?o		< antag?nicos		< redefini??o
> damião		> antagônicos		> redefinição
< v?nus			< desperdi?ou		< desuni?o
> vênus			> desperdiçou		> desunido
< influ?ncias		< pux?o			< predi?o
> influências		> puxão			> predigo
< obsess?o		< super-her?i		< dentu?o
> obsessão		> superherói		> dentuço
< caix?o		< descal?a		< coment?rios
> caixão		> descalça		> comentári

Unfortunately, some of them went wrong...

cal?ado > calado

s?c > sc

al?ar > aliar

lix?o > lixo

etc

There are many entries in the database which have at least one empty field.

there are still many entries which were not fixed

In [45]:
awk -F, '$2~/\?/{print $2}' /tmp/CorpusABGv2spell.csv | wc -l

3222


for example:

In [46]:
awk -F, '$2~/\?/{print $2}' /tmp/CorpusABGv2spell.csv | column | head

condi??o			beb?o
justi?a				?s
opini?o				reprercuss?o
uni?o				?ngulos
servi?os			sarad?es
atl?tico-mg			compreend?-los
produ??o			interse??es
servi?o				liter?rios
?				equalit?rio
introdu??o			colch?o


As we may check, now every entry has 14 fields.

In [1]:
awk -F'[ ]*,[ ]*' 'BEGIN{OFS=","} {print NF}' /tmp/CorpusABGv2spell.csv | sort | uniq -c

  92592 14


And there are empty fields (717 of them) only on the 3rd field (morphological category).

In [2]:
awk -F',[ ]*' 'BEGIN{OFS=","} {for (i = 1; i <= NF; i++) if ($i=="") print i}' /tmp/CorpusABGv2spell.csv | sort | uniq -c

    717 3


## Syllables frequency

Lets break the transcription on the hyphen (-) and count the occurrence of syllables.

In [3]:
tail -n +2 /tmp/CorpusABGv2spell.csv | awk -F',' '{n=split($5,syl,"-"); for (i = 0; ++i <= n;) print syl[i]}' | sort | uniq -c | sort -rn | column

   9291 a	     33 DJas	      6 5r	      2 nip	      1 oJs
   7220 do	     32 TJe	      6 4	      2 nex	      1 OJ
   6682 ta	     32 rU	      5 zus	      2 neWf	      1 ob
   5937 ka	     32 reW	      5 zIs	      2 nat	      1 nyO
   5536 Ti	     32 pJe	      5 zIg	      2 nAs	      1 nyIg
   5319 de	     32 n$	      5 zaJs	      2 naS	      1 nya
   5277 se	     32 jir	      5 WE	      2 Nan	      1 nWan
   5276 da	     32 bJE	      5 W1	      2 naJr	      1 nUs
   5052 te	     32 b5	      5 vri	      2 naf	      1 nun
   4315 ra	     31 z5	      5 vEs	      2 nad	      1 noz
   4095 ko	     31 rir	      5 tW7	      2 nab	      1 nol
   3979 he	     31 ran	      5 tr1s	      2 n7W	      1 noh
   3742 to	     31 p$	      5 toWr	      2 n6	      1 nod
   3741 si	     31 LoW	      5 tley	      2 mys	      1 NO
   3629 na	     31 kr7	      5 TJos	      2 mWos	      1 nJu
   3597 ma	     31 jJas	      5 TJE	      2 muy	      1 nJov
   3597 li	     31 his	      5 tay	      2 mul	      1 nJO

    484 dE	     19 ay	      4 pJes	      2 glJos	      1 lib
    476 La	     19 As	      4 pJE	      2 gler	      1 leWrs
    468 veW	     19 2	      4 pIs	      2 gleJ	      1 lEt
    461 bra	     19 1W	      4 pin	      2 glAd	      1 lEs
    431 DJa	     18 zen	      4 pIg	      2 gl$	      1 lep
    429 tor	     18 truJ	      4 pen	      2 gJe	      1 leJr
    421 Si	     18 sen	      4 oWr	      2 giW	      1 leJb
    418 kre	     18 SaW	      4 nur	      2 gIs	      1 leg
    416 Dis	     18 pru	      4 not	      2 gis	      1 lef
    412 1	     18 plJa	      4 Nor	      2 gAg	      1 lays
    404 tAW	     18 plaW	      4 NJa	      2 g4	      1 laWr
    402 par	     18 n9	      4 niWs	      2 g$s	      1 LaWd
    400 Dy	     18 lJas	      4 Nis	      2 frW7	      1 lAt
    396 lA	     18 Jor	      4 neWr	      2 frJos	      1 lat
    395 lis	     18 jJos	      4 Nes	      2 frJas	      1 laJt
    393 oW	     18 hay	      4 nay	      2 frJar	      1 laj
    389 vE	     18 freJ	   

    188 aJs	     12 nJE	      3 sWo	      1 zay	      1 j1s
    186 fre	     12 m1r	      3 sWas	      1 zAWs	      1 iyah
    184 zer	     12 lJer	      3 Sre	      1 zav	      1 iWls
    184 mO	     12 lIs	      3 Sot	      1 zAt	      1 Iv
    184 kuW	     12 kWo	      3 Sor	      1 zat	      1 Ist
    183 Ne	     12 koWr	      3 sob	      1 zars	      1 Iss
    183 l7	     12 klJE	      3 SO	      1 zaJr	      1 iss
    180 sJas	     12 klis	      3 slo	      1 zah	      1 irs
    180 mes	     12 klA	      3 sli	      1 zaf	      1 Ip
    179 gWa	     12 kJo	      3 sle	      1 zab	      1 ip
    179 er	     12 jO	      3 SJor	      1 z9	      1 ild
    179 dA	     12 JA	      3 sip	      1 z6s	      1 ih
    177 pA	     12 hey	      3 sid	      1 z$	      1 ig
    176 roW	     12 geW	      3 Set	      1 z#	      1 if
    176 DJo	     12 g1s	      3 sat	      1 WyIg	      1 id
    174 zas	     12 frJa	      3 sAs	      1 WO	      1 hWo
    174 gas	     12 bJA	      3 san	      1 WE

    102 baW	      9 h$	      3 ik	      1 traS	      1 frJed
    100 beW	      9 gys	      3 huy	      1 trab	      1 fris
     99 frA	      9 gur	      3 huJs	      1 toz	      1 frid
     99 boW	      9 grI	      3 hOJs	      1 toy	      1 frar
     96 lJo	      9 glJo	      3 hiz	      1 toWt	      1 fraJ
     96 kE	      9 fun	      3 hiS	      1 toS	      1 fr5
     96 d7	      9 fJan	      3 heJs	      1 ton	      1 fr1s
     95 kI	      9 fIs	      3 han	      1 toJr	      1 foy
     95 faW	      9 et	      3 haJs	      1 tog	      1 foWs
     95 fA	      9 dWos	      3 haf	      1 tof	      1 fOk
     94 ler	      9 dros	      3 h7s	      1 tmJe	      1 flyIg
     94 jU	      9 dras	      3 h6	      1 tmaW	      1 flyI
     93 mAW	      9 DJE	      3 gWaJs	      1 tlus	      1 flWos
     92 TiW	      9 dAs	      3 guy	      1 tlO	      1 flWa
     92 p5	      9 dad	      3 grU	      1 tlJer	      1 flW7
     91 zeJ	      9 buJr	      3 grO	      1 tlJe	      1 floJ
     91 sJE	

     55 blo	      7 mJos	      2 toWs	      1 SaWJ	      1 buz
     54 s!	      7 mAWs	      2 tov	      1 SaWd	      1 buyW
     54 g5	      7 mad	      2 tlo	      1 Sars	      1 but
     53 LAW	      7 m5s	      2 tles	      1 sars	      1 bUs
     53 kuJ	      7 lyo	      2 tled	      1 SAp	      1 burs
     52 zO	      7 luJr	      2 tlas	      1 sals	      1 bun
     52 kWA	      7 Lu	      2 tlAd	      1 saJW	      1 bulls
     52 jI	      7 lir	      2 tjer	      1 SaJss	      1 buJW
     52 fru	      7 LeW	      2 TJar	      1 SaJs	      1 buJn
     52 frO	      7 kWaW	      2 Tip	      1 SaJrs	      1 bso
     52 DJA	      7 kUs	      2 teWr	      1 sah	      1 bSe
     51 zOJs	      7 ksJa	      2 teS	      1 sAg	      1 bse
     51 sJA	      7 kov	      2 tem	      1 sag	      1 bryIg
     51 Ser	      7 kok	      2 tEf	      1 saf	      1 bryAt
     51 Sas	      7 klI	      2 tEd	      1 s9	      1 bruy
     51 brI	      7 kay	      2 taz	      1 s7s	      1 bruJ
     50 N

In [4]:
tail -n +2 /tmp/CorpusABGv2spell.csv | awk -F',' '{n=split($5,syl,"-"); for (i = 0; ++i <= n;) print syl[i]}' | sort | uniq -c | sort -rn | tee >/tmp/syllables_freq_abg >(nl | gnuplot -e "set terminal png; set output '/tmp/syllables_freq_abg_loglog.png'; set xlabel 'syllables'; set ylabel 'frequency'; set logscale xy; plot '/dev/stdin' with lines title 'abgcorpus'") >(column)

: 1

In [8]:
cp /tmp/syllables_freq_abg_loglog.png images/
cp /tmp/syllables_freq_abg .

![](images/syllables_freq_abg_loglog.png)

Lets consider now only syllabes in the beggining of a word. The frequency and plot are presented bellow.

In [6]:
tail -n +2 /tmp/CorpusABGv2spell.csv | awk -F',' '{n=split($5,syl,"-"); print syl[1]}' | sort | uniq -c | sort -rn | column

   7466 a	     20 p$	      4 k!r	      2 Diz	      1 lig
   3400 he	     20 naW	      4 kok	      2 DiWr	      1 let
   2856 de	     20 in	      4 klaWs	      2 di	      1 Les
   2838 e	     20 haJ	      4 keW	      2 dep	      1 lEs
   2757 I	     20 gO	      4 kay	      2 dAW	      1 Ler
   2654 kO	     20 fur	      4 kAW	      2 dAs	      1 leJr
   2284 es	     20 bis	      4 kars	      2 d7s	      1 leJb
   2002 E	     19 zu	      4 jEJ	      2 d5r	      1 lEJ
   1821 ka	     19 zA	      4 h$s	      2 d1r	      1 led
   1777 i	     19 vJe	      4 gU	      2 byr	      1 Le
   1760 ko	     19 toW	      4 ger	      2 bWar	      1 lay
   1582 o	     19 teJ	      4 g5r	      2 bWa	      1 laWr
   1390 ma	     19 ses	      4 g1s	      2 b!s	      1 lAW
   1156 pa	     19 meW	      4 flWo	      2 bros	      1 laJt
   1142 se	     18 var	      4 flor	      2 braW	      1 laJs
   1079 pro	     18 sWa	      4 et	      2 brAs	      1 lAg
   1037 Di	     18 s7	      4 duW	      2 braJt	      1

    100 DJa	     10 noW	      3 dWar	      1 tEs	      1 hayA
     99 sur	     10 niW	      3 dr1s	      1 teS	      1 haWh
     99 kri	     10 krJan	      3 dek	      1 tells	      1 hAW
     98 maW	     10 klJE	      3 dEJ	      1 tEf	      1 haS
     96 pla	     10 jO	      3 bym	      1 tays	      1 hAp
     96 kla	     10 jeW	      3 bWe	      1 taWt	      1 haJW
     95 So	     10 im	      3 brU	      1 tat	      1 haJo
     94 bri	     10 his	      3 broW	      1 tam	      1 haj
     93 nu	     10 h7	      3 br1	      1 tak	      1 h5s
     92 mor	     10 gli	      3 boy	      1 taJW	      1 !h
     92 mer	     10 fro	      3 bOs	      1 tah	      1 gWiW
     92 ber	     10 f!	      3 bles	      1 tAg	      1 gWe
     91 Su	     10 d$	      3 bl1	      1 tab	      1 gWas
     89 heJ	     10 blA	      3 big	      1 t1s	      1 guz
     88 deJ	     10 bJe	      3 baJr	      1 t1r	      1 guJW
     87 tro	     10 b7	      3 art	      1 sWor	      1 guJs
     87 frA	      9 vas	    

     39 fI	      6 gay	      2 m7s	      1 ped	      1 DJan
     39 8	      6 frJe	      2 m1s	      1 paz	      1 d!J
     38 mas	      6 flaW	      2 lWo	      1 paWs	      1 Diy
     38 lan	      6 drU	      2 loWr	      1 pap	      1 Did
     38 koJ	      6 b$	      2 loWJ	      1 pan	      1 dEs
     37 vir	      6 2	      2 lJOs	      1 pal	      1 den
     37 kuJ	      5 z5	      2 lJo	      1 paes	      1 deJk
     37 kI	      5 vEJ	      2 lJeS	      1 pad	      1 dehs
     37 ir	      5 v5	      2 lJe	      1 p9s	      1 days
     37 eJ	      5 tSe	      2 liW	      1 p6	      1 daS
     36 sor	      5 tr7	      2 laz	      1 p5W	      1 daJs
     36 siW	      5 tAW	      2 laWb	      1 p5s	      1 dad
     36 moW	      5 tas	      2 lars	      1 p4	      1 d9
     35 tar	      5 taJs	      2 kys	      1 p2	      1 byars
     35 m1	      5 t5r	      2 kWAW	      1 p1g	      1 bWo
     35 l7	      5 syW	      2 kWan	      1 p$W	      1 buz
     35 jiW	      5 SU	      2 krys	 

In [7]:
tail -n +2 /tmp/CorpusABGv2spell.csv | awk -F',' '{n=split($5,syl,"-"); print syl[1]}' | sort | uniq -c | sort -rn | tee >/tmp/start_syllables_freq_abg >(nl | gnuplot -e "set terminal png; set output '/tmp/start_syllables_freq_abg_loglog.png'; set xlabel 'start syllables'; set ylabel 'frequency'; set logscale xy; plot '/dev/stdin' with lines title 'abgcorpus'") >(column)

   7466 a	     2

: 1

In [9]:
cp /tmp/start_syllables_freq_abg_loglog.png images/
cp /tmp/start_syllables_freq_abg .

![](images/start_syllables_freq_abg_loglog.png)

and we might do the same for end syllables

In [11]:
tail -n +2 /tmp/CorpusABGv2spell.csv | awk -F',' '{n=split($5,syl,"-"); print syl[n]}' | sort | uniq -c | sort -rn | tee >/tmp/end_syllables_freq_abg >(nl | gnuplot -e "set terminal png; set output '/tmp/end_syllables_freq_abg_loglog.png'; set xlabel 'end syllables'; set ylabel 'frequency'; set logscale xy; plot '/dev/stdin' with lines title 'abgcorpus'") >(column)

   5353 do	     13 uW	      4 dog	      2 giW	      1 may
   2912 da	     13 trEJ	      4 dley	      2 gIs	      1 mass
   2469 te	     13 TIs	      4 DJA	      2 gAg	      1 map
   2323 se	     13 teJs	      4 DIg	      2 g4	      1 mans
   2100 sAW	     13 rus	      4 dAWs	      2 fuW	      1 malls
   1988 to	     13 r1s	      4 buW	      2 fur	      1 maJr
   1823 dos	     13 r!	      4 breW	      2 frJos	      1 maJd
   1710 ta	     13 pJos	      4 bOs	      2 frJas	      1 maJ
   1671 ra	     13 mJas	      4 blog	      2 frJar	      1 m7
   1543 rAW	     13 lu	      4 blo	      2 fri	      1 m2
   1490 das	     13 kus	      4 bla	      2 frey	      1 lWAsk
   1399 va	     13 heJ	      4 bl1	      2 freWd	      1 lWas
   1314 ka	     13 guJ	      4 bJe	      2 freW	      1 lWar
   1246 mos	    

: 1

In [12]:
cp /tmp/end_syllables_freq_abg_loglog.png images/.
cp /tmp/end_syllables_freq_abg .

![](images/end_syllables_freq_abg_loglog.png)

In the revious results we have considered the occurrency of syllables in the vocabullary, we have not take in account the frequency of occurrence of words.

Now, we're gonna repeat the previous analysis, but take in account the frequency of occurrence of words and therefore we're going to approximate the frequency of occurrence of syllables in the language.

In [13]:
tail -n +2 /tmp/CorpusABGv2spell.csv | 
  awk -F',' '{n=split($5,syl,"-"); for (i=0;++i<=n;) syllable[syl[i]]+=$9} END{for (s in syllable) print syllable[s]"\t"s}' | 
  sort -rn | tee >/tmp/syllables_freqocc_abg >(nl | gnuplot -e "set terminal png; set output '/tmp/syllables_freqocc_abg_loglog.png'; set xlabel 'syllables'; set ylabel 'frequency of occurence'; set logscale xy; plot '/dev/stdin' with lines title 'abgcorpus'") >(column) 

cp /tmp/syllables_freqocc_abg .
cp /tmp/syllables_freqocc_abg_loglog.png images/.

350604	a	489	!r	29	Ud	6	bWas	2	bulls
229808	de	486	bJa	29	trWo	6	b!s	2	buJW
205651	e	482	gles	29	trWa	6	boJs	2	bSe
178051	do	480	kaJs	29	sEs	6	bat	2	brJas
153426	ke	479	pJor	29	mak	6	b$W	2	brJar
143340	o	477	b1	29	maJls	6	an	2	brJA
132518	da	475	NJa	29	lir	5	zab	2	brik
129315	se	474	rJe	29	iJ	5	Wis	2	brAt
115955	te	474	n%	29	fJA	5	Wi	2	brap
109927	to	470	p7	29	bWar	5	Ward	2	brah
108515	ta	470	bos	28	zJar	5	vroW	2	brad
101279	ra	469	dre	28	tSe	5	vrOJs	2	boWt
96925	ka	466	klJE	28	s1b	5	vok	2	bok
92259	ma	462	kris	28	prAW	5	vOJs	2	bma
90309	kO	460	LoW	28	paes	5	tWos	2	blus
83973	ko	457	zir	28	neWs	5	tud	2	bloW
78212	Ti	453	tOJs	28	n6	5	tSa	2	blis
76054	na	451	w	28	muy	5	trov	2	bler
70182	no	451	gor	28	mJar	5	trolls	2	bled
65073	es	448	ksu	28	loy	5	tlJe	2	blAW
65031	pa	447	g7	28	klar	5	tI	2	blaJr
63285	la	446	jer	28	key	5	sWe	2	bJar
62721	3	445	mOs	28	jy	5	sOJ	2	bit
58566	Di	442	d$s	28	hWa	5	soJ	2	bik
57972	E	442	br1s	28	hey	5	sli	2	bey
57716	sAW	441	z!	28	glos	5	SJes	2	beJt
57577	so	441	r

5340	ir	159	leW	15	plays	4	fIg	1	SaJss
5244	sA	158	g1	15	pJE	4	faJr	1	SaJs
5221	rE	158	dWa	15	oWr	4	f1s	1	SaJrs
5203	pos	157	hor	15	noWs	4	f1r	1	sah
5144	O	156	hJa	15	ners	4	eWs	1	s9
5105	kas	155	pus	15	my	4	esk	1	s7s
5097	raW	155	fus	15	lJes	4	ert	1	s7r
5066	las	154	plos	15	kIs	4	eJrd	1	s6W
5036	jEJ	152	gun	15	joy	4	drJe	1	s2W
4966	deW	151	t1s	15	hEs	4	drag	1	s1s
4959	Ja	151	seJs	15	gWo	4	dmO	1	s$W
4904	hes	151	5r	15	gW7	4	dlA	1	rys
4873	pri	149	nJos	15	gray	4	deh	1	ryIg
4842	grA	148	SOJs	15	frAW	4	deb	1	ryA
4813	mU	148	nJe	15	faz	4	dak	1	rW7
4796	ju	148	j$	15	et	4	d5W	1	rW
4756	DJa	147	DJE	15	DJO	4	bug	1	rut
4747	ker	147	b!	14	zJE	4	bser	1	ruj
4743	sEJ	146	v8	14	vus	4	brU	1	rub
4702	kaW	146	hEJ	14	up	4	brJoz	1	rs
4697	koW	143	dWo	14	tlo	4	braJt	1	rols
4624	bro	142	LeW	14	t1r	4	br5	1	rof
4580	kri	141	vJas	14	SO	4	bOb	1	rJu
4571	bu	139	vJar	14	prat	4	bJas	1	rJen
4532	m7	139	gAWs	14	plus	4	biz	1	rJ1
4521	zAW	139	feW	14	plI	4	bib	1	rit
4405	by	138	TJos	14	plAt	4	bayo	1	rin
4383	vAW	138	g

1482	ran	62	5Js	9	hot	2	troJt	1	jaWss
1481	sO	61	teJs	9	hJe	2	troh	1	Jat
1468	NA	61	priW	9	hiz	2	trik	1	jAt
1468	fre	61	Nan	9	hik	2	trEd	1	jAs
1455	TJa	61	gon	9	gWiW	2	traz	1	jak
1455	!	60	klI	9	gret	2	trak	1	jah
1453	suW	59	saJs	9	gox	2	toy	1	J7
1449	Nos	59	play	9	gOs	2	toWt	1	j5r
1440	jor	59	ly	9	goJ	2	toS	1	j2
1432	f1	59	gays	9	f%	2	ton	1	iyah
1431	jis	58	vres	9	ed	2	tof	1	iWls
1429	mis	58	v6	9	Dys	2	tmaW	1	Ist
1428	kE	58	Os	9	duJ	2	tlis	1	ih
1425	nJo	58	Ok	9	d!s	2	tlAd	1	ig
1398	jar	58	nIt	9	dr5s	2	TJar	1	if
1392	kis	58	lOJs	9	dOs	2	Tiv	1	hWo
1381	pI	58	LE	9	DJap	2	TiJs	1	huI
1379	g6J	58	gli	9	daJs	2	teWz	1	h!s
1367	sWas	58	DJar	9	d5s	2	tess	1	hrE
1360	nJa	58	daJ	9	bris	2	teS	1	hoWl
1350	heJ	57	kres	9	b1W	2	tem	1	hors
1340	mEJ	57	ah	9	9	2	taz	1	hleJ
1333	l!	56	We	8	zAg	2	tak	1	hJu
1332	tr1s	56	trEs	8	z9	2	taJW	1	hJosp
1330	tO	56	Sef	8	z$s	2	tag	1	hJers
1329	sJe	56	plE	8	v7r	2	sup	1	hJad
1326	fiW	56	moJ	8	tmJe	2	Sri	1	hIg
1321	kAW	56	kir	8	Tif	2	SreJ	1	hen
1315	h7	56	frJa	8	tAg	2	so

![](images/syllables_freqocc_abg_loglog.png)

Now lets consider only the starting syllables of words.

In [14]:
tail -n +2 /tmp/CorpusABGv2spell.csv | 
  awk -F',' '{n=split($5,syl,"-"); syllable[syl[1]]+=$9} END{for (s in syllable) print syllable[s]"\t"s}' | 
  sort -rn | tee >/tmp/start_syllables_freqocc_abg >(nl | gnuplot -e "set terminal png; set output '/tmp/start_syllables_freqocc_abg_loglog.png'; set xlabel 'start syllables'; set ylabel 'frequency of occurence'; set logscale xy; plot '/dev/stdin' with lines title 'abgcorpus'") >(column)
  
cp /tmp/start_syllables_freqocc_abg .
cp /tmp/start_syllables_freqocc_abg_loglog.png images/.

316886	a	382	fO	25	v5	5	mass	2	byr
196716	e	380	gre	25	tos	5	lWAsk	2	bWo
171456	de	378	fri	25	sU	5	ley	2	buyW
124998	o	371	gWa	25	fl!	5	lEs	2	bulls
122409	ke	366	pros	24	Sis	5	lAW	2	buJW
82294	kO	365	kl1	24	!s	5	kyW	2	brJo
62720	3	365	hA	24	p7	5	kros	2	bris
60933	es	364	flo	24	his	5	krJo	2	breW
53599	E	359	k!	24	gla	5	kOp	2	brAt
52734	se	357	uW	24	eJs	5	kiss	2	braJ
52278	do	357	krJar	24	bym	5	kAt	2	brah
49318	ko	357	gu	24	b1r	5	k1Ws	2	brad
48416	da	354	eJ	23	v4	5	k%	2	br7
47749	eW	351	guJ	23	piW	5	jJe	2	blO
47233	pa	349	bro	23	maWs	5	j5	2	bler
43984	nAW	346	gro	23	krJA	5	iz	2	blaW
42648	no	345	lus	23	klo	5	Iv	2	blaJr
42512	fa	343	tO	23	kl1W	5	ik	2	beJt
41646	na	339	vos	23	jaJ	5	hob	2	bad
41119	he	327	b1	23	hey	5	hip	2	ayWr
40679	u	326	7	23	fler	5	hayA	2	ayW
39089	pe	325	sJE	23	ars	5	gU	2	ayes
38473	ka	324	v!	22	zeJ	5	greW	2	At
37482	por	320	juW	22	tran	5	goWr	2	arS
37054	EJ	320	d7	22	sAt	5	gleb	2	Ar
35760	U	318	ze	22	praJ	5	gE	2	7s
35678	Di	317	frO	22	plas	5	gaz	2	$r
34796	vo	315	jiW	2

1593	t1	63	toWr	9	kWo	3	brEt	1	klJa
1575	DJas	62	t$	9	klay	3	brAW	1	kleW
1573	aJ	62	fy	9	juJ	3	bOb	1	kler
1543	pes	62	fro	9	jEJ	3	blEd	1	klAW
1539	tri	61	taJ	9	hiz	3	bJa	1	klAs
1535	tAW	61	l5	9	haW	3	bet	1	kJess
1505	jus	60	vJas	9	gy	3	bay	1	kJes
1484	hoW	60	gon	9	gret	3	asp	1	kJa
1437	nI	59	zA	9	gox	3	an	1	kit
1430	vJo	59	Ty	9	gaJ	3	am	1	kip
1405	tes	59	las	9	duW	3	alt	1	kik
1394	jor	59	fr1	9	DJap	3	alls	1	kiJ
1391	mor	59	fJa	9	brAs	3	Ak	1	kaw
1375	deW	58	pop	9	b1W	3	aJs	1	kaJd
1367	sWas	58	Ok	9	%	3	aJg	1	k$s
1298	fI	58	hay	8	ziW	3	#	1	jUg
1292	ki	57	sJe	8	zEJ	2	zla	1	joy
1274	us	57	dWar	8	zaW	2	zeW	1	jOh
1262	k1	56	trEs	8	v7r	2	z!	1	job
1259	bri	56	SJa	8	troJ	2	vur	1	jJyad
1252	f1	55	j4	8	tok	2	vAWs	1	jJas
1230	saW	54	Sef	8	Tis	2	vaJs	1	jIz
1221	zo	54	mJa	8	tam	2	Ut	1	jIs
1212	frA	54	kaos	8	sip	2	uh	1	jEs
1177	krJa	54	jeW	8	pys	2	Tyz	1	jEg
1173	saJr	53	luJ	8	pyO	2	tuf	1	jaz
1170	vI	53	leW	8	nis	2	tS5	1	jays
1160	pos	52	nar	8	lars	2	truz	1	jaWss
1159	fer	52	mJo	8	krok	2	troh	1	jAt
115

![](images/start_syllables_freqocc_abg_loglog.png)

And now the ending syllables of words.

In [15]:
tail -n +2 /tmp/CorpusABGv2spell.csv | 
  awk -F',' '{n=split($5,syl,"-"); syllable[syl[n]]+=$9} END{for (s in syllable) print syllable[s]"\t"s}' | 
  sort -rn | tee >/tmp/end_syllables_freqocc_abg >(nl | gnuplot -e "set terminal png; set output '/tmp/end_syllables_freqocc_abg_loglog.png'; set xlabel 'end syllables'; set ylabel 'frequency of occurence'; set logscale xy; plot '/dev/stdin' with lines title 'abgcorpus'") >(column)
  
cp /tmp/end_syllables_freqocc_abg .
cp /tmp/end_syllables_freqocc_abg_loglog.png images/.

171307	de	222	beJ	22	h!	5	ey	1	zU
162809	do	221	bros	22	freWd	5	eJss	1	zoWr
136511	ke	218	krus	22	dek	5	dris	1	zOs
128110	a	217	ges	22	blo	5	dley	1	zok
104896	o	216	zO	22	beJs	5	DIg	1	zod
93148	e	213	kso	21	tri	5	Dig	1	zJOs
88039	da	208	kaJr	21	taJ	5	dep	1	zJes
85741	to	207	dres	21	sey	5	bWo	1	zJA
77246	ra	206	jos	21	pu	5	boyer	1	zIs
76538	se	205	ner	21	prO	5	bJer	1	zeWp
75383	te	205	dO	21	play	5	biWs	1	zev
61550	3	205	bres	21	naJ	5	bet	1	zEt
57688	sAW	203	gaJs	21	mOJs	5	baWd	1	zep
55552	ma	203	freW	21	kok	5	bAs	1	zay
50044	no	202	juJs	21	juk	5	ap	1	zav
49452	na	201	klo	21	eWr	5	$s	1	zAt
47472	eW	200	riW	20	tors	4	zWar	1	zat
46892	ta	200	kro	20	tah	4	zov	1	zaJr
44733	nAW	200	bu	20	set	4	zet	1	zah
38577	la	199	bes	20	piW	4	zAWs	1	z1
37952	EJ	198	heJs	20	neJs	4	vrov	1	z#
35747	U	198	fre	20	mOs	4	vreJ	1	WyIg
31598	Na	193	niW	20	kle	4	vlI	1	Wes
31275	mo	193	gEJ	20	kJev	4	vek	1	WaJs
29970	dos	191	TIs	20	jik	4	vah	1	W5s
29745	kO	189	joW	20	jeJs	4	tSo	1	W5
29414	ka	189	blog	20	hap	4	trWo	1	vr

1757	Nor	67	pys	10	bob	3	got	1	my
1746	p!s	67	lJw	10	Ad	3	gOg	1	mWo
1744	mJa	66	pop	9	zJer	3	glJos	1	muW
1725	DJas	66	keW	9	zIg	3	gli	1	mud
1723	kWaW	65	toWr	9	tok	3	gAg	1	moWs
1722	pos	65	ted	9	TJer	3	g8	1	mols
1714	veJs	65	SaW	9	sub	3	fUd	1	mlet
1705	Nas	65	sA	9	soJs	3	frJo	1	mJO
1691	NAW	65	peJ	9	sJer	3	frid	1	mJEt
1689	var	64	top	9	SeWs	3	flip	1	mJes
1680	l6	64	draW	9	sek	3	fAWs	1	mJer
1674	Us	63	fos	9	rJO	3	eyes	1	mJah
1661	kaW	63	fi	9	reJs	3	eWs	1	miss
1658	maW	63	boJ	9	pr1	3	erj	1	mim
1652	nu	62	or	9	p1g	3	ells	1	miJ
1649	LOJs	62	lJAs	9	ors	3	eg	1	mib
1644	boW	62	kWo	9	nOJs	3	Ed	1	merh
1616	ge	62	blJa	9	n7	3	Dit	1	map
1611	teW	62	5Js	9	mur	3	deye	1	maJr
1595	fI	61	teJs	9	loJ	3	ded	1	maJd
1594	jo	61	priW	9	krAW	3	dav	1	m7
1587	teJ	61	dri	9	kJar	3	brEt	1	lWas
1580	pe	60	TJes	9	ket	3	bot	1	luW
1575	paJs	60	s1	9	k5	3	boss	1	lUSs
1541	zos	60	pI	9	jus	3	blJo	1	lub
1531	keJ	60	far	9	It	3	blEs	1	lop
1505	gUs	59	saJs	9	hot	3	blEd	1	lOg
1497	Sar	59	nuJ	9	hik	3	blaW	1	LOes
1473	6	59	gays	9

297	mJo	25	prEJ	5	nWos	2	dUs	1	brI
295	bir	25	neWs	5	NAs	2	dU	1	boyJsh
294	Les	25	n!	5	mlI	2	drus	1	boWs
293	kJa	25	mus	5	mills	2	droW	1	boWr
291	ok	25	ksos	5	mIh	2	drIg	1	blos
290	oJ	25	klar	5	meg	2	doyEJ	1	blok
290	dre	25	key	5	mass	2	days	1	bliW
288	meJ	25	haW	5	m$	2	dag	1	blId
284	tu	25	glJa	5	lWAsk	2	bye	1	blEJ
281	graW	25	eJs	5	loWs	2	bulls	1	bleh
280	has	24	zJO	5	lof	2	bru	1	blAt
278	broW	24	zik	5	lib	2	brJas	1	blat
276	mi	24	ult	5	lEs	2	brJar	1	bJes
276	gras	24	Sris	5	leJW	2	brik	1	b!J
271	dus	24	sir	5	laJs	2	brAt	1	bIg
270	vros	24	S5	5	kur	2	brap	1	bells
269	biW	24	med	5	kroW	2	brah	1	bek
268	n1	24	LaW	5	kross	2	brad	1	bEd
267	Si	24	klos	5	kler	2	boWt	1	bE
263	pros	24	bev	5	kiss	2	bok	1	baSs
261	bAW	23	sAt	5	k3J	2	bma	1	bars
258	trEJ	23	m5	5	jur	2	blus	1	b8
258	rJe	23	lah	5	jOJs	2	bloW	1	b7
257	Og	23	l7	5	jIg	2	bler	1	b$r
257	dr5	23	kup	5	Iv	2	bled	1	b$
252	bas	23	jis	5	ik	2	blAW	1	ayev
250	kI	23	gWar	5	hob	2	blaJr	1	aye
247	zuW	23	grid	5	hJors	2	bJO	1	ayA
245	ksAW	23	grAd	5	h

![](images/end_syllables_freqocc_abg_loglog.png)

## Syllabic structure frequency

The same analysis and scripts from before, but now considering the 7th field (ESTSILABICA).

In [17]:
tail -n +2 /tmp/CorpusABGv2spell.csv | 
  awk -F',' '{n=split($7,syl,"-"); for(i=0;++i<=n;) print syl[i]}' | 
  sort | uniq -c | sort -rn | tee >/tmp/syllabic_struct_freq_abg >(nl | gnuplot -e "set terminal png; set output '/tmp/syllabic_struct_freq_abg_loglog.png'; set xlabel 'syllabic structure'; set ylabel 'frequency'; set logscale xy; plot '/dev/stdin' with lines title 'abgcorpus'") >(column) 
cp /tmp/syllabic_struct_freq_abg .
cp /tmp/syllabic_struct_freq_abg_loglog.png images/.

 194786 CV	    512 VS	     31 CCGVS	      6 CVGV	      2 CCVCCS
  26005 V	    474 CVGC	     23 CVGG	      5 CGVGC	      2 
  23325 CVC	    450 CCGV	     20 CC	      4 CVVS	      1 VGGG
  20525 CVG	    344 CCVS	     19 VCS	      4 CCVVC	      1 VGGC
  15298 CCV	    208 VGS	     17 CGVCC	      3 VGVS	      1 VGCC
  14210 CVS	    151 CVCS	     16 CVV	      3 GVG	      1 VCG
   9786 CGV	    108 VGC	     13 CVCCS	      3 GVCC	      1 GyVC
   5023 VC	     84 GVS	     11 CVVC	      3 CVGGS	      1 GVGS
   3349 VG	     70 VCC	     11 CCVCS	      3 CGyVC	      1 CVGVG
   1570 CVGS	     69 GVC	     10 CVGCS	      3 CGVGS	      1 CVGGCC
   1116 CCVC	     62 CGVG	     10 CGVCS	      2 VGVC	      1 CVGGC
   1099 CGVS	     62 CCVGC	      9 VGG	      2 VGCS	      1 CS
    840 CCVG	     42 CCVGS	      7 G	      2 GVCS	      1 CGyV
    734 GV	     41 CVCC	      6 VGV	      2 CVVCS	      1 CCVV
    693 CGVC	     40 CCGVC	      6 VCCS	      2 CVGGG	      1 CCVGCS


![](images/syllabic_struct_freq_abg_loglog.png)

Considering only syllabic structures in the beggining of a word.

In [18]:
tail -n +2 /tmp/CorpusABGv2spell.csv | 
  awk -F',' '{n=split($7,syl,"-"); print syl[1]}' | 
  sort | uniq -c | sort -rn | tee >/tmp/start_syllabic_struct_freq_abg >(nl | gnuplot -e "set terminal png; set output '/tmp/start_syllabic_struct_freq_abg_loglog.png'; set xlabel 'start syllabic structure'; set ylabel 'frequency'; set logscale xy; plot '/dev/stdin' with lines title 'abgcorpus'") >(column)
cp /tmp/start_syllabic_struct_freq_abg .
cp /tmp/start_syllabic_struct_freq_abg_loglog.png images/.

  45399 CV	    116 CVS	     13 GV	      3 VGVS	      1 VGGG
  20269 V	     63 VGC	      9 VGG	      3 CVGCS	      1 VGGC
   8156 CVC	     45 VCC	      9 VCS	      2 VGCS	      1 VGCC
   6094 CCV	     45 CCVS	      8 VGS	      2 GVC	      1 VCG
   4285 VC	     41 CVGS	      8 CVCCS	      2 CVVCS	      1 GVG
   3813 CVG	     29 VS	      6 VCCS	      2 CVVC	      1 CVVS
   1585 VG	     27 CCVGC	      6 G	      2 CVV	      1 CVGVG
    997 CGV	     24 CGVG	      6 CGVCC	      2 CVGGS	      1 CVGGCC
    629 CCVC	     22 CVCS	      5 VGV	      2 CVGGG	      1 CVGGC
    306 CCVG	     22 CGVS	      5 CGVGC	      2 CGyVC	      1 CGyV
    177 CGVC	     20 CVCC	      5 CCGVS	      2 CGVCS	      1 CCVV
    141 CVGC	     16 CCGVC	      4 CVGV	      2 CCVVC	      1 CCVCS
    128 CCGV	     14 CVGG	      4 CCVGS	      2 


![](images/start_syllabic_struct_freq_abg_loglog.png)

And now considering only those at the end of a word.

In [19]:
tail -n +2 /tmp/CorpusABGv2spell.csv | 
  awk -F',' '{n=split($7,syl,"-"); print syl[n]}' | 
  sort | uniq -c | sort -rn | tee >/tmp/end_syllabic_struct_freq_abg >(nl | gnuplot -e "set terminal png; set output '/tmp/end_syllabic_struct_freq_abg_loglog.png'; set xlabel 'end syllabic structure'; set ylabel 'frequency'; set logscale xy; plot '/dev/stdin' with lines title 'abgcorpus'") >(column)
cp /tmp/end_syllabic_struct_freq_abg .
cp /tmp/end_syllabic_struct_freq_abg_loglog.png images/.

  43985 CV	    399 CCVG	     24 CGVG	      4 VGV	      2 CGVGC
  14210 CVS	    352 CCVC	     19 VCS	      4 CVVS	      2 CCVCCS
  12279 CVG	    345 CVGC	     19 CCGVC	      4 CCVVC	      1 VGGG
   6741 CVC	    344 CCVS	     18 CC	      3 VGVS	      1 VGCC
   4642 CGV	    208 VGS	     14 CVGG	      3 GVCC	      1 GyVC
   1570 CVGS	    151 CVCS	     13 CVCCS	      3 CVGV	      1 GVGS
   1482 VG	    134 CCGV	     13 CGVCC	      3 CVGGS	      1 G
   1252 V	     84 GVS	     11 CCVCS	      3 CVCC	      1 CVGVG
   1099 CGVS	     60 VCC	     10 CVGCS	      3 CGVGS	      1 CVGGCC
    993 CCV	     54 VGC	     10 CGVCS	      2 VGVC	      1 CS
    512 VS	     44 CCVGC	      8 CVV	      2 VGCS	      1 CCVGCS
    494 VC	     42 CCVGS	      7 CVVC	      2 GVCS
    409 CGVC	     41 GVC	      6 VCCS	      2 CVVCS
    404 GV	     31 CCGVS	      5 VGG	      2 CGyVC


![](images/end_syllabic_struct_freq_abg_loglog.png)

And now lets take in account the frequency of occurrence of the words in the corpus.

In [20]:
tail -n +2 /tmp/CorpusABGv2spell.csv | 
  awk -F',' '{n=split($7,syl,"-"); for(i=0;++i<=n;) syllable[syl[i]]+=$9} END{for (s in syllable) print syllable[s]"\t"s}' | 
  sort -rn | tee >/tmp/syllabic_struct_freqocc_abg >(nl | gnuplot -e "set terminal png; set output '/tmp/syllabic_struct_freqocc_abg_loglog.png'; set xlabel 'syllabic structure'; set ylabel 'frequency of occurrence'; set logscale xy; plot '/dev/stdin' with lines title 'abgcorpus'") >(column) 
cp /tmp/syllabic_struct_freqocc_abg .
cp /tmp/syllabic_struct_freqocc_abg_loglog.png images/.

4105107	CV	10055	CVGS	301	CCVGS	20	G	2	GVCS
1031151	V	6583	CGVS	255	CCGVS	19	CVGGS	2	CVVCS
647318	CVC	5145	CCGV	171	CVGG	17	VGG	2	CVGVG
567467	CVG	4701	CGVG	147	GVG	17	CVGV	2	CVGGG
275931	CCV	3274	GVC	125	CC	14	VCCS	2	
162688	VG	3202	CCGVC	91	CVV	12	VGV	1	VGGG
152647	VC	2572	VS	74	CGVGS	9	CCVVC	1	VCG
146831	CGV	2222	CCVS	59	VCS	6	VGCS	1	GyVC
80644	CVS	1782	VGS	46	CVCCS	5	VGVS	1	GVGS
51313	CVGC	696	CCVGC	42	CVGCS	5	GVCC	1	CVGGCC
25641	CCVC	678	CGVGC	31	CVVS	5	CCVCCS	1	CVGGC
21685	CGVC	676	GVS	31	CGVCS	4	VGCC	1	CS
14417	GV	387	VCC	29	CCVCS	3	CGyVC	1	CGyV
11488	CCVG	346	CVCS	28	CGVCC	2	VGVC	1	CCVV
10240	VGC	327	CVCC	26	CVVC	2	VGGC	1	CCVGCS


![](images/syllabic_struct_freqocc_abg_loglog.png)

Now lets considere only syllabic structures at the beggining of a word.

In [21]:
tail -n +2 /tmp/CorpusABGv2spell.csv | 
  awk -F',' '{n=split($7,syl,"-"); syllable[syl[1]]+=$9} END{for (s in syllable) print syllable[s]"\t"s}' | 
  sort -rn | tee >/tmp/start_syllabic_struct_freqocc_abg >(nl | gnuplot -e "set terminal png; set output '/tmp/start_syllabic_struct_freqocc_abg_loglog.png'; set xlabel 'start syllabic structure'; set ylabel 'frequency of occurrence'; set logscale xy; plot '/dev/stdin' with lines title 'abgcorpus'") >(column) 
cp /tmp/start_syllabic_struct_freqocc_abg .
cp /tmp/start_syllabic_struct_freqocc_abg_loglog.png images/.

1644869	CV	2818	CCGVC	107	CCVGC	14	VCCS	2	CVGVG
927315	V	1978	CCGV	61	CCVGS	11	VGV	2	CVGGG
272956	CVG	720	CVS	59	CVCS	11	CVGV	2	CGyVC
269642	CVC	678	CGVGC	52	CCGVS	10	CGVCC	2	CCVVC
142196	CCV	365	CVGS	41	VGS	9	CVV	2	
136148	VG	353	VGC	41	VCS	6	VGCS	1	VGGG
130300	VC	329	VCC	41	CVGG	5	VGVS	1	VCG
30140	CGV	305	CCVS	33	CVCCS	4	VGCC	1	CVGGCC
29994	CVGC	296	CVCC	28	CVVS	3	CVGCS	1	CVGGC
9439	CCVC	216	CGVS	21	GV	3	CGVCS	1	CGyV
7190	CGVC	180	GVC	19	G	2	VGGC	1	CCVV
3392	CGVG	136	VS	18	CVGGS	2	CVVCS	1	CCVCS
2879	CCVG	136	GVG	17	VGG	2	CVVC


![](images/start_syllabic_struct_freqocc_abg_loglog.png)

And now, syllabic structures at the end of words.

In [22]:
tail -n +2 /tmp/CorpusABGv2spell.csv | 
  awk -F',' '{n=split($7,syl,"-"); syllable[syl[n]]+=$9} END{for (s in syllable) print syllable[s]"\t"s}' | 
  sort -rn | tee >/tmp/end_syllabic_struct_freqocc_abg >(nl | gnuplot -e "set terminal png; set output '/tmp/end_syllabic_struct_freqocc_abg_loglog.png'; set xlabel 'end syllabic structure'; set ylabel 'frequency of occurrence'; set logscale xy; plot '/dev/stdin' with lines title 'abgcorpus'") >(column) 
cp /tmp/end_syllabic_struct_freqocc_abg .
cp /tmp/end_syllabic_struct_freqocc_abg_loglog.png images/.

1816902	CV	9966	VGC	346	CVCS	14	VCCS	2	GVCS
445946	V	8704	CCVG	301	CCVGS	14	CVVC	2	CVVCS
398640	CVG	6583	CGVS	255	CCGVS	12	VGG	2	CVGVG
376240	CVC	3031	GVC	156	CVGG	11	CVV	2	CGyVC
118656	VG	2785	CGVG	123	CC	9	CCVVC	1	VGGG
88534	CGV	2572	VS	74	CGVGS	8	CVGV	1	GyVC
80644	CVS	2222	CCVS	59	VCS	6	VGV	1	GVGS
72341	CCV	1841	CCGV	46	CVCCS	6	VGCS	1	G
64024	VC	1782	VGS	42	CVGCS	5	VGVS	1	CVGGCC
50962	CVGC	709	CCGVC	31	CVVS	5	GVCC	1	CS
19904	CCVC	676	GVS	31	CGVCS	5	CCVCCS	1	CCVGCS
17120	CGVC	650	CCVGC	29	CCVCS	4	VGCC
11454	GV	649	CGVGC	21	CGVCC	3	CVCC
10055	CVGS	363	VCC	19	CVGGS	2	VGVC


![](images/end_syllabic_struct_freqocc_abg_loglog.png)