In [66]:
# check if ABG corpus was already downloaded
# if not, download it
if [ ! -f /tmp/Corpus_ABG.csv ]; then 
  wget -q https://raw.githubusercontent.com/SauronGuide/corpusABG/master/Corpus_ABG_Completo_Versao3.csv -O /tmp/Corpus_ABG.csv
fi

Check ABG Corpus file type.

In [67]:
file /tmp/Corpus_ABG.csv

/tmp/Corpus_ABG.csv: UTF-8 Unicode (with BOM) text, with CRLF line terminators


Since it uses CRLF line terminators, it was probably created on Windows. This double line terminator is unnacessary and carries an extra byte (```\r```, carriage return) that might appear as ```^M``` on Linux and create further problem.

The BOM marker is also redundant, since it is possible to infer the endianness by a simple analysis of the data. The BOM marker (in UTF-8) is made of the three inicial bytes ```0xEF```,```0xBB```,```0xBF```.

So let's start by removing both of them.

In [68]:
tail --bytes=+4 /tmp/Corpus_ABG.csv | tr -d '\r' > /tmp/CorpusABG.csv

Unfortunately the ABG corpus has many erros. We will try to fix some of them and leave many behind.

The first error we found is empty lines, that appear as: ```,,,,,,,,,,,,,,,,```.

Lets list them and remove them. We will use grep to find them and the parameter ```-n``` to print the line number, along the match.

In [69]:
grep -n '^,' /tmp/CorpusABG.csv 

[32m[K20608[m[K[36m[K:[m[K[01;31m[K,[m[K,,,,,,,,,,,,,,,
[32m[K20747[m[K[36m[K:[m[K[01;31m[K,[m[K,,,,,,,,,,,,,,,
[32m[K20957[m[K[36m[K:[m[K[01;31m[K,[m[K,,,,,,,,,,,,,,,
[32m[K48422[m[K[36m[K:[m[K[01;31m[K,[m[K,,,,,,,,,,,,,,,
[32m[K48764[m[K[36m[K:[m[K[01;31m[K,[m[K,,,,,,,,,,,,,,,
[32m[K49034[m[K[36m[K:[m[K[01;31m[K,[m[K,,,,,,,,,,,,,,,
[32m[K49705[m[K[36m[K:[m[K[01;31m[K,[m[K,,,,,,,,,,,,,,,
[32m[K49838[m[K[36m[K:[m[K[01;31m[K,[m[K,,,,,,,,,,,,,,,
[32m[K50175[m[K[36m[K:[m[K[01;31m[K,[m[K,,,,,,,,,,,,,,,
[32m[K51587[m[K[36m[K:[m[K[01;31m[K,[m[K,,,,,,,,,,,,,,,
[32m[K53205[m[K[36m[K:[m[K[01;31m[K,[m[K,,,,,,,,,,,,,,,
[32m[K92625[m[K[36m[K:[m[K[01;31m[K,[m[K,   ,,,,,,,,,,,,,,


Now, lets remove those lines. We're using the same command from above and cutting the result to get only the line number. Then we peform a loop on those numbers to create a string with everything we will remove using ```sed```.

In [70]:
LINES=""; 
for i in `grep -n '^,' /tmp/CorpusABG.csv | cut -d: -f1`; do 
  LINES="${LINES}${i}d;"; 
done; 
sed -ie "${LINES}" /tmp/CorpusABG.csv

The corpus has a sequence of commas ```,,,,,``` in the end of each line. We will remove it, since it is unused.

In [71]:
sed -i 's/,,,,,$//g' /tmp/CorpusABG.csv

The fields (columns) in the ABG corpus are listed on the first line.

In [72]:
head -n 1 /tmp/CorpusABG.csv 

ID,PALAVRA,CATMORF,LEMA,TRANSCRICAO,ACENTUACAO,ESTSILABICA,CATACENTUAL,FREQGERAL,FREQORAL,FREQESCRITA,Freq_Nivel


There are 12 fields: ID (1), PALAVRA (2), CATMORF (3), LEMA (4), TRANSCRICAO (5), ACENTUACAO (6), ESTSILABICA (7), CATACENTUAL (8), FREQGERAL (9), FREQORAL (10), FREQESCRITA (11), Freq_Nivel (12).

Another type of error we found in the database is lines that don't have exactly 12 fields (3 have more than 12 and 3 have less than 12). Probably the authors have inserted erroneous commas or have missed some, creating a malformed file. The script bellow present the rows that do not have 12 fields. We also print the line number, so we may once again remove the bad data.

In [73]:
cat /tmp/CorpusABG.csv | sed 's/,\+\s*$//' | awk -F, 'NF!=12{print NR":"$0}'

24089:24093,Jereissati,   F   jereissati,   &je-reJ-sa-Ti*,   &je-reJ-s1-Ty*,   &CV-CVG-CV-CV*,   parox?tona,6,   0o,6,2
35806:35816,Fuentes,   F   Fuentes,   &fE-tes*,   &fE-t4s*,   &CV-CVS*,   ox?tona,3,   0o,3,2
46495:46508,prociss?es,   NP,,   prociss?es,   &pro-si-sOJs*,   &pro-si-s#Js*,   &CCV-CV-CVGS*,   ox?tona,2,2,   0e,2
51341:51359,Josh,   F   Josh,   &jo-Sy*,   &j9-Sy*,   &CV-CV*,   parox?tona,2,   0o,2,2
54571:54607,Juvenile,   NOM,   Juvenile,   &ju-ve-ni-le*,   &ju-ve-n7-ly*,   &CV-CV-CV-CV*,   parox?tona,1,   0o,1,,54607
77325:77959,bulut,   F,   bulut,   &bu-lut*,   &bu-l$t*,   &CV-CVC*,   ox?tona,1,   0o,1,1,,1,   0o,1,1


Lets then remove those lines, just as we did before.

In [74]:
LINES=""; 
for i in `sed 's/,\+\s*$//' < /tmp/Corpus_ABG.csv | awk -F, 'NF!=12{print NR":"$0}' | cut -d: -f1`; do 
  LINES="${LINES}${i}d;"; 
done; 
sed -ie "${LINES}" /tmp/CorpusABG.csv

There are also rows where the FREQGERAL is zero or not a number. 

In [75]:
awk -F, '$9<1{print NR":"$0}' < /tmp/CorpusABG.csv

24086:24093,Jereissati,   F   jereissati,   &je-reJ-sa-Ti*,   &je-reJ-s1-Ty*,   &CV-CVG-CV-CV*,   parox?tona,6,   0o,6,2
35802:35816,Fuentes,   F   Fuentes,   &fE-tes*,   &fE-t4s*,   &CV-CVS*,   ox?tona,3,   0o,3,2
46490:46508,prociss?es,   NP,,   prociss?es,   &pro-si-sOJs*,   &pro-si-s#Js*,   &CCV-CV-CVGS*,   ox?tona,2,2,   0e,2
51329:51359,Josh,   F   Josh,   &jo-Sy*,   &j9-Sy*,   &CV-CV*,   parox?tona,2,   0o,2,2


We're going to remove these data as well.

In [76]:
LINES=""; 
for i in `awk -F, '$9<1{print NR":"$0}' < /tmp/CorpusABG.csv | cut -d: -f1`; do 
  LINES="${LINES}${i}d;"; 
done; 
sed -ie "${LINES}" /tmp/CorpusABG.csv

There are other frequency fields (FREQORAL (10), FREQESCRITA (11), Freq_Nivel (12)) and theit values should be numeric and positive. But many have non valid values.

In [77]:
awk -F, '($10<1||$11<1||$12<1){print NR":"$0}' < /tmp/CorpusABG.csv | wc -l

71043


This number represents a large amount of the total dada. We will leave it be.

In [78]:
TOTAL=$(wc -l /tmp/CorpusABG.csv | cut -d' ' -f1)
COUNTNN=$(awk -F, '($10<1||$11<1||$12<1){print NR":"$0}' < /tmp/CorpusABG.csv | wc -l)
echo "$COUNTNN/$TOTAL" | bc -l

.76726931052358735095


The morphological category (column 3) and stress category (column 8) are categorical variables, they might have values in a finite set and their value assigning each sample to a different group. Lets check the values used in the corpus.


Morphological category (column 3):

In [79]:
awk -F',[ ]*' '{print $3}' /tmp/CorpusABG.csv | sort | uniq -c | sort -rn

  36563 NOM
  32776 V
  11681 ADJ
   5067 F
   2188 V+P
   1456 C
   1061 ADV
    717 
    430 G
    184 P
    179 I
     80 PREP+P
     55 CONJ
     42 NUM
     39 PREP
     38 PREP+DET
     27 DET
      5 PREP+ADV
      1 V+p
      1 FNOM
      1 f
      1 CATMORF


We see there 717 with no morphological category assigned, there is one "V+p" that might be "V+P" and a "f" that might be "F". We might easily correct those two.

In [80]:
awk -F',[ ]*' 'BEGIN{OFS=", "} ($3=="f"){$3="F"} ($3=="V+p"){$3="V+P"} {print}' /tmp/CorpusABG.csv > /tmp/tmpABG
mv /tmp/tmpABG /tmp/CorpusABG.csv

The list of the 717 of entries with empty morphological category (column 3) is given bellow:

In [81]:
awk -F',[ ]*' '($3==""){print $1,$2}' /tmp/CorpusABG.csv | column

143 legal		395 crian?as		634 logo
149 segundo		396 pol?cia		635 estados
150 vida		397 sala		636 velho
151 outros		398 p?blico		637 m?dico
152 nunca		399 enquanto		638 a??o
153 foram		400 nosso		639 mal
154 ?poca		401 junto		640 votos
155 ia			402 sul			641 sociedade
156 S?o			403 falava		642 alunos
157 disse		404 quarto		643 relacionadas
158 est?o		405 situa??o		644 pequena
159 fui			406 talvez		645 aquelas
160 tipo		407 social		646 debate
161 ir			408 conhece		647 gostoso
162 quer		409 seguran?a		648 trabalhando
163 dentro		410 umas		649 antigamente
164 pessoal		411 idade		650 quiser
165 ficou		412 m?s			651 dilma
166 d?			413 dados		652 papel
167 ver			414 for			653 l?ngua
168 dizer		415 podem		654 equipe
169 falei		416 Mas			655 saiu
170 maior		417 bairros		656 seguinte
171 seus		418 importante		657 cinema
172 entendeu		419 filha		658 pensar
173 teve		420 pouquinho		659 ruim
174 jeito		421 conhe?o		660 eleitoral
175 tanto		422 curso		661 cidades
176 essas		423 processo		662 ideia
17

391 professor		630 entrevista		869 obrigada
392 viol?ncia		631 condi??es		870 Petrobras
393 viu			632 certeza		871 guerra
394 zona		633 baixo		872 poucos


Stress category (column 8):

In [82]:
awk -F',[ ]*' '{print $8}' /tmp/CorpusABG.csv | sort | uniq -c | sort -rn

  55262 parox?tona
  21694 ox?tona
   6800 paroxitona
   3470 oxitona
   3392 proparox?tona
   1530 mono
    396 proparoxitona
     40 4
      2 PARox?tona
      2 ox?tono
      1 quatro
      1 parox?ton
      1 parox?ona
      1 CATACENTUAL


There are 40 entries with value 4 for stress category, one entry with value 'quatro' (might be the same as 4) and there are many mistyped names which we might correct with a simple substitution.

In [83]:
# fix stress category
sed -i 's/ox?tona/oxitona/g' /tmp/CorpusABG.csv
sed -i 's/ox?tono/oxitona/g' /tmp/CorpusABG.csv
sed -i 's/PARoxitona/paroxitona/g' /tmp/CorpusABG.csv
sed -i 's/parox?ona/paroxitona/g' /tmp/CorpusABG.csv
sed -i 's/parox?ton/paroxitona/g' /tmp/CorpusABG.csv

Now, lets check some of those entries with value 4 in stress category.

In [84]:
awk -F',[ ]*' '($8==4){print}' /tmp/CorpusABG.csv | head -n 5

719,t?cnico,,   t?cnico,   &t5-ky-ni-ko*,   &t5-ky-ny-kw*,   &CV-CV-CV-CV*,4,500,151,349,3
1485,t?cnica,   NOM,   t?cnica,   &t5-ky-ni-ka*,   &t5-ky-ny-k@*,   &CV-CV-CV-CV*,4,231,70,161,3
2666,t?cnicos,   ADJ,   t?cnicos,   &t5-ky-ni-kos*,   &t5-ky-ny-kws*,   &CV-CV-CV-CVS*,4,121,13,108,3
4111,t?cnicas,   ADJ,   t?cnicas,   &t5-ky-ni-kas*,   &t5-ky-ny-k@s*,   &CV-CV-CV-CVC*,4,73,11,62,2
5060,d?ficit,   NOM,   d?ficit,   &d5-fi-si-ty*,   &d5-fy-sy-ty*,   &CV-CV-CV-CV*,4,56,3,53,2


And the complete list of words with value 4 (or 'quatro') in stress category is:

In [85]:
awk -F',[ ]*' '($8==4||$8=="quatro"){print $1,$2}' /tmp/CorpusABG.csv | column

719 t?cnico		31949 Polit?cnica	56961 antiss?ptico
1485 t?cnica		34472 eletrot?cnico	57174 antiss?pticos
2666 t?cnicos		34935 ?tnicas		57581 apocal?pticas
4111 t?cnicas		41512 aut?ctones	59961 c?psula
5060 d?ficit		42759 eletrot?cnica	64142 el?ptico
13979 l?xico		43225 el?ptica		64157 epil?ptica
16379 polit?cnica	44164 ex-t?cnico	64278 epil?pticos
16904 ?tnica		46546 logar?tmica	72143 lepid?pteros
19590 inc?gnita		50151 pan?ptico		74419 g?ngsteres
21275 ?tnicos		51463 r?tmico		77274 multi?tnico
23465 ?tnico		51627 sociot?cnica	83538 d?couvert
26420 inc?gnito		56107 h?bitat		84891 pirot?cnicos
28814 apocal?ptica	56835 anal?pticos	87837 Polit?cnico
29173 c?psulas		56954 antiss?ptica


It seems all of them are proparoxytone ('proparoxitona') in fact. So lets correct them.

In [86]:
awk -F',[ ]*' 'BEGIN{OFS=", "} ($8==4||$8=="quatro"){$8="proparoxitona"} {print}' /tmp/CorpusABG.csv > /tmp/tmpABG
mv /tmp/tmpABG /tmp/CorpusABG.csv

Now lets make an histogram for the stress category.

In [87]:
awk -F',[ ]*' '{print $8}' /tmp/CorpusABG.csv | sort | uniq -c | sort -rn | 
  head -n -1 | nl| 
  gnuplot -e "set terminal png; set output 'images/stress_category.png'; set xlabel 'categoria acentual'; set ylabel 'frequencia'; set style fill solid; set boxwidth 1; set title 'corpus abg'; set xtics rotate by 45 right; plot '/dev/stdin' using 1:2:xtic(3) with boxes notitle"

![](images/stress_category.png)

Many entries have ```?``` in the word (column 2) or lemma (column 5) transcription. The amount is:

In [88]:
awk -F',[ ]*' '($2~/?/)||($4~/?/){print $2, $4, $5}' /tmp/CorpusABG.csv | wc -l

18007


and some examples are given bellow:

In [89]:
awk -F',[ ]*' '($2~/?/)||($4~/?/){print $2, $4, $5}' /tmp/CorpusABG.csv | head

l? l? &l1*
tamb?m tamb?m &tA-b6*
est? est? &es-t1*
s?o s?o &sAW*
j? j? &j1*
s? s? &s!*
? ? a&a*
at? at? &a-t5*
?s ?s &as*
m?e m?e &mA-e*


Lets create a list of all entries that still need a fix (we will not fix lema).

In [90]:
#awk -F',[ ]*' 'BEGIN{OFS=", "} ($2~/?/){print NR, $2, $4, $5}' /tmp/CorpusABG.csv > /tmp/needfixlist

In the ABG Corpus repository there is a file ```Corpus_Tag_Freq_Trans.txt``` (probably a intermediary file) which has some data that might be used to fix those ```?``` in the corpus file. 

Lets first download this file:

In [91]:
if [ ! -f /tmp/Acentuador.zip ]; then 
  wget -q https://github.com/SauronGuide/corpusABG/raw/master/7-%20Acentuador.zip -O /tmp/Acentuador.zip
fi 
#unzip -p /tmp/Acentuador.zip "7- Acentuador/Corpus_Tag_Freq_Trans.txt" > /tmp/Corpus_Tag_Freq_Trans.txt
unzip -p /tmp/Acentuador.zip "7- Acentuador/Corpus_Transcrito.xlsx" > /tmp/Corpus_Transcrito.xlsx

And not lets make a script to fix it.

For each entry that has ```?``` in it, which is listed in file ```/tmp/needfixlist``` created above, we will find the corresponding entry (same pronounciation) and replace the mispelled word by the vertion in the downloaded file.

In [92]:
#while read line; do
#  TRANSC=$(echo $line | awk -F', ' '{print $4}')
#  PATT=${TRANSC/\*/\\\*}
#  WORD=$(grep "$PATT" /tmp/Corpus_Tag_Freq_Trans.txt | cut -f1 | sed -r 's/[^[:alnum:]]//g' | awk '{print tolower($0)}' | grep "[àáãâéêíóõôú]" | head -n 1)
#  LINNUM=$(echo $line | awk -F', ' '{print $1}')
#  if [ -z "$WORD" ]; then
#    awk -F',[ ]*' -v linnum="$LINNUM" -v word="$WORD" 'BEGIN{OFS=","} NR==linnum{$2=word} {$1=$1}1' /tmp/CorpusABG.csv > /tmp/tmpabg
#    mv /tmp/tmpabg /tmp/CorpusABG.csv  
#  fi
#done < /tmp/needfixlist

Lets create a list of all entries that still need a fix.

In [93]:
awk -F',[ ]*' 'BEGIN{OFS=","} ($2~/?/){print NR, $2, $4, $5, $9, $10, $11}' /tmp/CorpusABG.csv > /tmp/needfixlist

In [94]:
libreoffice --headless --convert-to csv --outdir /tmp/ /tmp/Corpus_Transcrito.xlsx

convert /tmp/Corpus_Transcrito.xlsx -> /tmp/Corpus_Transcrito.csv using filter : Text - txt - csv (StarCalc)
Overwriting: /tmp/Corpus_Transcrito.csv


In [95]:
file /tmp/Corpus_Transcrito.csv 

/tmp/Corpus_Transcrito.csv: Non-ISO extended-ASCII text


In [96]:
iconv -f ISO-8859-1 -t UTF-8//TRANSLIT < /tmp/Corpus_Transcrito.csv  > /tmp/Corpus_Transcrito_utf8.csv

In [97]:
while read line; do
  LINNUM=$(echo $line | awk -F',' '{print $1}')
  TRANSC=$(echo $line | awk -F',' '{print $4}')
  F1=$(echo $line | awk -F',' '{print $5}')
  F2=$(echo $line | awk -F',' '{print $6}')
  F3=$(echo $line | awk -F',' '{print $7}')
  WORD=$(awk -F'[ ]*,[ ]*' -v prn="$TRANSC" -v f1="$F1" -v f2="$F2" -v f3="$F3" 'BEGIN{OFS=","} $4==prn&&$8==f1&&$9==f2&&$10==f3{print $1}' /tmp/Corpus_Transcrito_utf8.csv)
  if [ ! -z "$WORD" ]; then
     awk -F',[ ]*' -v linnum="$LINNUM" -v word="$WORD" 'BEGIN{OFS=","} NR==linnum{$2=word} {$1=$1}1' /tmp/CorpusABG.csv > /tmp/tmpabg
     mv /tmp/tmpabg /tmp/CorpusABG.csv
  fi
done < /tmp/needfixlist

Keyboard Interrupt


In [98]:
# add number of phonemes and syllables
awk -F','  'BEGIN{OFS=","} NR==1{printf "%s",$0; printf "%s",",NUMFONES,NUMSILABAS\n"} NR>1{gsub(/^\&|\*$/,"",$5); gsub(/^\&|\*$/,"",$6); gsub(/^\&|\*$/,"",$7); printf "%s,",$0; gsub(/-/,"",$5); printf "%d,%d\n", length($5), gsub(/-/,"",$7)+1}' /tmp/CorpusABG.csv > /tmp/CorpusABGv2.csv