In [1]:
"""

https://github.com/nlpyang/geval/tree/main/prompts/summeval


Coerencia:

Coherence  (1-5): 

1. Read the original text carefully and identify the main topic and key points.
2. Read the paraphrased text and compare it to the original text. Check if the paraphrased text covers the main topic and key points original text, and if it presents them in a clear and logical order.
3. Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria.

--------------------------------

Consistencia fatual

Consistency (1-5): 

the factual alignment between the paraphrased text and the original text. 
A factually consistent paraphrased text contains only statements that are entailed by the original text. 
Annotators were also asked to penalize paraphrased text that contained hallucinated facts. 

1. Read the original text carefully and identify the main facts and details it presents.
2. Read the paraphrased text and compare it to the original text. Check if the paraphrased text contains any factual errors that are not supported by the original text.
3. Assign a score for consistency based on the Evaluation Criteria.

--------------------------------

Fluencia


Fluency (1-3): the quality of the text in terms of grammar, spelling, punctuation, word choice, and sentence structure.

- 1: Poor. The text has many errors that make it hard to understand or sound unnatural.
- 2: Fair. The text has some errors that affect the clarity or smoothness of the text, but the main points are still comprehensible.
- 3: Good. The text has few or no errors and is easy to read and follow.

--------------------------------

Relevancia

Relevance (1-5) - selection of important content from the original text. The paraphrased text should include only important information from the source original text.
Annotators were instructed to penalize paraphrased text which contained redundancies and excess information.

1. Read the paraphrased text and the original text carefully.
2. Compare the paraphrased text to the original text and identify the main points of the text.
3. Assess how well the paraphrased text covers the main points of the orignal text, and how much irrelevant or redundant information it contains.
4. Assign a relevance score from 1 to 5.


"""





In [None]:
"""
Answer Correctness:


The assessment of Answer Correctness involves gauging the accuracy of the generated answer when compared to the ground truth. 
This evaluation relies on the ground truth and the answer, with scores ranging from 0 to 1. 
A higher score indicates a closer alignment between the generated answer and the ground truth, signifying better correctness.

Answer correctness encompasses two critical aspects: semantic similarity between the generated answer and the ground truth, as well as factual similarity. 
These aspects are combined using a weighted scheme to formulate the answer correctness score. 
Users also have the option to employ a ‘threshold’ value to round the resulting score to binary, if desired.
Let’s calculate the answer correctness for the answer with low answer correctness. 
It is computed as the sum of factual correctness and the semantic similarity between the given answer and the ground truth.

Factual correctness quantifies the factual overlap between the generated answer and the ground truth answer.


https://docs.ragas.io/en/latest/concepts/metrics/answer_correctness.html
"""



In [None]:
"""
Paraphase (1-3): 

checks whether a pair of phrases convey approximately the same meaning using different words. Annotators were asked to mark the extent to which a sentence could be consideredas being a paraphrase of its counterpart and fluency.

1. Read the paraphrased text and the original text carefully.
2. Read the paraphrased text and compare it to the original text. Check the paraphrased text for synonym errors that are not supported by the original text.0
3. Evaluate the extent to which the paraphrased text covers the main semantic meaning of the original text.
4. Assign a score for paraphase based on the Evaluation Criteria.

"""



In [1]:

import pandas as pd
pd.set_option('display.max_colwidth', None)

df_english    = pd.read_csv("./data/FILTERED/english.tsv", sep="\t", index_col=0)
df_french     = pd.read_csv("./data/FILTERED/french.tsv", sep="\t", index_col=0)
df_spanish    = pd.read_csv("./data/FILTERED/spanish.tsv", sep="\t", index_col=0)
df_italian    = pd.read_csv("./data/FILTERED/italian.tsv", sep="\t", index_col=0)
df_portuguese = pd.read_csv("./data/FILTERED/portuguese.tsv", sep="\t", index_col=0)


df_english["language"]    = "EN"
df_french["language"]     = "FR"
df_spanish["language"]    = "ES"
df_italian["language"]    = "IT"
df_portuguese["language"] = "PT"

In [2]:
"""
remove everithing after footnotes.

"""
import re

df_portuguese["texto_x"] = [re.sub(r'Footnotes.*$','', str(x)) for x in df_portuguese['texto_x']]
df_portuguese["texto_y"] = [re.sub(r'Footnotes.*$','', str(x)) for x in df_portuguese['texto_y']]

df_english["texto_x"] = [re.sub(r'Footnotes.*$','', str(x)) for x in df_english['texto_x']]
df_english["texto_y"] = [re.sub(r'Footnotes.*$','', str(x)) for x in df_english['texto_y']]

df_french["texto_x"] = [re.sub(r'Footnotes.*$','', str(x)) for x in df_french['texto_x']]
df_french["texto_y"] = [re.sub(r'Footnotes.*$','', str(x)) for x in df_french['texto_y']]

df_spanish["texto_x"] = [re.sub(r'Footnotes.*$','', str(x)) for x in df_spanish['texto_x']]
df_spanish["texto_y"] = [re.sub(r'Footnotes.*$','', str(x)) for x in df_spanish['texto_y']]

df_italian["texto_x"] = [re.sub(r'Footnotes.*$','', str(x)) for x in df_italian['texto_x']]
df_italian["texto_y"] = [re.sub(r'Footnotes.*$','', str(x)) for x in df_italian['texto_y']]


"""Tudo o que está dentro de []"""

df_italian["texto_x"] = [re.sub(r'\[([^\]]*)\]', '', str(x)) for x in df_italian['texto_x']]
df_italian["texto_y"] = [re.sub(r'\[([^\]]*)\]', '', str(x)) for x in df_italian['texto_y']]


df_spanish["texto_x"] = [re.sub(r'\[([^\]]*)\]', '', str(x)) for x in df_spanish['texto_x']]
df_spanish["texto_y"] = [re.sub(r'\[([^\]]*)\]', '', str(x)) for x in df_spanish['texto_y']]


df_french["texto_x"] = [re.sub(r'\[([^\]]*)\]', '', str(x)) for x in df_french['texto_x']]
df_french["texto_y"] = [re.sub(r'\[([^\]]*)\]', '', str(x)) for x in df_french['texto_y']]

df_english["texto_x"] = [re.sub(r'\[([^\]]*)\]', '', str(x)) for x in df_english['texto_x']]
df_english["texto_y"] = [re.sub(r'\[([^\]]*)\]', '', str(x)) for x in df_english['texto_y']]

df_portuguese["texto_x"] = [re.sub(r'\[([^\]]*)\]', '', str(x)) for x in df_portuguese['texto_x']]
df_portuguese["texto_y"] = [re.sub(r'\[([^\]]*)\]', '', str(x)) for x in df_portuguese['texto_y']]




filter = df_spanish['estilo_x'].str.contains("JBSDHH")
df_spanish = df_spanish[~filter]

filter = df_spanish['estilo_y'].str.contains("JBSDHH")
df_spanish = df_spanish[~filter]

filter = df_spanish['estilo_x'].str.contains("RVR1690")
df_spanish = df_spanish[~filter]

filter = df_spanish['estilo_y'].str.contains("RVR1690")
df_spanish = df_spanish[~filter]



def filtrar_numeros_hifen(column):
    regex = r'\d+-\d+'  # Expressão regular para números seguidos por hífen
    return column.str.contains(regex)
def filtrar_numeros_pontos(column):
    regex = r'\d+:\d+'  # Expressão regular para números seguidos por hífen
    return column.str.contains(regex)


df_english = df_english[~filtrar_numeros_hifen(df_english['texto_x'])]
df_english = df_english[~filtrar_numeros_pontos(df_english['texto_x'])]

df_spanish = df_spanish[~filtrar_numeros_hifen(df_spanish['texto_x'])]
df_spanish = df_spanish[~filtrar_numeros_pontos(df_spanish['texto_x'])]

df_french = df_french[~filtrar_numeros_hifen(df_french['texto_x'])]
df_french = df_french[~filtrar_numeros_pontos(df_french['texto_x'])]

df_portuguese = df_portuguese[~filtrar_numeros_hifen(df_portuguese['texto_x'])]
df_portuguese = df_portuguese[~filtrar_numeros_pontos(df_portuguese['texto_x'])]

df_italian = df_italian[~filtrar_numeros_hifen(df_italian['texto_x'])]
df_italian = df_italian[~filtrar_numeros_pontos(df_italian['texto_x'])]



#df_portuguese = df_portuguese[df_portuguese["targetLen"] / df_portuguese["sourceLen"] >= 0.78]
#df_portuguese = df_portuguese[df_portuguese["sourceLen"] / df_portuguese["targetLen"] > 0.78]

#df_spanish = df_spanish[df_spanish["targetLen"] / df_spanish["sourceLen"] >= 0.78]
#df_spanish = df_spanish[df_spanish["sourceLen"] / df_spanish["targetLen"] > 0.78]

#df_italian = df_italian[df_italian["targetLen"] / df_italian["sourceLen"] >= 0.78]
#df_italian = df_italian[df_italian["sourceLen"] / df_italian["targetLen"] > 0.78]

#df_french = df_french[df_french["targetLen"] / df_french["sourceLen"] >= 0.78]
#df_french = df_french[df_french["sourceLen"] / df_french["targetLen"] > 0.78]

#df_english = df_english[df_english["targetLen"] / df_english["sourceLen"] >= 0.78]
#df_english = df_english[df_english["sourceLen"] / df_english["targetLen"] > 0.78]





In [3]:
###Remover do ingles, italiano, Espanhol e Frances
import utils

print(df_french.shape)
#french
for livro in utils.dicFenchNew.keys():

    filter = df_french['texto_x'].str.contains(" - "+livro)
    df_french = df_french[~filter]

for livro in utils.dicFenchNew.keys():

    filter = df_french['texto_y'].str.contains(" - "+livro)
    df_french = df_french[~filter]


for livro in utils.dictFrenchOld.keys():

    filter = df_french['texto_x'].str.contains(" - "+livro)
    df_french = df_french[~filter]

for livro in utils.dictFrenchOld.keys():

    filter = df_french['texto_y'].str.contains(" - "+livro)
    df_french = df_french[~filter]
print(df_french.shape)


(153194, 12)
(151387, 12)


In [4]:
print(df_italian.shape)
#italian
for livro in utils.dictItalianNew.keys():

    filter = df_italian['texto_x'].str.contains(" - "+livro)
    df_italian = df_italian[~filter]

for livro in utils.dictItalianNew.keys():

    filter = df_italian['texto_y'].str.contains(" - "+livro)
    df_italian = df_italian[~filter]


for livro in utils.dictItalianOld.keys():

    filter = df_italian['texto_x'].str.contains(" - "+livro)
    df_italian = df_italian[~filter]

for livro in utils.dictItalianOld.keys():

    filter = df_italian['texto_y'].str.contains(" - "+livro)
    df_italian = df_italian[~filter]
print(df_italian.shape)

(181260, 12)
(175940, 12)


In [5]:
print(df_english.shape)
#English
for livro in utils.dicEnglishNew.keys():

    filter = df_english['texto_x'].str.contains(" - "+livro)
    df_english = df_english[~filter]

for livro in utils.dicEnglishNew.keys():

    filter = df_english['texto_y'].str.contains(" - "+livro)
    df_english = df_english[~filter]


for livro in utils.dictEnglishOld.keys():

    filter = df_english['texto_x'].str.contains(" - "+livro)
    df_english = df_english[~filter]

for livro in utils.dictEnglishOld.keys():

    filter = df_english['texto_y'].str.contains(" - "+livro)
    df_english = df_english[~filter]
print(df_english.shape)

(10547933, 12)
(10523875, 12)


In [6]:
print(df_spanish.shape)
#Spanish
for livro in utils.dicSpanishNew.keys():

    filter = df_spanish['texto_x'].str.contains(" - "+livro)
    df_spanish = df_spanish[~filter]

for livro in utils.dicSpanishNew.keys():

    filter = df_spanish['texto_y'].str.contains(" - "+livro)
    df_spanish = df_spanish[~filter]


for livro in utils.dictSpanishOld.keys():

    filter = df_spanish['texto_x'].str.contains(" - "+livro)
    df_spanish = df_spanish[~filter]

for livro in utils.dictSpanishOld.keys():

    filter = df_spanish['texto_y'].str.contains(" - "+livro)
    df_spanish = df_spanish[~filter]
print(df_spanish.shape)

(2239902, 12)
(2239011, 12)


In [7]:
import style

df_style_en = pd.DataFrame.from_dict(
style.dict_styles_en
)

df_style_fr = pd.DataFrame.from_dict(
style.dict_styles_fr
)

df_style_it = pd.DataFrame.from_dict(
style.dict_styles_it
)

df_style_pt = pd.DataFrame.from_dict(
style.dict_styles_pt
)

df_style_es = pd.DataFrame.from_dict(
style.dict_styles_es
)




In [8]:
df_style_pt

Unnamed: 0,livro,forma_traducao,linguagem,formal/informal
0,ARC,literal,arcaica,formal
1,VFL,dinâmica,moderna,informal
2,NTLH,dinâmica,moderna,informal
3,NVT,dinâmica,moderna,formal
4,NVI-PT,dinâmica,moderna,formal
5,OL,literal,arcaica,formal


In [9]:
#Italian 
for line in df_italian.groupby(["estilo_x", "estilo_y"]).count().reset_index(["estilo_x", "estilo_y"])[["estilo_x", "estilo_y"]].values:

    estilo_origem = df_style_it.loc[df_style_it['livro'] == line[0]]["forma_traducao"].values
    estilo_alvo   =  df_style_it.loc[df_style_it['livro'] == line[1]]["forma_traducao"].values

    linguagem_origem = df_style_it.loc[df_style_it['livro'] == line[0]]["linguagem"].values
    linguagem_alvo   =  df_style_it.loc[df_style_it['livro'] == line[1]]["linguagem"].values

    forinf_origem = df_style_it.loc[df_style_it['livro'] == line[0]]["formal/informal"].values
    forinf_alvo   =  df_style_it.loc[df_style_it['livro'] == line[1]]["formal/informal"].values
    
    df_italian.loc[df_italian["estilo_x"] == line[0], "forma_x"] = estilo_origem[0]
    df_italian.loc[df_italian["estilo_y"] == line[1], "forma_y"] = estilo_alvo[0]

    df_italian.loc[df_italian["estilo_x"] == line[0], "linguagem_x"] = linguagem_origem[0]
    df_italian.loc[df_italian["estilo_y"] == line[1], "linguagem_y"] = linguagem_alvo[0]

    df_italian.loc[df_italian["estilo_x"] == line[0], "formal/informal_x"] = forinf_origem[0]
    df_italian.loc[df_italian["estilo_y"] == line[1], "formal/informal_y"] = forinf_alvo[0]


#Spanish 
for line in df_spanish.groupby(["estilo_x", "estilo_y"]).count().reset_index(["estilo_x", "estilo_y"])[["estilo_x", "estilo_y"]].values:

    estilo_origem = df_style_es.loc[df_style_es['livro'] == line[0]]["forma_traducao"].values
    estilo_alvo   =  df_style_es.loc[df_style_es['livro'] == line[1]]["forma_traducao"].values

    linguagem_origem = df_style_es.loc[df_style_es['livro'] == line[0]]["linguagem"].values
    linguagem_alvo   =  df_style_es.loc[df_style_es['livro'] == line[1]]["linguagem"].values

    forinf_origem = df_style_es.loc[df_style_es['livro'] == line[0]]["formal/informal"].values
    forinf_alvo   =  df_style_es.loc[df_style_es['livro'] == line[1]]["formal/informal"].values
    
    df_spanish.loc[df_spanish["estilo_x"] == line[0], "forma_x"] = estilo_origem[0]
    df_spanish.loc[df_spanish["estilo_y"] == line[1], "forma_y"] = estilo_alvo[0]

    df_spanish.loc[df_spanish["estilo_x"] == line[0], "linguagem_x"] = linguagem_origem[0]
    df_spanish.loc[df_spanish["estilo_y"] == line[1], "linguagem_y"] = linguagem_alvo[0]

    df_spanish.loc[df_spanish["estilo_x"] == line[0], "formal/informal_x"] = forinf_origem[0]
    df_spanish.loc[df_spanish["estilo_y"] == line[1], "formal/informal_y"] = forinf_alvo[0]


#English 
for line in df_english.groupby(["estilo_x", "estilo_y"]).count().reset_index(["estilo_x", "estilo_y"])[["estilo_x", "estilo_y"]].values:

    estilo_origem = df_style_en.loc[df_style_en['livro'] == line[0]]["forma_traducao"].values
    estilo_alvo   =  df_style_en.loc[df_style_en['livro'] == line[1]]["forma_traducao"].values

    linguagem_origem = df_style_en.loc[df_style_en['livro'] == line[0]]["linguagem"].values
    linguagem_alvo   =  df_style_en.loc[df_style_en['livro'] == line[1]]["linguagem"].values

    forinf_origem = df_style_en.loc[df_style_en['livro'] == line[0]]["formal/informal"].values
    forinf_alvo   =  df_style_en.loc[df_style_en['livro'] == line[1]]["formal/informal"].values
    
    df_english.loc[df_english["estilo_x"] == line[0], "forma_x"] = estilo_origem[0]
    df_english.loc[df_english["estilo_y"] == line[1], "forma_y"] = estilo_alvo[0]

    df_english.loc[df_english["estilo_x"] == line[0], "linguagem_x"] = linguagem_origem[0]
    df_english.loc[df_english["estilo_y"] == line[1], "linguagem_y"] = linguagem_alvo[0]

    df_english.loc[df_english["estilo_x"] == line[0], "formal/informal_x"] = forinf_origem[0]
    df_english.loc[df_english["estilo_y"] == line[1], "formal/informal_y"] = forinf_alvo[0]


#french 
for line in df_french.groupby(["estilo_x", "estilo_y"]).count().reset_index(["estilo_x", "estilo_y"])[["estilo_x", "estilo_y"]].values:

    estilo_origem = df_style_fr.loc[df_style_fr['livro'] == line[0]]["forma_traducao"].values
    estilo_alvo   =  df_style_fr.loc[df_style_fr['livro'] == line[1]]["forma_traducao"].values

    linguagem_origem = df_style_fr.loc[df_style_fr['livro'] == line[0]]["linguagem"].values
    linguagem_alvo   =  df_style_fr.loc[df_style_fr['livro'] == line[1]]["linguagem"].values

    forinf_origem = df_style_fr.loc[df_style_fr['livro'] == line[0]]["formal/informal"].values
    forinf_alvo   =  df_style_fr.loc[df_style_fr['livro'] == line[1]]["formal/informal"].values
    
    df_french.loc[df_french["estilo_x"] == line[0], "forma_x"] = estilo_origem[0]
    df_french.loc[df_french["estilo_y"] == line[1], "forma_y"] = estilo_alvo[0]

    df_french.loc[df_french["estilo_x"] == line[0], "linguagem_x"] = linguagem_origem[0]
    df_french.loc[df_french["estilo_y"] == line[1], "linguagem_y"] = linguagem_alvo[0]

    df_french.loc[df_french["estilo_x"] == line[0], "formal/informal_x"] = forinf_origem[0]
    df_french.loc[df_french["estilo_y"] == line[1], "formal/informal_y"] = forinf_alvo[0]

#portuguese 
for line in df_portuguese.groupby(["estilo_x", "estilo_y"]).count().reset_index(["estilo_x", "estilo_y"])[["estilo_x", "estilo_y"]].values:
    estilo_origem = df_style_pt.loc[df_style_pt['livro'] == line[0]]["forma_traducao"].values
    estilo_alvo   =  df_style_pt.loc[df_style_pt['livro'] == line[1]]["forma_traducao"].values

    linguagem_origem = df_style_pt.loc[df_style_pt['livro'] == line[0]]["linguagem"].values
    linguagem_alvo   =  df_style_pt.loc[df_style_pt['livro'] == line[1]]["linguagem"].values

    forinf_origem = df_style_pt.loc[df_style_pt['livro'] == line[0]]["formal/informal"].values
    forinf_alvo   =  df_style_pt.loc[df_style_pt['livro'] == line[1]]["formal/informal"].values
    
    df_portuguese.loc[df_portuguese["estilo_x"] == line[0], "forma_x"] = estilo_origem[0]
    df_portuguese.loc[df_portuguese["estilo_y"] == line[1], "forma_y"] = estilo_alvo[0]

    df_portuguese.loc[df_portuguese["estilo_x"] == line[0], "linguagem_x"] = linguagem_origem[0]
    df_portuguese.loc[df_portuguese["estilo_y"] == line[1], "linguagem_y"] = linguagem_alvo[0]

    df_portuguese.loc[df_portuguese["estilo_x"] == line[0], "formal/informal_x"] = forinf_origem[0]
    df_portuguese.loc[df_portuguese["estilo_y"] == line[1], "formal/informal_y"] = forinf_alvo[0]



In [10]:
df_concat = pd.concat([df_portuguese, df_french, df_english, df_spanish, df_italian])


In [11]:
df_concat = df_concat.drop_duplicates(["texto_x"])

In [12]:
df_concat

Unnamed: 0,estilo_x,livro,capitulo,versiculo,texto_x,estilo_y,texto_y,sourceLen,targetLen,overlap,VERSAO,language,forma_x,forma_y,linguagem_x,linguagem_y,formal/informal_x,formal/informal_y
17,ARC,Gênesis,1,18,"e para governar o dia e a noite , e para fazer separação entre a luz e as trevas . E viu Deus que era bom .",NTLH,para governarem o dia e a noite e para separarem a luz da escuridão . E Deus viu que o que havia feito era bom .,27,26,0.433962,VELHO,PT,literal,dinâmica,arcaica,moderna,formal,informal
18,ARC,Gênesis,1,19,E foi a tarde e a manhã : o dia quarto .,NTLH,"A noite passou , e veio a manhã . Esse foi o quarto dia .",12,15,0.310345,VELHO,PT,literal,dinâmica,arcaica,moderna,formal,informal
19,ARC,Gênesis,1,20,E disse Deus : Produzam as águas abundantemente répteis de alma vivente ; e voem as aves sobre a face da expansão dos céus .,NTLH,"Depois Deus disse : — Que as águas fiquem cheias de todo tipo de seres vivos , e que na terra haja aves que voem no ar !",25,28,0.181818,VELHO,PT,literal,dinâmica,arcaica,moderna,formal,informal
20,ARC,Gênesis,1,21,"E Deus criou as grandes baleias , e todo réptil de alma vivente que as águas abundantemente produziram conforme as suas espécies , e toda ave de asas conforme a sua espécie . E viu Deus que era bom .",NTLH,"Assim Deus criou os grandes monstros do mar , e todas as espécies de seres vivos que em grande quantidade se movem nas águas , e criou também todas as espécies de aves . E Deus viu que o que havia feito era bom .",40,45,0.235955,VELHO,PT,literal,dinâmica,arcaica,moderna,formal,informal
21,ARC,Gênesis,1,22,"E Deus os abençoou , dizendo : Frutificai , e multiplicai-vos , e enchei as águas nos mares ; e as aves se multipliquem na terra .",NTLH,Ele abençoou os seres vivos do mar e disse : — Aumentem muito em número e encham as águas dos mares ! E que as aves se multipliquem na terra !,27,31,0.311475,VELHO,PT,literal,dinâmica,arcaica,moderna,formal,informal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7940,NR1994,Apocalisse,22,8,"Io , * Giovanni , sono quello che ha udito e visto queste cose . E , dopo averle viste e udite , mi prostrai ai piedi dell&apos; angelo che me le aveva mostrate , per adorarlo .",NR2006,"Io , Giovanni , sono quello che ha udito e visto queste cose . E , dopo averle viste e udite , mi prostrai ai piedi dell ’ angelo che me le aveva mostrate , per adorarlo .",38,38,0.790123,NOVO,IT,literal,literal,moderna,moderna,formal,formal
7948,NR1994,Apocalisse,22,16,"Io , Gesú , ho mandato il mio angelo per attestarvi queste cose in seno alle chiese . Io sono la radice e la discendenza di * Davide , la lucente stella del mattino » .",NR2006,"Io , Gesù , ho mandato il mio angelo per attestarvi queste cose in seno alle chiese . Io sono la radice e la discendenza di Davide , la lucente stella del mattino » .",36,35,0.830986,NOVO,IT,literal,literal,moderna,moderna,formal,formal
7949,NR1994,Apocalisse,22,17,"( C ) Lo Spirito e la * sposa dicono : « Vieni » . E chi ode , dica : « Vieni » . Chi ha sete , venga ; chi vuole , prenda in dono dell&apos; acqua della vita .",NR2006,"( C ) Lo Spirito e la sposa dicono : « Vieni ! » E chi ode , dica : « Vieni ! » Chi ha sete , venga ; chi vuole , prenda in dono dell ’ acqua della vita .",42,42,0.696629,NOVO,IT,literal,literal,moderna,moderna,formal,formal
7950,NR1994,Apocalisse,22,18,"Io lo dichiaro a chiunque ode le parole della profezia di questo libro : se qualcuno vi aggiunge qualcosa , Dio aggiungerà ai suoi mali i flagelli descritti in questo libro ;",NR2006,"Io lo dichiaro a chiunque ode le parole della profezia di questo libro : se qualcuno vi aggiunge qualcosa , Dio aggiungerà ai suoi mali i flagelli descritti in questo libro ;",32,32,0.952381,NOVO,IT,literal,literal,moderna,moderna,formal,formal


In [25]:
import itertools
def get_vocab(x):
    return x.split(" ")





print(len(set((df_concat[df_concat["language"] == "EN"]["estilo_x"]))))
print(len((df_concat[df_concat["language"] == "EN"]["texto_x"])))
print(df_concat[df_concat["language"] == "EN"]["sourceLen"].mean())
print("English : ", len(set(list(itertools.chain(*df_concat[df_concat["language"] == "EN"]["texto_x"].apply(lambda x: get_vocab(x)).tolist())))))

print("--------")
print(len(set((df_concat[df_concat["language"] == "ES"]["estilo_x"]))))
print(len((df_concat[df_concat["language"] == "ES"]["texto_x"])))
print(df_concat[df_concat["language"] == "ES"]["sourceLen"].mean())
print("Spanish : ", len(set(list(itertools.chain(*df_concat[df_concat["language"] == "ES"]["texto_x"].apply(lambda x: get_vocab(x)).tolist())))))

print("--------")
print(len(set((df_concat[df_concat["language"] == "PT"]["estilo_x"]))))
print(len((df_concat[df_concat["language"] == "PT"]["texto_x"])))
print(df_concat[df_concat["language"] == "PT"]["sourceLen"].mean())
print("Portuguese : ", len(set(list(itertools.chain(*df_concat[df_concat["language"] == "PT"]["texto_x"].apply(lambda x: get_vocab(x)).tolist())))))

print("--------")
print(len(set((df_concat[df_concat["language"] == "IT"]["estilo_x"]))))
print(len((df_concat[df_concat["language"] == "IT"]["texto_x"])))
print(df_concat[df_concat["language"] == "IT"]["sourceLen"].mean())
print("Italian : ", len(set(list(itertools.chain(*df_concat[df_concat["language"] == "IT"]["texto_x"].apply(lambda x: get_vocab(x)).tolist())))))


print("--------")
print(len(set((df_concat[df_concat["language"] == "FR"]["estilo_x"]))))
print(len((df_concat[df_concat["language"] == "FR"]["texto_x"])))
print(df_concat[df_concat["language"] == "FR"]["sourceLen"].mean())
print("Frensh : ", len(set(list(itertools.chain(*df_concat[df_concat["language"] == "FR"]["texto_x"].apply(lambda x: get_vocab(x)).tolist())))))






31
646838
33.43115277704773
English :  70130
--------
13
287684
29.016705830007925
Spanish :  85873
--------
5
111490
27.220010763297157
Portuguese :  53481
--------
4
84071
27.04393905151598
Italian :  47447
--------
3
73716
32.26888328178414
Frensh :  35089


In [None]:
"""
        Forma de tradução:
"""





Caracterizar os tipos de estilos (médias)
# Remover o duplicates do X
# tamanho médio e desvio em termos de tokens do estilo 
# quantidades de palavras de cadas uma 
# Olhar para as PoS
    # qtd media e desvio de verbos no texto
    # qtd media e desvio de prep no texto
    # qtd media e desvio de punct no texto
    # qtd media e desvio de adverbios no texto
    .....






In [26]:
df_concat.groupby(["forma_x"])["sourceLen"].mean()

forma_x
dinâmica    31.264231
literal     31.181419
livre       31.596585
Name: sourceLen, dtype: float64

In [28]:
df_concat["forma_x"].value_counts()

forma_x
literal     548680
dinâmica    447893
livre       207226
Name: count, dtype: int64

In [29]:
# quantidades de palavras de cada estilo
import itertools
def get_vocab(x):
    return x.split(" ")


print("literal : ", len(set(list(itertools.chain(*df_concat[df_concat["forma_x"] == "literal"]["texto_x"].apply(lambda x: get_vocab(x)).tolist())))))
print("dinâmica : ", len(set(list(itertools.chain(*df_concat[df_concat["forma_x"] == "dinâmica"]["texto_x"].apply(lambda x: get_vocab(x)).tolist())))))
print("livre : ", len(set(list(itertools.chain(*df_concat[df_concat["forma_x"] == "livre"]["texto_x"].apply(lambda x: get_vocab(x)).tolist())))))


literal :  178095
dinâmica :  159326
livre :  92304


In [30]:
# Olhar para as PoS
    # qtd media e desvio de verbos no texto
    # qtd media e desvio de prep no texto
    # qtd media e desvio de punct no texto
    # qtd media e desvio de adverbios no texto
import spacy
from tqdm import tqdm
import statistics

nlp = spacy.load("it_core_news_lg")
tqdm.pandas()
def get_pos(x):
    return [w.pos_ for w in nlp(x)]

def get_avg_std(lista, busca):
    ocorrencias = []
    
    for sentenca in lista:
        ocorrencias.append(sentenca.count(busca))
    if len(ocorrencias) == 0:
        return 0.0, 0.0
    else:
        return statistics.mean(ocorrencias), statistics.stdev(ocorrencias)
    
print("--------Italian--------")

pos_it = df_concat[(df_concat["language"] == "IT") & (df_concat["forma_x"] == "literal")]["texto_x"].progress_apply(get_pos).tolist()

from collections import Counter
from itertools import chain



print("--------literal: ")
print(Counter(chain(*pos_it)))



print("avg. and std. NOUN", get_avg_std(pos_it, "NOUN"))
print("avg. and std. PUNCT", get_avg_std(pos_it, "PUNCT"))
print("avg. and std. ADP", get_avg_std(pos_it, "ADP"))
print("avg. and std. VERB", get_avg_std(pos_it, "VERB"))
print("avg. and std. DET", get_avg_std(pos_it, "DET"))
print("avg. and std. PRON", get_avg_std(pos_it, "PRON"))
print("avg. and std. ADV", get_avg_std(pos_it, "ADV"))
print("avg. and std. PROPN", get_avg_std(pos_it, "PROPN"))
print("avg. and std. CCONJ", get_avg_std(pos_it, "CCONJ"))
print("avg. and std. ADJ", get_avg_std(pos_it, "ADJ"))

print("--------dinâmica: ")

pos_it = df_concat[(df_concat["language"] == "IT") & (df_concat["forma_x"] == "dinâmica")]["texto_x"].progress_apply(get_pos).tolist()

from collections import Counter
from itertools import chain
print(Counter(chain(*pos_it)))


print("avg. and std. NOUN", get_avg_std(pos_it, "NOUN"))
print("avg. and std. PUNCT", get_avg_std(pos_it, "PUNCT"))
print("avg. and std. ADP", get_avg_std(pos_it, "ADP"))
print("avg. and std. VERB", get_avg_std(pos_it, "VERB"))
print("avg. and std. DET", get_avg_std(pos_it, "DET"))
print("avg. and std. PRON", get_avg_std(pos_it, "PRON"))
print("avg. and std. ADV", get_avg_std(pos_it, "ADV"))
print("avg. and std. PROPN", get_avg_std(pos_it, "PROPN"))
print("avg. and std. CCONJ", get_avg_std(pos_it, "CCONJ"))
print("avg. and std. ADJ", get_avg_std(pos_it, "ADJ"))


print("--------livre: ")

pos_it = df_concat[(df_concat["language"] == "IT") & (df_concat["forma_x"] == "livre")]["texto_x"].progress_apply(get_pos).tolist()

from collections import Counter
from itertools import chain
print(Counter(chain(*pos_it)))

print("avg. and std. NOUN", get_avg_std(pos_it, "NOUN"))
print("avg. and std. PUNCT", get_avg_std(pos_it, "PUNCT"))
print("avg. and std. ADP", get_avg_std(pos_it, "ADP"))
print("avg. and std. VERB", get_avg_std(pos_it, "VERB"))
print("avg. and std. DET", get_avg_std(pos_it, "DET"))
print("avg. and std. PRON", get_avg_std(pos_it, "PRON"))
print("avg. and std. ADV", get_avg_std(pos_it, "ADV"))
print("avg. and std. PROPN", get_avg_std(pos_it, "PROPN"))
print("avg. and std. CCONJ", get_avg_std(pos_it, "CCONJ"))
print("avg. and std. ADJ", get_avg_std(pos_it, "ADJ"))

  from .autonotebook import tqdm as notebook_tqdm


--------Italian--------


100%|██████████| 77678/77678 [05:36<00:00, 230.74it/s]


--------literal: 
Counter({'NOUN': 380531, 'PUNCT': 348301, 'ADP': 264056, 'VERB': 247748, 'DET': 240530, 'PRON': 176261, 'ADV': 105423, 'PROPN': 94821, 'CCONJ': 93524, 'AUX': 79999, 'ADJ': 57091, 'SCONJ': 35903, 'NUM': 9582, 'SYM': 3136, 'INTJ': 961, 'X': 383, 'PART': 1})
avg. and std. NOUN (4.89882592239759, 2.9916437060344845)
avg. and std. PUNCT (4.483907927598548, 2.729057812131463)
avg. and std. ADP (3.399366616030279, 2.4593790360498216)
avg. and std. VERB (3.1894230026519734, 1.9614276427415727)
avg. and std. DET (3.0965009397770284, 2.374321793694466)
avg. and std. PRON (2.269123818841886, 1.998753961806364)
avg. and std. ADV (1.3571796390226318, 1.4125879443171692)
avg. and std. PROPN (1.2206931177424754, 1.6600268203108355)
avg. and std. CCONJ (1.2039959834187286, 1.0968725379487736)
avg. and std. ADJ (0.7349700043770437, 0.9914138092624931)
--------dinâmica: 


100%|██████████| 6393/6393 [00:23<00:00, 266.88it/s]


Counter({'PUNCT': 28488, 'NOUN': 26334, 'VERB': 24699, 'ADP': 21662, 'DET': 17669, 'PRON': 16261, 'ADV': 11864, 'AUX': 8893, 'CCONJ': 7035, 'PROPN': 7011, 'ADJ': 6855, 'SCONJ': 4733, 'NUM': 546, 'INTJ': 164, 'X': 17, 'SYM': 10, 'PART': 1})
avg. and std. NOUN (4.119192867198499, 2.5105900556011296)
avg. and std. PUNCT (4.456123885499766, 2.494673385193473)
avg. and std. ADP (3.388393555451275, 2.271077702368692)
avg. and std. VERB (3.8634443923040824, 2.0100571195036507)
avg. and std. DET (2.763804160800876, 1.994177392693757)
avg. and std. PRON (2.54356327232911, 2.0527355000752388)
avg. and std. ADV (1.8557797591115281, 1.6376662388699421)
avg. and std. PROPN (1.0966682308775222, 1.346335302734845)
avg. and std. CCONJ (1.1004223369310182, 1.0103910118477082)
avg. and std. ADJ (1.0722665415297983, 1.2072873259471244)
--------livre: 


0it [00:00, ?it/s]

Counter()
avg. and std. NOUN (0.0, 0.0)
avg. and std. PUNCT (0.0, 0.0)
avg. and std. ADP (0.0, 0.0)
avg. and std. VERB (0.0, 0.0)
avg. and std. DET (0.0, 0.0)
avg. and std. PRON (0.0, 0.0)
avg. and std. ADV (0.0, 0.0)
avg. and std. PROPN (0.0, 0.0)
avg. and std. CCONJ (0.0, 0.0)
avg. and std. ADJ (0.0, 0.0)





In [31]:
import spacy
from tqdm import tqdm
import statistics

nlp = spacy.load("es_dep_news_trf")
tqdm.pandas()
def get_pos(x):
    return [w.pos_ for w in nlp(x)]

def get_avg_std(lista, busca):
    ocorrencias = []
    
    for sentenca in lista:
        ocorrencias.append(sentenca.count(busca))
    if len(ocorrencias) == 0:
        return 0.0, 0.0
    else:
        return statistics.mean(ocorrencias), statistics.stdev(ocorrencias)
    
print("--------Spanish--------")

pos_it = df_concat[(df_concat["language"] == "ES") & (df_concat["forma_x"] == "literal")]["texto_x"].progress_apply(get_pos).tolist()

from collections import Counter
from itertools import chain



print("--------literal: ")
print(Counter(chain(*pos_it)))



print("avg. and std. NOUN", get_avg_std(pos_it, "NOUN"))
print("avg. and std. PUNCT", get_avg_std(pos_it, "PUNCT"))
print("avg. and std. ADP", get_avg_std(pos_it, "ADP"))
print("avg. and std. VERB", get_avg_std(pos_it, "VERB"))
print("avg. and std. DET", get_avg_std(pos_it, "DET"))
print("avg. and std. PRON", get_avg_std(pos_it, "PRON"))
print("avg. and std. ADV", get_avg_std(pos_it, "ADV"))
print("avg. and std. PROPN", get_avg_std(pos_it, "PROPN"))
print("avg. and std. CCONJ", get_avg_std(pos_it, "CCONJ"))
print("avg. and std. ADJ", get_avg_std(pos_it, "ADJ"))

print("--------dinâmica: ")

pos_it = df_concat[(df_concat["language"] == "ES") & (df_concat["forma_x"] == "dinâmica")]["texto_x"].progress_apply(get_pos).tolist()

from collections import Counter
from itertools import chain
print(Counter(chain(*pos_it)))


print("avg. and std. NOUN", get_avg_std(pos_it, "NOUN"))
print("avg. and std. PUNCT", get_avg_std(pos_it, "PUNCT"))
print("avg. and std. ADP", get_avg_std(pos_it, "ADP"))
print("avg. and std. VERB", get_avg_std(pos_it, "VERB"))
print("avg. and std. DET", get_avg_std(pos_it, "DET"))
print("avg. and std. PRON", get_avg_std(pos_it, "PRON"))
print("avg. and std. ADV", get_avg_std(pos_it, "ADV"))
print("avg. and std. PROPN", get_avg_std(pos_it, "PROPN"))
print("avg. and std. CCONJ", get_avg_std(pos_it, "CCONJ"))
print("avg. and std. ADJ", get_avg_std(pos_it, "ADJ"))


print("--------livre: ")

pos_it = df_concat[(df_concat["language"] == "ES") & (df_concat["forma_x"] == "livre")]["texto_x"].progress_apply(get_pos).tolist()

from collections import Counter
from itertools import chain
print(Counter(chain(*pos_it)))

print("avg. and std. NOUN", get_avg_std(pos_it, "NOUN"))
print("avg. and std. PUNCT", get_avg_std(pos_it, "PUNCT"))
print("avg. and std. ADP", get_avg_std(pos_it, "ADP"))
print("avg. and std. VERB", get_avg_std(pos_it, "VERB"))
print("avg. and std. DET", get_avg_std(pos_it, "DET"))
print("avg. and std. PRON", get_avg_std(pos_it, "PRON"))
print("avg. and std. ADV", get_avg_std(pos_it, "ADV"))
print("avg. and std. PROPN", get_avg_std(pos_it, "PROPN"))
print("avg. and std. CCONJ", get_avg_std(pos_it, "CCONJ"))
print("avg. and std. ADJ", get_avg_std(pos_it, "ADJ"))

--------Spanish--------


100%|██████████| 120863/120863 [1:52:16<00:00, 17.94it/s] 


--------literal: 
Counter({'NOUN': 530033, 'PUNCT': 528065, 'ADP': 457106, 'DET': 410231, 'VERB': 394898, 'PRON': 288498, 'PROPN': 175220, 'CCONJ': 172267, 'ADV': 116939, 'AUX': 109846, 'SCONJ': 89736, 'ADJ': 83309, 'SYM': 51063, 'NUM': 18031, 'SPACE': 11603, 'INTJ': 8806, 'PART': 570})
avg. and std. NOUN (4.385403307877514, 2.649542762863567)
avg. and std. PUNCT (4.36912040905819, 2.545557875413242)
avg. and std. ADP (3.7820176563547157, 2.4768208390132127)
avg. and std. VERB (3.267319196114609, 1.9211857905537564)
avg. and std. DET (3.394181842251144, 2.227337682255591)
avg. and std. PRON (2.3869836095413817, 2.018333674032288)
avg. and std. ADV (0.9675334883297618, 1.0954232164037057)
avg. and std. PROPN (1.4497406154075276, 1.7626501611897571)
avg. and std. CCONJ (1.42530799334784, 1.2173333890445281)
avg. and std. ADJ (0.6892845618592952, 0.9587536814538508)
--------dinâmica: 


100%|██████████| 59780/59780 [35:27<00:00, 28.10it/s]  


Counter({'PUNCT': 303230, 'NOUN': 272099, 'ADP': 223026, 'DET': 203537, 'VERB': 200040, 'PRON': 148159, 'PROPN': 109733, 'CCONJ': 80132, 'ADV': 57750, 'AUX': 53317, 'SCONJ': 45498, 'ADJ': 44881, 'SYM': 24726, 'SPACE': 17730, 'NUM': 9429, 'INTJ': 5013, 'PART': 389})
avg. and std. NOUN (4.551672800267648, 2.722011659637018)
avg. and std. PUNCT (5.07243225158916, 4.972186089600906)
avg. and std. ADP (3.730779524924724, 2.4426754398419113)
avg. and std. VERB (3.346269655403145, 1.9347587512754687)
avg. and std. DET (3.404767480762797, 2.2039218766404223)
avg. and std. PRON (2.478404148544664, 2.0591185531373997)
avg. and std. ADV (0.9660421545667447, 1.1012347330603762)
avg. and std. PROPN (1.835613917698227, 9.346936478447383)
avg. and std. CCONJ (1.3404483104717297, 1.1828949525696653)
avg. and std. ADJ (0.7507694881231181, 0.9942011781299583)
--------livre: 


100%|██████████| 107041/107041 [1:02:13<00:00, 28.67it/s]


Counter({'NOUN': 453965, 'ADP': 388171, 'PUNCT': 373233, 'VERB': 371676, 'DET': 365231, 'PRON': 265703, 'PROPN': 157328, 'CCONJ': 118076, 'ADV': 108817, 'ADJ': 99127, 'AUX': 97138, 'SCONJ': 86540, 'SYM': 46984, 'NUM': 13403, 'INTJ': 5310, 'SPACE': 4950, 'PART': 301})
avg. and std. NOUN (4.241038480582207, 2.6166470885129014)
avg. and std. PUNCT (3.486822806214441, 1.9980457183277125)
avg. and std. ADP (3.6263768088863144, 2.409464335285695)
avg. and std. VERB (3.4722769779804, 1.992502721649018)
avg. and std. DET (3.4120664044618416, 2.2079118143384573)
avg. and std. PRON (2.4822544632430565, 2.081330597206193)
avg. and std. ADV (1.0165917732457657, 1.1419297197195335)
avg. and std. PROPN (1.4697919488794013, 1.8391619170292348)
avg. and std. CCONJ (1.1030913388327837, 1.0176753817882558)
avg. and std. ADJ (0.9260657131379565, 1.1115933603205126)


In [32]:
import spacy
from tqdm import tqdm
import statistics


nlp = spacy.load("fr_dep_news_trf")
tqdm.pandas()
def get_pos(x):
    return [w.pos_ for w in nlp(x)]

def get_avg_std(lista, busca):
    ocorrencias = []
    
    for sentenca in lista:
        ocorrencias.append(sentenca.count(busca))
    if len(ocorrencias) == 0:
        return 0.0, 0.0
    else:
        return statistics.mean(ocorrencias), statistics.stdev(ocorrencias)
    
print("--------French--------")

pos_it = df_concat[(df_concat["language"] == "FR") & (df_concat["forma_x"] == "literal")]["texto_x"].progress_apply(get_pos).tolist()

from collections import Counter
from itertools import chain



print("--------literal: ")
print(Counter(chain(*pos_it)))



print("avg. and std. NOUN", get_avg_std(pos_it, "NOUN"))
print("avg. and std. PUNCT", get_avg_std(pos_it, "PUNCT"))
print("avg. and std. ADP", get_avg_std(pos_it, "ADP"))
print("avg. and std. VERB", get_avg_std(pos_it, "VERB"))
print("avg. and std. DET", get_avg_std(pos_it, "DET"))
print("avg. and std. PRON", get_avg_std(pos_it, "PRON"))
print("avg. and std. ADV", get_avg_std(pos_it, "ADV"))
print("avg. and std. PROPN", get_avg_std(pos_it, "PROPN"))
print("avg. and std. CCONJ", get_avg_std(pos_it, "CCONJ"))
print("avg. and std. ADJ", get_avg_std(pos_it, "ADJ"))

print("--------dinâmica: ")

pos_it = df_concat[(df_concat["language"] == "FR") & (df_concat["forma_x"] == "dinâmica")]["texto_x"].progress_apply(get_pos).tolist()

from collections import Counter
from itertools import chain
print(Counter(chain(*pos_it)))


print("avg. and std. NOUN", get_avg_std(pos_it, "NOUN"))
print("avg. and std. PUNCT", get_avg_std(pos_it, "PUNCT"))
print("avg. and std. ADP", get_avg_std(pos_it, "ADP"))
print("avg. and std. VERB", get_avg_std(pos_it, "VERB"))
print("avg. and std. DET", get_avg_std(pos_it, "DET"))
print("avg. and std. PRON", get_avg_std(pos_it, "PRON"))
print("avg. and std. ADV", get_avg_std(pos_it, "ADV"))
print("avg. and std. PROPN", get_avg_std(pos_it, "PROPN"))
print("avg. and std. CCONJ", get_avg_std(pos_it, "CCONJ"))
print("avg. and std. ADJ", get_avg_std(pos_it, "ADJ"))


print("--------livre: ")

pos_it = df_concat[(df_concat["language"] == "FR") & (df_concat["forma_x"] == "livre")]["texto_x"].progress_apply(get_pos).tolist()

from collections import Counter
from itertools import chain
print(Counter(chain(*pos_it)))

print("avg. and std. NOUN", get_avg_std(pos_it, "NOUN"))
print("avg. and std. PUNCT", get_avg_std(pos_it, "PUNCT"))
print("avg. and std. ADP", get_avg_std(pos_it, "ADP"))
print("avg. and std. VERB", get_avg_std(pos_it, "VERB"))
print("avg. and std. DET", get_avg_std(pos_it, "DET"))
print("avg. and std. PRON", get_avg_std(pos_it, "PRON"))
print("avg. and std. ADV", get_avg_std(pos_it, "ADV"))
print("avg. and std. PROPN", get_avg_std(pos_it, "PROPN"))
print("avg. and std. CCONJ", get_avg_std(pos_it, "CCONJ"))
print("avg. and std. ADJ", get_avg_std(pos_it, "ADJ"))

--------French--------


100%|██████████| 26866/26866 [15:49<00:00, 28.31it/s]


--------literal: 
Counter({'PUNCT': 151388, 'NOUN': 124110, 'PRON': 104069, 'ADP': 102625, 'DET': 98043, 'VERB': 93561, 'CCONJ': 35577, 'PROPN': 34584, 'ADV': 30380, 'AUX': 28306, 'ADJ': 22030, 'SCONJ': 12015, 'NUM': 3328, 'INTJ': 280, 'X': 9, 'SYM': 8})
avg. and std. NOUN (4.619593538301198, 2.6867906637431815)
avg. and std. PUNCT (5.634928906424477, 2.8475985534248)
avg. and std. ADP (3.8198838680860567, 2.5814079785980732)
avg. and std. VERB (3.48250576937393, 1.9903571532027255)
avg. and std. DET (3.649333730365518, 2.2346223525228366)
avg. and std. PRON (3.8736321000521103, 2.860253958519746)
avg. and std. ADV (1.1307972902553414, 1.4624066135553018)
avg. and std. PROPN (1.287277599940445, 1.6627960804840196)
avg. and std. CCONJ (1.3242388148589295, 1.1199518030190276)
avg. and std. ADJ (0.8199955333879253, 1.0377124550571588)
--------dinâmica: 


100%|██████████| 46850/46850 [28:01<00:00, 27.86it/s]


Counter({'NOUN': 263204, 'PUNCT': 228673, 'PRON': 226155, 'DET': 204981, 'ADP': 203272, 'VERB': 175438, 'CCONJ': 71405, 'ADV': 69628, 'PROPN': 55231, 'AUX': 52291, 'ADJ': 43381, 'NUM': 31909, 'SCONJ': 29957, 'INTJ': 866, 'SYM': 11, 'X': 9})
avg. and std. NOUN (5.618014941302028, 3.3079734147435387)
avg. and std. PUNCT (4.880960512273212, 3.080508510842169)
avg. and std. ADP (4.338783351120598, 2.9635984846046086)
avg. and std. VERB (3.7446744930629667, 2.0907795120184063)
avg. and std. DET (4.3752614727854855, 2.7539013251058546)
avg. and std. PRON (4.827214514407684, 3.5278515180812002)
avg. and std. ADV (1.4861899679829242, 1.8486083621092442)
avg. and std. PROPN (1.17889007470651, 1.6538074654007033)
avg. and std. CCONJ (1.524119530416222, 1.2813487323352162)
avg. and std. ADJ (0.9259551760939168, 1.1126839931217567)
--------livre: 


0it [00:00, ?it/s]

Counter()
avg. and std. NOUN (0.0, 0.0)
avg. and std. PUNCT (0.0, 0.0)
avg. and std. ADP (0.0, 0.0)
avg. and std. VERB (0.0, 0.0)
avg. and std. DET (0.0, 0.0)
avg. and std. PRON (0.0, 0.0)
avg. and std. ADV (0.0, 0.0)
avg. and std. PROPN (0.0, 0.0)
avg. and std. CCONJ (0.0, 0.0)
avg. and std. ADJ (0.0, 0.0)





In [33]:
import spacy
from tqdm import tqdm
import statistics


nlp = spacy.load("pt_core_news_lg")
tqdm.pandas()
def get_pos(x):
    return [w.pos_ for w in nlp(x)]

def get_avg_std(lista, busca):
    ocorrencias = []
    
    for sentenca in lista:
        ocorrencias.append(sentenca.count(busca))
    if len(ocorrencias) == 0:
        return 0.0, 0.0
    else:
        return statistics.mean(ocorrencias), statistics.stdev(ocorrencias)
    
print("--------Portuguese--------")

pos_it = df_concat[(df_concat["language"] == "PT") & (df_concat["forma_x"] == "literal")]["texto_x"].progress_apply(get_pos).tolist()

from collections import Counter
from itertools import chain



print("--------literal: ")
print(Counter(chain(*pos_it)))



print("avg. and std. NOUN", get_avg_std(pos_it, "NOUN"))
print("avg. and std. PUNCT", get_avg_std(pos_it, "PUNCT"))
print("avg. and std. ADP", get_avg_std(pos_it, "ADP"))
print("avg. and std. VERB", get_avg_std(pos_it, "VERB"))
print("avg. and std. DET", get_avg_std(pos_it, "DET"))
print("avg. and std. PRON", get_avg_std(pos_it, "PRON"))
print("avg. and std. ADV", get_avg_std(pos_it, "ADV"))
print("avg. and std. PROPN", get_avg_std(pos_it, "PROPN"))
print("avg. and std. CCONJ", get_avg_std(pos_it, "CCONJ"))
print("avg. and std. ADJ", get_avg_std(pos_it, "ADJ"))

print("--------dinâmica: ")

pos_it = df_concat[(df_concat["language"] == "PT") & (df_concat["forma_x"] == "dinâmica")]["texto_x"].progress_apply(get_pos).tolist()

from collections import Counter
from itertools import chain
print(Counter(chain(*pos_it)))


print("avg. and std. NOUN", get_avg_std(pos_it, "NOUN"))
print("avg. and std. PUNCT", get_avg_std(pos_it, "PUNCT"))
print("avg. and std. ADP", get_avg_std(pos_it, "ADP"))
print("avg. and std. VERB", get_avg_std(pos_it, "VERB"))
print("avg. and std. DET", get_avg_std(pos_it, "DET"))
print("avg. and std. PRON", get_avg_std(pos_it, "PRON"))
print("avg. and std. ADV", get_avg_std(pos_it, "ADV"))
print("avg. and std. PROPN", get_avg_std(pos_it, "PROPN"))
print("avg. and std. CCONJ", get_avg_std(pos_it, "CCONJ"))
print("avg. and std. ADJ", get_avg_std(pos_it, "ADJ"))


print("--------livre: ")

pos_it = df_concat[(df_concat["language"] == "PT") & (df_concat["forma_x"] == "livre")]["texto_x"].progress_apply(get_pos).tolist()

from collections import Counter
from itertools import chain
print(Counter(chain(*pos_it)))

print("avg. and std. NOUN", get_avg_std(pos_it, "NOUN"))
print("avg. and std. PUNCT", get_avg_std(pos_it, "PUNCT"))
print("avg. and std. ADP", get_avg_std(pos_it, "ADP"))
print("avg. and std. VERB", get_avg_std(pos_it, "VERB"))
print("avg. and std. DET", get_avg_std(pos_it, "DET"))
print("avg. and std. PRON", get_avg_std(pos_it, "PRON"))
print("avg. and std. ADV", get_avg_std(pos_it, "ADV"))
print("avg. and std. PROPN", get_avg_std(pos_it, "PROPN"))
print("avg. and std. CCONJ", get_avg_std(pos_it, "CCONJ"))
print("avg. and std. ADJ", get_avg_std(pos_it, "ADJ"))

--------Portuguese--------


100%|██████████| 33668/33668 [02:01<00:00, 277.23it/s]


--------literal: 
Counter({'PUNCT': 144795, 'NOUN': 143051, 'ADP': 115177, 'VERB': 108710, 'DET': 105653, 'PRON': 68643, 'CCONJ': 55460, 'PROPN': 52169, 'ADV': 42590, 'SCONJ': 28242, 'AUX': 21856, 'ADJ': 20546, 'NUM': 5865, 'INTJ': 932, 'SPACE': 92, 'X': 22, 'SYM': 15})
avg. and std. NOUN (4.248871331828442, 2.571632197420271)
avg. and std. PUNCT (4.300671260544137, 2.271730604166547)
avg. and std. ADP (3.4209635261969824, 2.407369577194469)
avg. and std. VERB (3.228882024474278, 1.9417990402255079)
avg. and std. DET (3.138083640251871, 2.329695769002729)
avg. and std. PRON (2.0388202447427823, 1.8690477600942166)
avg. and std. ADV (1.2649994059641203, 1.3052495767700836)
avg. and std. PROPN (1.5495128905785909, 1.7256722268583673)
avg. and std. CCONJ (1.6472614945942734, 1.410584608742226)
avg. and std. ADJ (0.6102530592847808, 0.8849144885585934)
--------dinâmica: 


100%|██████████| 77822/77822 [04:37<00:00, 280.15it/s]


Counter({'NOUN': 338767, 'PUNCT': 316121, 'VERB': 278555, 'ADP': 252244, 'DET': 235437, 'PRON': 178859, 'PROPN': 121201, 'ADV': 93650, 'CCONJ': 90864, 'SCONJ': 64336, 'AUX': 62205, 'ADJ': 57902, 'NUM': 12448, 'SPACE': 4505, 'INTJ': 1914, 'X': 65, 'SYM': 16, 'PART': 3})
avg. and std. NOUN (4.353100665621547, 2.6235820015730367)
avg. and std. PUNCT (4.0621032612885815, 2.3606506141012202)
avg. and std. ADP (3.2412942355632084, 2.2892026862032937)
avg. and std. VERB (3.579386291794094, 2.039417817790462)
avg. and std. DET (3.02532702834674, 2.145846656727196)
avg. and std. PRON (2.2983089614761893, 2.045202290918983)
avg. and std. ADV (1.2033872169823443, 1.2626598329565577)
avg. and std. PROPN (1.5574130708540002, 1.764552390687047)
avg. and std. CCONJ (1.16758757163784, 1.0310475782447444)
avg. and std. ADJ (0.7440312508031148, 0.9583070515672081)
--------livre: 


0it [00:00, ?it/s]

Counter()
avg. and std. NOUN (0.0, 0.0)
avg. and std. PUNCT (0.0, 0.0)
avg. and std. ADP (0.0, 0.0)
avg. and std. VERB (0.0, 0.0)
avg. and std. DET (0.0, 0.0)
avg. and std. PRON (0.0, 0.0)
avg. and std. ADV (0.0, 0.0)
avg. and std. PROPN (0.0, 0.0)
avg. and std. CCONJ (0.0, 0.0)
avg. and std. ADJ (0.0, 0.0)





In [34]:
import spacy
from tqdm import tqdm
import statistics


nlp = spacy.load("en_core_web_trf")
tqdm.pandas()
def get_pos(x):
    return [w.pos_ for w in nlp(x)]

def get_avg_std(lista, busca):
    ocorrencias = []
    
    for sentenca in lista:
        ocorrencias.append(sentenca.count(busca))
    if len(ocorrencias) == 0:
        return 0.0, 0.0
    else:
        return statistics.mean(ocorrencias), statistics.stdev(ocorrencias)
    
print("--------English--------")

pos_it = df_concat[(df_concat["language"] == "EN") & (df_concat["forma_x"] == "literal")]["texto_x"].progress_apply(get_pos).tolist()

from collections import Counter
from itertools import chain



print("--------literal: ")
print(Counter(chain(*pos_it)))



print("avg. and std. NOUN", get_avg_std(pos_it, "NOUN"))
print("avg. and std. PUNCT", get_avg_std(pos_it, "PUNCT"))
print("avg. and std. ADP", get_avg_std(pos_it, "ADP"))
print("avg. and std. VERB", get_avg_std(pos_it, "VERB"))
print("avg. and std. DET", get_avg_std(pos_it, "DET"))
print("avg. and std. PRON", get_avg_std(pos_it, "PRON"))
print("avg. and std. ADV", get_avg_std(pos_it, "ADV"))
print("avg. and std. PROPN", get_avg_std(pos_it, "PROPN"))
print("avg. and std. CCONJ", get_avg_std(pos_it, "CCONJ"))
print("avg. and std. ADJ", get_avg_std(pos_it, "ADJ"))

print("--------dinâmica: ")

pos_it = df_concat[(df_concat["language"] == "EN") & (df_concat["forma_x"] == "dinâmica")]["texto_x"].progress_apply(get_pos).tolist()

from collections import Counter
from itertools import chain
print(Counter(chain(*pos_it)))


print("avg. and std. NOUN", get_avg_std(pos_it, "NOUN"))
print("avg. and std. PUNCT", get_avg_std(pos_it, "PUNCT"))
print("avg. and std. ADP", get_avg_std(pos_it, "ADP"))
print("avg. and std. VERB", get_avg_std(pos_it, "VERB"))
print("avg. and std. DET", get_avg_std(pos_it, "DET"))
print("avg. and std. PRON", get_avg_std(pos_it, "PRON"))
print("avg. and std. ADV", get_avg_std(pos_it, "ADV"))
print("avg. and std. PROPN", get_avg_std(pos_it, "PROPN"))
print("avg. and std. CCONJ", get_avg_std(pos_it, "CCONJ"))
print("avg. and std. ADJ", get_avg_std(pos_it, "ADJ"))


print("--------livre: ")

pos_it = df_concat[(df_concat["language"] == "EN") & (df_concat["forma_x"] == "livre")]["texto_x"].progress_apply(get_pos).tolist()

from collections import Counter
from itertools import chain
print(Counter(chain(*pos_it)))

print("avg. and std. NOUN", get_avg_std(pos_it, "NOUN"))
print("avg. and std. PUNCT", get_avg_std(pos_it, "PUNCT"))
print("avg. and std. ADP", get_avg_std(pos_it, "ADP"))
print("avg. and std. VERB", get_avg_std(pos_it, "VERB"))
print("avg. and std. DET", get_avg_std(pos_it, "DET"))
print("avg. and std. PRON", get_avg_std(pos_it, "PRON"))
print("avg. and std. ADV", get_avg_std(pos_it, "ADV"))
print("avg. and std. PROPN", get_avg_std(pos_it, "PROPN"))
print("avg. and std. CCONJ", get_avg_std(pos_it, "CCONJ"))
print("avg. and std. ADJ", get_avg_std(pos_it, "ADJ"))

--------English--------


100%|██████████| 289605/289605 [2:48:08<00:00, 28.71it/s]   


--------literal: 
Counter({'PUNCT': 1557865, 'NOUN': 1271660, 'PRON': 1155337, 'ADP': 1006125, 'VERB': 973033, 'DET': 788733, 'PROPN': 568482, 'AUX': 530419, 'CCONJ': 521806, 'ADV': 238832, 'ADJ': 225675, 'SCONJ': 195171, 'PART': 161393, 'SPACE': 84991, 'NUM': 59627, 'X': 29025, 'INTJ': 19470, 'SYM': 432})
avg. and std. NOUN (4.391015348491911, 2.7200032932480234)
avg. and std. PUNCT (5.379275219695793, 5.547938150242381)
avg. and std. ADP (3.4741285544103175, 2.3340084330127233)
avg. and std. VERB (3.359862571433504, 1.9642562252341462)
avg. and std. DET (2.723478531102709, 2.1756553228037867)
avg. and std. PRON (3.9893544655651665, 2.7290300357022774)
avg. and std. ADV (0.824681894304311, 1.0262967822312097)
avg. and std. PROPN (1.9629564406691873, 7.936949967020898)
avg. and std. CCONJ (1.8017851901728217, 1.4594722481643752)
avg. and std. ADJ (0.7792510488423887, 1.0354810960965184)
--------dinâmica: 


100%|██████████| 257048/257048 [2:27:29<00:00, 29.05it/s]  


Counter({'PUNCT': 1040207, 'NOUN': 1038411, 'PRON': 1008151, 'VERB': 919907, 'ADP': 784274, 'DET': 633718, 'PROPN': 504039, 'AUX': 491259, 'CCONJ': 294803, 'ADV': 258100, 'ADJ': 250316, 'PART': 176923, 'SCONJ': 155458, 'NUM': 48595, 'X': 38775, 'SPACE': 36572, 'INTJ': 9448, 'SYM': 4159})
avg. and std. NOUN (4.039755220814789, 2.6257262866199507)
avg. and std. PUNCT (4.046742242693972, 2.5225661035990576)
avg. and std. ADP (3.051079953938564, 2.175733545813376)
avg. and std. VERB (3.5787362671563288, 2.0859563302783393)
avg. and std. DET (2.4653683358750116, 2.0507890095499572)
avg. and std. PRON (3.922034016992935, 2.768199295409124)
avg. and std. ADV (1.004092620833463, 1.2657573268734437)
avg. and std. PROPN (1.960875011670972, 2.9191765037246205)
avg. and std. CCONJ (1.146879182098285, 1.1072311951530651)
avg. and std. ADJ (0.9738103389250257, 1.1669636786330935)
--------livre: 


100%|██████████| 100185/100185 [57:20<00:00, 29.12it/s]


Counter({'NOUN': 419175, 'PUNCT': 417361, 'PRON': 383173, 'VERB': 365767, 'ADP': 311897, 'DET': 255077, 'AUX': 187087, 'PROPN': 167598, 'CCONJ': 154548, 'ADJ': 105997, 'ADV': 99534, 'SPACE': 98646, 'PART': 64797, 'SCONJ': 64225, 'NUM': 18294, 'X': 16118, 'INTJ': 5770, 'SYM': 1242})
avg. and std. NOUN (4.184009582272795, 2.6949627524164583)
avg. and std. PUNCT (4.1659030793032885, 2.5246875059507907)
avg. and std. ADP (3.113210560463143, 2.1911106501067468)
avg. and std. VERB (3.6509158057593454, 2.16711408080425)
avg. and std. DET (2.5460597893896293, 2.065513735745255)
avg. and std. PRON (3.824654389379648, 2.711663198251019)
avg. and std. ADV (0.9935020212606678, 1.219897333005017)
avg. and std. PROPN (1.6728851624494685, 1.9820064989490038)
avg. and std. CCONJ (1.5426261416379698, 1.3662750122097715)
avg. and std. ADJ (1.0580126765483855, 1.3107734165948166)


In [40]:
"""
        Linguagem:
"""





Caracterizar os tipos de estilos (médias)
# Remover o duplicates do X
# tamanho médio e desvio em termos de tokens do estilo 
# quantidades de palavras de cadas uma 
# Olhar para as PoS
    # qtd media e desvio de verbos no texto
    # qtd media e desvio de prep no texto
    # qtd media e desvio de punct no texto
    # qtd media e desvio de adverbios no texto
    .....





('A', 'AP', 'B', 'BN', 'B_PC', 'CC', 'CS', 'DD', 'DE', 'DI', 'DQ', 'DR', 'E', 'E_RD', 'FB', 'FC', 'FF', 'FS', 'I', 'N', 'NO', 'PART', 'PC', 'PC_PC', 'PD', 'PE', 'PI', 'PP', 'PQ', 'PR', 'RD', 'RI', 'S', 'SP', 'SW', 'SYM', 'T', 'V', 'VA', 'VA_PC', 'VM', 'VM_PC', 'VM_PC_PC', 'V_B', 'V_PC', 'V_PC_PC', 'X', '_SP')


In [35]:
df_concat.groupby(["linguagem_x"])["sourceLen"].mean()

linguagem_x
arcaica    31.440088
moderna    31.240523
Name: sourceLen, dtype: float64

In [36]:
df_concat["linguagem_x"].value_counts()

linguagem_x
moderna    943358
arcaica    260441
Name: count, dtype: int64

In [37]:
# quantidades de palavras de cada estilo
import itertools
def get_vocab(x):
    return x.split(" ")


print("moderna : ", len(set(list(itertools.chain(*df_concat[df_concat["linguagem_x"] == "moderna"]["texto_x"].apply(lambda x: get_vocab(x)).tolist())))))
print("arcaica : ", len(set(list(itertools.chain(*df_concat[df_concat["linguagem_x"] == "arcaica"]["texto_x"].apply(lambda x: get_vocab(x)).tolist())))))

moderna :  240696
arcaica :  119591


In [38]:
import spacy
from tqdm import tqdm
import statistics

nlp = spacy.load("it_core_news_lg")
tqdm.pandas()
def get_pos(x):
    return [w.pos_ for w in nlp(x)]

def get_avg_std(lista, busca):
    ocorrencias = []
    
    for sentenca in lista:
        ocorrencias.append(sentenca.count(busca))
    if len(ocorrencias) == 0:
        return 0.0, 0.0
    else:
        return statistics.mean(ocorrencias), statistics.stdev(ocorrencias)
    
print("--------Italian--------")

pos_it = df_concat[(df_concat["language"] == "IT") & (df_concat["linguagem_x"] == "moderna")]["texto_x"].progress_apply(get_pos).tolist()

from collections import Counter
from itertools import chain



print("--------moderna: ")
print(Counter(chain(*pos_it)))



print("avg. and std. NOUN", get_avg_std(pos_it, "NOUN"))
print("avg. and std. PUNCT", get_avg_std(pos_it, "PUNCT"))
print("avg. and std. ADP", get_avg_std(pos_it, "ADP"))
print("avg. and std. VERB", get_avg_std(pos_it, "VERB"))
print("avg. and std. DET", get_avg_std(pos_it, "DET"))
print("avg. and std. PRON", get_avg_std(pos_it, "PRON"))
print("avg. and std. ADV", get_avg_std(pos_it, "ADV"))
print("avg. and std. PROPN", get_avg_std(pos_it, "PROPN"))
print("avg. and std. CCONJ", get_avg_std(pos_it, "CCONJ"))
print("avg. and std. ADJ", get_avg_std(pos_it, "ADJ"))

print("--------arcaica: ")

pos_it = df_concat[(df_concat["language"] == "IT") & (df_concat["linguagem_x"] == "arcaica")]["texto_x"].progress_apply(get_pos).tolist()

from collections import Counter
from itertools import chain
print(Counter(chain(*pos_it)))


print("avg. and std. NOUN", get_avg_std(pos_it, "NOUN"))
print("avg. and std. PUNCT", get_avg_std(pos_it, "PUNCT"))
print("avg. and std. ADP", get_avg_std(pos_it, "ADP"))
print("avg. and std. VERB", get_avg_std(pos_it, "VERB"))
print("avg. and std. DET", get_avg_std(pos_it, "DET"))
print("avg. and std. PRON", get_avg_std(pos_it, "PRON"))
print("avg. and std. ADV", get_avg_std(pos_it, "ADV"))
print("avg. and std. PROPN", get_avg_std(pos_it, "PROPN"))
print("avg. and std. CCONJ", get_avg_std(pos_it, "CCONJ"))
print("avg. and std. ADJ", get_avg_std(pos_it, "ADJ"))



--------Italian--------


100%|██████████| 58601/58601 [03:25<00:00, 285.34it/s]


--------moderna: 
Counter({'NOUN': 281829, 'PUNCT': 262303, 'ADP': 199046, 'VERB': 189871, 'DET': 179450, 'PRON': 132218, 'ADV': 79792, 'PROPN': 70257, 'CCONJ': 66192, 'AUX': 61920, 'ADJ': 43865, 'SCONJ': 28094, 'NUM': 7094, 'SYM': 1181, 'INTJ': 798, 'X': 286, 'PART': 1})
avg. and std. NOUN (4.80928653094657, 2.9978905731470826)
avg. and std. PUNCT (4.4760840258698655, 2.7197756751073063)
avg. and std. ADP (3.3966314568010785, 2.4834197461682215)
avg. and std. VERB (3.2400641627275983, 2.0046140350027915)
avg. and std. DET (3.0622344328595075, 2.3633219484018624)
avg. and std. PRON (2.256241361068924, 2.0057352907830697)
avg. and std. ADV (1.3616149895052985, 1.4395914330647663)
avg. and std. PROPN (1.1989044555553658, 1.6298493914069907)
avg. and std. CCONJ (1.1295370386170884, 1.0696215108383353)
avg. and std. ADJ (0.7485367143905394, 1.0165393094378623)
--------arcaica: 


100%|██████████| 25470/25470 [01:30<00:00, 282.99it/s]


Counter({'NOUN': 125036, 'PUNCT': 114486, 'ADP': 86672, 'VERB': 82576, 'DET': 78749, 'PRON': 60304, 'ADV': 37495, 'CCONJ': 34367, 'PROPN': 31575, 'AUX': 26972, 'ADJ': 20081, 'SCONJ': 12542, 'NUM': 3034, 'SYM': 1965, 'INTJ': 327, 'X': 114, 'PART': 1})
avg. and std. NOUN (4.909148017275226, 2.8867928420026794)
avg. and std. PUNCT (4.494935217903416, 2.6938405175865476)
avg. and std. ADP (3.40290537887711, 2.3561807265986525)
avg. and std. VERB (3.242088731841382, 1.8991958687705188)
avg. and std. DET (3.091833529642717, 2.3163742344836713)
avg. and std. PRON (2.3676482135846095, 1.9985992005155644)
avg. and std. ADV (1.472124067530428, 1.4281612622497766)
avg. and std. PROPN (1.2396937573616018, 1.658264089339224)
avg. and std. CCONJ (1.34931291715744, 1.1234101889123724)
avg. and std. ADJ (0.7884177463682764, 1.0055947626645068)


In [39]:
import spacy
from tqdm import tqdm
import statistics

nlp = spacy.load("es_dep_news_trf")
tqdm.pandas()
def get_pos(x):
    return [w.pos_ for w in nlp(x)]

def get_avg_std(lista, busca):
    ocorrencias = []
    
    for sentenca in lista:
        ocorrencias.append(sentenca.count(busca))
    if len(ocorrencias) == 0:
        return 0.0, 0.0
    else:
        return statistics.mean(ocorrencias), statistics.stdev(ocorrencias)
    
print("--------Spanish--------")

pos_it = df_concat[(df_concat["language"] == "ES") & (df_concat["linguagem_x"] == "moderna")]["texto_x"].progress_apply(get_pos).tolist()

from collections import Counter
from itertools import chain



print("--------Moderna: ")
print(Counter(chain(*pos_it)))



print("avg. and std. NOUN", get_avg_std(pos_it, "NOUN"))
print("avg. and std. PUNCT", get_avg_std(pos_it, "PUNCT"))
print("avg. and std. ADP", get_avg_std(pos_it, "ADP"))
print("avg. and std. VERB", get_avg_std(pos_it, "VERB"))
print("avg. and std. DET", get_avg_std(pos_it, "DET"))
print("avg. and std. PRON", get_avg_std(pos_it, "PRON"))
print("avg. and std. ADV", get_avg_std(pos_it, "ADV"))
print("avg. and std. PROPN", get_avg_std(pos_it, "PROPN"))
print("avg. and std. CCONJ", get_avg_std(pos_it, "CCONJ"))
print("avg. and std. ADJ", get_avg_std(pos_it, "ADJ"))

print("--------Arcaica: ")

pos_it = df_concat[(df_concat["language"] == "ES") & (df_concat["linguagem_x"] == "arcaica")]["texto_x"].progress_apply(get_pos).tolist()

from collections import Counter
from itertools import chain
print(Counter(chain(*pos_it)))


print("avg. and std. NOUN", get_avg_std(pos_it, "NOUN"))
print("avg. and std. PUNCT", get_avg_std(pos_it, "PUNCT"))
print("avg. and std. ADP", get_avg_std(pos_it, "ADP"))
print("avg. and std. VERB", get_avg_std(pos_it, "VERB"))
print("avg. and std. DET", get_avg_std(pos_it, "DET"))
print("avg. and std. PRON", get_avg_std(pos_it, "PRON"))
print("avg. and std. ADV", get_avg_std(pos_it, "ADV"))
print("avg. and std. PROPN", get_avg_std(pos_it, "PROPN"))
print("avg. and std. CCONJ", get_avg_std(pos_it, "CCONJ"))
print("avg. and std. ADJ", get_avg_std(pos_it, "ADJ"))



--------Spanish--------


100%|██████████| 265406/265406 [2:34:01<00:00, 28.72it/s]   


--------Moderna: 
Counter({'NOUN': 1145690, 'PUNCT': 1057410, 'ADP': 983746, 'DET': 901282, 'VERB': 894320, 'PRON': 649508, 'PROPN': 398459, 'CCONJ': 337993, 'ADV': 262313, 'AUX': 239502, 'ADJ': 211865, 'SCONJ': 205025, 'SYM': 107592, 'NUM': 38333, 'SPACE': 25226, 'INTJ': 16216, 'PART': 1115})
avg. and std. NOUN (4.316744911569445, 2.63182073263636)
avg. and std. PUNCT (3.9841224388295666, 3.0148374890947425)
avg. and std. ADP (3.7065703111459425, 2.4421308831575006)
avg. and std. VERB (3.369629925472672, 1.9565046435205515)
avg. and std. DET (3.3958614349336487, 2.2106858201830097)
avg. and std. PRON (2.447224252654424, 2.0550681314252266)
avg. and std. ADV (0.9883461564546393, 1.1170945384525537)
avg. and std. PROPN (1.501318734316481, 4.700940495069977)
avg. and std. CCONJ (1.2734941938011952, 1.1408367766297993)
avg. and std. ADJ (0.798267559889377, 1.036428917074861)
--------Arcaica: 


100%|██████████| 22278/22278 [12:57<00:00, 28.65it/s] 


Counter({'PUNCT': 147118, 'NOUN': 110407, 'ADP': 84557, 'DET': 77717, 'VERB': 72294, 'PRON': 52852, 'PROPN': 43822, 'CCONJ': 32482, 'ADV': 21193, 'AUX': 20799, 'SCONJ': 16749, 'ADJ': 15452, 'SYM': 15181, 'SPACE': 9057, 'INTJ': 2913, 'NUM': 2530, 'PART': 145})
avg. and std. NOUN (4.955875751862824, 2.853027976413892)
avg. and std. PUNCT (6.603734626088518, 3.2711119958533237)
avg. and std. ADP (3.7955381991202084, 2.4880887541927676)
avg. and std. VERB (3.245084837058982, 1.906319676935849)
avg. and std. DET (3.4885088428045608, 2.267540679010472)
avg. and std. PRON (2.3723853128647097, 2.000142422595091)
avg. and std. ADV (0.9512972439177664, 1.081210185307889)
avg. and std. PROPN (1.967052697728701, 2.090060706915245)
avg. and std. CCONJ (1.4580303438369693, 1.2278336423248637)
avg. and std. ADJ (0.6935990663434779, 0.9576128471771913)


In [40]:
import spacy
from tqdm import tqdm
import statistics


nlp = spacy.load("fr_dep_news_trf")
tqdm.pandas()
def get_pos(x):
    return [w.pos_ for w in nlp(x)]

def get_avg_std(lista, busca):
    ocorrencias = []
    
    for sentenca in lista:
        ocorrencias.append(sentenca.count(busca))
    if len(ocorrencias) == 0:
        return 0.0, 0.0
    else:
        return statistics.mean(ocorrencias), statistics.stdev(ocorrencias)
    
print("--------French--------")

pos_it = df_concat[(df_concat["language"] == "FR") & (df_concat["linguagem_x"] == "moderna")]["texto_x"].progress_apply(get_pos).tolist()

from collections import Counter
from itertools import chain



print("--------Moderna: ")
print(Counter(chain(*pos_it)))



print("avg. and std. NOUN", get_avg_std(pos_it, "NOUN"))
print("avg. and std. PUNCT", get_avg_std(pos_it, "PUNCT"))
print("avg. and std. ADP", get_avg_std(pos_it, "ADP"))
print("avg. and std. VERB", get_avg_std(pos_it, "VERB"))
print("avg. and std. DET", get_avg_std(pos_it, "DET"))
print("avg. and std. PRON", get_avg_std(pos_it, "PRON"))
print("avg. and std. ADV", get_avg_std(pos_it, "ADV"))
print("avg. and std. PROPN", get_avg_std(pos_it, "PROPN"))
print("avg. and std. CCONJ", get_avg_std(pos_it, "CCONJ"))
print("avg. and std. ADJ", get_avg_std(pos_it, "ADJ"))

print("--------Arcaica: ")

pos_it = df_concat[(df_concat["language"] == "FR") & (df_concat["linguagem_x"] == "arcaica")]["texto_x"].progress_apply(get_pos).tolist()

from collections import Counter
from itertools import chain
print(Counter(chain(*pos_it)))


print("avg. and std. NOUN", get_avg_std(pos_it, "NOUN"))
print("avg. and std. PUNCT", get_avg_std(pos_it, "PUNCT"))
print("avg. and std. ADP", get_avg_std(pos_it, "ADP"))
print("avg. and std. VERB", get_avg_std(pos_it, "VERB"))
print("avg. and std. DET", get_avg_std(pos_it, "DET"))
print("avg. and std. PRON", get_avg_std(pos_it, "PRON"))
print("avg. and std. ADV", get_avg_std(pos_it, "ADV"))
print("avg. and std. PROPN", get_avg_std(pos_it, "PROPN"))
print("avg. and std. CCONJ", get_avg_std(pos_it, "CCONJ"))
print("avg. and std. ADJ", get_avg_std(pos_it, "ADJ"))


--------French--------


100%|██████████| 27342/27342 [17:57<00:00, 25.38it/s]  


--------Moderna: 
Counter({'NOUN': 167025, 'PUNCT': 146655, 'PRON': 133537, 'ADP': 117547, 'DET': 117062, 'VERB': 105069, 'CCONJ': 44758, 'ADV': 42834, 'PROPN': 30602, 'AUX': 30376, 'NUM': 29731, 'ADJ': 27582, 'SCONJ': 17147, 'INTJ': 585, 'SYM': 11, 'X': 2})
avg. and std. NOUN (6.108733816107088, 3.5765070179214753)
avg. and std. PUNCT (5.363726135615536, 3.5744714145955085)
avg. and std. ADP (4.299136859044693, 2.94175167478981)
avg. and std. VERB (3.8427693658108404, 2.1500109980522795)
avg. and std. DET (4.2813985809377515, 2.744678871718367)
avg. and std. PRON (4.883951430034379, 3.6182990472862615)
avg. and std. ADV (1.56660083388194, 1.8917933979154569)
avg. and std. PROPN (1.1192304878940824, 1.5983114948579717)
avg. and std. CCONJ (1.6369687660010241, 1.3558606917928682)
avg. and std. ADJ (1.0087777046302393, 1.159272112254435)
--------Arcaica: 


100%|██████████| 46374/46374 [1:35:39<00:00,  8.08it/s]  


Counter({'PUNCT': 233406, 'NOUN': 220289, 'PRON': 196687, 'ADP': 188350, 'DET': 185962, 'VERB': 163930, 'CCONJ': 62224, 'PROPN': 59213, 'ADV': 57174, 'AUX': 50221, 'ADJ': 37829, 'SCONJ': 24825, 'NUM': 5506, 'INTJ': 561, 'X': 16, 'SYM': 8})
avg. and std. NOUN (4.750269547591323, 2.716235017147298)
avg. and std. PUNCT (5.033122008021737, 2.630638327060437)
avg. and std. ADP (4.061543106050804, 2.7765897556526267)
avg. and std. VERB (3.534954931642731, 1.9939898376637366)
avg. and std. DET (4.010048734204511, 2.5059707764624015)
avg. and std. PRON (4.241320567559408, 3.1266041543873473)
avg. and std. ADV (1.2328891189028335, 1.6082668547958952)
avg. and std. PROPN (1.2768577219993962, 1.6893395346907143)
avg. and std. CCONJ (1.341786345797214, 1.1330778078283155)
avg. and std. ADJ (0.8157372665717859, 1.0356514945349355)


In [41]:
import spacy
from tqdm import tqdm
import statistics


nlp = spacy.load("pt_core_news_lg")
tqdm.pandas()
def get_pos(x):
    return [w.pos_ for w in nlp(x)]

def get_avg_std(lista, busca):
    ocorrencias = []
    
    for sentenca in lista:
        ocorrencias.append(sentenca.count(busca))
    if len(ocorrencias) == 0:
        return 0.0, 0.0
    else:
        return statistics.mean(ocorrencias), statistics.stdev(ocorrencias)
    
print("--------Portuguese--------")

pos_it = df_concat[(df_concat["language"] == "PT") & (df_concat["linguagem_x"] == "moderna")]["texto_x"].progress_apply(get_pos).tolist()

from collections import Counter
from itertools import chain



print("--------moderna: ")
print(Counter(chain(*pos_it)))



print("avg. and std. NOUN", get_avg_std(pos_it, "NOUN"))
print("avg. and std. PUNCT", get_avg_std(pos_it, "PUNCT"))
print("avg. and std. ADP", get_avg_std(pos_it, "ADP"))
print("avg. and std. VERB", get_avg_std(pos_it, "VERB"))
print("avg. and std. DET", get_avg_std(pos_it, "DET"))
print("avg. and std. PRON", get_avg_std(pos_it, "PRON"))
print("avg. and std. ADV", get_avg_std(pos_it, "ADV"))
print("avg. and std. PROPN", get_avg_std(pos_it, "PROPN"))
print("avg. and std. CCONJ", get_avg_std(pos_it, "CCONJ"))
print("avg. and std. ADJ", get_avg_std(pos_it, "ADJ"))

print("--------Arcaica: ")

pos_it = df_concat[(df_concat["language"] == "PT") & (df_concat["linguagem_x"] == "arcaica")]["texto_x"].progress_apply(get_pos).tolist()

from collections import Counter
from itertools import chain
print(Counter(chain(*pos_it)))


print("avg. and std. NOUN", get_avg_std(pos_it, "NOUN"))
print("avg. and std. PUNCT", get_avg_std(pos_it, "PUNCT"))
print("avg. and std. ADP", get_avg_std(pos_it, "ADP"))
print("avg. and std. VERB", get_avg_std(pos_it, "VERB"))
print("avg. and std. DET", get_avg_std(pos_it, "DET"))
print("avg. and std. PRON", get_avg_std(pos_it, "PRON"))
print("avg. and std. ADV", get_avg_std(pos_it, "ADV"))
print("avg. and std. PROPN", get_avg_std(pos_it, "PROPN"))
print("avg. and std. CCONJ", get_avg_std(pos_it, "CCONJ"))
print("avg. and std. ADJ", get_avg_std(pos_it, "ADJ"))




--------Portuguese--------


100%|██████████| 77822/77822 [05:15<00:00, 246.71it/s]


--------moderna: 
Counter({'NOUN': 338767, 'PUNCT': 316121, 'VERB': 278555, 'ADP': 252244, 'DET': 235437, 'PRON': 178859, 'PROPN': 121201, 'ADV': 93650, 'CCONJ': 90864, 'SCONJ': 64336, 'AUX': 62205, 'ADJ': 57902, 'NUM': 12448, 'SPACE': 4505, 'INTJ': 1914, 'X': 65, 'SYM': 16, 'PART': 3})
avg. and std. NOUN (4.353100665621547, 2.6235820015730367)
avg. and std. PUNCT (4.0621032612885815, 2.3606506141012202)
avg. and std. ADP (3.2412942355632084, 2.2892026862032937)
avg. and std. VERB (3.579386291794094, 2.039417817790462)
avg. and std. DET (3.02532702834674, 2.145846656727196)
avg. and std. PRON (2.2983089614761893, 2.045202290918983)
avg. and std. ADV (1.2033872169823443, 1.2626598329565577)
avg. and std. PROPN (1.5574130708540002, 1.764552390687047)
avg. and std. CCONJ (1.16758757163784, 1.0310475782447444)
avg. and std. ADJ (0.7440312508031148, 0.9583070515672081)
--------Arcaica: 


100%|██████████| 33668/33668 [02:21<00:00, 237.14it/s]


Counter({'PUNCT': 144795, 'NOUN': 143051, 'ADP': 115177, 'VERB': 108710, 'DET': 105653, 'PRON': 68643, 'CCONJ': 55460, 'PROPN': 52169, 'ADV': 42590, 'SCONJ': 28242, 'AUX': 21856, 'ADJ': 20546, 'NUM': 5865, 'INTJ': 932, 'SPACE': 92, 'X': 22, 'SYM': 15})
avg. and std. NOUN (4.248871331828442, 2.571632197420271)
avg. and std. PUNCT (4.300671260544137, 2.271730604166547)
avg. and std. ADP (3.4209635261969824, 2.407369577194469)
avg. and std. VERB (3.228882024474278, 1.9417990402255079)
avg. and std. DET (3.138083640251871, 2.329695769002729)
avg. and std. PRON (2.0388202447427823, 1.8690477600942166)
avg. and std. ADV (1.2649994059641203, 1.3052495767700836)
avg. and std. PROPN (1.5495128905785909, 1.7256722268583673)
avg. and std. CCONJ (1.6472614945942734, 1.410584608742226)
avg. and std. ADJ (0.6102530592847808, 0.8849144885585934)


In [42]:
import spacy
from tqdm import tqdm
import statistics


nlp = spacy.load("en_core_web_trf")
tqdm.pandas()
def get_pos(x):
    return [w.pos_ for w in nlp(x)]

def get_avg_std(lista, busca):
    ocorrencias = []
    
    for sentenca in lista:
        ocorrencias.append(sentenca.count(busca))
    if len(ocorrencias) == 0:
        return 0.0, 0.0
    else:
        return statistics.mean(ocorrencias), statistics.stdev(ocorrencias)
    
print("--------English--------")

pos_it = df_concat[(df_concat["language"] == "EN") & (df_concat["linguagem_x"] == "moderna")]["texto_x"].progress_apply(get_pos).tolist()

from collections import Counter
from itertools import chain



print("--------moderna: ")
print(Counter(chain(*pos_it)))



print("avg. and std. NOUN", get_avg_std(pos_it, "NOUN"))
print("avg. and std. PUNCT", get_avg_std(pos_it, "PUNCT"))
print("avg. and std. ADP", get_avg_std(pos_it, "ADP"))
print("avg. and std. VERB", get_avg_std(pos_it, "VERB"))
print("avg. and std. DET", get_avg_std(pos_it, "DET"))
print("avg. and std. PRON", get_avg_std(pos_it, "PRON"))
print("avg. and std. ADV", get_avg_std(pos_it, "ADV"))
print("avg. and std. PROPN", get_avg_std(pos_it, "PROPN"))
print("avg. and std. CCONJ", get_avg_std(pos_it, "CCONJ"))
print("avg. and std. ADJ", get_avg_std(pos_it, "ADJ"))

print("--------arcaica: ")

pos_it = df_concat[(df_concat["language"] == "EN") & (df_concat["linguagem_x"] == "arcaica")]["texto_x"].progress_apply(get_pos).tolist()

from collections import Counter
from itertools import chain
print(Counter(chain(*pos_it)))


print("avg. and std. NOUN", get_avg_std(pos_it, "NOUN"))
print("avg. and std. PUNCT", get_avg_std(pos_it, "PUNCT"))
print("avg. and std. ADP", get_avg_std(pos_it, "ADP"))
print("avg. and std. VERB", get_avg_std(pos_it, "VERB"))
print("avg. and std. DET", get_avg_std(pos_it, "DET"))
print("avg. and std. PRON", get_avg_std(pos_it, "PRON"))
print("avg. and std. ADV", get_avg_std(pos_it, "ADV"))
print("avg. and std. PROPN", get_avg_std(pos_it, "PROPN"))
print("avg. and std. CCONJ", get_avg_std(pos_it, "CCONJ"))
print("avg. and std. ADJ", get_avg_std(pos_it, "ADJ"))


--------English--------


100%|██████████| 514187/514187 [7:08:02<00:00, 20.02it/s]   


--------moderna: 
Counter({'PUNCT': 2471796, 'NOUN': 2178943, 'PRON': 2022997, 'VERB': 1817813, 'ADP': 1648567, 'DET': 1318327, 'PROPN': 1035177, 'AUX': 970032, 'CCONJ': 712990, 'ADV': 492116, 'ADJ': 482333, 'PART': 330580, 'SCONJ': 324298, 'SPACE': 187014, 'NUM': 98308, 'X': 82446, 'INTJ': 25914, 'SYM': 5472})
avg. and std. NOUN (4.237647003911029, 2.710340976004248)
avg. and std. PUNCT (4.807192713934813, 4.59788484129409)
avg. and std. ADP (3.206162349495417, 2.2359155719830324)
avg. and std. VERB (3.5353149729573095, 2.0653723019134973)
avg. and std. DET (2.5639057385737094, 2.093304037553453)
avg. and std. PRON (3.9343604564098276, 2.7463356275629)
avg. and std. ADV (0.9570759276294422, 1.1940468690367103)
avg. and std. PROPN (2.0132305950947806, 6.286629310115079)
avg. and std. CCONJ (1.3866356014446106, 1.2756519710711824)
avg. and std. ADJ (0.9380497756652735, 1.1677183631230086)
--------arcaica: 


100%|██████████| 132651/132651 [1:42:30<00:00, 21.57it/s]  


Counter({'NOUN': 550303, 'PUNCT': 543637, 'PRON': 523664, 'ADP': 453729, 'VERB': 440894, 'DET': 359201, 'CCONJ': 258167, 'AUX': 238733, 'PROPN': 204942, 'ADV': 104350, 'ADJ': 99655, 'SCONJ': 90556, 'PART': 72533, 'SPACE': 33195, 'NUM': 28208, 'INTJ': 8774, 'X': 1472, 'SYM': 361})
avg. and std. NOUN (4.148502461345938, 2.5776353349912786)
avg. and std. PUNCT (4.098250295889213, 2.0100435790173825)
avg. and std. ADP (3.4204717642535676, 2.340437178881853)
avg. and std. VERB (3.323714106942277, 1.9756382690603023)
avg. and std. DET (2.7078649991330637, 2.185093563670272)
avg. and std. PRON (3.947682263985948, 2.727994110174976)
avg. and std. ADV (0.7866506848798728, 1.0079779800363249)
avg. and std. PROPN (1.5449713910939231, 1.9221730250112459)
avg. and std. CCONJ (1.9462122411440548, 1.5234793245965454)
avg. and std. ADJ (0.7512570579942858, 1.0126023762813592)


In [None]:
"""
        Formal/informal:
"""





Caracterizar os tipos de estilos (médias)
# Remover o duplicates do X
# tamanho médio e desvio em termos de tokens do estilo 
# quantidades de palavras de cadas uma 
# Olhar para as PoS
    # qtd media e desvio de verbos no texto
    # qtd media e desvio de prep no texto
    # qtd media e desvio de punct no texto
    # qtd media e desvio de adverbios no texto
    .....

In [43]:
df_concat.groupby(["formal/informal_x"])["sourceLen"].mean()

formal/informal_x
formal      31.813412
informal    29.603942
Name: sourceLen, dtype: float64

In [44]:
df_concat["formal/informal_x"].value_counts()

formal/informal_x
formal      915192
informal    288607
Name: count, dtype: int64

In [45]:
# quantidades de palavras de cada estilo
import itertools
def get_vocab(x):
    return x.split(" ")


print("moderna : ", len(set(list(itertools.chain(*df_concat[df_concat["formal/informal_x"] == "formal"]["texto_x"].apply(lambda x: get_vocab(x)).tolist())))))
print("arcaica : ", len(set(list(itertools.chain(*df_concat[df_concat["formal/informal_x"] == "informal"]["texto_x"].apply(lambda x: get_vocab(x)).tolist())))))

moderna :  224928
arcaica :  140786


In [46]:
import spacy
from tqdm import tqdm
import statistics

nlp = spacy.load("it_core_news_lg")
tqdm.pandas()
def get_pos(x):
    return [w.pos_ for w in nlp(x)]

def get_avg_std(lista, busca):
    ocorrencias = []
    
    for sentenca in lista:
        ocorrencias.append(sentenca.count(busca))
    if len(ocorrencias) == 0:
        return 0.0, 0.0
    else:
        return statistics.mean(ocorrencias), statistics.stdev(ocorrencias)
    
print("--------Italian--------")

pos_it = df_concat[(df_concat["language"] == "IT") & (df_concat["formal/informal_x"] == "formal")]["texto_x"].progress_apply(get_pos).tolist()

from collections import Counter
from itertools import chain



print("--------formal: ")
print(Counter(chain(*pos_it)))



print("avg. and std. NOUN", get_avg_std(pos_it, "NOUN"))
print("avg. and std. PUNCT", get_avg_std(pos_it, "PUNCT"))
print("avg. and std. ADP", get_avg_std(pos_it, "ADP"))
print("avg. and std. VERB", get_avg_std(pos_it, "VERB"))
print("avg. and std. DET", get_avg_std(pos_it, "DET"))
print("avg. and std. PRON", get_avg_std(pos_it, "PRON"))
print("avg. and std. ADV", get_avg_std(pos_it, "ADV"))
print("avg. and std. PROPN", get_avg_std(pos_it, "PROPN"))
print("avg. and std. CCONJ", get_avg_std(pos_it, "CCONJ"))
print("avg. and std. ADJ", get_avg_std(pos_it, "ADJ"))

print("--------informal: ")

pos_it = df_concat[(df_concat["language"] == "IT") & (df_concat["formal/informal_x"] == "informal")]["texto_x"].progress_apply(get_pos).tolist()

from collections import Counter
from itertools import chain
print(Counter(chain(*pos_it)))


print("avg. and std. NOUN", get_avg_std(pos_it, "NOUN"))
print("avg. and std. PUNCT", get_avg_std(pos_it, "PUNCT"))
print("avg. and std. ADP", get_avg_std(pos_it, "ADP"))
print("avg. and std. VERB", get_avg_std(pos_it, "VERB"))
print("avg. and std. DET", get_avg_std(pos_it, "DET"))
print("avg. and std. PRON", get_avg_std(pos_it, "PRON"))
print("avg. and std. ADV", get_avg_std(pos_it, "ADV"))
print("avg. and std. PROPN", get_avg_std(pos_it, "PROPN"))
print("avg. and std. CCONJ", get_avg_std(pos_it, "CCONJ"))
print("avg. and std. ADJ", get_avg_std(pos_it, "ADJ"))


--------Italian--------


100%|██████████| 77678/77678 [04:48<00:00, 269.18it/s]


--------formal: 
Counter({'NOUN': 380531, 'PUNCT': 348301, 'ADP': 264056, 'VERB': 247748, 'DET': 240530, 'PRON': 176261, 'ADV': 105423, 'PROPN': 94821, 'CCONJ': 93524, 'AUX': 79999, 'ADJ': 57091, 'SCONJ': 35903, 'NUM': 9582, 'SYM': 3136, 'INTJ': 961, 'X': 383, 'PART': 1})
avg. and std. NOUN (4.89882592239759, 2.9916437060344845)
avg. and std. PUNCT (4.483907927598548, 2.729057812131463)
avg. and std. ADP (3.399366616030279, 2.4593790360498216)
avg. and std. VERB (3.1894230026519734, 1.9614276427415727)
avg. and std. DET (3.0965009397770284, 2.374321793694466)
avg. and std. PRON (2.269123818841886, 1.998753961806364)
avg. and std. ADV (1.3571796390226318, 1.4125879443171692)
avg. and std. PROPN (1.2206931177424754, 1.6600268203108355)
avg. and std. CCONJ (1.2039959834187286, 1.0968725379487736)
avg. and std. ADJ (0.7349700043770437, 0.9914138092624931)
--------informal: 


100%|██████████| 6393/6393 [00:28<00:00, 227.17it/s]


Counter({'PUNCT': 28488, 'NOUN': 26334, 'VERB': 24699, 'ADP': 21662, 'DET': 17669, 'PRON': 16261, 'ADV': 11864, 'AUX': 8893, 'CCONJ': 7035, 'PROPN': 7011, 'ADJ': 6855, 'SCONJ': 4733, 'NUM': 546, 'INTJ': 164, 'X': 17, 'SYM': 10, 'PART': 1})
avg. and std. NOUN (4.119192867198499, 2.5105900556011296)
avg. and std. PUNCT (4.456123885499766, 2.494673385193473)
avg. and std. ADP (3.388393555451275, 2.271077702368692)
avg. and std. VERB (3.8634443923040824, 2.0100571195036507)
avg. and std. DET (2.763804160800876, 1.994177392693757)
avg. and std. PRON (2.54356327232911, 2.0527355000752388)
avg. and std. ADV (1.8557797591115281, 1.6376662388699421)
avg. and std. PROPN (1.0966682308775222, 1.346335302734845)
avg. and std. CCONJ (1.1004223369310182, 1.0103910118477082)
avg. and std. ADJ (1.0722665415297983, 1.2072873259471244)


In [47]:
import spacy
from tqdm import tqdm
import statistics

nlp = spacy.load("es_dep_news_trf")
tqdm.pandas()
def get_pos(x):
    return [w.pos_ for w in nlp(x)]

def get_avg_std(lista, busca):
    ocorrencias = []
    
    for sentenca in lista:
        ocorrencias.append(sentenca.count(busca))
    if len(ocorrencias) == 0:
        return 0.0, 0.0
    else:
        return statistics.mean(ocorrencias), statistics.stdev(ocorrencias)
    
print("--------Spanish--------")

pos_it = df_concat[(df_concat["language"] == "ES") & (df_concat["formal/informal_x"] == "formal")]["texto_x"].progress_apply(get_pos).tolist()

from collections import Counter
from itertools import chain



print("--------Informal: ")
print(Counter(chain(*pos_it)))



print("avg. and std. NOUN", get_avg_std(pos_it, "NOUN"))
print("avg. and std. PUNCT", get_avg_std(pos_it, "PUNCT"))
print("avg. and std. ADP", get_avg_std(pos_it, "ADP"))
print("avg. and std. VERB", get_avg_std(pos_it, "VERB"))
print("avg. and std. DET", get_avg_std(pos_it, "DET"))
print("avg. and std. PRON", get_avg_std(pos_it, "PRON"))
print("avg. and std. ADV", get_avg_std(pos_it, "ADV"))
print("avg. and std. PROPN", get_avg_std(pos_it, "PROPN"))
print("avg. and std. CCONJ", get_avg_std(pos_it, "CCONJ"))
print("avg. and std. ADJ", get_avg_std(pos_it, "ADJ"))

print("--------Formal: ")

pos_it = df_concat[(df_concat["language"] == "ES") & (df_concat["formal/informal_x"] == "informal")]["texto_x"].progress_apply(get_pos).tolist()

from collections import Counter
from itertools import chain
print(Counter(chain(*pos_it)))


print("avg. and std. NOUN", get_avg_std(pos_it, "NOUN"))
print("avg. and std. PUNCT", get_avg_std(pos_it, "PUNCT"))
print("avg. and std. ADP", get_avg_std(pos_it, "ADP"))
print("avg. and std. VERB", get_avg_std(pos_it, "VERB"))
print("avg. and std. DET", get_avg_std(pos_it, "DET"))
print("avg. and std. PRON", get_avg_std(pos_it, "PRON"))
print("avg. and std. ADV", get_avg_std(pos_it, "ADV"))
print("avg. and std. PROPN", get_avg_std(pos_it, "PROPN"))
print("avg. and std. CCONJ", get_avg_std(pos_it, "CCONJ"))
print("avg. and std. ADJ", get_avg_std(pos_it, "ADJ"))

--------Spanish--------


100%|██████████| 180643/180643 [1:46:29<00:00, 28.27it/s]


--------Informal: 
Counter({'PUNCT': 831295, 'NOUN': 802132, 'ADP': 680132, 'DET': 613768, 'VERB': 594938, 'PRON': 436657, 'PROPN': 284953, 'CCONJ': 252399, 'ADV': 174689, 'AUX': 163163, 'SCONJ': 135234, 'ADJ': 128190, 'SYM': 75789, 'SPACE': 29333, 'NUM': 27460, 'INTJ': 13819, 'PART': 959})
avg. and std. NOUN (4.4404266979622795, 2.6748791827562948)
avg. and std. PUNCT (4.6018666651904585, 3.5533568924631362)
avg. and std. ADP (3.7650614748426454, 2.4656846079022996)
avg. and std. VERB (3.293446189445481, 1.9260410257178096)
avg. and std. DET (3.397684936587634, 2.219615550808448)
avg. and std. PRON (2.4172373133749994, 2.032370817899358)
avg. and std. ADV (0.967039962799555, 1.0973469989997071)
avg. and std. PROPN (1.5774372657672868, 5.569836772182662)
avg. and std. CCONJ (1.3972254668046922, 1.2067031585887453)
avg. and std. ADJ (0.709631704522179, 0.9710559490632815)
--------Formal: 


100%|██████████| 107041/107041 [1:01:24<00:00, 29.05it/s]


Counter({'NOUN': 453965, 'ADP': 388171, 'PUNCT': 373233, 'VERB': 371676, 'DET': 365231, 'PRON': 265703, 'PROPN': 157328, 'CCONJ': 118076, 'ADV': 108817, 'ADJ': 99127, 'AUX': 97138, 'SCONJ': 86540, 'SYM': 46984, 'NUM': 13403, 'INTJ': 5310, 'SPACE': 4950, 'PART': 301})
avg. and std. NOUN (4.241038480582207, 2.6166470885129014)
avg. and std. PUNCT (3.486822806214441, 1.9980457183277125)
avg. and std. ADP (3.6263768088863144, 2.409464335285695)
avg. and std. VERB (3.4722769779804, 1.992502721649018)
avg. and std. DET (3.4120664044618416, 2.2079118143384573)
avg. and std. PRON (2.4822544632430565, 2.081330597206193)
avg. and std. ADV (1.0165917732457657, 1.1419297197195335)
avg. and std. PROPN (1.4697919488794013, 1.8391619170292348)
avg. and std. CCONJ (1.1030913388327837, 1.0176753817882558)
avg. and std. ADJ (0.9260657131379565, 1.1115933603205126)


In [48]:
import spacy
from tqdm import tqdm
import statistics


nlp = spacy.load("fr_dep_news_trf")
tqdm.pandas()
def get_pos(x):
    return [w.pos_ for w in nlp(x)]

def get_avg_std(lista, busca):
    ocorrencias = []
    
    for sentenca in lista:
        ocorrencias.append(sentenca.count(busca))
    if len(ocorrencias) == 0:
        return 0.0, 0.0
    else:
        return statistics.mean(ocorrencias), statistics.stdev(ocorrencias)
    
print("--------French--------")

pos_it = df_concat[(df_concat["language"] == "FR") & (df_concat["formal/informal_x"] == "formal")]["texto_x"].progress_apply(get_pos).tolist()

from collections import Counter
from itertools import chain



print("--------Formal: ")
print(Counter(chain(*pos_it)))



print("avg. and std. NOUN", get_avg_std(pos_it, "NOUN"))
print("avg. and std. PUNCT", get_avg_std(pos_it, "PUNCT"))
print("avg. and std. ADP", get_avg_std(pos_it, "ADP"))
print("avg. and std. VERB", get_avg_std(pos_it, "VERB"))
print("avg. and std. DET", get_avg_std(pos_it, "DET"))
print("avg. and std. PRON", get_avg_std(pos_it, "PRON"))
print("avg. and std. ADV", get_avg_std(pos_it, "ADV"))
print("avg. and std. PROPN", get_avg_std(pos_it, "PROPN"))
print("avg. and std. CCONJ", get_avg_std(pos_it, "CCONJ"))
print("avg. and std. ADJ", get_avg_std(pos_it, "ADJ"))

print("--------Informal: ")

pos_it = df_concat[(df_concat["language"] == "FR") & (df_concat["formal/informal_x"] == "informal")]["texto_x"].progress_apply(get_pos).tolist()

from collections import Counter
from itertools import chain
print(Counter(chain(*pos_it)))


print("avg. and std. NOUN", get_avg_std(pos_it, "NOUN"))
print("avg. and std. PUNCT", get_avg_std(pos_it, "PUNCT"))
print("avg. and std. ADP", get_avg_std(pos_it, "ADP"))
print("avg. and std. VERB", get_avg_std(pos_it, "VERB"))
print("avg. and std. DET", get_avg_std(pos_it, "DET"))
print("avg. and std. PRON", get_avg_std(pos_it, "PRON"))
print("avg. and std. ADV", get_avg_std(pos_it, "ADV"))
print("avg. and std. PROPN", get_avg_std(pos_it, "PROPN"))
print("avg. and std. CCONJ", get_avg_std(pos_it, "CCONJ"))
print("avg. and std. ADJ", get_avg_std(pos_it, "ADJ"))

--------French--------


100%|██████████| 46374/46374 [28:19<00:00, 27.29it/s]


--------Formal: 
Counter({'PUNCT': 233406, 'NOUN': 220289, 'PRON': 196687, 'ADP': 188350, 'DET': 185962, 'VERB': 163930, 'CCONJ': 62224, 'PROPN': 59213, 'ADV': 57174, 'AUX': 50221, 'ADJ': 37829, 'SCONJ': 24825, 'NUM': 5506, 'INTJ': 561, 'X': 16, 'SYM': 8})
avg. and std. NOUN (4.750269547591323, 2.716235017147298)
avg. and std. PUNCT (5.033122008021737, 2.630638327060437)
avg. and std. ADP (4.061543106050804, 2.7765897556526267)
avg. and std. VERB (3.534954931642731, 1.9939898376637366)
avg. and std. DET (4.010048734204511, 2.5059707764624015)
avg. and std. PRON (4.241320567559408, 3.1266041543873473)
avg. and std. ADV (1.2328891189028335, 1.6082668547958952)
avg. and std. PROPN (1.2768577219993962, 1.6893395346907143)
avg. and std. CCONJ (1.341786345797214, 1.1330778078283155)
avg. and std. ADJ (0.8157372665717859, 1.0356514945349355)
--------Informal: 


100%|██████████| 27342/27342 [17:10<00:00, 26.53it/s]


Counter({'NOUN': 167025, 'PUNCT': 146655, 'PRON': 133537, 'ADP': 117547, 'DET': 117062, 'VERB': 105069, 'CCONJ': 44758, 'ADV': 42834, 'PROPN': 30602, 'AUX': 30376, 'NUM': 29731, 'ADJ': 27582, 'SCONJ': 17147, 'INTJ': 585, 'SYM': 11, 'X': 2})
avg. and std. NOUN (6.108733816107088, 3.5765070179214753)
avg. and std. PUNCT (5.363726135615536, 3.5744714145955085)
avg. and std. ADP (4.299136859044693, 2.94175167478981)
avg. and std. VERB (3.8427693658108404, 2.1500109980522795)
avg. and std. DET (4.2813985809377515, 2.744678871718367)
avg. and std. PRON (4.883951430034379, 3.6182990472862615)
avg. and std. ADV (1.56660083388194, 1.8917933979154569)
avg. and std. PROPN (1.1192304878940824, 1.5983114948579717)
avg. and std. CCONJ (1.6369687660010241, 1.3558606917928682)
avg. and std. ADJ (1.0087777046302393, 1.159272112254435)


In [49]:
import spacy
from tqdm import tqdm
import statistics


nlp = spacy.load("pt_core_news_lg")
tqdm.pandas()
def get_pos(x):
    return [w.pos_ for w in nlp(x)]

def get_avg_std(lista, busca):
    ocorrencias = []
    
    for sentenca in lista:
        ocorrencias.append(sentenca.count(busca))
    if len(ocorrencias) == 0:
        return 0.0, 0.0
    else:
        return statistics.mean(ocorrencias), statistics.stdev(ocorrencias)
    
print("--------Portuguese--------")

pos_it = df_concat[(df_concat["language"] == "PT") & (df_concat["formal/informal_x"] == "formal")]["texto_x"].progress_apply(get_pos).tolist()

from collections import Counter
from itertools import chain



print("--------Formal: ")
print(Counter(chain(*pos_it)))



print("avg. and std. NOUN", get_avg_std(pos_it, "NOUN"))
print("avg. and std. PUNCT", get_avg_std(pos_it, "PUNCT"))
print("avg. and std. ADP", get_avg_std(pos_it, "ADP"))
print("avg. and std. VERB", get_avg_std(pos_it, "VERB"))
print("avg. and std. DET", get_avg_std(pos_it, "DET"))
print("avg. and std. PRON", get_avg_std(pos_it, "PRON"))
print("avg. and std. ADV", get_avg_std(pos_it, "ADV"))
print("avg. and std. PROPN", get_avg_std(pos_it, "PROPN"))
print("avg. and std. CCONJ", get_avg_std(pos_it, "CCONJ"))
print("avg. and std. ADJ", get_avg_std(pos_it, "ADJ"))

print("--------Informal: ")

pos_it = df_concat[(df_concat["language"] == "PT") & (df_concat["formal/informal_x"] == "informal")]["texto_x"].progress_apply(get_pos).tolist()

from collections import Counter
from itertools import chain
print(Counter(chain(*pos_it)))


print("avg. and std. NOUN", get_avg_std(pos_it, "NOUN"))
print("avg. and std. PUNCT", get_avg_std(pos_it, "PUNCT"))
print("avg. and std. ADP", get_avg_std(pos_it, "ADP"))
print("avg. and std. VERB", get_avg_std(pos_it, "VERB"))
print("avg. and std. DET", get_avg_std(pos_it, "DET"))
print("avg. and std. PRON", get_avg_std(pos_it, "PRON"))
print("avg. and std. ADV", get_avg_std(pos_it, "ADV"))
print("avg. and std. PROPN", get_avg_std(pos_it, "PROPN"))
print("avg. and std. CCONJ", get_avg_std(pos_it, "CCONJ"))
print("avg. and std. ADJ", get_avg_std(pos_it, "ADJ"))




--------Portuguese--------


100%|██████████| 85650/85650 [05:11<00:00, 274.93it/s]


--------Formal: 
Counter({'NOUN': 368029, 'PUNCT': 359903, 'VERB': 285483, 'ADP': 282931, 'DET': 256792, 'PRON': 178533, 'PROPN': 130715, 'CCONJ': 113505, 'ADV': 101525, 'SCONJ': 67593, 'ADJ': 59399, 'AUX': 57744, 'NUM': 13753, 'SPACE': 4592, 'INTJ': 1984, 'X': 79, 'SYM': 26, 'PART': 2})
avg. and std. NOUN (4.296894337419731, 2.5798399545519644)
avg. and std. PUNCT (4.202019848219498, 2.3627263014371804)
avg. and std. ADP (3.3033391710449505, 2.3149437220398097)
avg. and std. VERB (3.333134851138354, 1.9435808719844074)
avg. and std. DET (2.9981552831290132, 2.1830116091382696)
avg. and std. PRON (2.0844483362521893, 1.910458618024813)
avg. and std. ADV (1.1853473438412143, 1.2505605272981049)
avg. and std. PROPN (1.5261529480443665, 1.7355287057761113)
avg. and std. CCONJ (1.3252189141856392, 1.2123007263064045)
avg. and std. ADJ (0.6935084646818447, 0.9363276591551016)
--------Informal: 


100%|██████████| 25840/25840 [01:36<00:00, 268.70it/s]


Counter({'NOUN': 113789, 'VERB': 101782, 'PUNCT': 101013, 'ADP': 84490, 'DET': 84298, 'PRON': 68969, 'PROPN': 42655, 'ADV': 34715, 'CCONJ': 32819, 'AUX': 26317, 'SCONJ': 24985, 'ADJ': 19049, 'NUM': 4560, 'INTJ': 862, 'X': 8, 'SYM': 5, 'SPACE': 5, 'PART': 1})
avg. and std. NOUN (4.40359907120743, 2.6994664104472723)
avg. and std. PUNCT (3.909171826625387, 2.2336853963757273)
avg. and std. ADP (3.2697368421052633, 2.366264556137251)
avg. and std. VERB (3.9389318885448916, 2.1789733860758425)
avg. and std. DET (3.2623065015479877, 2.2586365716932755)
avg. and std. PRON (2.669078947368421, 2.2022376736569425)
avg. and std. ADV (1.3434597523219813, 1.3497594176845074)
avg. and std. PROPN (1.650735294117647, 1.8060157871165055)
avg. and std. CCONJ (1.2700851393188854, 1.0628326364375906)
avg. and std. ADJ (0.7371904024767801, 0.9460107491166083)


In [50]:
import spacy
from tqdm import tqdm
import statistics


nlp = spacy.load("en_core_web_trf")
tqdm.pandas()
def get_pos(x):
    return [w.pos_ for w in nlp(x)]

def get_avg_std(lista, busca):
    ocorrencias = []
    
    for sentenca in lista:
        ocorrencias.append(sentenca.count(busca))
    if len(ocorrencias) == 0:
        return 0.0, 0.0
    else:
        return statistics.mean(ocorrencias), statistics.stdev(ocorrencias)
    
print("--------English--------")

pos_it = df_concat[(df_concat["language"] == "EN") & (df_concat["formal/informal_x"] == "formal")]["texto_x"].progress_apply(get_pos).tolist()

from collections import Counter
from itertools import chain



print("--------formal: ")
print(Counter(chain(*pos_it)))



print("avg. and std. NOUN", get_avg_std(pos_it, "NOUN"))
print("avg. and std. PUNCT", get_avg_std(pos_it, "PUNCT"))
print("avg. and std. ADP", get_avg_std(pos_it, "ADP"))
print("avg. and std. VERB", get_avg_std(pos_it, "VERB"))
print("avg. and std. DET", get_avg_std(pos_it, "DET"))
print("avg. and std. PRON", get_avg_std(pos_it, "PRON"))
print("avg. and std. ADV", get_avg_std(pos_it, "ADV"))
print("avg. and std. PROPN", get_avg_std(pos_it, "PROPN"))
print("avg. and std. CCONJ", get_avg_std(pos_it, "CCONJ"))
print("avg. and std. ADJ", get_avg_std(pos_it, "ADJ"))

print("--------informal: ")

pos_it = df_concat[(df_concat["language"] == "EN") & (df_concat["formal/informal_x"] == "informal")]["texto_x"].progress_apply(get_pos).tolist()

from collections import Counter
from itertools import chain
print(Counter(chain(*pos_it)))


print("avg. and std. NOUN", get_avg_std(pos_it, "NOUN"))
print("avg. and std. PUNCT", get_avg_std(pos_it, "PUNCT"))
print("avg. and std. ADP", get_avg_std(pos_it, "ADP"))
print("avg. and std. VERB", get_avg_std(pos_it, "VERB"))
print("avg. and std. DET", get_avg_std(pos_it, "DET"))
print("avg. and std. PRON", get_avg_std(pos_it, "PRON"))
print("avg. and std. ADV", get_avg_std(pos_it, "ADV"))
print("avg. and std. PROPN", get_avg_std(pos_it, "PROPN"))
print("avg. and std. CCONJ", get_avg_std(pos_it, "CCONJ"))
print("avg. and std. ADJ", get_avg_std(pos_it, "ADJ"))


--------English--------


100%|██████████| 524847/524847 [5:16:40<00:00, 27.62it/s]   


--------formal: 
Counter({'PUNCT': 2517487, 'NOUN': 2251808, 'PRON': 2052112, 'VERB': 1794872, 'ADP': 1752342, 'DET': 1390547, 'PROPN': 1019733, 'AUX': 960702, 'CCONJ': 842397, 'ADJ': 452134, 'ADV': 447889, 'SCONJ': 341094, 'PART': 314482, 'SPACE': 213551, 'NUM': 108237, 'X': 82240, 'INTJ': 30267, 'SYM': 5112})
avg. and std. NOUN (4.2904084428414375, 2.689915642030412)
avg. and std. PUNCT (4.796611202883888, 4.4834290082010675)
avg. and std. ADP (3.3387672978982446, 2.275315615190581)
avg. and std. VERB (3.4198004370797586, 1.9955861933840198)
avg. and std. DET (2.6494330728764766, 2.1280454657452776)
avg. and std. PRON (3.9099242255362037, 2.7027508126166695)
avg. and std. ADV (0.8533706013371516, 1.0761023372813139)
avg. and std. PROPN (1.942914792310902, 6.196439474233375)
avg. and std. CCONJ (1.6050334668960669, 1.3900278708540814)
avg. and std. ADJ (0.861458672717954, 1.1166934892065536)
--------informal: 


100%|██████████| 121991/121991 [1:12:50<00:00, 27.92it/s]


Counter({'PUNCT': 497946, 'PRON': 494549, 'NOUN': 477438, 'VERB': 463835, 'ADP': 349954, 'DET': 286981, 'AUX': 248063, 'PROPN': 220386, 'ADV': 148577, 'ADJ': 129854, 'CCONJ': 128760, 'PART': 88631, 'SCONJ': 73760, 'NUM': 18279, 'SPACE': 6658, 'INTJ': 4421, 'X': 1678, 'SYM': 721})
avg. and std. NOUN (3.9137149461845544, 2.636128643198193)
avg. and std. PUNCT (4.081825708453902, 2.6528821221854373)
avg. and std. ADP (2.868687034289415, 2.1482849600434752)
avg. and std. VERB (3.8022067201678813, 2.2384797129133074)
avg. and std. DET (2.352476822060644, 2.0308483501075694)
avg. and std. PRON (4.053979391922355, 2.904874265286586)
avg. and std. ADV (1.2179341098933527, 1.4315888748984942)
avg. and std. PROPN (1.8065758949430695, 2.362687736925302)
avg. and std. CCONJ (1.0554876999122886, 1.0466510157217248)
avg. and std. ADJ (1.0644555745915683, 1.2223034089582667)


In [19]:
df_concat["livro"]

17         Gênesis
18         Gênesis
19         Gênesis
20         Gênesis
21         Gênesis
           ...    
7940    Apocalisse
7948    Apocalisse
7949    Apocalisse
7950    Apocalisse
7953    Apocalisse
Name: livro, Length: 1219810, dtype: object

In [49]:
""" Exemplos de textos pareados em ingles e portugues, estilo forma de tradução """

df_concat[(df_concat["language"] == "EN") & (df_concat["forma_x"] == "literal") & (df_concat["livro"] == "1%20Corinthians") & (df_concat["capitulo"] == 1) & (df_concat["versiculo"] == 1)]



Unnamed: 0,estilo_x,livro,capitulo,versiculo,texto_x,estilo_y,texto_y,sourceLen,targetLen,overlap,VERSAO,language,forma_x,forma_y,linguagem_x,linguagem_y,formal/informal_x,formal/informal_y
4101,AKJV,1%20Corinthians,1,1,"Paul , called to be an apostle of Jesus Christ through the will of God , and Sosthenes our brother ,",ASV,Corinthians 1:31 Jer . 9:23f .,21,6,0.0,NOVO,EN,literal,literal,arcaica,arcaica,formal,formal
4101,ASV,1%20Corinthians,1,1,Corinthians 1:31 Jer . 9:23f .,BRG,"Paul , called to be an apostle of Jesus Christ through the will of God , and Sosthenes our brother ,",6,21,0.0,NOVO,EN,literal,literal,arcaica,moderna,formal,formal
4108,DARBY,1%20Corinthians,1,1,"Paul , [ a ] called apostle of Jesus Christ , by God 's will , and Sosthenes the brother ,",DRA,"Paul , called to be an apostle of Jesus Christ by the will of God , and Sosthenes a brother ,",21,21,0.511628,NOVO,EN,literal,literal,arcaica,arcaica,formal,formal
4068,DRA,1%20Corinthians,1,1,"Paul , called to be an apostle of Jesus Christ by the will of God , and Sosthenes a brother ,",ERV,Corinthians 1:31 Quote from Jer . 9:24 .,21,8,0.0,NOVO,EN,literal,dinâmica,arcaica,moderna,formal,informal
4110,GNV,1%20Corinthians,1,1,"Corinthians 1:31 Let him yield all to God and give him thanks : and so by this place is man ’ s free will beaten down , which the Papist so dream of . Geneva Bible ,",KJ21,"Paul , called to be an apostle of Jesus Christ through the will of God , and Sosthenes our brother ,",37,21,0.090909,NOVO,EN,literal,literal,arcaica,moderna,formal,formal
4113,KJV,1%20Corinthians,1,1,"Paul called to be an apostle of Jesus Christ through the will of God , and Sosthenes our brother ,",LEB,Corinthians 1:31 A quotation from Jer 9:24,20,7,0.0,NOVO,EN,literal,dinâmica,arcaica,moderna,formal,formal


In [48]:
df_concat[(df_concat["language"] == "PT") & (df_concat["forma_x"] == "literal") & (df_concat["livro"] == "1%20Coríntios") & (df_concat["capitulo"] == 1) & (df_concat["versiculo"] == 1)]

Unnamed: 0,estilo_x,livro,capitulo,versiculo,texto_x,estilo_y,texto_y,sourceLen,targetLen,overlap,VERSAO,language,forma_x,forma_y,linguagem_x,linguagem_y,formal/informal_x,formal/informal_y
4990,ARC,1%20Coríntios,1,1,"Paulo ( chamado apóstolo de Jesus Cristo , pela vontade de Deus ) e o irmão Sóstenes ,",NTLH,"1-2 Eu , Paulo , que fui chamado pela vontade de Deus para ser apóstolo de Cristo Jesus , escrevo , junto com o irmão Sóstenes , esta carta à igreja de Deus que está na cidade de Corinto . Escrevo a todos os que , pela sua união com Cristo Jesus , foram chamados para pertencerem ao povo de Deus . Esta carta é também para aqueles que em todos os lugares adoram o nosso Senhor Jesus Cristo , Senhor deles e nosso .",18,85,0.142012,NOVO,PT,literal,dinâmica,arcaica,moderna,formal,informal
4967,OL,1%20Coríntios,1,1,"Paulo , escolhido por Deus para ser o apóstolo de Jesus Cristo , e o irmão Sóstenes .",VFL,"De Paulo , chamado para ser apóstolo de Cristo Jesus [ a ] pela vontade de Deus , e também de Sóstenes , o nosso irmão em Cristo .",18,29,0.315789,NOVO,PT,literal,dinâmica,arcaica,moderna,formal,informal


In [53]:
df_concat[(df_concat["language"] == "PT") & (df_concat["forma_x"] == "dinâmica") & (df_concat["livro"] == "1%20Coríntios") & (df_concat["capitulo"] == 1) & (df_concat["versiculo"] == 1)]

Unnamed: 0,estilo_x,livro,capitulo,versiculo,texto_x,estilo_y,texto_y,sourceLen,targetLen,overlap,VERSAO,language,forma_x,forma_y,linguagem_x,linguagem_y,formal/informal_x,formal/informal_y
5120,NTLH,1%20Coríntios,1,1,"1-2 Eu , Paulo , que fui chamado pela vontade de Deus para ser apóstolo de Cristo Jesus , escrevo , junto com o irmão Sóstenes , esta carta à igreja de Deus que está na cidade de Corinto . Escrevo a todos os que , pela sua união com Cristo Jesus , foram chamados para pertencerem ao povo de Deus . Esta carta é também para aqueles que em todos os lugares adoram o nosso Senhor Jesus Cristo , Senhor deles e nosso .",NVI-PT,"Paulo , chamado para ser apóstolo de Cristo Jesus pela vontade de Deus , e o irmão Sóstenes ,",85,19,0.16568,NOVO,PT,dinâmica,dinâmica,moderna,moderna,informal,formal
5021,NVI-PT,1%20Coríntios,1,1,"Paulo , chamado para ser apóstolo de Cristo Jesus pela vontade de Deus , e o irmão Sóstenes ,",NVT,"Eu , Paulo , chamado para ser apóstolo de Cristo Jesus pela vontade de Deus , escrevo esta carta , com nosso irmão Sóstenes ,",19,25,0.591837,NOVO,PT,dinâmica,dinâmica,moderna,moderna,formal,formal
4987,NVT,1%20Coríntios,1,1,"Eu , Paulo , chamado para ser apóstolo de Cristo Jesus pela vontade de Deus , escrevo esta carta , com nosso irmão Sóstenes ,",OL,"Paulo , escolhido por Deus para ser o apóstolo de Jesus Cristo , e o irmão Sóstenes .",25,18,0.306122,NOVO,PT,dinâmica,literal,moderna,arcaica,formal,formal


In [52]:
df_concat[(df_concat["language"] == "EN") & (df_concat["forma_x"] == "dinâmica") & (df_concat["livro"] == "1%20Corinthians") & (df_concat["capitulo"] == 1) & (df_concat["versiculo"] == 1)]

Unnamed: 0,estilo_x,livro,capitulo,versiculo,texto_x,estilo_y,texto_y,sourceLen,targetLen,overlap,VERSAO,language,forma_x,forma_y,linguagem_x,linguagem_y,formal/informal_x,formal/informal_y
4004,CEB,1%20Corinthians,1,1,"From Paul , called by God ’ s will to be an apostle of Jesus Christ , and from Sosthenes our brother .",CEV,"From Paul , chosen by God to be an apostle of Christ Jesus , and from Sosthenes , who is also a follower .",23,24,0.553191,NOVO,EN,dinâmica,dinâmica,moderna,moderna,formal,informal
4005,CEV,1%20Corinthians,1,1,"From Paul , chosen by God to be an apostle of Christ Jesus , and from Sosthenes , who is also a follower .",CJB,Corinthians 1:31 Jeremiah 9:23 (,24,5,0.0,NOVO,EN,dinâmica,dinâmica,moderna,moderna,informal,informal
4112,CJB,1%20Corinthians,1,1,Corinthians 1:31 Jeremiah 9:23 (,DARBY,"Paul , [ a ] called apostle of Jesus Christ , by God 's will , and Sosthenes the brother ,",5,21,0.0,NOVO,EN,dinâmica,literal,moderna,arcaica,informal,formal
4065,ERV,1%20Corinthians,1,1,Corinthians 1:31 Quote from Jer . 9:24 .,EXB,"From Paul . ·God called me [ L …called ] to be an ·apostle [ messenger ] of Christ Jesus ·because that is what God wanted [ L by the will of God ] . Also from Sosthenes [ C a coworker ; Paul may be dictating the letter to him ; see 16:21 ; perhaps the synagogue leader mentioned in Acts 18:15–17 ] , our ·brother in Christ [ L brother ] .",8,74,0.017751,NOVO,EN,dinâmica,livre,moderna,moderna,informal,formal
4101,LEB,1%20Corinthians,1,1,Corinthians 1:31 A quotation from Jer 9:24,MOUNCE,"Paul Paulos , called klētos to be an apostle apostolos of Christ Christos Jesus Iēsous by dia the will thelēma of God theos , and kai Sosthenes Sōsthenēs our ho brother adelphos ,",7,33,0.0,NOVO,EN,dinâmica,dinâmica,moderna,moderna,formal,formal
4100,MOUNCE,1%20Corinthians,1,1,"Paul Paulos , called klētos to be an apostle apostolos of Christ Christos Jesus Iēsous by dia the will thelēma of God theos , and kai Sosthenes Sōsthenēs our ho brother adelphos ,",NABRE,"Cor 1:31 ) , on the other hand , is the acknowledgment that we live only from God and for God . Scripture texts , prefaces , introductions , footnotes and cross references used in this work are taken from the New American Bible , revised edition ©",33,48,0.040404,NOVO,EN,dinâmica,dinâmica,moderna,moderna,formal,formal
4220,NABRE,1%20Corinthians,1,1,"Cor 1:31 ) , on the other hand , is the acknowledgment that we live only from God and for God . Scripture texts , prefaces , introductions , footnotes and cross references used in this work are taken from the New American Bible , revised edition ©",NCB,"Corinthians 1:29 “ Boasting ” refers to a person ’ s sin in thinking that one is saved by oneself . The truth is that we live only from God and for God . Hence , the only “ boasting ” possible is “ boasting in the Lord . ”",48,50,0.242718,NOVO,EN,dinâmica,dinâmica,moderna,moderna,formal,formal
4114,NCB,1%20Corinthians,1,1,"Corinthians 1:29 “ Boasting ” refers to a person ’ s sin in thinking that one is saved by oneself . The truth is that we live only from God and for God . Hence , the only “ boasting ” possible is “ boasting in the Lord . ”",NCV,"From Paul . God called me to be an apostle of Christ Jesus because that is what God wanted . Also from Sosthenes , our brother in Christ .",50,29,0.07767,NOVO,EN,dinâmica,dinâmica,moderna,moderna,formal,formal
4084,NCV,1%20Corinthians,1,1,"From Paul . God called me to be an apostle of Christ Jesus because that is what God wanted . Also from Sosthenes , our brother in Christ .",NIRV,"I , Paul , am writing this letter . I have been chosen to be an apostle of Christ Jesus just as God planned . Our brother Sosthenes joins me in writing .",29,33,0.323077,NOVO,EN,dinâmica,dinâmica,moderna,moderna,formal,informal
4089,NIRV,1%20Corinthians,1,1,"I , Paul , am writing this letter . I have been chosen to be an apostle of Christ Jesus just as God planned . Our brother Sosthenes joins me in writing .",YLT,"Paul , a called apostle of Jesus Christ , through the will of God , and Sosthenes the brother ,",33,20,0.169231,NOVO,EN,dinâmica,literal,moderna,arcaica,informal,formal


In [54]:
df_concat[(df_concat["language"] == "EN") & (df_concat["forma_x"] == "livre") & (df_concat["livro"] == "1%20Corinthians") & (df_concat["capitulo"] == 1) & (df_concat["versiculo"] == 1)]

Unnamed: 0,estilo_x,livro,capitulo,versiculo,texto_x,estilo_y,texto_y,sourceLen,targetLen,overlap,VERSAO,language,forma_x,forma_y,linguagem_x,linguagem_y,formal/informal_x,formal/informal_y
4110,EXB,1%20Corinthians,1,1,"From Paul . ·God called me [ L …called ] to be an ·apostle [ messenger ] of Christ Jesus ·because that is what God wanted [ L by the will of God ] . Also from Sosthenes [ C a coworker ; Paul may be dictating the letter to him ; see 16:21 ; perhaps the synagogue leader mentioned in Acts 18:15–17 ] , our ·brother in Christ [ L brother ] .",GNV,"Corinthians 1:31 Let him yield all to God and give him thanks : and so by this place is man ’ s free will beaten down , which the Papist so dream of . Geneva Bible ,",74,37,0.065089,NOVO,EN,livre,literal,moderna,arcaica,formal,formal


In [55]:
df_concat[(df_concat["language"] == "PT") & (df_concat["forma_x"] == "livre") & (df_concat["livro"] == "1%20Coríntios") & (df_concat["capitulo"] == 1) & (df_concat["versiculo"] == 1)]

Unnamed: 0,estilo_x,livro,capitulo,versiculo,texto_x,estilo_y,texto_y,sourceLen,targetLen,overlap,VERSAO,language,forma_x,forma_y,linguagem_x,linguagem_y,formal/informal_x,formal/informal_y


In [None]:
""" Exemplos de textos pareados em ingles e portugues, estilo linguagem """

In [56]:
df_concat[(df_concat["language"] == "PT") & (df_concat["linguagem_x"] == "moderna") & (df_concat["livro"] == "1%20Coríntios") & (df_concat["capitulo"] == 1) & (df_concat["versiculo"] == 1)]

Unnamed: 0,estilo_x,livro,capitulo,versiculo,texto_x,estilo_y,texto_y,sourceLen,targetLen,overlap,VERSAO,language,forma_x,forma_y,linguagem_x,linguagem_y,formal/informal_x,formal/informal_y
5120,NTLH,1%20Coríntios,1,1,"1-2 Eu , Paulo , que fui chamado pela vontade de Deus para ser apóstolo de Cristo Jesus , escrevo , junto com o irmão Sóstenes , esta carta à igreja de Deus que está na cidade de Corinto . Escrevo a todos os que , pela sua união com Cristo Jesus , foram chamados para pertencerem ao povo de Deus . Esta carta é também para aqueles que em todos os lugares adoram o nosso Senhor Jesus Cristo , Senhor deles e nosso .",NVI-PT,"Paulo , chamado para ser apóstolo de Cristo Jesus pela vontade de Deus , e o irmão Sóstenes ,",85,19,0.16568,NOVO,PT,dinâmica,dinâmica,moderna,moderna,informal,formal
5021,NVI-PT,1%20Coríntios,1,1,"Paulo , chamado para ser apóstolo de Cristo Jesus pela vontade de Deus , e o irmão Sóstenes ,",NVT,"Eu , Paulo , chamado para ser apóstolo de Cristo Jesus pela vontade de Deus , escrevo esta carta , com nosso irmão Sóstenes ,",19,25,0.591837,NOVO,PT,dinâmica,dinâmica,moderna,moderna,formal,formal
4987,NVT,1%20Coríntios,1,1,"Eu , Paulo , chamado para ser apóstolo de Cristo Jesus pela vontade de Deus , escrevo esta carta , com nosso irmão Sóstenes ,",OL,"Paulo , escolhido por Deus para ser o apóstolo de Jesus Cristo , e o irmão Sóstenes .",25,18,0.306122,NOVO,PT,dinâmica,literal,moderna,arcaica,formal,formal


In [51]:
df_concat[(df_concat["language"] == "EN") & (df_concat["linguagem_x"] == "moderna") & (df_concat["livro"] == "1%20Corinthians") & (df_concat["capitulo"] == 1) & (df_concat["versiculo"] == 1)]

Unnamed: 0,estilo_x,livro,capitulo,versiculo,texto_x,estilo_y,texto_y,sourceLen,targetLen,overlap,VERSAO,language,forma_x,forma_y,linguagem_x,linguagem_y,formal/informal_x,formal/informal_y
4004,CEB,1%20Corinthians,1,1,"From Paul , called by God ’ s will to be an apostle of Jesus Christ , and from Sosthenes our brother .",CEV,"From Paul , chosen by God to be an apostle of Christ Jesus , and from Sosthenes , who is also a follower .",23,24,0.553191,NOVO,EN,dinâmica,dinâmica,moderna,moderna,formal,informal
4005,CEV,1%20Corinthians,1,1,"From Paul , chosen by God to be an apostle of Christ Jesus , and from Sosthenes , who is also a follower .",CJB,Corinthians 1:31 Jeremiah 9:23 (,24,5,0.0,NOVO,EN,dinâmica,dinâmica,moderna,moderna,informal,informal
4110,EXB,1%20Corinthians,1,1,"From Paul . ·God called me to be an ·apostle of Christ Jesus ·because that is what God wanted . Also from Sosthenes , our ·brother in Christ .",GNV,"Corinthians 1:31 Let him yield all to God and give him thanks : and so by this place is man ’ s free will beaten down , which the Papist so dream of . Geneva Bible ,",74,37,0.065089,NOVO,EN,livre,literal,moderna,arcaica,formal,formal
4100,MOUNCE,1%20Corinthians,1,1,"Paul Paulos , called klētos to be an apostle apostolos of Christ Christos Jesus Iēsous by dia the will thelēma of God theos , and kai Sosthenes Sōsthenēs our ho brother adelphos ,",NABRE,"Cor 1:31 ) , on the other hand , is the acknowledgment that we live only from God and for God . Scripture texts , prefaces , introductions , footnotes and cross references used in this work are taken from the New American Bible , revised edition ©",33,48,0.040404,NOVO,EN,dinâmica,dinâmica,moderna,moderna,formal,formal
4084,NCV,1%20Corinthians,1,1,"From Paul . God called me to be an apostle of Christ Jesus because that is what God wanted . Also from Sosthenes , our brother in Christ .",NIRV,"I , Paul , am writing this letter . I have been chosen to be an apostle of Christ Jesus just as God planned . Our brother Sosthenes joins me in writing .",29,33,0.323077,NOVO,EN,dinâmica,dinâmica,moderna,moderna,formal,informal
4089,NIRV,1%20Corinthians,1,1,"I , Paul , am writing this letter . I have been chosen to be an apostle of Christ Jesus just as God planned . Our brother Sosthenes joins me in writing .",YLT,"Paul , a called apostle of Jesus Christ , through the will of God , and Sosthenes the brother ,",33,20,0.169231,NOVO,EN,dinâmica,literal,moderna,arcaica,informal,formal


In [58]:
df_concat[(df_concat["language"] == "PT") & (df_concat["linguagem_x"] == "arcaica") & (df_concat["livro"] == "1%20Coríntios") & (df_concat["capitulo"] == 1) & (df_concat["versiculo"] == 1)]

Unnamed: 0,estilo_x,livro,capitulo,versiculo,texto_x,estilo_y,texto_y,sourceLen,targetLen,overlap,VERSAO,language,forma_x,forma_y,linguagem_x,linguagem_y,formal/informal_x,formal/informal_y
4990,ARC,1%20Coríntios,1,1,"Paulo ( chamado apóstolo de Jesus Cristo , pela vontade de Deus ) e o irmão Sóstenes ,",NTLH,"1-2 Eu , Paulo , que fui chamado pela vontade de Deus para ser apóstolo de Cristo Jesus , escrevo , junto com o irmão Sóstenes , esta carta à igreja de Deus que está na cidade de Corinto . Escrevo a todos os que , pela sua união com Cristo Jesus , foram chamados para pertencerem ao povo de Deus . Esta carta é também para aqueles que em todos os lugares adoram o nosso Senhor Jesus Cristo , Senhor deles e nosso .",18,85,0.142012,NOVO,PT,literal,dinâmica,arcaica,moderna,formal,informal
4967,OL,1%20Coríntios,1,1,"Paulo , escolhido por Deus para ser o apóstolo de Jesus Cristo , e o irmão Sóstenes .",VFL,"De Paulo , chamado para ser apóstolo de Cristo Jesus [ a ] pela vontade de Deus , e também de Sóstenes , o nosso irmão em Cristo .",18,29,0.315789,NOVO,PT,literal,dinâmica,arcaica,moderna,formal,informal


In [59]:
df_concat[(df_concat["language"] == "EN") & (df_concat["linguagem_x"] == "arcaica") & (df_concat["livro"] == "1%20Corinthians") & (df_concat["capitulo"] == 1) & (df_concat["versiculo"] == 1)]

Unnamed: 0,estilo_x,livro,capitulo,versiculo,texto_x,estilo_y,texto_y,sourceLen,targetLen,overlap,VERSAO,language,forma_x,forma_y,linguagem_x,linguagem_y,formal/informal_x,formal/informal_y
4101,AKJV,1%20Corinthians,1,1,"Paul , called to be an apostle of Jesus Christ through the will of God , and Sosthenes our brother ,",ASV,Corinthians 1:31 Jer . 9:23f .,21,6,0.0,NOVO,EN,literal,literal,arcaica,arcaica,formal,formal
4101,ASV,1%20Corinthians,1,1,Corinthians 1:31 Jer . 9:23f .,BRG,"Paul , called to be an apostle of Jesus Christ through the will of God , and Sosthenes our brother ,",6,21,0.0,NOVO,EN,literal,literal,arcaica,moderna,formal,formal
4108,DARBY,1%20Corinthians,1,1,"Paul , [ a ] called apostle of Jesus Christ , by God 's will , and Sosthenes the brother ,",DRA,"Paul , called to be an apostle of Jesus Christ by the will of God , and Sosthenes a brother ,",21,21,0.511628,NOVO,EN,literal,literal,arcaica,arcaica,formal,formal
4068,DRA,1%20Corinthians,1,1,"Paul , called to be an apostle of Jesus Christ by the will of God , and Sosthenes a brother ,",ERV,Corinthians 1:31 Quote from Jer . 9:24 .,21,8,0.0,NOVO,EN,literal,dinâmica,arcaica,moderna,formal,informal
4110,GNV,1%20Corinthians,1,1,"Corinthians 1:31 Let him yield all to God and give him thanks : and so by this place is man ’ s free will beaten down , which the Papist so dream of . Geneva Bible ,",KJ21,"Paul , called to be an apostle of Jesus Christ through the will of God , and Sosthenes our brother ,",37,21,0.090909,NOVO,EN,literal,literal,arcaica,moderna,formal,formal
4113,KJV,1%20Corinthians,1,1,"Paul called to be an apostle of Jesus Christ through the will of God , and Sosthenes our brother ,",LEB,Corinthians 1:31 A quotation from Jer 9:24,20,7,0.0,NOVO,EN,literal,dinâmica,arcaica,moderna,formal,formal


In [None]:
""" Exemplos de textos pareados em ingles e portugues, estilo formal/informal """

In [61]:
df_concat[(df_concat["language"] == "EN") & (df_concat["formal/informal_x"] == "informal") & (df_concat["livro"] == "1%20Corinthians") & (df_concat["capitulo"] == 1) & (df_concat["versiculo"] == 1)]

Unnamed: 0,estilo_x,livro,capitulo,versiculo,texto_x,estilo_y,texto_y,sourceLen,targetLen,overlap,VERSAO,language,forma_x,forma_y,linguagem_x,linguagem_y,formal/informal_x,formal/informal_y
4005,CEV,1%20Corinthians,1,1,"From Paul , chosen by God to be an apostle of Christ Jesus , and from Sosthenes , who is also a follower .",CJB,Corinthians 1:31 Jeremiah 9:23 (,24,5,0.0,NOVO,EN,dinâmica,dinâmica,moderna,moderna,informal,informal
4112,CJB,1%20Corinthians,1,1,Corinthians 1:31 Jeremiah 9:23 (,DARBY,"Paul , [ a ] called apostle of Jesus Christ , by God 's will , and Sosthenes the brother ,",5,21,0.0,NOVO,EN,dinâmica,literal,moderna,arcaica,informal,formal
4065,ERV,1%20Corinthians,1,1,Corinthians 1:31 Quote from Jer . 9:24 .,EXB,"From Paul . ·God called me [ L …called ] to be an ·apostle [ messenger ] of Christ Jesus ·because that is what God wanted [ L by the will of God ] . Also from Sosthenes [ C a coworker ; Paul may be dictating the letter to him ; see 16:21 ; perhaps the synagogue leader mentioned in Acts 18:15–17 ] , our ·brother in Christ [ L brother ] .",8,74,0.017751,NOVO,EN,dinâmica,livre,moderna,moderna,informal,formal
4089,NIRV,1%20Corinthians,1,1,"I , Paul , am writing this letter . I have been chosen to be an apostle of Christ Jesus just as God planned . Our brother Sosthenes joins me in writing .",YLT,"Paul , a called apostle of Jesus Christ , through the will of God , and Sosthenes the brother ,",33,20,0.169231,NOVO,EN,dinâmica,literal,moderna,arcaica,informal,formal


In [62]:
df_concat[(df_concat["language"] == "PT") & (df_concat["formal/informal_x"] == "informal") & (df_concat["livro"] == "1%20Coríntios") & (df_concat["capitulo"] == 1) & (df_concat["versiculo"] == 1)]

Unnamed: 0,estilo_x,livro,capitulo,versiculo,texto_x,estilo_y,texto_y,sourceLen,targetLen,overlap,VERSAO,language,forma_x,forma_y,linguagem_x,linguagem_y,formal/informal_x,formal/informal_y
5120,NTLH,1%20Coríntios,1,1,"1-2 Eu , Paulo , que fui chamado pela vontade de Deus para ser apóstolo de Cristo Jesus , escrevo , junto com o irmão Sóstenes , esta carta à igreja de Deus que está na cidade de Corinto . Escrevo a todos os que , pela sua união com Cristo Jesus , foram chamados para pertencerem ao povo de Deus . Esta carta é também para aqueles que em todos os lugares adoram o nosso Senhor Jesus Cristo , Senhor deles e nosso .",NVI-PT,"Paulo , chamado para ser apóstolo de Cristo Jesus pela vontade de Deus , e o irmão Sóstenes ,",85,19,0.16568,NOVO,PT,dinâmica,dinâmica,moderna,moderna,informal,formal


In [63]:
df_concat[(df_concat["language"] == "PT") & (df_concat["formal/informal_x"] == "formal") & (df_concat["livro"] == "1%20Coríntios") & (df_concat["capitulo"] == 1) & (df_concat["versiculo"] == 1)]

Unnamed: 0,estilo_x,livro,capitulo,versiculo,texto_x,estilo_y,texto_y,sourceLen,targetLen,overlap,VERSAO,language,forma_x,forma_y,linguagem_x,linguagem_y,formal/informal_x,formal/informal_y
4990,ARC,1%20Coríntios,1,1,"Paulo ( chamado apóstolo de Jesus Cristo , pela vontade de Deus ) e o irmão Sóstenes ,",NTLH,"1-2 Eu , Paulo , que fui chamado pela vontade de Deus para ser apóstolo de Cristo Jesus , escrevo , junto com o irmão Sóstenes , esta carta à igreja de Deus que está na cidade de Corinto . Escrevo a todos os que , pela sua união com Cristo Jesus , foram chamados para pertencerem ao povo de Deus . Esta carta é também para aqueles que em todos os lugares adoram o nosso Senhor Jesus Cristo , Senhor deles e nosso .",18,85,0.142012,NOVO,PT,literal,dinâmica,arcaica,moderna,formal,informal
5021,NVI-PT,1%20Coríntios,1,1,"Paulo , chamado para ser apóstolo de Cristo Jesus pela vontade de Deus , e o irmão Sóstenes ,",NVT,"Eu , Paulo , chamado para ser apóstolo de Cristo Jesus pela vontade de Deus , escrevo esta carta , com nosso irmão Sóstenes ,",19,25,0.591837,NOVO,PT,dinâmica,dinâmica,moderna,moderna,formal,formal
4987,NVT,1%20Coríntios,1,1,"Eu , Paulo , chamado para ser apóstolo de Cristo Jesus pela vontade de Deus , escrevo esta carta , com nosso irmão Sóstenes ,",OL,"Paulo , escolhido por Deus para ser o apóstolo de Jesus Cristo , e o irmão Sóstenes .",25,18,0.306122,NOVO,PT,dinâmica,literal,moderna,arcaica,formal,formal
4967,OL,1%20Coríntios,1,1,"Paulo , escolhido por Deus para ser o apóstolo de Jesus Cristo , e o irmão Sóstenes .",VFL,"De Paulo , chamado para ser apóstolo de Cristo Jesus [ a ] pela vontade de Deus , e também de Sóstenes , o nosso irmão em Cristo .",18,29,0.315789,NOVO,PT,literal,dinâmica,arcaica,moderna,formal,informal


In [64]:
df_concat[(df_concat["language"] == "EN") & (df_concat["formal/informal_x"] == "formal") & (df_concat["livro"] == "1%20Corinthians") & (df_concat["capitulo"] == 1) & (df_concat["versiculo"] == 1)]

Unnamed: 0,estilo_x,livro,capitulo,versiculo,texto_x,estilo_y,texto_y,sourceLen,targetLen,overlap,VERSAO,language,forma_x,forma_y,linguagem_x,linguagem_y,formal/informal_x,formal/informal_y
4101,AKJV,1%20Corinthians,1,1,"Paul , called to be an apostle of Jesus Christ through the will of God , and Sosthenes our brother ,",ASV,Corinthians 1:31 Jer . 9:23f .,21,6,0.0,NOVO,EN,literal,literal,arcaica,arcaica,formal,formal
4101,ASV,1%20Corinthians,1,1,Corinthians 1:31 Jer . 9:23f .,BRG,"Paul , called to be an apostle of Jesus Christ through the will of God , and Sosthenes our brother ,",6,21,0.0,NOVO,EN,literal,literal,arcaica,moderna,formal,formal
4004,CEB,1%20Corinthians,1,1,"From Paul , called by God ’ s will to be an apostle of Jesus Christ , and from Sosthenes our brother .",CEV,"From Paul , chosen by God to be an apostle of Christ Jesus , and from Sosthenes , who is also a follower .",23,24,0.553191,NOVO,EN,dinâmica,dinâmica,moderna,moderna,formal,informal
4108,DARBY,1%20Corinthians,1,1,"Paul , [ a ] called apostle of Jesus Christ , by God 's will , and Sosthenes the brother ,",DRA,"Paul , called to be an apostle of Jesus Christ by the will of God , and Sosthenes a brother ,",21,21,0.511628,NOVO,EN,literal,literal,arcaica,arcaica,formal,formal
4068,DRA,1%20Corinthians,1,1,"Paul , called to be an apostle of Jesus Christ by the will of God , and Sosthenes a brother ,",ERV,Corinthians 1:31 Quote from Jer . 9:24 .,21,8,0.0,NOVO,EN,literal,dinâmica,arcaica,moderna,formal,informal
4110,EXB,1%20Corinthians,1,1,"From Paul . ·God called me [ L …called ] to be an ·apostle [ messenger ] of Christ Jesus ·because that is what God wanted [ L by the will of God ] . Also from Sosthenes [ C a coworker ; Paul may be dictating the letter to him ; see 16:21 ; perhaps the synagogue leader mentioned in Acts 18:15–17 ] , our ·brother in Christ [ L brother ] .",GNV,"Corinthians 1:31 Let him yield all to God and give him thanks : and so by this place is man ’ s free will beaten down , which the Papist so dream of . Geneva Bible ,",74,37,0.065089,NOVO,EN,livre,literal,moderna,arcaica,formal,formal
4110,GNV,1%20Corinthians,1,1,"Corinthians 1:31 Let him yield all to God and give him thanks : and so by this place is man ’ s free will beaten down , which the Papist so dream of . Geneva Bible ,",KJ21,"Paul , called to be an apostle of Jesus Christ through the will of God , and Sosthenes our brother ,",37,21,0.090909,NOVO,EN,literal,literal,arcaica,moderna,formal,formal
4113,KJV,1%20Corinthians,1,1,"Paul called to be an apostle of Jesus Christ through the will of God , and Sosthenes our brother ,",LEB,Corinthians 1:31 A quotation from Jer 9:24,20,7,0.0,NOVO,EN,literal,dinâmica,arcaica,moderna,formal,formal
4101,LEB,1%20Corinthians,1,1,Corinthians 1:31 A quotation from Jer 9:24,MOUNCE,"Paul Paulos , called klētos to be an apostle apostolos of Christ Christos Jesus Iēsous by dia the will thelēma of God theos , and kai Sosthenes Sōsthenēs our ho brother adelphos ,",7,33,0.0,NOVO,EN,dinâmica,dinâmica,moderna,moderna,formal,formal
4100,MOUNCE,1%20Corinthians,1,1,"Paul Paulos , called klētos to be an apostle apostolos of Christ Christos Jesus Iēsous by dia the will thelēma of God theos , and kai Sosthenes Sōsthenēs our ho brother adelphos ,",NABRE,"Cor 1:31 ) , on the other hand , is the acknowledgment that we live only from God and for God . Scripture texts , prefaces , introductions , footnotes and cross references used in this work are taken from the New American Bible , revised edition ©",33,48,0.040404,NOVO,EN,dinâmica,dinâmica,moderna,moderna,formal,formal


In [22]:
df_concat.drop(["estilo_y", "targetLen", "overlap", "forma_y", "linguagem_y", "texto_y", "formal/informal_y"], axis=1).to_parquet("./data/complete.parquet", compression="gzip")

In [23]:
import polars

polars.read_parquet("./data/complete.parquet")

estilo_x,livro,capitulo,versiculo,texto_x,sourceLen,VERSAO,language,forma_x,linguagem_x,formal/informal_x,__index_level_0__
str,str,i64,i64,str,i64,str,str,str,str,str,i64
"""ARC""","""Gênesis""",1,18,"""e para governa…",27,"""VELHO""","""PT""","""literal""","""arcaica""","""formal""",17
"""ARC""","""Gênesis""",1,19,"""E foi a tarde …",12,"""VELHO""","""PT""","""literal""","""arcaica""","""formal""",18
"""ARC""","""Gênesis""",1,20,"""E disse Deus :…",25,"""VELHO""","""PT""","""literal""","""arcaica""","""formal""",19
"""ARC""","""Gênesis""",1,21,"""E Deus criou a…",40,"""VELHO""","""PT""","""literal""","""arcaica""","""formal""",20
"""ARC""","""Gênesis""",1,22,"""E Deus os aben…",27,"""VELHO""","""PT""","""literal""","""arcaica""","""formal""",21
…,…,…,…,…,…,…,…,…,…,…,…
"""NR1994""","""Apocalisse""",22,8,"""Io , * Giovann…",38,"""NOVO""","""IT""","""literal""","""moderna""","""formal""",7940
"""NR1994""","""Apocalisse""",22,16,"""Io , Gesú , ho…",36,"""NOVO""","""IT""","""literal""","""moderna""","""formal""",7948
"""NR1994""","""Apocalisse""",22,17,"""( C ) Lo Spiri…",42,"""NOVO""","""IT""","""literal""","""moderna""","""formal""",7949
"""NR1994""","""Apocalisse""",22,18,"""Io lo dichiaro…",32,"""NOVO""","""IT""","""literal""","""moderna""","""formal""",7950
