# Data Processing

## Filter Dataset

In this notebook, I create a filtered DataFrame from the full dataset of translated sentences. I choose sentences that are within a certain size range, determined by token count per sentence. This will make the training task easier.

First I import the two libraries I need for this notebook:
- `re` for regular expressions, used to tokenize the sentences
- `pandas` to create a DataFrame of the sentences

In [1]:
import re
import pandas as pd

## Data

I downloaded and extracted the two text files from [here](https://www.statmt.org/europarl/v7/pt-en.tgz) first (this is a `tgz` file). Then I put them into the same directory as this notebook. 

Note: make sure the two text files are in this notebook's directory before running the following cells.

First I read the Portuguese and English sentences into their own variables:

In [2]:
pt_text = open('pt-en', 'r').read() # Portuguese sentences
en_text = open('europarl-v7.pt-en.en', 'r').read() # English sentences

I make all the text lowercase so that capitalization won't be an issue during training:

In [3]:
# lowercase
pt_text = pt_text.lower()
en_text = en_text.lower()

In [4]:
len(pt_text), len(en_text)

(317385553, 295365281)

Then I split the texts into sentences:

In [5]:
pt_sents = pt_text.split('\n')
en_sents = en_text.split('\n')

## Tokenize

I tokenize by first flattening each text:

In [6]:
# combined texts
pt_text = '\n'.join(pt_sents)
en_text = '\n'.join(en_sents)

Then I extract all words and punctuation with `re`:

In [7]:
def tokenize_sentence(sent):
    return re.findall(r'\w+|[^\w\s]+', sent) # matches words or punctuation

## DataFrame

I create a DataFrame of the sentences as strings:

In [8]:
df = pd.DataFrame({'pt':pt_sents, 'en':en_sents})

In [9]:
len(df)

1960408

In [10]:
df.head()

Unnamed: 0,pt,en
0,reinício da sessão,resumption of the session
1,declaro reaberta a sessão do parlamento europe...,i declare resumed the session of the european ...
2,"como puderam constatar, o grande ""bug do ano 2...","although, as you will have seen, the dreaded '..."
3,os senhores manifestaram o desejo de se proced...,you have requested a debate on this subject in...
4,"entretanto, gostaria - como também me foi pedi...","in the meantime, i should like to observe a mi..."


Then I add columns with the sentences as lists of tokens:

In [11]:
df['pt_toks'] = df['pt'].apply(tokenize_sentence)
df['en_toks'] = df['en'].apply(tokenize_sentence)

Then I add columns with the lengths of each sentence (number of tokens):

In [13]:
df['pt_len'] = df['pt_toks'].apply(len)
df['en_len'] = df['en_toks'].apply(len)

I then filter the DataFrame to create a much shorter one. I choose only the rows where:
- the Portuguese sentence is between 6 and 8 tokens long
- the English sentence is between 6 and 8 tokens long

In [23]:
short_df = df[(df['pt_len'] >= 6) 
              & (df['pt_len'] <= 8) 
              & (df['en_len'] >= 6) 
              & (df['en_len'] <= 8)]
short_df

Unnamed: 0,pt,en,pt_toks,en_toks,pt_len,en_len
27,é o caso de alexander nikitin.,it is the case of alexander nikitin.,"[é, o, caso, de, alexander, nikitin, .]","[it, is, the, case, of, alexander, nikitin, .]",7,8
84,(aplausos da bancada do grupo pse),(applause from the pse group),"[(, aplausos, da, bancada, do, grupo, pse, )]","[(, applause, from, the, pse, group, )]",8,7
128,"obrigada, senhor deputado poettering.","thank you, mr poettering.","[obrigada, ,, senhor, deputado, poettering, .]","[thank, you, ,, mr, poettering, .]",6,6
140,"o meu voto era ""a favor"".","my vote was ""in favour"" .","[o, meu, voto, era, "", a, favor, "".]","[my, vote, was, "", in, favour, "", .]",8,8
143,não há lugar para mudanças.,there is no room for amendments.,"[não, há, lugar, para, mudanças, .]","[there, is, no, room, for, amendments, .]",6,7
...,...,...,...,...,...,...
1960216,extensão ao tajiquistão de assistência finance...,extension of exceptional financial assistance ...,"[extensão, ao, tajiquistão, de, assistência, f...","[extension, of, exceptional, financial, assist...",7,7
1960366,vamos agora proceder à votação.,we shall now proceed to the vote.,"[vamos, agora, proceder, à, votação, .]","[we, shall, now, proceed, to, the, vote, .]",6,8
1960367,após a votação das alterações:,following the vote on the amendments,"[após, a, votação, das, alterações, :]","[following, the, vote, on, the, amendments]",6,6
1960382,"­ muito obrigado, senhora deputada thyssen.","thank you very much, mrs thyssen.","[­, muito, obrigado, ,, senhora, deputada, thy...","[thank, you, very, much, ,, mrs, thyssen, .]",8,8


I then save this small DataFrame for use in the following notebook:

In [24]:
# pandas save dataframe
short_df.to_pickle('short_df.p')

The training and inference happen in the next notebook, titled "Translation".